'ITWeb/개발일반'에 해당되는 글 489건

정규표현식 가이드 (펌)

ITWeb/개발일반 2012. 5. 4. 15:02

더 쉽게 표현된 문서들도 많겠지만, 나름 괜찮은것 같아서 올려 봅니다.. :)

[원본]

http://www.zytrax.com/tech/web/regex.htm

[원본글]

Regular Expressions - User Guide

A Regular Expression is the term used to describe a codified method of searching invented, or defined, by the American mathematician Stephen Kleene.

The syntax (language format) described on this page is compliant with extended regular expressions (EREs) defined in IEEE POSIX 1003.2 (Section 2.8). EREs are now commonly supported by Apache, PERL, PHP4, Javascript 1.3+, MS Visual Studio, MS Frontpage, most visual editors, vi, emac, the GNU family of tools (including grep, awk and sed) as well as many others. Extended Regular Expressions (EREs) will support Basic Regular Expressions (BREs are essentially a subset of EREs). Most applications, utilities and laguages that implement RE's, especially PERL, extend the capabilities defined. The appropriate documentation should always be consulted.

Translation: The page has been translated into Bulgarian, courtesy of Albert Ward - thanks.

A Gentle Introduction: - the Basics
Simple Searches
Brackets, Ranges and Negation
Search Positioning (aka Anchors)
Iteration (aka Quantifiers)
Parenthesis and Alternation (OR)
POSIX Standard Character Classes:
Commonly Available extensions: - \w etc
Submatches, Groups and Backreferences:
Regular Expression Tester: - Experiment with your own target strings and search expressions in your browser
Some Examples: - A worked example and some samples
Notes: - general notes when using utilities and lanuages
Utility notes: - using Visual Studio regular expressions
Utility notes: - using sed for file manipulation (not for the faint hearted)

A Gentle Introduction: The Basics

The title is a misnomer - there is no gentle beginning to regular expressions. You are either into hieroglyphics big time - in which case you will love this stuff - or you need to use them, in which case your only reward may be a headache.

Some Definitions before we start

We are going to be using the terms literal, metacharacter, target string, escape sequence and search expression (aka regular expression) in this overview. Here is a definition of our terms:

literal	A literal is any character we use in a search or matching expression, for example, to find ind in windows theind is a literal string - each character plays a part in the search, it is literally the string we want to find.
metacharacter	A metacharacter is one or more special characters that have a unique meaning and are NOT used as literalsin the search expression, for example, the character ^ (circumflex or caret) is a metacharacter.
target string	This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.
search expression	Most commonly called the regular expression. This term describes the search expression that we will be using to search our target string, that is, the pattern we use to find what we want.
escape sequence	An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of themetacharacter that we want to use as a literal, for example, if we want to find (s) in the target stringwindow(s) then we use the search expression $s$ and if we want to find \\file in the target string c:\\filethen we would need to use the search expression \\\\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence \).

Our Example Target Strings

Throughout this guide we will use the following as our target strings:

STRING1   Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2   Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)

These are Browser ID Strings and appear as the Apache Environmental variable HTTP_USER_AGENT (full list of Apache environmental variables).

Simple Matching

We are going to try some simple matching against our example target strings:

Note: You can also experiment as you go through the examples.

Search for (search expression)
m	STRING1	match	Finds the m in compatible
	STRING2	no match	There is no lower case m in this string. Searches are case sensitive unless you take special action.
a/4	STRING1	match	Found in Mozilla/4.0 - any combination of characters can be used for the match
	STRING2	match	Found in same place as in STRING1
5 [	STRING1	no match	The search is looking for a pattern of '5 [' and this does NOT exist in STRING1. Spaces are valid in searches.
	STRING2	match	Found in Mozilla/4.75 [en]
in	STRING1	match	found in Windows
	STRING2	match	Found in Linux
le	STRING1	match	found in compatible
	STRING2	no match	There is an l and an e in this string but they are not adjacent (or contiguous).

Check the results in our Regular Expression Tester.

Brackets, Ranges and Negation

Bracket expressions introduce our first metacharacters, in this case the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now. These lists can be grouped into what are known as Character Classes typically comprising well know groups such as all numbers etc.

Metacharacter	Meaning
[ ]	Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.
-	The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c). NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^	The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z. Notes: There are no spaces between the range delimiter values, if there was, depending on the range, it would be added to the possible range or rejected as invalid. Be very careful with spaces. Some regular expression systems, notably VBScript, provide a negation operator (!) for use with strings. This is a non-standard feature and therefore the resulting expressions are not portable. Negation can be very tricky - you may want to read these additional notes on this and other topics.

NOTE: There are some special range values (Character Classes) that are built-in to most regular expression software and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE.

So lets try this new stuff with our target strings.

Search for (search expression)
in[du]	STRING1	match	finds ind in Windows
	STRING2	match	finds inu in Linux
x[0-9A-Z]	STRING1	no match	Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.
	STRING2	match	Finds x2 in Linux2
[^A-M]in	STRING1	match	Finds Win in Windows
	STRING2	no match	We have excluded the range A to M in our search so Linux is not found but linux (if it were present) would be found.

Check the results in our Regular Expression Tester.

Positioning (or Anchors)

We can control where in our target strings the matches are valid. The following is a list of metacharacters that affect the position of the search:

Metacharacter	Meaning
^	The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.
$	The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'.
.	The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but notwanton because it has no following character.

NOTE: Many systems and utilities, but not all, support special positioning macros, for example \< match at beginning of word, \> match at end of word, \b match at the begining OR end of word , \B except at the beginning or end of a word. List of the common values.

So lets try this lot out with our example target strings..

Search for (search expression)
[a-z]\)$	STRING1	match	finds t) in DigiExt) Note: The \ is an escape character and is required to treat the ) as a literal
	STRING2	no match	We have a numeric value at the end of this string but we would need [0-9a-z]) to find it.
.in	STRING1	match	Finds Win in Windows.
	STRING2	match	Finds Lin in Linux.

Check the results in our Regular Expression Tester.

Iteration 'metacharacters'

The following is a set of iteration metacharacters (a.k.a. quantifiers) that can control the number of times the preceding character is found in our searches. The iteration meta characters can also be used in conjunction with parenthesis meta characters.

Metacharacter	Meaning
?	The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).
*	The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).
+	The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but NOT trough (0 times).
{n}	Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).
{n,m}	Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).

Note: While it may be obvious it is worth emphasizing. In all the above examples only the character immediately preceding the iteration character takes part in the iteration, all other characters in the search expression (regular expression) are literals. Thus, in the first example search expression colou?r, the string colo is a literal and must be found before the iteration sequence (u?) is triggered which, if satisfied, must also be followed by the literal r for a match to occur.

So lets try them out with our example target strings.

Search for (search expression)
\(.*l	STRING1	match	finds the ( and l in (compatible. The opening \ is an escape character used to indicate the ( it precedes is a literal (search character) not a metacharacter. Note: If you use the tester with STRING1 and the above expression it will return the match (compatibl. The literal ( essentially anchors the search - it simply says start the search only when an ( is found. The following .* says the ( may be followed by any character (.), zero or more times () (thus compatib* are essentially random characters that happen to appear in this string - they were not part of the search) and terminate the search on finding an l literal. Only the ( and l are truly part of the search expression.
	STRING2	no match	Mozilla contains lls but not preceded by an open parenthesis (no match) and Linux has an upper case L (no match).
We had previously defined the above test using the search value l? (thanks to David Werner Wiebe for pointing out our error). The search expression l? actually means find anything, even if it has no l (l 0 or 1 times), so would match on both strings. We had been looking for a method to find a single l and exclude ll which, without lookahead (a relatively new extension to regular expressions pioneered by PERL) is pretty difficult. Well that is our excuse.
W*in	STRING1	match	Finds the Win in Windows.
	STRING2	match	Finds in in Linux preceded by W zero times - so a match.
[xX][0-9a-z]{2}	STRING1	no match	Finds x in DigExt but only one t.
	STRING2	match	Finds X and 11 in X11.

Check the results in the Regular Expression Tester.

More 'metacharacters'

The following is a set of additional metacharacters that provide added power to our searches:

Metacharacter	Meaning
()	The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together - see this example.
\|	The \| (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a\|e)y will find 'gray' or 'grey' and has the sense that if the first test is not valid the second will be tried, if the first is valid the second will not be tried. Alternation can be nested within each expression, thus gr((a\|e)\|i)ywill find 'grey', 'grey' and 'griy'.

<humblepie> In our examples, we blew this expression ^([L-Z]in), we incorrectly stated that this would negate the tests [L-Z], the '^' only performs this function inside square brackets, here it is outside the square brackets and is an anchor indicating 'start from first character'. Many thanks to Mirko Stojanovic for pointing it out and apologies to one and all.</humblepie>

So lets try these out with our example strings..

Search for (search expression)
^([L-Z]in)	STRING1	no match	The '^' is an anchor indicating first position. Win does not start the string so no match.
	STRING2	no match	The '^' is an anchor indicating first position. Linux does not start the string so no match.
((4\.[0-3])\|(2\.[0-3]))	STRING1	match	Finds the 4.0 in Mozilla/4.0.
	STRING2	match	Finds the 2.2 in Linux2.2.16-22.
(W\|L)in	STRING1	match	Finds Win in Windows.
	STRING2	match	Finds Lin in Linux.

Check the results in the Regular Expression Tester.

More Stuff

POSIX Standard Character Classes
Apache browser recognition - a worked example
Commonly Available extensions - \w etc
Submatches, Groups and Backreferences
Regular Expression Tester - Experiment with your own strings and expressions in your browser
Common examples - regular expression examples
Notes - general notes when using utilities and lanuages
Utility notes - using Visual Studio regular expressions
Utility notes - using sed for file manipulation (not for the faint hearted)

For more information on regular expressions go to our links pages under Languages/regex. There are lots of folks who get a real buzz out of making any search a 'one liner' and they are incredibly helpful at telling you how they did it. Welcome to the wonderful, if arcane, world of Regular Expressions. You may want to play around with your new found knowledge using this tool.

POSIX Character Class Definitions

POSIX 1003.2 section 2.8.3.2 (6) defines a set of character classesthat denote certain common ranges. They tend to look very ugly but have the advantage that also take into account the 'locale', that is, any variant of the local language/coding system. Many utilities/languages provide short-hand ways of invoking these classes. Strictly the names used and hence their contents reference the LC_CTYPE POSIX definition (1003.2 section 2.5.2.1).

Value	Meaning
[:digit:]	Only the digits 0 to 9
[:alnum:]	Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]	Any alpha character A to Z or a to z.
[:blank:]	Space and TAB characters only.
[:xdigit:]	Hexadecimal notation 0-9, A-F, a-f.
[:punct:]	Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } \| ~
[:print:]	Any printable character.
[:space:]	Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:]	Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:]	Any alpha character A to Z.
[:lower:]	Any alpha character a to z.
[:cntrl:]	Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]

Common Extensions and Abbreviations

Some utitlities and most languages provide extensions or abbreviations to simplify(!) regular expressions. These tend to fall into Character Classes or position extensions and the most common are listed below. In general these extensions are defined by PERL and implemented in what is called PCRE's (Perl Compatible Regular Expressions) which has been implemented in the form of a libary that has been ported to many systems. Full details of PCRE. PERL 5.8.8 regular expression documentation.

While the \x type syntax for can look initially confusing the backslash precedes a character that does not normally need escaping and hence can be interpreted correctly by the utility or language - whereas we simple humans tend to become confused more easily. The following are supported by: .NET, PHP, PERL, RUBY, PYTHON, Javascript as well as many others.

Character Class Abbreviations
\d	Match any character in the range 0 - 9 (equivalent of POSIX [:digit:])
\D	Match any character NOT in the range 0 - 9 (equivalent of POSIX [^[:digit:]])
\s	Match any whitespace characters (space, tab etc.). (equivalent of POSIX [:space:] EXCEPT VT is not recognized)
\S	Match any character NOT whitespace (space, tab). (equivalent of POSIX [^[:space:]])
\w	Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])
\W	Match any character NOT the range 0 - 9, A - Z and a - z (equivalent of POSIX [^[:alnum:]])
Positional Abbreviations
\b	Word boundary. Match any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bton\b will find ton but not tons, but \bton will find tons.
\B	Not word boundary. Match any character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus \Bton\B will find wantons but not tons, but ton\B will find both wantons and tons.

Submatches, Groups and Backreferences

Some regular expression implementations provide the last results of each separate match enclosed in parenthesis (called a submatch, group or backreference) in variables that may subsequently be used or substituted in an expression. There may one or more such groupings in an expression. These variables are usually numbered $1 to $9. Where $1 will contain the first submatch, $2 will contain the second submatch and so on. The $x format typically persists until it is referenced in some expression or until another regular expression is encountered. Example:

# assume target string = "cat"
search expression = (c|a)(t|z)
$1 will contain "a"
# $1 contains "a" because it is the last
# character found using (c|a) 
# if the target string was "act"
# $1 would contain "c"
$2 will contain "t" 

# OpenLDAP 'access to' directive example: assume target dn 
# is "ou=something,cn=my name,dc=example,dc=com"
# then $1 = 'my name' at end of match below
# because first regular expression does not have ()
access to dn.regex="ou=[^,]+,cn=([^,]+),dc=example,dc=com"
 by dn.exact,expand="cn=$1,dc=example,dc=com"

PERL, Ruby and the OpenLDAP access to directive support submatches.

When used within a single expression these submatches are typically called groups or backreferences and are placed in numeric variables (typically addressed using \1 to \9). These groups or backreferences (variables) may be substituted within the regular expression. The following demonstrates usage:

# the following expression finds any occurrence of double characters
(.)\1
# the parenthesis creates the grouping (or submatch or backreference) 
# in this case it is the first (only), so is referenced by \1
# the . (dot) finds any character and the \1 substitutes whatever 
# character was found by the dot in the next character position, 
# thus to match it must find two consecutive characters which are the same

Regular Expression - Experiments and Testing

This simple regular expression tester lets you experiment using your browser's regular expression Javascript function (use View Source in your browser for the Javascript source code).

Enter or copy/paste the string you want to search in the box labled String: and the regular expression in the box labeled RE:, click theSearch button and results will appear in the box labeled Results:. If you are very lucky the results may even be what you expect. This tester displays the whole searched string in the <>Results field and encloses in < > the first found result. That may not be terribly helpful if you are dealing with HTML - but our heart is in the right place. All matches are then displayed separately showing the found text and its character position in the string. Checking the Case Insensitive: box makes the search case insensitive thus [AZ] will find the "a" in "cat", whereas without checking the box [aZ] would be required to find the "a" in "cat". Note: Not all regular expression systems provide a case insensitivity feature and therefore the regular expression may not be portable. Checking Results only will supress display of the marked up original string and only show the results found, undoing all our helpful work, but which can make things a little less complicated especially if dealing with HTML strings or anything else with multiple < > symbols. Clear will zap all the fields - including the regular expression that you just took 6 hours to develop. Use with care. See the notes below for limitations, support and capabilities.

Note: If the regular expression is invalid or syntactically incorrect the tester will display in the Results field exactly what your browser thought of your attempt - sometimes it might even be useful.

<ouch> We had an error such that if the match occurred in the first position the enclosing <> was incorrectly displayed.</ouch>

If you plan to experiment with the target strings used to illustrate usage of the various meta characters we have thoughtfully replicated them below to save you all that scrolling. Just copy and paste into the String box. Are we helpful or not?

STRING1   Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2   Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)

Javascript implementations may vary from browser to browser. This feature was tested with MSIE 6.x, Gecko (Firefox 2.something) and Opera 9 (a recent email indicates it works in Google's Chrome - so that suggests it will work with Safari - both use Webkit). If the tester does not work for you we are very, very sad - but yell at your browser supplier not us.Notes:

The ECMA-262 (Javascript 1.2'ish) spec defines the regex implementation to be based on Perl 5 which means that submatches/backreferences, short forms such as \d etc should be supported as well as standard BRE and ERE functionality.
In Opera and MSIE the following backreference/submatch worked:
```
(.)\1
```
Which finds any occurence of double characters, such as, oo in spoon. Using Gecko (Firefox 2.something) it did not work. Since, at least, Gecko/20100228 (possibly even before that) the expression now works.
If you get the message Match(es) = zero length in the results field this implies your browser's Javascript engine has choked on executing the regular expression. It has returned a valid match (there may be others in the string) but has (incorrectly) returned a length of 0 for the number of characters in the match. This appears to happen (Moz/FF and IE tested) when searching for a single character in a string when using the meta characters ? and *. For example, the regular expression e? on any string will generate the error (this expression will yield a match on every character in the string and is probably not what the user wanted).
For those of you familiar with Javascript's regular expressions there is no need to add the enclosing //, just enter the raw regular expression. Those of you not familiar with Javascript regular expressions - ignore the preceding sentence.

Some Examples

The following sections show a number of worked examples which may help to clarify regular expression. Most likely they will not.

Apache Browser Identification - a Worked Example

All we ever wanted to do with Regular Expressions was to find enough about visiting browsers arriving at our Apache powered web site to decide what HTML/CSS to supply or not for our pop-out menus. The Apache BrowserMatch directives will set a variable if the expression matches the USER_AGENT string.

We want to know:

If we have any browser that supports Javascript (isJS).
If we have any browser that supports the MSIE DHTML Object Model (isIE).
If we have any browser that supports the W3C DOM (isW3C).

Here in their glory are the Apache regular expression statements we used (maybe you can understand them now)

BrowserMatchNoCase [Mm]ozilla/[4-6] isJS
BrowserMatchNoCase MSIE isIE
BrowserMatchNoCase [Gg]ecko isW3C
BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9]|1[0-9])) isW3C
BrowserMatchNoCase W3C_ isW3C

Notes:

Line 1 checks for any upper or lower case variant of Mozilla/4-6 (MSIE also sets this value). This test sets the variable isJS for all version 4-6 browsers (we assume that version 3 and lower do not support Javascript or at least not a sensible Javascript).
Line 2 checks for MSIE only (line 1 will take out any MSIE 1-3 browsers even if this variable is set.
Line 3 checks for any upper or lower case variant of the Gecko browser which includes Firefox, Netscape 6, 7 and now 8 and the Moz clones (all of which are Mozilla/5).
Line 4 checks for MSIE 5.5 (or greater) OR MSIE 6 - 19 (future proofing - though at the rate MS is updating MSIE it will probably be out-of-date next month).
NOTE about binding:This expression does not work:
```
BrowserMatchNoCase MSIE.(5\.[5-9])|([6-9]) isW3C
```
It incorrectly sets variable isW3C if the number 6 - 9 appears in the string. Our guess is the binding of the first parenthesis is directly to the MSIE expression and the OR and second parenthesis is treated as a separate expression. Adding the inner parenthesis fixed the problem.
Line 5 checks for W3C_ in any part of the line. This allows us to identify the W3C validation services (either CSS or HTML/XHTML page validation).

Some of the above checks may be a bit excessive, for example, is Mozilla ever spelled mozilla?, but it is also pretty silly to have code fail just because of this 'easy to prevent' condition. There is apparently no final consensus that all Gecko browsers will have to use Gecko in their 'user-agent' string but it would be extremely foolish not to since this would force guys like us to make huge numbers of tests for branded products and the more likely outcome would be that we would not.

Common Examples

The following examples may be useful, they are particularly aimed at extracting parameters but cover some other ground as well. If anyone wants to email us some more examples we'd be happy to post with an appropriate credit.

# split on simple space 
string "aaa bbb ccc"
re = \S+ 
result = "aaa", "bbb", "ccc"
# Note: If you want the location of the whitespace (space) use \s+

# css definition split on space or comma but keep "" enclosed items
string = '10pt "Times Roman", Helvetica,Arial, sans-serif'
re = \s*("[^"]+"|[^ ,]+)
result = "10pt" ,"\"Times Roman\"", "Helvetica" ,"Arial", "sans-serif"

# extract HTML <> enclosed tags
string = '<a href="#">A link</a>'
re = <[^>]*>
result = '<a href="#">', "</a>"

# find all double characters
string = 'aabcdde'
re = (.)\1
result = "aa" "dd"

# separate comma delimted values into groups (submatches or backreferences)
string = ou=people,cn=web,dc=example,dc=com
re = ou=[^,]+,cn=([^,]+),dc=example,dc=com
result $1 variable will contain "web" - first expression has no grouping ()

Utility and Language Notes - General

Certain utilities, notably grep, suggest that it is a good idea to enclose any complex search expression inside single quotes. In fact it is not a good idea - it is absolutely essential! Example:
```
grep 'string\\' *.txt # this works correctly
grep string\\ *.txt # this does not work
```
Some utilities and most languages use / (forward slash) to start and end (de-limit or contain) the search expression others may use single quotes. This is especially true when there may be optional following arguments (see the grep example above). These characters do not play any role in the search itself.

Utility Notes - Using Visual Studio

For reasons best know to itself MS Visual Studio (VS.NET) uses a bizarre set of extensions to regular expressions. (MS VS standard documentation) But there is a free regular expression add-in if you want to return to sanity.

Utility Notes - Using sed

Stream editor (sed) is one of those amazingly powerful tools for manipulating files that are simply horrible when you try to use them - unless you get a buzz out of ancient Egyptian hieroglyphics. But well worth the effort. So if you are hieroglyphically-challenged, like us, these notes may help. There again they may not. There is also a useful series of tutorials on sed and this list of sed one liners.

not all seds are equal: Linux uses GNU sed, the BSDs use their own, slightly different, version.
sed on windows: GNU sed has been ported to windows.
sed is line oriented: sed operates on lines of text within the file or input stream.
expression quoting: To avoid shell expansion (in BASH especially) quote all expressions in single quotes as in a 'search expression'.
sed defaults to BRE: The default behaviour of sed is to support Basic Regular Expressions (BRE). To use all the features described on this page set the -r (Linux) or -E (BSD) flag to use Extended Regular Expressions (ERE) as shown:
```
# use ERE on Linux (GNU sed)
sed -r 'search expression' file
# use ERE on BSD
sed -E 'search expression' file
```
in-situ editing: By default sed outputs to 'Standard Out' (normally the console/shell). There are two mutually exclusive options to create modified files. Redirect 'standard out' to a file or use in-situ editing with the -i option. The following two lines illustrate the options:
```
# in-situ: saves the unmodified file to file.bak BEFORE
# modifying
sed -i .bak 'search expression' file

# redirection: file is UNCHANGED the modified file is file.bak
sed 'search expression' file > file.bak
```
sed source: Sed will read from a file or 'Standard In' and therefore may be used in piped sequences. The following two lines are functionally equivalent:
```
cat file |sed 'search expression' > file.mod
sed 'search expression' file > file.mod
```

sed with substitution: sed's major use for most of us is in changing the contents of files using the substitution feature. Subsitution uses the following expression:

# substitution syntax
sed '[position]s/find/change/flag' file > file.mod
# where 
# [position] - optional - normally called address in most documentation
#  s         - indicates substitution command
#  find      - the expression to be changed
#  change    - the expression to be substituted
#  flag      - controls the actions and may be
#              g = repeat on same line
#              N = Nth occurence only on line
#              p = output line only if find was found!
#              (needs -n option to suppress other lines)
#              w ofile = append line to ofile only if find 
#                        was found
# if no flag given changes only the first occurrence of 
# find on every line is substituted

# examples
# change every occurrence of abc on every line to def
sed 's/abc/def/g' file > file.mod

# change only 2nd occurrence of abc on every line to def
sed 's/abc/def/2' file > file.mod

# creates file changed consisting of only lines in which
# abc was changed to def
sed 's/abc/def/w changed' file
# functionally identical to above
sed -n 's/abc/def/p' file > changed

Line deletion: sed provides for simple line deletion. The following examples illustrate the syntax and a trivial example:

# line delete syntax
sed '/find/d' file > file.mod
# where
# find - find regular expression
# d    - delete command

# delete every comment line (starting with #) in file
sed '/^#/d' file > file.mod

Delete vs Replace with null: If you use the delete feature of sed it deletes the entire line on which 'search expression' appears, which may not be the desired outcome. If all you want to do it delete the 'search expression' from the line then use replace with null. The following examples illustrate the difference:
```
# delete (substitute with null) every occurrence of abc in file
sed 's/abc//g' file > file.mod

# delete every line with abc in file
sed '/abc/d' file > file.mod
```
Escaping: You need to escape certain characters when using as literals using the standard \ technique. This removes the width attribute from html pages that many web editors, such as frontpage, annoyingly place on every line. The " are used as literals in the expression and are escaped by using \:
```
# delete (substitue with null) every occurrence of width="x" in file
# where x may be pure numeric or a percentage
sed 's/width=\"[0-9.%]*\"//g' file.html > file.mod
```

Delimiters: If you use sed when working with, say, paths which contain / it can be a royal pain to escape them all so you can use any sensible delimiter for the expresssions. The following example illustrates the principle:

# use of / delimiter with a path containing /
# replaces all occurences of /var/www/ with /var/local/www/
sed 's/\/var\/www\//\/var\/local\/www\//g' file > file.mod

# functionally identical using : as delimiter
sed 's:/var/www/:/var/local/www/:g' file > file.mod

Positioning with sed: sed documentation uses, IOHO, the confusing term address for what we call [position]. Positional expressions can optionally be placed before sed commands to position the execution of subsequent expressions/commands. Commands may take 1 or 2 positional expressions which may be line or text based. The following are simple examples:

# delete (subsitute with null) every occurrence of abc 
# in file only on lines starting with xyz (1 positional expression)
sed '/^xyz/s/abc//g' file > file.mod

# delete (subsitute with null) every occurrence of abc 
# only in lines 1 to 50
# 2 positional expression separated by comma
sed '1,50s/abc//g' file > file.mod

# delete (subsitute with null) every occurrence of abc 
# except lines 1 - 50 
# 2 positional expression separated by comma
sed '1,50!s/abc//g' file > file.mod

# delete (subsitute with null) every occurrence of abc 
# between lines containing aaa and xxx
# 2 positional expression separated by comma
sed '/aaa/,/xxx/s/abc//g' file > file.mod

# delete first 50 lines of file
# 2 positional expression separated by comma
sed '1,50d' file > file.mod

# leave first 50 lines of file - delete all others
# 2 positional expression separated by comma
sed '1,50!d' file > file.mod

when to use -e: you can use -e (indicating sed commands) with any search expression but when you have multiple command sequences you must use -e. The following are functionality identical:
```
# delete (substitute with null) every occurrence of width="x" in file
sed 's/width=\"[0-9.%]*\"//g' file.html > file.mod
sed -e 's/width=\"[0-9.%]*\"//g' file.html > file.mod
```

Strip HTML tags: Regular expressions take the longest match and therefore when stripping HTML tags may not yield the desired result:

# target line
<b>I</b> want you to <i>get</i> lost.

# this command finds the first < and last > on line
sed 's/<.*>//g' file.html
# and yields
lost.

# instead delimit each < with >
sed 's/<[^>]*>//g' file.html
# yields
I want you to get lost.

# finally to allow for multi-line tags you must use
# following (attributed to S.G Ravenhall)
sed -e :aloop -e 's/<[^>]*>//g;/</N;//bloop'
[see explantion below]

labels, branching and multiple commands: sed allows mutiple commands on a single line separated by semi-colons (;) and the definition of labels to allow branching (looping) within commands. The following example illustrates these features:

# this sequence strips html tags including multi-line ones
sed -e :aloop -e 's/<[^>]*>//g;/</N;//bloop'

# Explanation:
# -e :aloop consists of :a which creates a label followed by its name 
# in this case 'loop' that can be branched to by a later command 
# next -e s/<[^>]*>//g; removes tags on a single line and
# the ; terminates this command when the current line is exhausted.
# At this point the line buffer (called the search space) holds the 
# current line text with any transformations applied, so <> 
# sequences within the line have been removed from the search space.
# However we may have either an < or no < left in the current
# search space which is then processed by the next command which is:
# /</N; which is a positioning command looking for <
# in any remaining part of the search space. If < is found, the N  
# adds a NL char to the search space (a delimiter) and tells sed
# to ADD the next line in the file to the search space, control 
# then passes to the next command.
# If < was NOT found the search buffer is cleared (output normally) 
# and a new line read into the search space as normal. Then control
# passes to the next command, which is:
# //bloop which comprises // = do nothing, b = branch to and loop
# which is the label to which we will branch and was created 
# with -e :aloop. This simply restarts the sequence with EITHER just
# the next line of input (no < was left in the search space)
# or with the next line ADDED to the search space (there was a < 
# left in the search space but no corresponding >)
# all pretty obvious really!

adding line numbers to files: Sometimes it's incredibly useful to be able to find the line number within a file, say, to match up with error messages for example a parser outputs the message 'error at line 327'. The following adds a line number followed by a single space to each line in file:
```
# add the line number followed by space to every line in file
sed = file|sed 's/\n/ /' > file.lineno
# the first pass (sed = file) creates a line number and 
# terminates it with \n (creating a new line)
# the second piped pass (sed 's/\n/ /') substitutes a space 
# for \n making a single line
```
Note: We got email asking why the above does not also remove the real end of line (EOL) as well. OK. When any data is read into the buffer for processing the EOL is removed (and appended again only if written to a file). The line number created by the first command is pre-pended (with an EOL to force a new line if written to file) to the processing buffer and the whole lot piped to the second command and thus is the only EOL found.

MongoDB - Strategies when hitting disk

ITWeb/개발일반 2012. 5. 3. 09:46

역시 퍼왔습니다.. ^^;

[원본]

http://www.colinhowe.co.uk/2012/apr/26/mongodb-strategies-when-hitting-disk/

[원본 글]

MongoDB - Strategies when hitting disk

April 26th, 2012 — 0 Comments — Permalink

I gave a lightning talk on this at the London MongoDB User Group and thought I'd write it up here

MongoDB sucks when it hits disk (ignoring SSDs). The general advice is to never hit disk. What if you have to hit disk? Conversocial's new metrics infrastructure will allow people to see statistics for their Facebook and Twitter channels going back indefinitely. In general, the data being queried and updated will be in the past month and we can keep this is memory. But, we want to let them query the data going back further than this - which means hitting disk.

We found three good strategies for making hitting the disk less painful:

1. Use Single Big Documents

The naive implementation of our metrics system stored documents like this:

{ metric: "content_count", client: 5, value: 51, date: ISODate("2012-04-01 13:00") } 
{ metric: "content_count", client: 5, value: 49, date: ISODate("2012-04-02 13:00") }

An alternative implementation is:

{ metric: "content_count", client: 5, month: "2012-04", 1: 51, 2: 49, ... }

In this case we have a single document that spans an entire month with the value for each day being a field inside the document.

For a simple test we filled our database so that we had ~7gb of data on an Amazon c1.medium instance (1.7gb RAM) then tested how long it would take to read the data for an entire year and averaged this over multiple runs:

Naive implementation: 1.6s for a single year
Single document per month: 0.3s

That's a huge difference. The reasoning behind it is fairly simple:

The naive implementation has a worst case scenario where it has to read from the disk for all 365 documents and each of these results in a random seek
Having a single document per month has a worst case scenario where it has to read from the disk for 12 documents

An added benefit of this strategy is that there is less overhead per day which means the working set can contain much more data.

Foursquare do this.

2. Unusual Indices

Sometimes it pays to experiment with unusual index layouts. The naive index for our metrics system is on metric, client and then date:

db.metrics.ensureIndex({ metric: 1, client: 1, date: 1})

A common tip with indexing is to have all new values go to one side of the index. We reasoned that although the date was at the end of our index we would be writing to the right of lots of parts of the index so performance should be OK. We were wrong. We compared the performance of the above index with a new one:

db.metrics.ensureIndex({ date: 1, metric: 1, client: 1 })

The naive implementation performed 10k/sec inserts but after 20 million inserts the performance dropped down to 2.5k/sec inserts and occasionally stalled with lots of IO to disk. Ouch
By switching to date at the start of the index our performance was kept constant at 10k/sec inserts

What about queries? By putting the date at the front of the index we realised we'd now have to query an entire year of data using an in query:

db.metrics.find({ 
    metric: 'content_count', client: 1, date: { $in: [ "2012-01", "2012-02", ... ] } 
})

A test of the read performance of this displayed no noticeable impact.

The reasoning for this is that the naive implementation will be causing a lot of rebalancing of the trees used for the index. By switching the index around we ensured that all inserts went to one side of the index and rebalancing became a trivial operation.

3. Pre-Allocate for Locality

For most disks (not SSDs) the sequential read performance is vastly better than the random read performance. This means that we can read our metrics really fast from disk if we read them all from the same part of the disk. With MongoDB documents will reside on disk in the order that you wrote them unless they are resized and need to be moved around.

If we pre-allocate zero filled documents then we can force values for nearby months for the same metric to be stored on disk in the same location and then exploit the speed of sequential reads:

db.metrics.insert([ 
    { metric: 'content_count', client: 3, date: '2012-01', 0: 0, 1: 0, 2: 0, ... } 
    { .................................., date: '2012-02', ... }) 
    { .................................., date: '2012-03', ... }) 
    { .................................., date: '2012-04', ... }) 
    { .................................., date: '2012-05', ... }) 
    { .................................., date: '2012-06', ... }) 
    { .................................., date: '2012-07', ... }) 
    { .................................., date: '2012-08', ... }) 
    { .................................., date: '2012-09', ... }) 
    { .................................., date: '2012-10', ... }) 
    { .................................., date: '2012-11', ... }) 
    { .................................., date: '2012-12', ... }) 
])

Now, when client 3 wants their values for 'content_count' for the past year we can serve it using one big sequential read.

And the benchmarks?

Reading an entire year without pre-allocation: 62ms
Reading an entire year with pre-allocation: 6.6ms

Despite the performance gains from this we decided not to do this. Pre-allocation can get expensive for sparse data: you end up wasting a lot of space storing zeros that are never changed.

Conclusions

MongoDB can be made to have decent disk performance. You've just got to do some of the work yourself to ensure that reads aren't too expensive.

NoSQL Data Modeling....

ITWeb/개발일반 2012. 5. 3. 09:42

좋은 글은.. 두루 두루 공유되어야 합니다..
^^;
사실 Keep 하고 싶어서.. 퍼왔습니다.

[원본]

http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

[원본 글]

NoSQL Data Modeling Techniques

Posted on March 1, 2012

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

I would like to thank Daniel Kirkdorffer who reviewed the article and cleaned up the grammar.

To explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

NoSQL Data Models

First, we should note that SQL and relational model in general were designed long time ago to interact with the end user. This user-oriented nature had vast implications:

The end user is often interested in aggregated reporting information, not in separate data items, and SQL pays a lot of attention to this aspect.
No one can expect human users to explicitly control concurrency, integrity, consistency, or data type validity. That’s why SQL pays a lot of attention to transactional guaranties, schemas, and referential integrity.

On the other hand, it turned out that software applications are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Besides this, elimination of these features had an extremely important influence on the performance and scalability of the stores. And this was where a new evolution of data models began:

Key-Value storage is a very simplistic, but very powerful model. Many techniques that are described below are perfectly applicable to this model.
One of the most significant shortcomings of the Key-Value model is a poor applicability to cases that require processing of key ranges. Ordered Key-Value model overcomes this limitation and significantly improves aggregation capabilities.
Ordered Key-Value model is very powerful, but it does not provide any framework for value modeling. In general, value modeling can be done by an application, but BigTable-style databases go further and model values as a map-of-maps-of-maps, namely, column families, columns, and timestamped versions.
Document databases advance the BigTable model offering two significant improvements. The first one is values with schemes of arbitrary complexity, not just a map-of-maps. The second one is database-managed indexes, at least in some implementations. Full Text Search Engines can be considered a related species in the sense that they also offer flexible schema and automatic indexes. The main difference is that Document database group indexes by field names, as opposed to Search Engines that group indexes by field values. It is also worth noting that some Key-Value stores like Oracle Coherence gradually move towards Document databases via addition of indexes and in-database entry processors.
Finally, Graph data models can be considered as a side branch of evolution that origins from the Ordered Key-Value models. Graph databases allow one model business entities very transparently (this depends on that), but hierarchical modeling techniques make other data models very competitive in this area too. Graph databases are related to Document databases because many implementations allow one model a value as a map or document.

General Notes on NoSQL Data Modeling

The rest of this article describes concrete data modeling techniques and patterns. As a preface, I would like to provide a few general notes on NoSQL data modeling:

NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling:
- Relational modeling is typically driven by the structure of available data. The main design theme is ”What answers do I have?”
- NoSQL data modeling is typically driven by application-specific access patterns, i.e. the types of queries to be supported. The main design theme is ”What questions do I have?”
NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling does. In this article I describe several well-known data structures that are not specific for NoSQL, but are very useful in practical NoSQL modeling.
Data duplication and denormalization are first-class citizens.
Relational databases are not very convenient for hierarchical or graph-like data modeling and processing. Graph databases are obviously a perfect solution for this area, but actually most of NoSQL solutions are surprisingly strong for such problems. That is why the current article devotes a separate section to hierarchical data modeling.

Although data modeling techniques are basically implementation agnostic, this is a list of the particular systems that I had in mind while working on this article:

Key-Value Stores: Oracle Coherence, Redis, Kyoto Cabinet
BigTable-style Databases: Apache HBase, Apache Cassandra
Document Databases: MongoDB, CouchDB
Full Text Search Engines: Apache Lucene, Apache Solr
Graph Databases: neo4j, FlockDB

Conceptual Techniques

This section is devoted to the basic principles of NoSQL data modeling.

(1) Denormalization

Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model. Most techniques described in this article leverage denormalization in one or another form.

In general, denormalization is helpful for the following trade-offs:

Query data volume or IO per query VS total data volume. Using denormalization one can group all data that is needed to process a query in one place. This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume.
Processing complexity VS total data volume. Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems. Denormalization allow one to store data in a query-friendly structure to simplify query processing.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(2) Aggregates

All major genres of NoSQL provide soft schema capabilities in one way or another:

Key-Value Stores and Graph Databases typically do not place constraints on values, so values can be comprised of arbitrary format. It is also possible to vary a number of records for one business entity by using composite keys. For example, a user account can be modeled as a set of entries with composite keys like UserID_name, UserID_email, UserID_messages and so on. If a user has no email or messages then a corresponding entry is not recorded.
BigTable models support soft schema via a variable set of columns within a column family and a variable number of versions for one cell.
Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema.

Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities.This feature provides two major facilities:

Minimization of one-to-many relationships by means of nested entities and, consequently, reduction of joins.
Masking of “technical” differences between business entities and modeling of heterogeneous business entities using one collection of documents or one table.

These facilities are illustrated in the figure below. This figure depicts modeling of a product entity for an eCommerce business domain. Initially, we can say that all products have an ID, Price, and Description. Next, we discover that different types of products have different attributes like Author for Book or Length for Jeans. Some of these attributes have a one-to-many or many-to-many nature like Tracks in Music Albums. Next, it is possible that some entities can not be modeled using fixed types at all. For example, Jeans attributes are not consistent across brands and specific for each manufacturer. It is possible to overcome all these issues in a relational normalized data model, but solutions are far from elegant. Soft schema allows one to use a single Aggregate (product) that can model all types of products and their attributes:

Entity Aggregation

Embedding with denormalization can greatly impact updates both in performance and consistency, so special attention should be paid to update flows.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(3) Application Side Joins

Joins are rarely supported in NoSQL solutions. As a consequence of the “question-oriented” NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time. Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. Of course, in many cases joins are inevitable and should be handled by an application. The major use cases are:

Many to many relationships are often modeled by links and require joins.
Aggregates are often inapplicable when entity internals are the subject of frequent modifications. It is usually better to keep a record that something happened and join the records at query time as opposed to changing a value . For example, a messaging system can be modeled as a User entity that contains nested Message entities. But if messages are often appended, it may be better to extract Messages as independent entities and join them to the User at query time:

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases

General Modeling Techniques

In this section we discuss general modeling techniques that applicable to a variety of NoSQL implementations.

(4) Atomic Aggregates

Many, although not all, NoSQL solutions have limited transaction support. In some cases one can achieve transactional behavior using distributed locks or application-managed MVCC, but it is common to model data using an Aggregates technique to guarantee some of the ACID properties.

One of the reasons why powerful transactional machinery is an inevitable part of the relational databases is that normalized data typically require multi-place updates. On the other hand, Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically:

Atomic Aggregates

Of course, Atomic Aggregates as a data modeling technique is not a complete transactional solution, but if the store provides certain guaranties of atomicity, locks, or test-and-set instructions then Atomic Aggregates can be applicable.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(5) Enumerable Keys

Perhaps the greatest benefit of an unordered Key-Value data model is that entires can be partitioned across multiple servers by just hashing the key. Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature. Let’s consider the modeling of email messages as an example:

Some NoSQL stores provide atomic counters that allow one to generate sequential IDs. In this case one can store messages using userID_messageID as a composite key. If the latest message ID is known, it is possible to traverse previous messages. It is also possible to traverse preceding and succeeding messages for any given message ID.
Messages can be grouped into buckets, for example, daily buckets. This allows one to traverse a mail box backward or forward starting from any specified date or the current date.

Applicability: Key-Value Stores

(6) Dimensionality Reduction

Dimensionality Reduction is a technique that allows one to map multidimensional data to a Key-Value model or to other non-multidimensional models.

Traditional geographic information systems use some variation of a Quadtree or R-Tree for indexes. These structures need to be updated in-place and are expensive to manipulate when data volumes are large. An alternative approach is to traverse the 2D structure and flatten it into a plain list of entries. One well known example of this technique is a Geohash. A Geohash uses a Z-like scan to fill 2D space and each move is encoded as 0 or 1 depending on direction. Bits for longitude and latitude moves are interleaved as well as moves. The encoding process is illustrated in the figure below, where black and red bits stand for longitude and latitude, respectively:

Geohash Index

An important feature of a Geohash is its ability to estimate distance between regions using bit-wise code proximity, as is shown in the figure. Geohash encoding allows one to store geographical information using plain data models, like sorted key values preserving spatial relationships. The Dimensionality Reduction technique for BigTable was described in [6.1]. More information about Geohashes and other related techniques can be found in [6.2] and [6.3].

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(7) Index Table

Index Table is a very straightforward technique that allows one to take advantage of indexes in stores that do not support indexes internally. The most important class of such stores is the BigTable-style database. The idea is to create and maintain a special table with keys that follow the access pattern. For example, there is a master table that stores user accounts that can be accessed by user ID. A query that retrieves all users by a specified city can be supported by means of an additional table where city is a key:

Index Table Example

An Index table can be updated for each update of the master table or in batch mode. Either way, it results in an additional performance penalty and become a consistency issue.

Index Table can be considered as an analog of materialized views in relational databases.

Applicability: BigTable-style Databases

(8) Composite Key Index

Composite key is a very generic technique, but it is extremely beneficial when a store with ordered keys is used. Composite keys in conjunction with secondary sorting allows one to build a kind of multidimensional index which is fundamentally similar to the previously described Dimensionality Reduction technique. For example, let’s take a set of records where each record is a user statistic. If we are going to aggregate these statistics by a region the user came from, we can use keys in a format (State:City:UserID) that allow us to iterate over records for a particular state or city if that store supports the selection of key ranges by a partial key match (as BigTable-style systems do):

1SELECT Values WHERE state="CA:*"
2SELECT Values WHERE city="CA:San Francisco*"

Composite Key Index

Applicability: BigTable-style Databases

(9) Aggregation with Composite Keys

Composite keys may be used not only for indexing, but for different types of grouping. Let’s consider an example. There is a huge array of log records with information about internet users and their visits from different sites (click stream). The goal is to count the number of unique users for each site. This is similar to the following SQL query:

`1`	`SELECT` `count(distinct(user_id))` `FROM` `clicks` `GROUP` `BY` `site`

We can model this situation using composite keys with a UserID prefix:

Counting Unique Users using Composite Keys

The idea is to keep all records for one user collocated, so it is possible to fetch such a frame into memory (one user can not produce too many events) and to eliminate site duplicates using hash table or whatever. An alternative technique is to have one entry for one user and append sites to this entry as events arrive. Nevertheless, entry modification is generally less efficient than entry insertion in the majority of implementations.

Applicability: Ordered Key-Value Stores, BigTable-style Databases

(10) Inverted Search – Direct Aggregation

This technique is more a data processing pattern, rather than data modeling. Nevertheless, data models are also impacted by usage of this pattern. The main idea of this technique is to use an index to find data that meets a criteria, but aggregate data using original representation or full scans. Let’s consider an example. There are a number of log records with information about internet users and their visits from different sites (click stream). Let assume that each record contains user ID, categories this user belongs to (Men, Women, Bloggers, etc), city this user came from, and visited site. The goal is to describe the audience that meet some criteria (site, city, etc) in terms of unique users for each category that occurs in this audience (i.e. in the set of users that meet the criteria).

It is quite clear that a search of users that meet the criteria can be efficiently done using inverted indexes like {Category -> [user IDs]} or {Site -> [user IDs]}. Using such indexes, one can intersect or unify corresponding user IDs (this can be done very efficiently if user IDs are stored as sorted lists or bit sets) and obtain an audience. But describing an audience which is similar to an aggregation query like

`1`	`SELECT` `count(distinct(user_id)) ...` `GROUP` `BY` `category`

cannot be handled efficiently using an inverted index if the number of categories is big. To cope with this, one can build a direct index of the form {UserID -> [Categories]} and iterate over it in order to build a final report. This schema is depicted below:

Counting Unique Users using Inverse and Direct Indexes

And as a final note, we should take into account that random retrieval of records for each user ID in the audience can be inefficient. One can grapple with this problem by leveraging batch query processing. This means that some number of user sets can be precomputed (for different criteria) and then all reports for this batch of audiences can be computed in one full scan of direct or inverse index.

Applicability: Key-Value Stores, BigTable-style Databases, Document Databases

Hierarchy Modeling Techniques

(11) Tree Aggregation

Trees or even arbitrary graphs (with the aid of denormalization) can be modeled as a single record or document.

This techniques is efficient when the tree is accessed at once (for example, an entire tree of blog comments is fetched to show a page with a post).
Search and arbitrary access to the entries may be problematic.
Updates are inefficient in most NoSQL implementations (as compared to independent nodes).

Tree Aggregation

Applicability: Key-Value Stores, Document Databases

(12) Adjacency Lists

Adjacency Lists are a straightforward way of graph modeling – each node is modeled as an independent record that contains arrays of direct ancestors or descendants. It allows one to search for nodes by identifiers of their parents or children and, of course, to traverse a graph by doing one hop per query. This approach is usually inefficient for getting an entire subtree for a given node, for deep or wide traversals.

Applicability: Key-Value Stores, Document Databases

(13) Materialized Paths

Materialized Paths is a technique that helps to avoid recursive traversals of tree-like structures. This technique can be considered as a kind of denormalization. The idea is to attribute each node by identifiers of all its parents or children, so that it is possible to determine all descendants or predecessors of the node without traversal:

Materialized Paths for eShop Category Hierarchy

This technique is especially helpful for Full Text Search Engines because it allows one to convert hierarchical structures into flat documents. One can see in the figure above that all products or subcategories within the Men’s Shoes category can be retrieved using a short query which is simply a category name.

Materialized Paths can be stored as a set of IDs or as a single string of concatenated IDs. The latter option allows one to search for nodes that meet a certain partial path criteria using regular expressions. This option is illustrated in the figure below (path includes node itself):

Query Materialized Paths using RegExp

Applicability: Key-Value Stores, Document Databases, Search Engines

(14) Nested Sets

Nested sets is a standard technique for modeling tree-like structures. It is widely used in relational databases, but it is perfectly applicable to Key-Value Stores and Document Databases. The idea is to store the leafs of the tree in an array and to map each non-leaf node to a range of leafs using start and end indexes, as is shown in the figure below:

Modeling of eCommerce Catalog using Nested Sets

This structure is pretty efficient for immutable data because it has a small memory footprint and allows one to fetch all leafs for a given node without traversals. Nevertheless, inserts and updates are quite costly because the addition of one leaf causes an extensive update of indexes.

Applicability: Key-Value Stores, Document Databases

(15) Nested Documents Flattening: Numbered Field Names

Search Engines typically work with flat documents, i.e. each document is a flat list of fields and values. The goal of data modeling is to map business entities to plain documents and this can be challenging if the entities have a complex internal structure. One typical challenge mapping documents with a hierarchical structure, i.e. documents with nested documents inside. Let’s consider the following example:

Nested Documents Problem

Each business entity is some kind of resume. It contains a person’s name and a list of his or her skills with a skill level. An obvious way to model such an entity is to create a plain document withSkill and Level fields. This model allows one to search for a person by skill or by level, but queries that combine both fields are liable to result in false matches, as depicted in the figure above.

One way to overcome this issue was suggested in [4.6]. The main idea of this technique is to index each skill and corresponding level as a dedicated pair of fields Skill_i and Level_i, and to search for all these pairs simultaneously (where the number of OR-ed terms in a query is as high as the maximum number of skills for one person):

Nested Document Modeling using Numbered Field Names

This approach is not really scalable because query complexity grows rapidly as a function of the number of nested structures.

Applicability: Search Engines

(16) Nested Documents Flattening: Proximity Queries

The problem with nested documents can be solved using another technique that were also described in [4.6]. The idea is to use proximity queries that limit the acceptable distance between words in the document. In the figure below, all skills and levels are indexed in one field, namely, SkillAndLevel, and the query indicates that the words “Excellent” and “Poetry” should follow one another:

Nested Document Modeling using Proximity Queries

[4.3] describes a success story for this technique used on top of Solr.

Applicability: Search Engines

(17) Batch Graph Processing

Graph databases like neo4j are exceptionally good for exploring the neighborhood of a given node or exploring relationships between two or a few nodes. Nevertheless, global processing of large graphs is not very efficient because general purpose graph databases do not scale well. Distributed graph processing can be done using MapReduce and the Message Passing pattern that was described, for example, in one of my previous articles. This approach makes Key-Value stores, Document databases, and BigTable-style databases suitable for processing large graphs.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

References

Finally, I provide a list of useful links related to NoSQL data modeling:

javamail pop3 ssl 적용.

ITWeb/개발일반 2012. 5. 2. 20:47

그냥 일반 웹메일에서 외부메일 가져오기로 기능 확인 하려다 삽질만 했내요.
기냥 코딩 할걸..ㅡ.ㅡ;;
- 네이버 메일에서 외부메일 가져오기는 SSL 지원이 없구요.
- G메일에서는 SSL Port 변경이 없내요..

javamail api 를 이용해서 구현 가능합니다.
아래는 샘플 코드 이구요.
퍼왔습니다.

[원본링크]

http://www.java-tips.org/other-api-tips/javamail/connecting-gmail-using-pop3-connection-with-ssl-2.html

[샘플코드]

You can use following utility class to conect to Gmail. Since Gmail only supports POP3 connection with SSL, the connection is established via SSL.

package org.javatipsjavaemaillistimporter;

import com.sun.mail.pop3.POP3SSLStore;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.Date;
import java.util.Properties;
import javax.mail.Address;
import javax.mail.FetchProfile;
import javax.mail.Flags;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Multipart;
import javax.mail.Part;
import javax.mail.Session;
import javax.mail.Store;
import javax.mail.URLName;
import javax.mail.internet.ContentType;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeBodyPart;
import javax.mail.internet.ParseException;

public class GmailUtilities {
    
    private Session session = null;
    private Store store = null;
    private String username, password;
    private Folder folder;
    
    public GmailUtilities() {
        
    }
    
    public void setUserPass(String username, String password) {
        this.username = username;
        this.password = password;
    }
    
    public void connect() throws Exception {
        
        String SSL_FACTORY = "javax.net.ssl.SSLSocketFactory";
        
        Properties pop3Props = new Properties();
        
        pop3Props.setProperty("mail.pop3.socketFactory.class", SSL_FACTORY);
        pop3Props.setProperty("mail.pop3.socketFactory.fallback", "false");
        pop3Props.setProperty("mail.pop3.port",  "995");
        pop3Props.setProperty("mail.pop3.socketFactory.port", "995");
        
        URLName url = new URLName("pop3", "pop.gmail.com", 995, "",
                username, password);
        
        session = Session.getInstance(pop3Props, null);
        store = new POP3SSLStore(session, url);
        store.connect();
        
    }
    
    public void openFolder(String folderName) throws Exception {
        
        // Open the Folder
        folder = store.getDefaultFolder();
        
        folder = folder.getFolder(folderName);
        
        if (folder == null) {
            throw new Exception("Invalid folder");
        }
        
        // try to open read/write and if that fails try read-only
        try {
            
            folder.open(Folder.READ_WRITE);
            
        } catch (MessagingException ex) {
            
            folder.open(Folder.READ_ONLY);
            
        }
    }
    
    public void closeFolder() throws Exception {
        folder.close(false);
    }
    
    public int getMessageCount() throws Exception {
        return folder.getMessageCount();
    }
    
    public int getNewMessageCount() throws Exception {
        return folder.getNewMessageCount();
    }
    
    public void disconnect() throws Exception {
        store.close();
    }
    
    public void printMessage(int messageNo) throws Exception {
        System.out.println("Getting message number: " + messageNo);
        
        Message m = null;
        
        try {
            m = folder.getMessage(messageNo);
            dumpPart(m);
        } catch (IndexOutOfBoundsException iex) {
            System.out.println("Message number out of range");
        }
    }
    
    public void printAllMessageEnvelopes() throws Exception {
        
        // Attributes & Flags for all messages ..
        Message[] msgs = folder.getMessages();
        
        // Use a suitable FetchProfile
        FetchProfile fp = new FetchProfile();
        fp.add(FetchProfile.Item.ENVELOPE);        
        folder.fetch(msgs, fp);
        
        for (int i = 0; i < msgs.length; i++) {
            System.out.println("--------------------------");
            System.out.println("MESSAGE #" + (i + 1) + ":");
            dumpEnvelope(msgs[i]);
            
        }
        
    }
    
    public void printAllMessages() throws Exception {
     
        // Attributes & Flags for all messages ..
        Message[] msgs = folder.getMessages();
        
        // Use a suitable FetchProfile
        FetchProfile fp = new FetchProfile();
        fp.add(FetchProfile.Item.ENVELOPE);        
        folder.fetch(msgs, fp);
        
        for (int i = 0; i < msgs.length; i++) {
            System.out.println("--------------------------");
            System.out.println("MESSAGE #" + (i + 1) + ":");
            dumpPart(msgs[i]);
        }
        
    
    }
    
    
    public static void dumpPart(Part p) throws Exception {
        if (p instanceof Message)
            dumpEnvelope((Message)p);
       
        String ct = p.getContentType();
        try {
            pr("CONTENT-TYPE: " + (new ContentType(ct)).toString());
        } catch (ParseException pex) {
            pr("BAD CONTENT-TYPE: " + ct);
        }
        
        /*
         * Using isMimeType to determine the content type avoids
         * fetching the actual content data until we need it.
         */
        if (p.isMimeType("text/plain")) {
            pr("This is plain text");
            pr("---------------------------");
            System.out.println((String)p.getContent());        
        } else {
            
            // just a separator
            pr("---------------------------");
            
        }
    }
    
    public static void dumpEnvelope(Message m) throws Exception {        
        pr(" ");
        Address[] a;
        // FROM
        if ((a = m.getFrom()) != null) {
            for (int j = 0; j < a.length; j++)
                pr("FROM: " + a[j].toString());
        }
        
        // TO
        if ((a = m.getRecipients(Message.RecipientType.TO)) != null) {
            for (int j = 0; j < a.length; j++) {
                pr("TO: " + a[j].toString());                
            }
        }
        
        // SUBJECT
        pr("SUBJECT: " + m.getSubject());
        
        // DATE
        Date d = m.getSentDate();
        pr("SendDate: " +
                (d != null ? d.toString() : "UNKNOWN"));
        

    }
    
    static String indentStr = "                                               ";
    static int level = 0;
    
    /**
     * Print a, possibly indented, string.
     */
    public static void pr(String s) {
        
        System.out.print(indentStr.substring(0, level * 2));
        System.out.println(s);
    }
    
}

And the following code snippet shows how to use the above utility class. You can uncomment printAllMessageEnvelopes() method to just print the envelopes of the messages instead of whole messages.

package org.javatipsjavaemaillistimporter;

public class Main {
    
    /** Creates a new instance of Main */
    public Main() {
    }
    
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        
        try {
            
            GmailUtilities gmail = new GmailUtilities();
            gmail.setUserPass(" myemail@gmail.com", "mypassword");
            gmail.connect();
            gmail.openFolder("INBOX");
            
            int totalMessages = gmail.getMessageCount();
            int newMessages = gmail.getNewMessageCount();
            
            System.out.println("Total messages = " + totalMessages);
            System.out.println("New messages = " + newMessages);
            System.out.println("-------------------------------");
            
            //gmail.printAllMessageEnvelopes();
            gmail.printAllMessages();
            
        } catch(Exception e) {
            e.printStackTrace();
            System.exit(-1);
        }
        
    }
    
}

오픈소스 MQ, Reporting tool..

ITWeb/개발일반 2012. 5. 2. 08:53

[Message Queue]

- RabbitMQ
- ActiveMQ

[Reporting Tool]
- BIRT Project : http://www.eclipse.org/birt/phoenix/
- JasperReports : http://jasperforge.org/
- Pentaho Reporting Project : http://reporting.pentaho.com/

RabbitMQ 와 BIRT 를 추천 한다고 합니다. :)

JavaMail 링크

ITWeb/개발일반 2012. 4. 19. 15:42

메일서비스 MUA 구현을 위한 API 검토..
뭐 JavaMail 사용하면 될 것 같다.

http://www.oracle.com/technetwork/java/javamail/index.html
http://java.sun.com/developer/onlineTraining/JavaMail/contents.html
http://javamail.kenai.com/nonav/javadocs/index.html

subclipse 설치 링크.

ITWeb/개발일반 2012. 4. 16. 20:18

subversion 1.7 을 설치 했더니 subclipse 버전도 올리라 그러는군요.. ㅋ

Links for 1.8.x Release:
Changelog: http://subclipse.tigris.org/subclipse_1.8.x/changes.html
Eclipse update site URL: http://subclipse.tigris.org/update_1.8.x
Zipped downloads: http://subclipse.tigris.org/servlets/ProjectDocumentList?folderID=2240

Links for 1.6.x Release:
Changelog: http://subclipse.tigris.org/subclipse_1.6.x/changes.html
Eclipse update site URL: http://subclipse.tigris.org/update_1.6.x
Zipped downloads: http://subclipse.tigris.org/servlets/ProjectDocumentList?folderID=2240

Links for 1.4.x Release:
Changelog: http://subclipse.tigris.org/subclipse_1.4.x/changes.html
Eclipse update site URL: http://subclipse.tigris.org/update_1.4.x

Zipped downloads: http://subclipse.tigris.org/servlets/ProjectDocumentList?folderID=2240

그래서 1.8.x 로 올렸습니다.. ㅎㅎ

10 Points about Java Heap Space

ITWeb/개발일반 2012. 4. 10. 16:27

JVM Heap 관련 글 찾던중.. 하나 keep..

[원본링크]

http://javarevisited.blogspot.com/2011/05/java-heap-space-memory-size-jvm.html

[원본글]

10 Points about Java Heap Space

1. Java Heap Memory is part of Memory allocated to JVM by Operating System.

2. Whenever we create objects they are created inside Heap in Java.

3. Java Heap space is divided into three regions or generation for sake of garbage collection called New Generation, Old or tenured Generation or Perm Space.

4. You can increase or change size of Java Heap space by using JVM command line option -Xms, -Xmx and -Xmn. don't forget to add word "M" or "G" after specifying size to indicate Mega or Giga. for example you can set java heap size to 258MB by executing following command java -Xmx256m HelloWord.

5. You can use either JConsole or Runtime.maxMemory(), Runtime.totalMemory(), Runtime.freeMemory() to query about Heap size programmatic in Java.

6. You can use command "jmap" to take Heap dump in Java and "jhat" to analyze that heap dump.

7. Java Heap space is different than Stack which is used to store call hierarchy and local variables.

8. Java Garbage collector is responsible for reclaiming memory from dead object and returning to Java Heap space.

9. Don’t panic when you get java.lang.outofmemoryerror, sometimes its just matter of increasing heap size but if it’s recurrent then look for memory leak in Java.

10. Use Profiler and Heap dump Analyzer tool to understand Java Heap space and how much memory is allocated to each object.

전자정부 표준프레임워크에서 GPKI 빼고 Component 테스트하기.

ITWeb/개발일반 2012. 3. 26. 15:29

GPKI 는 별도로 신청을 해야 하기 때문에 기존 Component 에서 dependency 가 있는 것들을 테스트 할때 불편한 점이 있습니다.
그래서 걍 해당 package 를 삭제해버리고 ID/PWD 방식으로만 테스트 하시면 충분히 원하시는 걸 확인 하실 수 있을 겁니다.

[삭제 패키지]

package egovframework.com.sec.pki.*

※ interface 와 implement 두개 파일이 있는데 삭제 하시고 실행 하시면 됩니다.

혹시라도 전체 Component 를 선택해서 테스트 하실때 아마도 SMS 관련 부분도 오류가 날텐데요.
이것도 그냥 삭제 하시고 테스트 하시면 됩니다.

[삭제 항목들...]

[패키지]
package egovframework.com.cop.sms.*
package egovframework.com.utl.sys.srm.*

[리소스]
resources/egovframework/sqlmap/com/cop/sms
resources/egovframework/sqlmap/com/utl/sys/srm
resources/egovframework/sqlmap/config/mysql/sql-map-config-mysql-cop-sms.xml
resources/egovframework/sqlmap/config/mysql/context-scheduling-cop-sms.xml
resources/egovframework/sqlmap/config/mysql/sql-map-config-mysql-utl-sys-srm.xml

[bean 설정]
resources/egovframework/spring/com/context-idgen.xml

resources/egovframework/spring/com/context-scheduling-utl-sys-srm.xml

※ 이도저도 다 귀찮으시다면 걍 component 에서 SMS 관련된거랑 시스템 모니터링 관련 된거 선택을 하지 않으시면 됩니다.

전자정부 표준프레임워크에서 실명인증 빼기.

ITWeb/개발일반 2012. 3. 23. 16:08

그냥 ID/PWD 방식으로만 사용을 해보려고 Q/A 에 올렸었으나 그 전에 처리했던 내용 올려봅니다.
뭐 어려울건 하나도 없고 기냥 회원가입 시 실명 인증 부분을 스킵하고 넘어가면 됩니다.

제가 테스트 하고 확인 없이 copy & paste 를 해서 넣었더니.. 잘못된 내용이 들어갔내요.. ^^;;
정정합니다.

[관련패키지]

~~package egovframework.com.dam.per.web;~~
package egovframework.com.sec.rnc.web;

[관련Controller]

[EgovRlnmManageController.java]
- 아래 코드에서 빨간색 부분을 주석 처리 하면 jsp 단에서 실명처리 된것 처럼 넘어 갈 수 있습니다.
- 또는 걍.. 회원가입 링크를 바로 회원정보 입력 페이지로 넘기셔도 됩니다.

뭐 어려울게 하나도 없죠.. ^^;

◀ PREV : [1] : [···] : [28] : [29] : [30] : [31] : [32] : [33] : [34] : [···] : [49] : NEXT ▶

'ITWeb/개발일반'에 해당되는 글 489건

Regular Expressions - User Guide

Contents

A Gentle Introduction: The Basics

Some Definitions before we start

Our Example Target Strings

Simple Matching

Brackets, Ranges and Negation

Metacharacter

Meaning

Positioning (or Anchors)

Metacharacter

Meaning

Iteration 'metacharacters'

Metacharacter

Meaning

More 'metacharacters'

Metacharacter

Meaning

More Stuff

Contents

POSIX Character Class Definitions

Value

Meaning

Common Extensions and Abbreviations

Submatches, Groups and Backreferences

Regular Expression - Experiments and Testing

Some Examples

Apache Browser Identification - a Worked Example

Common Examples

Utility and Language Notes - General

Utility Notes - Using Visual Studio

Utility Notes - Using sed

1. Use Single Big Documents

2. Unusual Indices

3. Pre-Allocate for Locality

Conclusions

NoSQL Data Modeling Techniques

General Notes on NoSQL Data Modeling

Conceptual Techniques

(1) Denormalization

(2) Aggregates

(3) Application Side Joins

General Modeling Techniques

(4) Atomic Aggregates

(5) Enumerable Keys

(6) Dimensionality Reduction

(7) Index Table

(8) Composite Key Index

(9) Aggregation with Composite Keys

(10) Inverted Search – Direct Aggregation

Hierarchy Modeling Techniques

(11) Tree Aggregation

(12) Adjacency Lists

(13) Materialized Paths

(14) Nested Sets

(15) Nested Documents Flattening: Numbered Field Names

(16) Nested Documents Flattening: Proximity Queries

(17) Batch Graph Processing

References

10 Points about Java Heap Space

티스토리툴바