'lucene'에 해당되는 글 71건

  1. 2013.09.16 [Elasticsearch] Sizing 및 설정 일반내용.
  2. 2013.08.23 [Elasticsearch] Highlight 기능.
  3. 2013.08.21 [Lucene] Analysis JavaDoc
  4. 2013.07.12 [lucene score] 펌) score 계산식.
  5. 2013.06.19 [lucene]phrase query
  6. 2013.05.27 [lucene] field options for indexing - StringField.java
  7. 2013.05.16 [Lucene] Apache Lucene - Index File Formats (발번역)
  8. 2013.04.19 [Elasticsearch] Plugins - site 플러그인과 custom analyzer 플러그인 만들기 1
  9. 2013.01.24 루씬 한국어형태소 분석기 lucene-core 3.2 에서 3.6 으로..
  10. 2013.01.23 루씬 2.4.3 Field options for term vectors

[Elasticsearch] Sizing 및 설정 일반내용.

Elastic/Elasticsearch 2013. 9. 16. 10:09

Elasticsearch 로 구성하는 일반적인 내용입니다.

서비스 특성과 목적에 맞춰서 구성 하는게 정답이구요.

아래는 ES 의 여러 setting 이나 config 관련된 내용 중 서버 Sizing 과 성능과 관련된 일부 내용입니다. 


[Shard & Replica]

- shard size 조건

shard size 는 replica size 보다 커야 함.

shard 하나당 10G ~100G 정도로 구성.


- replica size 조건

일반 : ceil(N/2) + 1

검색 성능을 높히기 위한 가장 쉬운 방법은 full replica 설정

서비스 특성에 맞춰서 replica 설정을 하는게 좋으며, 일반적으로는 1, 2로 구성함.


[System Recommended]

- 일반적으로는 시스템 사양보다 문서, 검색 그리고 색인 특징을 바탕으로 구성을 하는게 맞습니다.

- 가능 하면 물리서버를 추천 하며, VM 을 사용해도 크게 무리는 없습니다.

- 서비스를 위한 최소사양 ( 아래 사양 보다 낮은 서버로도 구성해서 서비스 하는 것도 있습니다.)

CPU 2.4GHz 쿼드코어

RAM 12G


[검색엔진 설정]

- index settings

refresh_interval

term_index_interval

merge

store

위 설정들은 검색과 색인 특성에 맞춰서 구성을 하시면 됩니다.


- elasticsearch.yml

bootstrap.mlockall

swap 사용을 방지 하기 위해서는 true 로 구성을 하셔야 합니다.

index.fielddata.cache

resident, node, soft 

0.90.3 에서 기본 값은 node 입니다.

성능을 우선시 하는 경우가 아니면 soft 를 사용하시는게 좋구요.

성능을 우선시 하는 경우면 resident 또는 node 를 사용하시는게 좋습니다.

indices.fielddata.cache.size

heap size 의 30 ~40%

indices.cache.filter.size

index.translog

operation 에 대한 transaction log 설정입니다.

type 은 simple, buffer 두 가지 입니다.

안정성을 우선시 할 경우 simple 로 설정 하시는게 좋습니다.

indices.memory.index_buffer_size

전체 메모리 용량에서 20~30% 로 설정 합니다.

threadpool

search, index 별 설정을 해서 사용하시는게 좋습니다.

성능에 관련된 설정입니다.



:

[Elasticsearch] Highlight 기능.

Elastic/Elasticsearch 2013. 8. 23. 16:11

[lucene]

Highlighter.java

FastVectorHighlighter.java


[elasticsearch]

PlainHighlighter.java

FastVectorHighlighter.java


실제 term 에 highlight tag 를 적용하는 건.. highlightTerm() method 입니다.

SimpleHTMLFormatter.java

GradientFormatter.java

참고하세요.


이넘들이 highlight 하기 위해서는 기본 두 개의 정보가 필요 합니다.

CharTermAttribute.java

OffsetAttribute.java


이것들이 어떻게 동작하는지는 소스코드를 보시면 되겠습니다.

간단하게는... 

1. stored 원문을 가져옵니다.

2. char term 과 offset 정보를 이용해서 원문을 재구성 합니다.

2.1 재구성 할때 highlightTerm() 에서 재구성된 term 을 만들어 줍니다.


뭐 상세한건 소스를 보시는게 건강에 좋습니다. :)

:

[Lucene] Analysis JavaDoc

Elastic/Elasticsearch 2013. 8. 21. 10:34

루씬 패키지에 보면 설명이 잘 나와 있습니다.

별도의 analyzer, tokenizer, filter 를 만드셔야 하는 분들은 참고 하시면 좋을 것 같내요.

아래에서 제가 기본적으로 살펴본 부분은 

- Invoking the analyzer

- TokenStream API

이 두 부분 입니다.

이 정도만 보셔도 아마 쉽게 이해 하실 수 있을 거라고 생각 되내요.


[원문]

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html


Package org.apache.lucene.analysis Description

API and code to convert text into indexable/searchable tokens. Covers Analyzer and related classes.

Parsing? Tokenization? Analysis!

Lucene, an indexing and search library, accepts only plain text input.

Parsing

Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene.

Tokenization

Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process of breaking input text into small indexing elements – tokens. The way input text is broken into tokens heavily influences how people will then be able to search for that text. For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. Lucene includes both pre- and post-tokenization analysis facilities.

Pre-tokenization analysis can include (but is not limited to) stripping HTML markup, and transforming or removing text matching arbitrary patterns or sets of fixed strings.

There are many post-tokenization steps that can be done, including (but not limited to):

  • Stemming – Replacing words with their stems. For instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find both documents containing "bike" and those containing "bikes".
  • Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.

Core Analysis

The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are four main classes in the package from which all analysis processes are derived. These are:

  • Analyzer – An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes. See below for more information on implementing your own Analyzer.
  • CharFilter – CharFilter extends Reader to perform pre-tokenization substitutions, deletions, and/or insertions on an input Reader's text, while providing corrected character offsets to account for these modifications. This capability allows highlighting to function over the original text when indexed tokens are created from CharFilter-modified text with offsets that are not the same as those in the original text. Tokenizers' constructors and reset() methods accept a CharFilter. CharFilters may be chained to perform multiple pre-tokenization modifications.
  • Tokenizer – A Tokenizer is a TokenStream and is responsible for breaking up incoming text into tokens. In most cases, an Analyzer will use a Tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use a CharStream subclass (see above).
  • TokenFilter – A TokenFilter is also a TokenStream and is responsible for modifying tokens that have been created by the Tokenizer. Common modifications performed by a TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters.

Hints, Tips and Traps

The synergy between Analyzer and Tokenizer is sometimes confusing. To ease this confusion, some clarifications:

Lucene Java provides a number of analysis capabilities, the most commonly used one being the StandardAnalyzer. Many applications will have a long and industrious life with nothing more than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning:

  1. PerFieldAnalyzerWrapper – Most Analyzers perform the same operation on all Fields. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different Fields.
  2. The analysis library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  3. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.

Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). Perhaps your application would be just fine using the simple WhitespaceTokenizer combined with a StopFilter. The benchmark/ library can be useful for testing out the speed of the analysis process.

Invoking the Analyzer

Applications usually do not invoke analysis – Lucene does it for them:

  • At indexing, as a consequence of addDocument(doc), the Analyzer in effect for indexing is invoked for each indexed field of the added document.
  • At search, a QueryParser may invoke the Analyzer during parsing. Note that for some queries, analysis does not take place, e.g. wildcard queries.

However an application might invoke Analysis of any text for testing or for any other purpose, something like:

    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    
    try {
      ts.reset(); // Resets this stream to the beginning. (Required)
      while (ts.incrementToken()) {
        // Use AttributeSource.reflectAsString(boolean)
        // for token stream debugging.
        System.out.println("token: " + ts.reflectAsString(true));

        System.out.println("token start offset: " + offsetAtt.startOffset());
        System.out.println("  token end offset: " + offsetAtt.endOffset());
      }
      ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
      ts.close(); // Release resources associated with this stream.
    }

Indexing Analysis vs. Search Analysis

Selecting the "correct" analyzer is crucial for search quality, and can also affect indexing and search performance. The "correct" analyzer differs between applications. Lucene java's wiki page AnalysisParalysis provides some data on "analyzing your analyzer". Here are some rules of thumb:

  1. Test test test... (did we say test?)
  2. Beware of over analysis – might hurt indexing performance.
  3. Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to...
  4. In some cases a different analyzer is required for indexing and search, for instance:
    • Certain searches require more stop words to be filtered. (I.e. more than those that were filtered at indexing.)
    • Query expansion by synonyms, acronyms, auto spell correction, etc.
    This might sometimes require a modified analyzer – see the next section on how to do that.

Implementing your own Analyzer

Creating your own Analyzer is straightforward. Your Analyzer can wrap existing analysis components — CharFilter(s) (optional), a Tokenizer, and TokenFilter(s) (optional) — or components you create, or a combination of existing and newly created components. Before pursuing this approach, you may find it worthwhile to explore the analyzers-common library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. If you are still committed to creating your own Analyzer, have a look at the source code of any one of the many samples located in this package.

The following sections discuss some aspects of implementing your own analyzer.

Field Section Boundaries

When document.add(field) is called multiple times for the same field name, we could say that each such call creates a new section for that field in that document. In fact, a separate call to tokenStream(field,reader) would take place for each of these so called "sections". However, the default Analyzer behavior is to treat all these sections as one large section. This allows phrase search and proximity search to seamlessly cross boundaries between these "sections". In other words, if a certain field "f" is added like this:

    document.add(new Field("f","first ends",...);
    document.add(new Field("f","starts two",...);
    indexWriter.addDocument(document);

Then, a phrase search for "ends starts" would find that document. Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", simply by overriding Analyzer.getPositionIncrementGap(fieldName):

  Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
  Analyzer myAnalyzer = new StandardAnalyzer(matchVersion) {
    public int getPositionIncrementGap(String fieldName) {
      return 10;
    }
  };

Token Position Increments

By default, all tokens created by Analyzers and Tokenizers have a position increment of one. This means that the position stored for that token in the index would be one more than that of the previous token. Recall that phrase and proximity searches rely on position info.

If the selected analyzer filters the stop words "is" and "the", then for a document containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, with position("sky") = 3 + position("blue"). Now, a phrase query "blue is the sky" would find that document, because the same analyzer filters the same stop words from that query. But the phrase query "blue sky" would not find that document because the position increment between "blue" and "sky" is only 1.

If this behavior does not fit the application needs, the query parser needs to be configured to not take position increments into account when generating phrase queries.

Note that a StopFilter MUST increment the position increment in order not to generate corrupt tokenstream graphs. Here is the logic used by StopFilter to increment positions when filtering out tokens:

  public TokenStream tokenStream(final String fieldName, Reader reader) {
    final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
    TokenStream res = new TokenStream() {
      CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
      PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

      public boolean incrementToken() throws IOException {
        int extraIncrement = 0;
        while (true) {
          boolean hasNext = ts.incrementToken();
          if (hasNext) {
            if (stopWords.contains(termAtt.toString())) {
              extraIncrement += posIncrAtt.getPositionIncrement(); // filter this word
              continue;
            } 
            if (extraIncrement>0) {
              posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement);
            }
          }
          return hasNext;
        }
      }
    };
    return res;
  }

A few more use cases for modifying position increments are:

  1. Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that identifies a new sentence can add 1 to the position increment of the first token of the new sentence.
  2. Injecting synonyms – here, synonyms of a token should be added after that token, and their position increment should be set to 0. As result, all synonyms of a token would be considered to appear in exactly the same position as that token, and so would they be seen by phrase and proximity searches.

Token Position Length

By default, all tokens created by Analyzers and Tokenizers have a position length of one. This means that the token occupies a single position. This attribute is not indexed and thus not taken into account for positional queries, but is used by eg. suggesters.

The main use case for positions lengths is multi-word synonyms. With single-word synonyms, setting the position increment to 0 is enough to denote the fact that two words are synonyms, for example:

Termredmagenta
Position increment10

Given that position(magenta) = 0 + position(red), they are at the same position, so anything working with analyzers will return the exact same result if you replace "magenta" with "red" in the input. However, multi-word synonyms are more tricky. Let's say that you want to build a TokenStream where "IBM" is a synonym of "Internal Business Machines". Position increments are not enough anymore:

TermIBMInternationalBusinessMachines
Position increment1011

The problem with this token stream is that "IBM" is at the same position as "International" although it is a synonym with "International Business Machines" as a whole. Setting the position increment of "Business" and "Machines" to 0 wouldn't help as it would mean than "International" is a synonym of "Business". The only way to solve this issue is to make "IBM" span across 3 positions, this is where position lengths come to rescue.

TermIBMInternationalBusinessMachines
Position increment1011
Position length3111

This new attribute makes clear that "IBM" and "International Business Machines" start and end at the same positions.

How to not write corrupt token streams

There are a few rules to observe when writing custom Tokenizers and TokenFilters:

  • The first position increment must be > 0.
  • Positions must not go backward.
  • Tokens that have the same start position must have the same start offset.
  • Tokens that have the same end position (taking into account the position length) must have the same end offset.

Although these rules might seem easy to follow, problems can quickly happen when chaining badly implemented filters that play with positions and offsets, such as synonym or n-grams filters. Here are good practices for writing correct filters:

  • Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.
  • Token filters should not insert positions. If a filter needs to add tokens, then they shoud all have a position increment of 0.
  • When they remove tokens, token filters should increment the position increment of the following token.
  • Token filters should preserve position lengths.

TokenStream API

"Flexible Indexing" summarizes the effort of making the Lucene indexer pluggable and extensible for custom index formats. A fully customizable indexer means that users will be able to store custom data structures on disk. Therefore an API is necessary that can transport custom types of data from the documents to the indexer.

Attribute and AttributeSource

Classes Attribute and AttributeSource serve as the basis upon which the analysis elements of "Flexible Indexing" are implemented. An Attribute holds a particular piece of information about a text token. For example, CharTermAttribute contains the term text of a token, and OffsetAttribute contains the start and end character offsets of a token. An AttributeSource is a collection of Attributes with a restriction: there may be only one instance of each attribute type. TokenStream now extends AttributeSource, which means that one can add Attributes to a TokenStream. Since TokenFilter extends TokenStream, all filters are also AttributeSources.

Lucene provides seven Attributes out of the box:

CharTermAttributeThe term text of a token. Implements CharSequence (providing methods length()
and charAt(), and allowing e.g. for direct use with regular expression
 Matchers)
and
 Appendable (allowing the term text to be appended to.)
OffsetAttributeThe start and end offset of a token in characters.
PositionIncrementAttributeSee above for detailed information about position increment.
PositionLengthAttributeThe number of positions occupied by a token.
PayloadAttributeThe payload that a Token can optionally have.
TypeAttributeThe type of the token. Default is 'word'.
FlagsAttributeOptional flags a token can have.
KeywordAttributeKeyword-aware TokenStreams/-Filters skip modification of tokens
that return true from this attribute's isKeyword() method.

Using the TokenStream API

There are a few important things to know in order to use the new API efficiently which are summarized here. You may want to walk through the example below first and come back to this section afterwards.
  1. Please keep in mind that an AttributeSource can only have one instance of a particular Attribute. Furthermore, if a chain of a TokenStream and multiple TokenFilters is used, then all TokenFilters in that chain share the Attributes with the TokenStream.

  2. Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it returns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances.

  3. For performance reasons a TokenStream/-Filter should add/get Attributes during instantiation; i.e., create an attribute in the constructor and store references to it in an instance variable. Using an instance variable instead of calling addAttribute()/getAttribute() in incrementToken() will avoid attribute lookups for every token in the document.

  4. All methods in AttributeSource are idempotent, which means calling them multiple times always yields the same result. This is especially important to know for addAttribute(). The method takes the type (Class) of an Attribute as an argument and returns aninstance. If an Attribute of the same type was previously added, then the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for extra performance.

Example

In this example we will create a WhiteSpaceTokenizer and use a LengthFilter to suppress all words that have only two or fewer characters. The LengthFilter is part of the Lucene core and its implementation will be explained here to illustrate the usage of the TokenStream API.

Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter.

Whitespace tokenization

public class MyAnalyzer extends Analyzer {

  private Version matchVersion;
  
  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader));
  }
  
  public static void main(String[] args) throws IOException {
    // text to tokenize
    final String text = "This is a demo of the TokenStream API";
    
    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
    
    // get the CharTermAttribute from the TokenStream
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

    try {
      stream.reset();
    
      // print all tokens until stream is exhausted
      while (stream.incrementToken()) {
        System.out.println(termAtt.toString());
      }
    
      stream.end();
    } finally {
      stream.close();
    }
  }
}
In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and prints the term text of the tokens by accessing the CharTermAttribute that the WhitespaceTokenizer provides. Here is the output:
This
is
a
demo
of
the
new
TokenStream
API

Adding a LengthFilter

We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter to the chain. Only the createComponents() method in our analyzer needs to be changed:
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, result);
  }
Note how now only words with 3 or more characters are contained in the output:
This
demo
the
new
TokenStream
API
Now let's take a look how the LengthFilter is implemented:
public final class LengthFilter extends FilteringTokenFilter {

  private final int min;
  private final int max;
  
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

  /**
   * Create a new LengthFilter. This will filter out tokens whose
   * CharTermAttribute is either too short
   * (< min) or too long (> max).
   * @param version the Lucene match version
   * @param in      the TokenStream to consume
   * @param min     the minimum length
   * @param max     the maximum length
   */
  public LengthFilter(Version version, TokenStream in, int min, int max) {
    super(version, in);
    this.min = min;
    this.max = max;
  }

  @Override
  public boolean accept() {
    final int len = termAtt.length();
    return (len >= min && len <= max);
  }

}

In LengthFilter, the CharTermAttribute is added and stored in the instance variable termAtt. Remember that there can only be a single instance of CharTermAttribute in the chain, so in our example the addAttribute() call in LengthFilter returns the CharTermAttribute that the WhitespaceTokenizer already added.

The tokens are retrieved from the input stream in FilteringTokenFilter's incrementToken() method (see below), which calls LengthFilter's accept() method. By looking at the term text in the CharTermAttribute, the length of the term can be determined and tokens that are either too short or too long are skipped. Note how accept() can efficiently access the instance variable; no attribute lookup is necessary. The same is true for the consumer, which can simply use local references to the Attributes.

LengthFilter extends FilteringTokenFilter:

public abstract class FilteringTokenFilter extends TokenFilter {

  private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

  /**
   * Create a new FilteringTokenFilter.
   * @param in      the TokenStream to consume
   */
  public FilteringTokenFilter(Version version, TokenStream in) {
    super(in);
  }

  /** Override this method and return if the current input token should be returned by incrementToken. */
  protected abstract boolean accept() throws IOException;

  @Override
  public final boolean incrementToken() throws IOException {
    int skippedPositions = 0;
    while (input.incrementToken()) {
      if (accept()) {
        if (skippedPositions != 0) {
          posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
        }
        return true;
      }
      skippedPositions += posIncrAtt.getPositionIncrement();
    }
    // reached EOS -- return false
    return false;
  }

  @Override
  public void reset() throws IOException {
    super.reset();
  }

}

Adding a custom Attribute

Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently PartOfSpeechAttribute. First we need to define the interface of the new Attribute:
  public interface PartOfSpeechAttribute extends Attribute {
    public static enum PartOfSpeech {
      Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
    }
  
    public void setPartOfSpeech(PartOfSpeech pos);
  
    public PartOfSpeech getPartOfSpeech();
  }

Now we also need to write the implementing class. The name of that class is important here: By default, Lucene checks if there is a class with the name of the Attribute with the suffix 'Impl'. In this example, we would consequently call the implementing classPartOfSpeechAttributeImpl.

This should be the usual behavior. However, there is also an expert-API that allows changing these naming conventions: AttributeSource.AttributeFactory. The factory accepts an Attribute interface as argument and returns an actual instance. You can implement your own factory if you need to change the default behavior.

Now here is the actual class that implements our new Attribute. Notice that the class has to extend AttributeImpl:

public final class PartOfSpeechAttributeImpl extends AttributeImpl 
                                  implements PartOfSpeechAttribute {
  
  private PartOfSpeech pos = PartOfSpeech.Unknown;
  
  public void setPartOfSpeech(PartOfSpeech pos) {
    this.pos = pos;
  }
  
  public PartOfSpeech getPartOfSpeech() {
    return pos;
  }

  @Override
  public void clear() {
    pos = PartOfSpeech.Unknown;
  }

  @Override
  public void copyTo(AttributeImpl target) {
    ((PartOfSpeechAttribute) target).setPartOfSpeech(pos);
  }
}

This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the AttributeImpl class and therefore implements its abstract methods clear() and copyTo(). Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.

  public static class PartOfSpeechTaggingFilter extends TokenFilter {
    PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    
    protected PartOfSpeechTaggingFilter(TokenStream input) {
      super(input);
    }
    
    public boolean incrementToken() throws IOException {
      if (!input.incrementToken()) {return false;}
      posAtt.setPartOfSpeech(determinePOS(termAtt.buffer(), 0, termAtt.length()));
      return true;
    }
    
    // determine the part of speech for the given term
    protected PartOfSpeech determinePOS(char[] term, int offset, int length) {
      // naive implementation that tags every uppercased word as noun
      if (length > 0 && Character.isUpperCase(term[0])) {
        return PartOfSpeech.Noun;
      }
      return PartOfSpeech.Unknown;
    }
  }

Just like the LengthFilter, this new filter stores references to the attributes it needs in instance variables. Notice how you only need to pass in the interface of the new Attribute and instantiating the correct class is automatically taken care of.

Now we need to add the filter to the chain in MyAnalyzer:

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE);
    result = new PartOfSpeechTaggingFilter(result);
    return new TokenStreamComponents(source, result);
  }
Now let's look at the output:
This
demo
the
new
TokenStream
API
Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer to make use of the new PartOfSpeechAttribute and print it out:
  public static void main(String[] args) throws IOException {
    // text to tokenize
    final String text = "This is a demo of the TokenStream API";
    
    MyAnalyzer analyzer = new MyAnalyzer();
    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
    
    // get the CharTermAttribute from the TokenStream
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
    
    // get the PartOfSpeechAttribute from the TokenStream
    PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class);

    try {
      stream.reset();

      // print all tokens until stream is exhausted
      while (stream.incrementToken()) {
        System.out.println(termAtt.toString() + ": " + posAtt.getPartOfSpeech());
      }
    
      stream.end();
    } finally {
      stream.close();
    }
  }
The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in the while loop that consumes the stream. Here is the new output:
This: Noun
demo: Unknown
the: Unknown
new: Unknown
TokenStream: Noun
API: Noun
Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it is the first word of a sentence. Actually this is a good opportunity for an exercise. To practice the usage of the new API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). As a small hint, this is how the new Attribute class could begin:
  public class FirstTokenOfSentenceAttributeImpl extends AttributeImpl
                              implements FirstTokenOfSentenceAttribute {
    
    private boolean firstToken;
    
    public void setFirstToken(boolean firstToken) {
      this.firstToken = firstToken;
    }
    
    public boolean getFirstToken() {
      return firstToken;
    }

    @Override
    public void clear() {
      firstToken = false;
    }

  ...

Adding a CharFilter chain

Analyzers take Java Readers as input. Of course you can wrap your Readers with FilterReaders to manipulate content, but this would have the big disadvantage that character offsets might be inconsistent with your original text.

CharFilter is designed to allow you to pre-process input like a FilterReader would, but also preserve the original offsets associated with those characters. This way mechanisms like highlighting still work correctly. CharFilters can be chained.

Example:

public class MyAnalyzer extends Analyzer {

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    return new TokenStreamComponents(new MyTokenizer(reader));
  }
  
  @Override
  protected Reader initReader(String fieldName, Reader reader) {
    // wrap the Reader in a CharFilter chain.
    return new SecondCharFilter(new FirstCharFilter(reader));
  }
}


:

[lucene score] 펌) score 계산식.

Elastic/Elasticsearch 2013. 7. 12. 17:29

Original URL : http://www.lucenetutorial.com/advanced-topics/scoring.html


Lucene Scoring

The authoritative document for scoring is found on the Lucene site here. Read that first.

Lucene implements a variant of the TfIdf scoring model. That is documented here.

The factors involved in Lucene's scoring algorithm are as follows:

  1. tf = term frequency in document = measure of how often a term appears in the document
  2. idf = inverse document frequency = measure of how often the term appears across the index
  3. coord = number of terms in the query that were found in the document
  4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
  5. queryNorm = normalization factor so that queries can be compared
  6. boost (index) = boost of the field at index-time
  7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are: 

note: the implication of these factors should be read as, "Everything else being equal, ... [implication]"

1. tf 
Implementation: sqrt(freq) 
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score 
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more

queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as1/sqrt(sumOfSquaredWeights)

So, in summary (quoting Mark Harwood from the mailing list),

* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found athttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.

For example, if you want to ignore how common a term appears across the index,

Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better

Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 *Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}


:

[lucene]phrase query

Elastic/Elasticsearch 2013. 6. 19. 14:22

http://www.avajava.com/tutorials/lessons/how-do-i-query-for-words-near-each-other-with-a-phrase-query.html


slop 관련 설명이 잘 되어 있어 공유 합니다.


Here are some foods that Deron likes:
hamburger
french fries
steak
mushrooms
artichokes
Query: contents:"french fries"
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Query: contents:"hamburger steak"
Number of hits: 0

Query: contents:"hamburger steak"~1
Number of hits: 0

Query: contents:"hamburger steak"~2
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Query: contents:"hamburger steak"~3
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'french fries' using QueryParser
Type of query: BooleanQuery
Query: contents:french contents:fries
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for '"french fries"' using QueryParser
Type of query: PhraseQuery
Query: contents:"french fries"
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for '"hamburger steak"~1' using QueryParser
Type of query: PhraseQuery
Query: contents:"hamburger steak"~1
Number of hits: 0

Searching for '"hamburger steak"~2' using QueryParser
Type of query: PhraseQuery
Query: contents:"hamburger steak"~2
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Let's talk briefly about the console output. The first phrase query searches for "french" and "fries" with a slop of 0, meaning that the phrase search ends up being a search for "french fries", where "french" and "fries" are next to each other. Since this exists in deron-foods.txt, we get 1 hit.

In the second query, we search for "hamburger" and "steak" with a slop of 0. Since "hamburger" and "steak" don't exist next to each other in either document, we get 0 hits. The third query also involves a search for "hamburger" and "steak", but with a slop of 1. These words are not within 1 word of each other, so we get 0 hits.

The fourth query searches for "hamburger" and "steak" with a slop of 2. In the deron-foods.txt file, we have the words "... hamburger french fries steak ...". Since "hamburger" and "steak" are within two words of each other, we get 1 hit. The fifth phrase query is the same search but with a slop of 3. Since "hamburger" and "steak" are withing three words of each other (they are two words from each other), we get a hit of 1.

The next four queries utilize QueryParser. Notice that in the first of the QueryParser queries, we get a BooleanQuery rather than a PhraseQuery. This is because we passed QueryParser's parse() method "french fries" rather than "\"french fries\"". If we want QueryParser to generate a PhraseQuery, the search string needs to be surrounded by double quotes. The next query does search for "\"french fries\"" and we can see that it generates a PhraseQuery (with the default slop of 0) and gets 1 hit in response to the query.

The last two QueryParser queries demonstrate setting slop values. We can see that the slop values can be set the following the double quotes of the search string with a tilde (~) following by the slop number.

As we have seen, phrase queries are a great way to produce queries that have a degree of leeway to them in terms of the proximity and ordering of the words to be searched. The total allowed spacing between words can be controlled using the setSlop() method of PhaseQuery.


:

[lucene] field options for indexing - StringField.java

Elastic/Elasticsearch 2013. 5. 27. 10:52

lucene 3.6.X 에는 없는 클래스 입니다.

4.X 에서 처음 등장한 넘이구요.

코드를 조금 보면

public final class StringField extends Field {


  /** Indexed, not tokenized, omits norms, indexes

   *  DOCS_ONLY, not stored. */

  public static final FieldType TYPE_NOT_STORED = new FieldType();


  /** Indexed, not tokenized, omits norms, indexes

   *  DOCS_ONLY, stored */

  public static final FieldType TYPE_STORED = new FieldType();


  static {

    TYPE_NOT_STORED.setIndexed(true);

    TYPE_NOT_STORED.setOmitNorms(true);

    TYPE_NOT_STORED.setIndexOptions(IndexOptions.DOCS_ONLY);

    TYPE_NOT_STORED.setTokenized(false);

    TYPE_NOT_STORED.freeze();


    TYPE_STORED.setIndexed(true);

    TYPE_STORED.setOmitNorms(true);

    TYPE_STORED.setIndexOptions(IndexOptions.DOCS_ONLY);

    TYPE_STORED.setStored(true);

    TYPE_STORED.setTokenized(false);

    TYPE_STORED.freeze();

  }


  /** Creates a new StringField. 

   *  @param name field name

   *  @param value String value

   *  @param stored Store.YES if the content should also be stored

   *  @throws IllegalArgumentException if the field name or value is null.

   */

  public StringField(String name, String value, Store stored) {

    super(name, value, stored == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);

  }

}


무조건 index true 입니다.

기존이랑은 좀 다르죠.

예전 옵션에 대한 정보만 가지고 있으면 실수 할 수도 있는 부분이라 살짝 올려봤습니다.

:

[Lucene] Apache Lucene - Index File Formats (발번역)

Elastic/Elasticsearch 2013. 5. 16. 13:34

 

원문 : http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Overview

--> Upgrade : lucene.apache.org/core/8_8_2/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description

Definitions

 

루씬의 기본 개념은 인덱스, 문서, 필드, 용어(색인어)라고 볼수 있습니다.

인덱스는 문서들의 순서를 포함합니다..

 

문서는 필드들의 집합.

 

필드는 용어들의 집합

 

용어(색인어)는 바이트의 집합.

 

두개의 다른 필드에 동일한 바이트의 순서는 다른 용어(색인어)로 간주됩니다. 

 

용어(색인어)는 문자이름을 가진 필드와 바이트를 포함한 필드의 쌍으로 구성됩니다.

 

 

Inverted Indexing

 

인덱스는 효율적인 색인어 기반의 검색을 생성하기 위하여 색인어에 대한 통계를 저장 합니다.

 

루씬의 인덱스는 인덱스로 알려진 인덱스류로 구분되며, 색인어를 포함하는 문서를 나열 있기 때문 입니다.

 

친자연적 관계의 이라는 것은 문서 목록 색인어를 의미 합니다.

 

 

Types of Fields

 

루씬에서 필드들은 (필드 텍스트가 그대로 인덱스에 저장되어질 경우)비반전 방식으로 저장 있습니다.

 

반전된 필드들을 인덱스라고 부릅니다.

 

필드는 저장과 색인  있습니다.

 

필드의 텍스트들은 색인되기 위해 색인어로 토큰화 있으며, 또는 색인되기 위해 그대로 사용 수도 있습니다.

 

대부분의 필드들은 토근화 됩니다. 그러나 때때로 어떤 식별자 필드들로 그대로 인덱스화하는 것이 유용하기도 하다.

 

 

 

Segments

 

루씬 인덱스들은 여러 하위 인덱스와 세그멘트들로 구성될  있으며,  세그멘트들은 완전 독립된 인덱스로 따로 따로 검색 있습니다.

 

인덱스에 의한 진화:

 

새로운 문서가 추가 되는 동안 신규 세그멘트가 생성하고.

 

기존 세그멘트들을 머지 합니다.

 

검색은 여러 세그멘트들과 인덱스들을 포함할 있습니다. 개별 인덱스 들은 잠재적으로 세그멘트의 집합으로 구성 되어 집니다.

 

 

 

Document Numbers

 

내부적으로 루씬은 정수형 문서번호에 의한 문서와 관련이 있습니다첫번째 문서는 0번을 가지고 인덱스에 추가 됩니다. 이후 문서들은 이전 보다 숫자 1 값을 가지고 추가 됩니다.

 

 

문서의 번호는 루씬 외부에서 저장할때 변경 있으니 주의 해야 합니다.

 

특히 문서번호는 다음과 같은 상황에서 변경  있습니다.

 

세그먼트에 저장된 번호들은 세그먼트 내에서 고유합니다. 그리고 맥락에서 사용되기 전에 변환해야 합니다.

 

표준 기술은 세그먼트에서 사용된 번호들의 범위를 기반으로 값의 범위를 세그먼트에 할당 하게 됩니다.

 

세그먼트에서 외부 값으로 문서 번호를 변환 하기 위해서는 세그먼트의 기본 문서번호가 추가 됩니다.

 

세그먼트의 특정 값으로 다시 외부값을 변환 하려면, 세그먼트는 외부값의 범위에 의해서 식별 되어지고, 세그먼트의 기본 값은 빠집니다.

 

 

 

문서가 삭제 될때 문서번호 간격은 번호로 만들어집니다. 인덱스 병합을 통합으로서 결국 삭제 됩니다. 삭제 문서들은 세그먼트가 병합될때 삭제 됩니다. 새로 병합된 세그먼트들은 번호의 차이가 없습니다.

 

 

Index Structure Overview

 

세그먼트 인덱스 유지보수는 다음과 같습니다.

 

- 세그먼트 정보. 문서의 번호나 어떤 파일을 사용지와 같은 세그먼트의 메타정보를 포함 합니다.

 

- 인덱스에서 사용하는 필드 이름의 집합을 포함합니다.

 

저장된 필드 . 문서에 대한 속성들은 필드 이름이고 속성- 목록을 포함 합니다. 이것들은 문서에 대한 보조정보를 저장하기 위한 타이틀이나 URL 또는 데이터베이스에 접근하기 위한 식별자로 사용됩니다. 저장된 필드의 집합들은 검색 매칭된 개별로 반환 되어집니다. 반환된 값은 문서 번호에 의한 키입니다.

 

용어 사전. 색인된 모든 필드 문서에 들어 있는 색인어(용어) 포함 하고 있습니다.

 

또한 사전은 색인어의 빈도와 근접 데이터의 포인터 그리고 용어를 포함하는 문서들의 번호를 포함 합니다.

 

색인어(용어) 빈도 데이터. 사전에서 용어를 위해, 문서 빈도 설정을 생략 하지 않는 다면 모든 문서와 문서 번호에서 용어를 포함 합니다.

 

용어 근접데이터. 사전에서 용어를 위해, 문서에서 발생한 용어의 위치, 생략 근접데이터 정보는 존재 하지 않습니다.

 

정규화 요소. 문서의 개별 필드를 위한 값이 히트 필드에 점수로 곱해져서 저장 됩니다.

 

용어 벡터. 문서의 개별 필드에 대해, 용어벡터(문서벡터) 저장 됩니다. 용어벡터는 용어 텍스트와 용어 빈도로 구성됩니다.

 

문서 . 저장된 값과 같이 또한 문서 번호에 의해 구분 되어지며, 일반적으로 빠른 액세스를위한 메인 메모리에로드 것입니다. 저장된 값은 일반적으로 검색의 요약 결과를 구성하는 반면, 문서 값은 점수 요소 같은 것들에 유용합니다.

 

삭제된 문서. 문서 삭제 삭제 표시 옵션 입니다.

 

 

 

File Naming

모든 파일은 세그먼트에 속하며 다양한 확장자에  같은 이름을 갖습니다.

 

비록 이것이 필수 사항은 아니지만 전형적으로 인덱스의 모든 세그멘트는 단일 디렉토리에 저장됩니다.

 

버전 2.1 부터 파일 이름은 절대 재사용하지 않습니다. (segments.gen)

 

순차적인 long integer 생성 합니다.

 

 

 

Summary of File Extensions

 Name

Extension

Brief Description 

 Segments File

 segments.gen, segments_N

 커밋 포인트에 대한 정보를 저장

 Lock File

 write.lock

 write.lock  동일한 파일에 여러 IndexWriter  쓰는 것을 방지

 Segment Info

 .si

 세그먼트에 대한 메타정보를 저장

 Compound File

 .cfs, .cfe

 부가적인 가상 파일은 빈번한 파일 처리가 어려운 시스템을 위해 다른 모든 인덱스 파일을 구성한다.

 Fields

 .fnm

 필드에 대한 정보를 저장

 Field Index

 .fdx

 필드 데이터의 포인터를 포함

 Field Data

 .fdt

 문서를 위한 필드를 저장

 Term Dictionary

 .tim

 용어사전으로 용어 정보를 저장

 Term Index

 .tip

 용어 사전에서의 인덱스

 Frequencies  .doc  개별 용어사이의 빈도를 포함하고 있는 문서들의 목록을 포함
 Positions  .pos  인덱스에서 용어가 발생한 위치 정보를 저장
 Payloads  .pay  개별 위치 메타데이터 정보를 추가적으로 저장 한다. 예를 들어 문자 오프셋  사용자 페이로드를 저장한다. (색인 하는 동안 발생한 모든 용어에 대한 위치를 저장)
 Norms  .nvd, .nvm  문서와 필드에 대한 부스트 요소와 길이를 표현
 Per-Document Values  .dvd, .dvm  추가적인 스코어 요소나 개별 문서의 정보를 표현
 Term Vector Index   .tvx  문서데이터 파일에서의 오프셋 정보를 저장
 Term Vector Data (Documents)  .tvd  개별 문서들이 가지고 있는 term vector 정보를 포함
 Term Vector Fields  .tvf  term vector  대한 레벨 정보 필드
 Deleted Documents   .del
 삭제된 파일에 대한 정보
 Live Documents .liv   Segment 파일에 삭제 된 정보가 있는 경우에 생성 됩니다.
Point Values .dii, .dim 색인 된 포인트 정보를 가집니다.

 

 

:

[Elasticsearch] Plugins - site 플러그인과 custom analyzer 플러그인 만들기

Elastic/Elasticsearch 2013. 4. 19. 10:55

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.


잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)



[elasticsearch API 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/modules/plugins/


elasticsearch를 사용하면서 가장 많이 사용하는 것이 head 와 kr lucene 형태소 분석기가 아닌가 싶습니다.

그럼 이런 것들은 어떻게 제작을 해야 하는지 궁금 할텐데요.

위 원문 아래쪽에 제공되는 모든 plugin 목록을 보여 주고 있습니다.

또는 아래 링크에서도 확인이 가능 합니다.


[git]

- https://github.com/elasticsearch

- https://github.com/search?q=elasticsearch&type=&ref=simplesearch


우선 head와 같은 site plugin 구성 부터 살펴 보겠습니다.

이건 사실 설명이 필요 없습니다. ^^;;


[_site plugin]

- plugin location : ES_HOME/plugins

- site plugin name : helloworld

- helloworld site plugin location : ES_HOME/plugins/helloworld

    . helloworld 폴더 아래로 _site 폴더 생성

    . _site 폴더 아래로 구현한 html, js, css 등의 파일을 위치 시키고 아래 링크로 확인 하면 됩니다.

- helloworld site plugin url

    . http://localhost:9200/_plugin/helloworld/index.html

- elasticsearch server 와의 통신은 ajax 통신을 이용해서 필요한 기능들을 구현 하시면 됩니다.


[kr lucene analyzer plugin] 

- 이미 관련 plugin 은 제공 되고 있습니다.

- 아래 링크 참고

http://cafe.naver.com/korlucene

https://github.com/chanil1218/elasticsearch-analysis-korean

- 적용하는 방법은 두 가지 입니다.

    . First : elasticsearch-analysis-korean 을 설치 한다. (설치 시 es 버전을 맞춰 주기 위해서 별도 빌드가 필요 할 수도 있다.)

    . Second : lucene kr analyzer 라이브러리를 이용해서 plugin 형태로 제작해서 설치 한다.

- 아래는 plugin  형태로 제작해서 설치한 방법을 기술 한 것입니다.

분석기 라이브러리를 사용하는 경우 kimchy 가 만들어 놓은 코드를 기본 템플릿으로 사용해서 구현 하시면 쉽고 빠르게 적용 하실 수 있습니다.

https://github.com/elasticsearch/elasticsearch-analysis-smartcn


- 만들어 봅시다.

[프로젝트 구성]

- Eclipse 에서 Maven 프로젝트를 하나 생성 합니다.


[패키지 및 리소스 구성]

- org.elasticsearch.index.analysis

    . KrLuceneAnalysisBinderProcessor.java

public class KrLuceneAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor {


    @Override

    public void processAnalyzers(AnalyzersBindings analyzersBindings) {

        analyzersBindings.processAnalyzer("krlucene_analyzer", KrLuceneAnalyzerProvider.class);

    }


    @Override

    public void processTokenizers(TokenizersBindings tokenizersBindings) {

        tokenizersBindings.processTokenizer("krlucene_tokenizer", KrLuceneTokenizerFactory.class);

    }


    @Override

    public void processTokenFilters(TokenFiltersBindings tokenFiltersBindings) {

        tokenFiltersBindings.processTokenFilter("krlucene_filter", KrLuceneTokenFilterFactory.class);

    }

}

    . 이 클래스는 analyzer, tokenizer, filter 를 name 기반으로 등록해 준다.

    . settings 구성 시 analyzer, tokenizer, filter 에 명시 하는 name 부분에 해당한다.

    . settings 에서 type 부분에는 패키지 full path 를 명시 하면 된다.

curl -XPUT http://localhost:9200/test  -d '{

    "settings" : { 

        "index": {

            "analysis": {

                "analyzer": {

                    "krlucene_analyzer": {

                        "type": "org.elasticsearch.index.analysis.KrLuceneAnalyzerProvider",

                        "tokenizer" : "krlucene_tokenizer",

                        "filter" : ["trim","lowercase", "krlucene_filter"]

                    }   

                }   

            }   

        }   

    }   

}'


    . KrLuceneAnalyzerProvider.java

public class KrLuceneAnalyzerProvider extends AbstractIndexAnalyzerProvider<KoreanAnalyzer> {


    private final KoreanAnalyzer analyzer;


    @Inject

    public KrLuceneAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException {

        super(index, indexSettings, name, settings);


        analyzer = new KoreanAnalyzer(Lucene.VERSION.LUCENE_36);

    }


    @Override

    public KoreanAnalyzer get() {

        return this.analyzer;

    }

}


    . KrLuceneTokenFilterFactory.java

public class KrLuceneTokenFilterFactory extends AbstractTokenFilterFactory {


    @Inject

    public KrLuceneTokenFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {

        super(index, indexSettings, name, settings);

    }


    @Override

    public TokenStream create(TokenStream tokenStream) {

        return new KoreanFilter(tokenStream);

    }

}


    . KrLuceneTokenizerFactory.java

public class KrLuceneTokenizerFactory extends AbstractTokenizerFactory {


    @Inject

    public KrLuceneTokenizerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {

        super(index, indexSettings, name, settings);

    }


    @Override

    public Tokenizer create(Reader reader) {

        return new KoreanTokenizer(Lucene.VERSION.LUCENE_36, reader);

    }

}


- org.elasticsearch.plugin.analysis.krlucene

    . AnalysisKrLucenePlugin.java

    . 이 클래스는 생성한 plugin 을 es 에 등록해 주는 역할을 한다.

    . plugin 명을 analysis-krlucene 라고 했을 경우 아래와 같은 path 에 jar 파일을 위치 시켜야 합니다.

    ES_HOME/plugins/analysis-krlucene


- src/main/assemblies/plugin.xml

<?xml version="1.0"?>

<assembly>

    <id>plugin</id>

    <formats>

        <format>zip</format>

    </formats>

    <includeBaseDirectory>false</includeBaseDirectory>

    <dependencySets>

        <dependencySet>

            <outputDirectory>/</outputDirectory>

            <useProjectArtifact>true</useProjectArtifact>

            <useTransitiveFiltering>true</useTransitiveFiltering>

            <excludes>

                <exclude>org.elasticsearch:elasticsearch</exclude>

            </excludes>

        </dependencySet>

        <dependencySet>

            <outputDirectory>/</outputDirectory>

            <useProjectArtifact>true</useProjectArtifact>

            <scope>provided</scope>

        </dependencySet>

    </dependencySets>

</assembly>


- src/main/resources/es-plugin.properties

plugin=org.elasticsearch.plugin.analysis.krlucene.AnalysisKrLucenePlugin


- 이렇게 해서 빌드를 하시고 생성된 jar 파일을 위에서 언급한 경로에 위치 시키고 ES 재시작 후 아래와 같이 테스트 해보시면 됩니다.


[테스트]

- test 인덱스 생성 (위에 생성 코드 참고)

- 테스트 URL

    . http://localhost:9200/test/_analyze?analyzer=krlucene_analyzer&text=이것은 루씬한국어 형태소 분석기 플러그인 입니다.&pretty=1

{
  "tokens" : [ {
    "token" : "이것은",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "이것",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "루씬한국어",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "루씬",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "한국어",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "형태소",
    "start_offset" : 10,
    "end_offset" : 13,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "분석기",
    "start_offset" : 14,
    "end_offset" : 17,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "분석",
    "start_offset" : 14,
    "end_offset" : 16,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "플러그인",
    "start_offset" : 18,
    "end_offset" : 22,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "플러그",
    "start_offset" : 18,
    "end_offset" : 21,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "입니다",
    "start_offset" : 23,
    "end_offset" : 26,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "입니",
    "start_offset" : 23,
    "end_offset" : 25,
    "type" : "word",
    "position" : 12
  } ] 

}


※ lucene 버전을 3.x 에서 4.x 로 올리고 싶으시다면 직접 코드 수정을 통해서 진행을 하시면 됩니다.

- elasticsearch-analysis-korean 의 경우는 고쳐야 할 부분이 좀 됩니다.

    . 우선 루씬 한국어 형태소 소스코드를 3.x 에서 4.x 로 올리셔야 합니다.

    . 관련 코드는 루씬 한국어 형태소 분석기 카페에 들어가 보시면 cvs 링크가 있습니다.

:pserver:anonymous@lucenekorean.cvs.sourceforge.net:/cvsroot/lucenekorean

    . 추가로 es 버전도 올리고 싶으시다면 pom.xml 에서 코드를 수정해 주시기 바랍니다.

<properties>

    <elasticsearch.version>0.20.4</elasticsearch.version>

    <lucene.version>3.6.2</lucene.version>

</properties>


- 직접 플러그인을 생성해서 적용하는 방법은 위와 같이 플러그인을 만드시고 루씬한국어 형태소 분석기 라이브러리만 버전에 맞게 넣어서 사용하시면 됩니다.

    . 단, 플러그인의 pom.xml 에서 각 라이브러리의 version 은 맞춰 주셔야 겠죠.


:

루씬 한국어형태소 분석기 lucene-core 3.2 에서 3.6 으로..

Elastic/Elasticsearch 2013. 1. 24. 16:03

lucene kr analyzer 사용 시 lucene core 3.2 에서 3.6 으로 올리시게 되면 아래 클래스에서 빨갱이가 나옵니다.
아래는 수정한 코드 인데 뭐 보시면 너무나 기본이라 이런건 작성할 필요가 있는지도 ^^;
암튼 머리 나쁜 저는 필요 해서.. 

[KoreanAnalyzer.java]

 /** Builds an analyzer with the stop words from the given file.

   * @see WordlistLoader#getWordSet(File)

   */

public KoreanAnalyzer(Version matchVersion, File stopwords) throws IOException {     

        this(matchVersion, WordlistLoader.getWordSet(new InputStreamReader(new FileInputStream(stopwords), DIC_ENCODING), matchVersion));        

}


  /** Builds an analyzer with the stop words from the given file.

   * @see WordlistLoader#getWordSet(File)

   */

public KoreanAnalyzer(Version matchVersion, File stopwords, String encoding) throws IOException {

        this(matchVersion, WordlistLoader.getWordSet(new InputStreamReader(new FileInputStream(stopwords), encoding), matchVersion));

}

/** Builds an analyzer with the stop words from the given reader.

* @see WordlistLoader#getWordSet(Reader)

*/

public KoreanAnalyzer(Version matchVersion, Reader stopwords) throws IOException {

  this(matchVersion, WordlistLoader.getWordSet(stopwords, matchVersion));    

}

기존 KoreanAnalyzer 에는 Version argument 가 없어서 추가만 했습니다. :)

:

루씬 2.4.3 Field options for term vectors

Elastic/Elasticsearch 2013. 1. 23. 19:02
퇴근 하면서 급하게 올리다 보니 아무런 내용도 없이 그냥 스크랩 내용만 등록을 했내요.

[요약하면]
Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID . But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original

[어떤 경우에 사용하지]
Sometimes when you index a document you’d like to retrieve all its unique terms at search time. One common use is to speed up highlighting the matched tokens in stored fields. (Highlighting is covered more in sections 8.3 and 8.4.) Another use is to enable a link, “Find similar documents,” that when clicked runs a new search using the salient terms in an original document. Yet another example is automatic categorization of documents. Section 5.9 shows concrete examples of using term vectors once they’re in your index.


2.4.3 Field options for term vectors
Sometimes when you index a document you’d like to retrieve all its unique terms at search time. One common use is to speed up highlighting the matched tokens in stored fields. (Highlighting is covered more in sections 8.3 and 8.4.) Another use is to enable a link, “Find similar documents,” that when clicked runs a new search using the salient terms in an original document. Yet another example is automatic categorization of documents. Section 5.9 shows concrete examples of using term vectors once they’re in your index.
But what exactly are term vectors? Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID . But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original
String content is stored verbatim, term vectors store the actual separate terms that were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order. Because the tokens coming out of an analyzer also have position and offset information (see section 4.2.1), you can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:
TermVector.YES —Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
TermVector.WITH_POSITIONS —Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets
TermVector.WITH_OFFSETS —Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions
TermVector.WITH_POSITIONS_OFFSETS —Stores unique terms and their counts, along with positions and offsets
TermVector.NO —Doesn’t store any term vector information
Note that you can’t index term vectors unless you’ve also turned on indexing for the field. Stated more directly: if Index.NO is specified for a field, you must also specify
TermVector.NO .

We’re done with the detailed options to control indexing, storing, and term vec-tors. Now let’s see how you can create a field with values other than String .


: