'루씬'에 해당되는 글 30건

  1. 2016.03.23 [Elasticsearch] Timeout 소개.
  2. 2016.03.18 [Elasticsearch] this-week-in-elasticsearch-and-apache-lucene-2016-03-14 요약
  3. 2016.03.17 [Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시
  4. 2016.03.17 [Elasticsearch] Elasticsearch에 Arirang 외부 사전 등록하기
  5. 2013.08.21 [Lucene] Analysis JavaDoc
  6. 2013.04.19 [Elasticsearch] Plugins - site 플러그인과 custom analyzer 플러그인 만들기 1
  7. 2013.01.24 루씬 한국어형태소 분석기 lucene-core 3.2 에서 3.6 으로..
  8. 2013.01.23 루씬 한글형태소 분석기 로컬 테스트
  9. 2013.01.22 루씬 한국어 형태소 분석기 사전 구성 및 팁.
  10. 2012.12.10 lucene 색인 옵션

[Elasticsearch] Timeout 소개.

Elastic/Elasticsearch 2016. 3. 23. 11:39

Timeout 소개라기 보다 하도 예전에 봤던거라 다시 한번 살펴 봤습니다.

2013년도에 0.90 버전때 봤던 코드라 2.2.0 기반으로 정리해 봅니다.


참고링크) 


원문 Snippet)

By default, the coordinating node waits to receive a response from all shards. If one node is having trouble, it could slow down the response to all search requests.



참고 클래스)

TransportService.java

SearchService.java

SearchRequestBuilder.java


예전과 크게 달라 지지는 않았습니다.


첫번째 Timeout은 Shard 별 Search operation 에 대한 timeout 입니다. 

아시는 바와 같이 search request 를 보내게 되면 각 shard 수 만큼 thread 가 action 수행을 하게 됩니다. 이때 개별 thread 에 대한 timeout 설정이라고 보시면 됩니다.


두번째 Timeout은 search coordinator node에서의 timeout 입니다. 즉, 모든 shard 에서 데이터를 받을 때 까지의 timeout 이라고 보시면 됩니다.


:

[Elasticsearch] this-week-in-elasticsearch-and-apache-lucene-2016-03-14 요약

Elastic/Elasticsearch 2016. 3. 18. 10:28

거의 매주 올려 주는 elasticsearch & lucene 소식 입니다.

그냥 학습 한다 생각하고 요점 정리만 해볼 생각 입니다.


원문링크)


원문요약) 

올라온 것 중 개인적으로 keep 할 것만 추렸습니다.


Changes in master:

  • `string` fields will be replaced by `text` and `keyword` fields in 5.0, with the following bwc layer:
    • String mappings in old indices will not be upgraded.
    • Text/Keyword mappings can be added to old and new indices.
    • String mappings on new indices will be upgraded automatically to text/keyword mappings, if possible, with deprecation logging.
    • If it is not possible to automatically upgrade, an exception will be thrown.
  • Norms can no longer be lazy loaded. This is no longer needed as they are no longer loaded into memory. The `norms` setting now take a boolean. Index time boosts are no longer stored as norms.
  • Queries deprecated in 2.0 have now been removed.
  • The generic thread pool is now bound to 4x the number of processors.

Ongoing changes:


master에 반영된 내용 중 눈에 확 들어 오는건 string field내요. 이제 text와 keyword로 맵핑을 해야 할 것 같습니다.

이미 반영된건 자동으로 업그레이드 되지 않지만 신규로 생성하는건 자동으로 되내요.

그리고 deprecated 된 query들 이제 remove 되었내요. 혹시라도 계속 사용하셨다면 에러 조심 하세요.


변경중인것 중에는 field명에 dot 지원이랑 percolator query가 눈에 들어 오내요. API 방식에서 Query 방식으로 변경되면 더 편하고 유용하게 사용할 수 있겠습니다.




:

[Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시

Elastic/Elasticsearch 2016. 3. 17. 18:34

나중에 또 잊어 버릴까봐 기록합니다.


참고문서)


예시)

"index": {
"analysis": {
"analyzer": {
"arirang_custom": {
"type": "custom",
"tokenizer": "arirang_tokenizer",
"filter": ["lowercase", "trim", "arirang_filter"]
},
"arirang_custom_searcher": {
"tokenizer": "arirang_tokenizer",
"filter": ["lowercase", "trim", "arirang_filter", "meme_synonym"]
}
},
"filter": {
"meme_synonym": {
"type": "synonym",
"synonyms": [
"henry,헨리,앙리"
]
}
}
}
}


여기서 주의할 점 몇 가지만 기록 합니다.

1. synonym analyzer 생성 시 type을 custom 으로 선언 하거나 type을 아예 선언 하지 않습니다.

2. synonym 은 filter 로 생성 해서 analyzer 에 filter 로 할당 합니다.

3. 색인 시 사용할 것인지 질의 시 사용할 것인지 장단점과 서비스 특성에 맞게 검토 합니다.

4. synonyms_path 를 이용하도록 합니다. (이건 주의라기 보다 관리적 차원)

5. match type 의 query만 사용이 가능 하며, term type 의 query를 사용하고 싶으시다면 색인 시 synonym 적용해야 합니다.


그럼 1번에서 선언 하지 않는 다는 이야기는 뭘까요?

선언 하지 않으시면 그냥 custom 으로 만들어 줍니다.

못 믿으시는 분들을 위해 아래 소스코드 투척 합니다.


[AnalysisModule.java]

String typeName = analyzerSettings.get("type");
Class<? extends AnalyzerProvider> type;
if (typeName == null) {
if (analyzerSettings.get("tokenizer") != null) {
// custom analyzer, need to add it
type = CustomAnalyzerProvider.class;
} else {
throw new IllegalArgumentException("Analyzer [" + analyzerName + "] must have a type associated with it");
}
} else if (typeName.equals("custom")) {
type = CustomAnalyzerProvider.class;
} else {
type = analyzersBindings.analyzers.get(typeName);
if (type == null) {
throw new IllegalArgumentException("Unknown Analyzer type [" + typeName + "] for [" + analyzerName + "]");
}
}


:

[Elasticsearch] Elasticsearch에 Arirang 외부 사전 등록하기

Elastic/Elasticsearch 2016. 3. 17. 12:49

arirang 한글 형태소 분석기를 적용하고 사전 데이터를 업데이트 할 일들이 많이 생깁니다.

jar 안에 들어 있는 사전 데이터는 패키지 빌드 후 재배포하고 클러스터 재시작까지 해줘야 하는데요.

이런 과정 없이 사전 데이터만 외부에서 파일로 업데이트 및 관리하고 재시작 없이 바로 적용했으면 합니다.


기본적으로 이전 글에서 사전 데이터를 reload 하는 REST API를 구현해 두었습니다.

이 기능으로 일단 기능 구현은 완료가 된 것입니다.


이전 글 보기)


그럼 elasticsearch에서 어디에 사전 파일을 두고 관리를 해야 적용이 가능 할까요?

이전 글을 보시면 기본적으로 수명님이 만드신 arirang.morph 에서 classpath 내  org/apache/lucene/analysis/ko/dic 과 같이 생성 및 배치 시키시면 먼저 이 파일을 읽어 들이게 되어 있습니다.


이전 글 보기)


단, elasticsearch 실행 시 classpath 정보에 생성한 경로를 추가하지 않으시면 사전 파일들을 찾을 수 없으니 이점 유의 하시기 바랍니다.


elasticsearch classpath 설정)

elasticsearch에서 가이드 하는 것은 수정하지 마라 입니다. 하지만 수정 없이는 이를 활용할 수 없으니 이런건 수정해줘야 합니다.


$ vi bin/elasticsearch.in.sh

.....

ES_CLASSPATH="$ES_HOME/lib/elasticsearch-2.2.0.jar:$ES_HOME/lib/*:$ES_HOME/설정하신경로입력"

.....


이렇게 수정하신 후 재시작 하시고 직접 사전 정보 업데이트 후 reload api 를 이용해서 적용되는지 확인해 보시면 되겠습니다.


참고 정보 - 간단 요약)

arirang.morph 에서 properties 파일과 dic 파일 loading flow


Step 1)

load external korean.properties into classpath.

dic files are same.


Step 2)

if not exist, load korean.properties into jar.

dic files are same.


사전 데이터는 어떻게 등록 할 수 있는지 궁금하신 분은 이전 글 참고하세요.


사전 데이터 등록 예제)

:

[Lucene] Analysis JavaDoc

Elastic/Elasticsearch 2013. 8. 21. 10:34

루씬 패키지에 보면 설명이 잘 나와 있습니다.

별도의 analyzer, tokenizer, filter 를 만드셔야 하는 분들은 참고 하시면 좋을 것 같내요.

아래에서 제가 기본적으로 살펴본 부분은 

- Invoking the analyzer

- TokenStream API

이 두 부분 입니다.

이 정도만 보셔도 아마 쉽게 이해 하실 수 있을 거라고 생각 되내요.


[원문]

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html


Package org.apache.lucene.analysis Description

API and code to convert text into indexable/searchable tokens. Covers Analyzer and related classes.

Parsing? Tokenization? Analysis!

Lucene, an indexing and search library, accepts only plain text input.

Parsing

Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene.

Tokenization

Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process of breaking input text into small indexing elements – tokens. The way input text is broken into tokens heavily influences how people will then be able to search for that text. For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. Lucene includes both pre- and post-tokenization analysis facilities.

Pre-tokenization analysis can include (but is not limited to) stripping HTML markup, and transforming or removing text matching arbitrary patterns or sets of fixed strings.

There are many post-tokenization steps that can be done, including (but not limited to):

  • Stemming – Replacing words with their stems. For instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find both documents containing "bike" and those containing "bikes".
  • Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.

Core Analysis

The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are four main classes in the package from which all analysis processes are derived. These are:

  • Analyzer – An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes. See below for more information on implementing your own Analyzer.
  • CharFilter – CharFilter extends Reader to perform pre-tokenization substitutions, deletions, and/or insertions on an input Reader's text, while providing corrected character offsets to account for these modifications. This capability allows highlighting to function over the original text when indexed tokens are created from CharFilter-modified text with offsets that are not the same as those in the original text. Tokenizers' constructors and reset() methods accept a CharFilter. CharFilters may be chained to perform multiple pre-tokenization modifications.
  • Tokenizer – A Tokenizer is a TokenStream and is responsible for breaking up incoming text into tokens. In most cases, an Analyzer will use a Tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use a CharStream subclass (see above).
  • TokenFilter – A TokenFilter is also a TokenStream and is responsible for modifying tokens that have been created by the Tokenizer. Common modifications performed by a TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters.

Hints, Tips and Traps

The synergy between Analyzer and Tokenizer is sometimes confusing. To ease this confusion, some clarifications:

Lucene Java provides a number of analysis capabilities, the most commonly used one being the StandardAnalyzer. Many applications will have a long and industrious life with nothing more than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning:

  1. PerFieldAnalyzerWrapper – Most Analyzers perform the same operation on all Fields. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different Fields.
  2. The analysis library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  3. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.

Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). Perhaps your application would be just fine using the simple WhitespaceTokenizer combined with a StopFilter. The benchmark/ library can be useful for testing out the speed of the analysis process.

Invoking the Analyzer

Applications usually do not invoke analysis – Lucene does it for them:

  • At indexing, as a consequence of addDocument(doc), the Analyzer in effect for indexing is invoked for each indexed field of the added document.
  • At search, a QueryParser may invoke the Analyzer during parsing. Note that for some queries, analysis does not take place, e.g. wildcard queries.

However an application might invoke Analysis of any text for testing or for any other purpose, something like:

    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    
    try {
      ts.reset(); // Resets this stream to the beginning. (Required)
      while (ts.incrementToken()) {
        // Use AttributeSource.reflectAsString(boolean)
        // for token stream debugging.
        System.out.println("token: " + ts.reflectAsString(true));

        System.out.println("token start offset: " + offsetAtt.startOffset());
        System.out.println("  token end offset: " + offsetAtt.endOffset());
      }
      ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
      ts.close(); // Release resources associated with this stream.
    }

Indexing Analysis vs. Search Analysis

Selecting the "correct" analyzer is crucial for search quality, and can also affect indexing and search performance. The "correct" analyzer differs between applications. Lucene java's wiki page AnalysisParalysis provides some data on "analyzing your analyzer". Here are some rules of thumb:

  1. Test test test... (did we say test?)
  2. Beware of over analysis – might hurt indexing performance.
  3. Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to...
  4. In some cases a different analyzer is required for indexing and search, for instance:
    • Certain searches require more stop words to be filtered. (I.e. more than those that were filtered at indexing.)
    • Query expansion by synonyms, acronyms, auto spell correction, etc.
    This might sometimes require a modified analyzer – see the next section on how to do that.

Implementing your own Analyzer

Creating your own Analyzer is straightforward. Your Analyzer can wrap existing analysis components — CharFilter(s) (optional), a Tokenizer, and TokenFilter(s) (optional) — or components you create, or a combination of existing and newly created components. Before pursuing this approach, you may find it worthwhile to explore the analyzers-common library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. If you are still committed to creating your own Analyzer, have a look at the source code of any one of the many samples located in this package.

The following sections discuss some aspects of implementing your own analyzer.

Field Section Boundaries

When document.add(field) is called multiple times for the same field name, we could say that each such call creates a new section for that field in that document. In fact, a separate call to tokenStream(field,reader) would take place for each of these so called "sections". However, the default Analyzer behavior is to treat all these sections as one large section. This allows phrase search and proximity search to seamlessly cross boundaries between these "sections". In other words, if a certain field "f" is added like this:

    document.add(new Field("f","first ends",...);
    document.add(new Field("f","starts two",...);
    indexWriter.addDocument(document);

Then, a phrase search for "ends starts" would find that document. Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", simply by overriding Analyzer.getPositionIncrementGap(fieldName):

  Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
  Analyzer myAnalyzer = new StandardAnalyzer(matchVersion) {
    public int getPositionIncrementGap(String fieldName) {
      return 10;
    }
  };

Token Position Increments

By default, all tokens created by Analyzers and Tokenizers have a position increment of one. This means that the position stored for that token in the index would be one more than that of the previous token. Recall that phrase and proximity searches rely on position info.

If the selected analyzer filters the stop words "is" and "the", then for a document containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, with position("sky") = 3 + position("blue"). Now, a phrase query "blue is the sky" would find that document, because the same analyzer filters the same stop words from that query. But the phrase query "blue sky" would not find that document because the position increment between "blue" and "sky" is only 1.

If this behavior does not fit the application needs, the query parser needs to be configured to not take position increments into account when generating phrase queries.

Note that a StopFilter MUST increment the position increment in order not to generate corrupt tokenstream graphs. Here is the logic used by StopFilter to increment positions when filtering out tokens:

  public TokenStream tokenStream(final String fieldName, Reader reader) {
    final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
    TokenStream res = new TokenStream() {
      CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
      PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

      public boolean incrementToken() throws IOException {
        int extraIncrement = 0;
        while (true) {
          boolean hasNext = ts.incrementToken();
          if (hasNext) {
            if (stopWords.contains(termAtt.toString())) {
              extraIncrement += posIncrAtt.getPositionIncrement(); // filter this word
              continue;
            } 
            if (extraIncrement>0) {
              posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement);
            }
          }
          return hasNext;
        }
      }
    };
    return res;
  }

A few more use cases for modifying position increments are:

  1. Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that identifies a new sentence can add 1 to the position increment of the first token of the new sentence.
  2. Injecting synonyms – here, synonyms of a token should be added after that token, and their position increment should be set to 0. As result, all synonyms of a token would be considered to appear in exactly the same position as that token, and so would they be seen by phrase and proximity searches.

Token Position Length

By default, all tokens created by Analyzers and Tokenizers have a position length of one. This means that the token occupies a single position. This attribute is not indexed and thus not taken into account for positional queries, but is used by eg. suggesters.

The main use case for positions lengths is multi-word synonyms. With single-word synonyms, setting the position increment to 0 is enough to denote the fact that two words are synonyms, for example:

Termredmagenta
Position increment10

Given that position(magenta) = 0 + position(red), they are at the same position, so anything working with analyzers will return the exact same result if you replace "magenta" with "red" in the input. However, multi-word synonyms are more tricky. Let's say that you want to build a TokenStream where "IBM" is a synonym of "Internal Business Machines". Position increments are not enough anymore:

TermIBMInternationalBusinessMachines
Position increment1011

The problem with this token stream is that "IBM" is at the same position as "International" although it is a synonym with "International Business Machines" as a whole. Setting the position increment of "Business" and "Machines" to 0 wouldn't help as it would mean than "International" is a synonym of "Business". The only way to solve this issue is to make "IBM" span across 3 positions, this is where position lengths come to rescue.

TermIBMInternationalBusinessMachines
Position increment1011
Position length3111

This new attribute makes clear that "IBM" and "International Business Machines" start and end at the same positions.

How to not write corrupt token streams

There are a few rules to observe when writing custom Tokenizers and TokenFilters:

  • The first position increment must be > 0.
  • Positions must not go backward.
  • Tokens that have the same start position must have the same start offset.
  • Tokens that have the same end position (taking into account the position length) must have the same end offset.

Although these rules might seem easy to follow, problems can quickly happen when chaining badly implemented filters that play with positions and offsets, such as synonym or n-grams filters. Here are good practices for writing correct filters:

  • Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.
  • Token filters should not insert positions. If a filter needs to add tokens, then they shoud all have a position increment of 0.
  • When they remove tokens, token filters should increment the position increment of the following token.
  • Token filters should preserve position lengths.

TokenStream API

"Flexible Indexing" summarizes the effort of making the Lucene indexer pluggable and extensible for custom index formats. A fully customizable indexer means that users will be able to store custom data structures on disk. Therefore an API is necessary that can transport custom types of data from the documents to the indexer.

Attribute and AttributeSource

Classes Attribute and AttributeSource serve as the basis upon which the analysis elements of "Flexible Indexing" are implemented. An Attribute holds a particular piece of information about a text token. For example, CharTermAttribute contains the term text of a token, and OffsetAttribute contains the start and end character offsets of a token. An AttributeSource is a collection of Attributes with a restriction: there may be only one instance of each attribute type. TokenStream now extends AttributeSource, which means that one can add Attributes to a TokenStream. Since TokenFilter extends TokenStream, all filters are also AttributeSources.

Lucene provides seven Attributes out of the box:

CharTermAttributeThe term text of a token. Implements CharSequence (providing methods length()
and charAt(), and allowing e.g. for direct use with regular expression
 Matchers)
and
 Appendable (allowing the term text to be appended to.)
OffsetAttributeThe start and end offset of a token in characters.
PositionIncrementAttributeSee above for detailed information about position increment.
PositionLengthAttributeThe number of positions occupied by a token.
PayloadAttributeThe payload that a Token can optionally have.
TypeAttributeThe type of the token. Default is 'word'.
FlagsAttributeOptional flags a token can have.
KeywordAttributeKeyword-aware TokenStreams/-Filters skip modification of tokens
that return true from this attribute's isKeyword() method.

Using the TokenStream API

There are a few important things to know in order to use the new API efficiently which are summarized here. You may want to walk through the example below first and come back to this section afterwards.
  1. Please keep in mind that an AttributeSource can only have one instance of a particular Attribute. Furthermore, if a chain of a TokenStream and multiple TokenFilters is used, then all TokenFilters in that chain share the Attributes with the TokenStream.

  2. Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it returns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances.

  3. For performance reasons a TokenStream/-Filter should add/get Attributes during instantiation; i.e., create an attribute in the constructor and store references to it in an instance variable. Using an instance variable instead of calling addAttribute()/getAttribute() in incrementToken() will avoid attribute lookups for every token in the document.

  4. All methods in AttributeSource are idempotent, which means calling them multiple times always yields the same result. This is especially important to know for addAttribute(). The method takes the type (Class) of an Attribute as an argument and returns aninstance. If an Attribute of the same type was previously added, then the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for extra performance.

Example

In this example we will create a WhiteSpaceTokenizer and use a LengthFilter to suppress all words that have only two or fewer characters. The LengthFilter is part of the Lucene core and its implementation will be explained here to illustrate the usage of the TokenStream API.

Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter.

Whitespace tokenization

public class MyAnalyzer extends Analyzer {

  private Version matchVersion;
  
  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader));
  }
  
  public static void main(String[] args) throws IOException {
    // text to tokenize
    final String text = "This is a demo of the TokenStream API";
    
    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
    
    // get the CharTermAttribute from the TokenStream
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

    try {
      stream.reset();
    
      // print all tokens until stream is exhausted
      while (stream.incrementToken()) {
        System.out.println(termAtt.toString());
      }
    
      stream.end();
    } finally {
      stream.close();
    }
  }
}
In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and prints the term text of the tokens by accessing the CharTermAttribute that the WhitespaceTokenizer provides. Here is the output:
This
is
a
demo
of
the
new
TokenStream
API

Adding a LengthFilter

We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter to the chain. Only the createComponents() method in our analyzer needs to be changed:
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, result);
  }
Note how now only words with 3 or more characters are contained in the output:
This
demo
the
new
TokenStream
API
Now let's take a look how the LengthFilter is implemented:
public final class LengthFilter extends FilteringTokenFilter {

  private final int min;
  private final int max;
  
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

  /**
   * Create a new LengthFilter. This will filter out tokens whose
   * CharTermAttribute is either too short
   * (< min) or too long (> max).
   * @param version the Lucene match version
   * @param in      the TokenStream to consume
   * @param min     the minimum length
   * @param max     the maximum length
   */
  public LengthFilter(Version version, TokenStream in, int min, int max) {
    super(version, in);
    this.min = min;
    this.max = max;
  }

  @Override
  public boolean accept() {
    final int len = termAtt.length();
    return (len >= min && len <= max);
  }

}

In LengthFilter, the CharTermAttribute is added and stored in the instance variable termAtt. Remember that there can only be a single instance of CharTermAttribute in the chain, so in our example the addAttribute() call in LengthFilter returns the CharTermAttribute that the WhitespaceTokenizer already added.

The tokens are retrieved from the input stream in FilteringTokenFilter's incrementToken() method (see below), which calls LengthFilter's accept() method. By looking at the term text in the CharTermAttribute, the length of the term can be determined and tokens that are either too short or too long are skipped. Note how accept() can efficiently access the instance variable; no attribute lookup is necessary. The same is true for the consumer, which can simply use local references to the Attributes.

LengthFilter extends FilteringTokenFilter:

public abstract class FilteringTokenFilter extends TokenFilter {

  private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

  /**
   * Create a new FilteringTokenFilter.
   * @param in      the TokenStream to consume
   */
  public FilteringTokenFilter(Version version, TokenStream in) {
    super(in);
  }

  /** Override this method and return if the current input token should be returned by incrementToken. */
  protected abstract boolean accept() throws IOException;

  @Override
  public final boolean incrementToken() throws IOException {
    int skippedPositions = 0;
    while (input.incrementToken()) {
      if (accept()) {
        if (skippedPositions != 0) {
          posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
        }
        return true;
      }
      skippedPositions += posIncrAtt.getPositionIncrement();
    }
    // reached EOS -- return false
    return false;
  }

  @Override
  public void reset() throws IOException {
    super.reset();
  }

}

Adding a custom Attribute

Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently PartOfSpeechAttribute. First we need to define the interface of the new Attribute:
  public interface PartOfSpeechAttribute extends Attribute {
    public static enum PartOfSpeech {
      Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
    }
  
    public void setPartOfSpeech(PartOfSpeech pos);
  
    public PartOfSpeech getPartOfSpeech();
  }

Now we also need to write the implementing class. The name of that class is important here: By default, Lucene checks if there is a class with the name of the Attribute with the suffix 'Impl'. In this example, we would consequently call the implementing classPartOfSpeechAttributeImpl.

This should be the usual behavior. However, there is also an expert-API that allows changing these naming conventions: AttributeSource.AttributeFactory. The factory accepts an Attribute interface as argument and returns an actual instance. You can implement your own factory if you need to change the default behavior.

Now here is the actual class that implements our new Attribute. Notice that the class has to extend AttributeImpl:

public final class PartOfSpeechAttributeImpl extends AttributeImpl 
                                  implements PartOfSpeechAttribute {
  
  private PartOfSpeech pos = PartOfSpeech.Unknown;
  
  public void setPartOfSpeech(PartOfSpeech pos) {
    this.pos = pos;
  }
  
  public PartOfSpeech getPartOfSpeech() {
    return pos;
  }

  @Override
  public void clear() {
    pos = PartOfSpeech.Unknown;
  }

  @Override
  public void copyTo(AttributeImpl target) {
    ((PartOfSpeechAttribute) target).setPartOfSpeech(pos);
  }
}

This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the AttributeImpl class and therefore implements its abstract methods clear() and copyTo(). Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.

  public static class PartOfSpeechTaggingFilter extends TokenFilter {
    PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    
    protected PartOfSpeechTaggingFilter(TokenStream input) {
      super(input);
    }
    
    public boolean incrementToken() throws IOException {
      if (!input.incrementToken()) {return false;}
      posAtt.setPartOfSpeech(determinePOS(termAtt.buffer(), 0, termAtt.length()));
      return true;
    }
    
    // determine the part of speech for the given term
    protected PartOfSpeech determinePOS(char[] term, int offset, int length) {
      // naive implementation that tags every uppercased word as noun
      if (length > 0 && Character.isUpperCase(term[0])) {
        return PartOfSpeech.Noun;
      }
      return PartOfSpeech.Unknown;
    }
  }

Just like the LengthFilter, this new filter stores references to the attributes it needs in instance variables. Notice how you only need to pass in the interface of the new Attribute and instantiating the correct class is automatically taken care of.

Now we need to add the filter to the chain in MyAnalyzer:

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE);
    result = new PartOfSpeechTaggingFilter(result);
    return new TokenStreamComponents(source, result);
  }
Now let's look at the output:
This
demo
the
new
TokenStream
API
Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer to make use of the new PartOfSpeechAttribute and print it out:
  public static void main(String[] args) throws IOException {
    // text to tokenize
    final String text = "This is a demo of the TokenStream API";
    
    MyAnalyzer analyzer = new MyAnalyzer();
    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
    
    // get the CharTermAttribute from the TokenStream
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
    
    // get the PartOfSpeechAttribute from the TokenStream
    PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class);

    try {
      stream.reset();

      // print all tokens until stream is exhausted
      while (stream.incrementToken()) {
        System.out.println(termAtt.toString() + ": " + posAtt.getPartOfSpeech());
      }
    
      stream.end();
    } finally {
      stream.close();
    }
  }
The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in the while loop that consumes the stream. Here is the new output:
This: Noun
demo: Unknown
the: Unknown
new: Unknown
TokenStream: Noun
API: Noun
Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it is the first word of a sentence. Actually this is a good opportunity for an exercise. To practice the usage of the new API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). As a small hint, this is how the new Attribute class could begin:
  public class FirstTokenOfSentenceAttributeImpl extends AttributeImpl
                              implements FirstTokenOfSentenceAttribute {
    
    private boolean firstToken;
    
    public void setFirstToken(boolean firstToken) {
      this.firstToken = firstToken;
    }
    
    public boolean getFirstToken() {
      return firstToken;
    }

    @Override
    public void clear() {
      firstToken = false;
    }

  ...

Adding a CharFilter chain

Analyzers take Java Readers as input. Of course you can wrap your Readers with FilterReaders to manipulate content, but this would have the big disadvantage that character offsets might be inconsistent with your original text.

CharFilter is designed to allow you to pre-process input like a FilterReader would, but also preserve the original offsets associated with those characters. This way mechanisms like highlighting still work correctly. CharFilters can be chained.

Example:

public class MyAnalyzer extends Analyzer {

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    return new TokenStreamComponents(new MyTokenizer(reader));
  }
  
  @Override
  protected Reader initReader(String fieldName, Reader reader) {
    // wrap the Reader in a CharFilter chain.
    return new SecondCharFilter(new FirstCharFilter(reader));
  }
}


:

[Elasticsearch] Plugins - site 플러그인과 custom analyzer 플러그인 만들기

Elastic/Elasticsearch 2013. 4. 19. 10:55

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.


잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)



[elasticsearch API 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/modules/plugins/


elasticsearch를 사용하면서 가장 많이 사용하는 것이 head 와 kr lucene 형태소 분석기가 아닌가 싶습니다.

그럼 이런 것들은 어떻게 제작을 해야 하는지 궁금 할텐데요.

위 원문 아래쪽에 제공되는 모든 plugin 목록을 보여 주고 있습니다.

또는 아래 링크에서도 확인이 가능 합니다.


[git]

- https://github.com/elasticsearch

- https://github.com/search?q=elasticsearch&type=&ref=simplesearch


우선 head와 같은 site plugin 구성 부터 살펴 보겠습니다.

이건 사실 설명이 필요 없습니다. ^^;;


[_site plugin]

- plugin location : ES_HOME/plugins

- site plugin name : helloworld

- helloworld site plugin location : ES_HOME/plugins/helloworld

    . helloworld 폴더 아래로 _site 폴더 생성

    . _site 폴더 아래로 구현한 html, js, css 등의 파일을 위치 시키고 아래 링크로 확인 하면 됩니다.

- helloworld site plugin url

    . http://localhost:9200/_plugin/helloworld/index.html

- elasticsearch server 와의 통신은 ajax 통신을 이용해서 필요한 기능들을 구현 하시면 됩니다.


[kr lucene analyzer plugin] 

- 이미 관련 plugin 은 제공 되고 있습니다.

- 아래 링크 참고

http://cafe.naver.com/korlucene

https://github.com/chanil1218/elasticsearch-analysis-korean

- 적용하는 방법은 두 가지 입니다.

    . First : elasticsearch-analysis-korean 을 설치 한다. (설치 시 es 버전을 맞춰 주기 위해서 별도 빌드가 필요 할 수도 있다.)

    . Second : lucene kr analyzer 라이브러리를 이용해서 plugin 형태로 제작해서 설치 한다.

- 아래는 plugin  형태로 제작해서 설치한 방법을 기술 한 것입니다.

분석기 라이브러리를 사용하는 경우 kimchy 가 만들어 놓은 코드를 기본 템플릿으로 사용해서 구현 하시면 쉽고 빠르게 적용 하실 수 있습니다.

https://github.com/elasticsearch/elasticsearch-analysis-smartcn


- 만들어 봅시다.

[프로젝트 구성]

- Eclipse 에서 Maven 프로젝트를 하나 생성 합니다.


[패키지 및 리소스 구성]

- org.elasticsearch.index.analysis

    . KrLuceneAnalysisBinderProcessor.java

public class KrLuceneAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor {


    @Override

    public void processAnalyzers(AnalyzersBindings analyzersBindings) {

        analyzersBindings.processAnalyzer("krlucene_analyzer", KrLuceneAnalyzerProvider.class);

    }


    @Override

    public void processTokenizers(TokenizersBindings tokenizersBindings) {

        tokenizersBindings.processTokenizer("krlucene_tokenizer", KrLuceneTokenizerFactory.class);

    }


    @Override

    public void processTokenFilters(TokenFiltersBindings tokenFiltersBindings) {

        tokenFiltersBindings.processTokenFilter("krlucene_filter", KrLuceneTokenFilterFactory.class);

    }

}

    . 이 클래스는 analyzer, tokenizer, filter 를 name 기반으로 등록해 준다.

    . settings 구성 시 analyzer, tokenizer, filter 에 명시 하는 name 부분에 해당한다.

    . settings 에서 type 부분에는 패키지 full path 를 명시 하면 된다.

curl -XPUT http://localhost:9200/test  -d '{

    "settings" : { 

        "index": {

            "analysis": {

                "analyzer": {

                    "krlucene_analyzer": {

                        "type": "org.elasticsearch.index.analysis.KrLuceneAnalyzerProvider",

                        "tokenizer" : "krlucene_tokenizer",

                        "filter" : ["trim","lowercase", "krlucene_filter"]

                    }   

                }   

            }   

        }   

    }   

}'


    . KrLuceneAnalyzerProvider.java

public class KrLuceneAnalyzerProvider extends AbstractIndexAnalyzerProvider<KoreanAnalyzer> {


    private final KoreanAnalyzer analyzer;


    @Inject

    public KrLuceneAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException {

        super(index, indexSettings, name, settings);


        analyzer = new KoreanAnalyzer(Lucene.VERSION.LUCENE_36);

    }


    @Override

    public KoreanAnalyzer get() {

        return this.analyzer;

    }

}


    . KrLuceneTokenFilterFactory.java

public class KrLuceneTokenFilterFactory extends AbstractTokenFilterFactory {


    @Inject

    public KrLuceneTokenFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {

        super(index, indexSettings, name, settings);

    }


    @Override

    public TokenStream create(TokenStream tokenStream) {

        return new KoreanFilter(tokenStream);

    }

}


    . KrLuceneTokenizerFactory.java

public class KrLuceneTokenizerFactory extends AbstractTokenizerFactory {


    @Inject

    public KrLuceneTokenizerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {

        super(index, indexSettings, name, settings);

    }


    @Override

    public Tokenizer create(Reader reader) {

        return new KoreanTokenizer(Lucene.VERSION.LUCENE_36, reader);

    }

}


- org.elasticsearch.plugin.analysis.krlucene

    . AnalysisKrLucenePlugin.java

    . 이 클래스는 생성한 plugin 을 es 에 등록해 주는 역할을 한다.

    . plugin 명을 analysis-krlucene 라고 했을 경우 아래와 같은 path 에 jar 파일을 위치 시켜야 합니다.

    ES_HOME/plugins/analysis-krlucene


- src/main/assemblies/plugin.xml

<?xml version="1.0"?>

<assembly>

    <id>plugin</id>

    <formats>

        <format>zip</format>

    </formats>

    <includeBaseDirectory>false</includeBaseDirectory>

    <dependencySets>

        <dependencySet>

            <outputDirectory>/</outputDirectory>

            <useProjectArtifact>true</useProjectArtifact>

            <useTransitiveFiltering>true</useTransitiveFiltering>

            <excludes>

                <exclude>org.elasticsearch:elasticsearch</exclude>

            </excludes>

        </dependencySet>

        <dependencySet>

            <outputDirectory>/</outputDirectory>

            <useProjectArtifact>true</useProjectArtifact>

            <scope>provided</scope>

        </dependencySet>

    </dependencySets>

</assembly>


- src/main/resources/es-plugin.properties

plugin=org.elasticsearch.plugin.analysis.krlucene.AnalysisKrLucenePlugin


- 이렇게 해서 빌드를 하시고 생성된 jar 파일을 위에서 언급한 경로에 위치 시키고 ES 재시작 후 아래와 같이 테스트 해보시면 됩니다.


[테스트]

- test 인덱스 생성 (위에 생성 코드 참고)

- 테스트 URL

    . http://localhost:9200/test/_analyze?analyzer=krlucene_analyzer&text=이것은 루씬한국어 형태소 분석기 플러그인 입니다.&pretty=1

{
  "tokens" : [ {
    "token" : "이것은",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "이것",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "루씬한국어",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "루씬",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "한국어",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "형태소",
    "start_offset" : 10,
    "end_offset" : 13,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "분석기",
    "start_offset" : 14,
    "end_offset" : 17,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "분석",
    "start_offset" : 14,
    "end_offset" : 16,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "플러그인",
    "start_offset" : 18,
    "end_offset" : 22,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "플러그",
    "start_offset" : 18,
    "end_offset" : 21,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "입니다",
    "start_offset" : 23,
    "end_offset" : 26,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "입니",
    "start_offset" : 23,
    "end_offset" : 25,
    "type" : "word",
    "position" : 12
  } ] 

}


※ lucene 버전을 3.x 에서 4.x 로 올리고 싶으시다면 직접 코드 수정을 통해서 진행을 하시면 됩니다.

- elasticsearch-analysis-korean 의 경우는 고쳐야 할 부분이 좀 됩니다.

    . 우선 루씬 한국어 형태소 소스코드를 3.x 에서 4.x 로 올리셔야 합니다.

    . 관련 코드는 루씬 한국어 형태소 분석기 카페에 들어가 보시면 cvs 링크가 있습니다.

:pserver:anonymous@lucenekorean.cvs.sourceforge.net:/cvsroot/lucenekorean

    . 추가로 es 버전도 올리고 싶으시다면 pom.xml 에서 코드를 수정해 주시기 바랍니다.

<properties>

    <elasticsearch.version>0.20.4</elasticsearch.version>

    <lucene.version>3.6.2</lucene.version>

</properties>


- 직접 플러그인을 생성해서 적용하는 방법은 위와 같이 플러그인을 만드시고 루씬한국어 형태소 분석기 라이브러리만 버전에 맞게 넣어서 사용하시면 됩니다.

    . 단, 플러그인의 pom.xml 에서 각 라이브러리의 version 은 맞춰 주셔야 겠죠.


:

루씬 한국어형태소 분석기 lucene-core 3.2 에서 3.6 으로..

Elastic/Elasticsearch 2013. 1. 24. 16:03

lucene kr analyzer 사용 시 lucene core 3.2 에서 3.6 으로 올리시게 되면 아래 클래스에서 빨갱이가 나옵니다.
아래는 수정한 코드 인데 뭐 보시면 너무나 기본이라 이런건 작성할 필요가 있는지도 ^^;
암튼 머리 나쁜 저는 필요 해서.. 

[KoreanAnalyzer.java]

 /** Builds an analyzer with the stop words from the given file.

   * @see WordlistLoader#getWordSet(File)

   */

public KoreanAnalyzer(Version matchVersion, File stopwords) throws IOException {     

        this(matchVersion, WordlistLoader.getWordSet(new InputStreamReader(new FileInputStream(stopwords), DIC_ENCODING), matchVersion));        

}


  /** Builds an analyzer with the stop words from the given file.

   * @see WordlistLoader#getWordSet(File)

   */

public KoreanAnalyzer(Version matchVersion, File stopwords, String encoding) throws IOException {

        this(matchVersion, WordlistLoader.getWordSet(new InputStreamReader(new FileInputStream(stopwords), encoding), matchVersion));

}

/** Builds an analyzer with the stop words from the given reader.

* @see WordlistLoader#getWordSet(Reader)

*/

public KoreanAnalyzer(Version matchVersion, Reader stopwords) throws IOException {

  this(matchVersion, WordlistLoader.getWordSet(stopwords, matchVersion));    

}

기존 KoreanAnalyzer 에는 Version argument 가 없어서 추가만 했습니다. :)

:

루씬 한글형태소 분석기 로컬 테스트

Elastic/Elasticsearch 2013. 1. 23. 16:09

이런.. 아래 코드는 2.X 용이내요.. 
3.X 에서는 동작 하지 않습니다.
그냥 새로 짜야겠내요.. ^^;

--------------------------------------------------------------

Special Thanks to : 이창민

한글형태소 분석기 로컬 테스트.


http://cafe.naver.com/korlucene 카페에 보면 krmorph-20091117.war파일

한국어 형태소분석 파일을 테스트 해볼수 있는 파일을 개발자님이 직접 올려주셨네요. (공모전에 출품작이였던 것 같습니다.;)

첨부한 파일은 : 공모전에 출품하려고 개발자가 매뉴얼을 작성한 것 같습니다. 프로그램에 대해 자세하게 적혀있습니다.(카페에서 찾았습니다.)



krmorph-20091117.war

 

해당 파일을 실행시키는 방법

1. 이클립스에서 마우스 오른쪽 버튼 -> Import -> WAR file

2. 다운받은 krmorph-2091117.war파일 선택

3. 프로젝트 import 완료

4. 프로젝트 선택후 오른쪽 버튼 -> Run As -> Run On Server (기존에 톰캣이 셋팅 되어 있어야 함. & 일반 웹프로젝트 톰캣셋팅과 동일함)

5. 톰캣 재시작


※ 접속 URL : http://localhost:8080/krmorph-20091117/


형태소 분석기 소스를 받아서 고치거나 사전 업데이트 후 실제 어떻게 동작 하는지 테스트 할때 사용하면 유용할 것 같내요.
어제 stopwords 잘 안되던거 테스트 하기 위해서 필요 했는데. ㅎㅎ 창민군 고마워요.. :)


:

루씬 한국어 형태소 분석기 사전 구성 및 팁.

Elastic/Elasticsearch 2013. 1. 22. 14:25

원본출처 : http://cafe.naver.com/korlucene


형태소사전은 모두 8개로 구성되어 있습니다. 그 중 하나는 음절정보이므로 실제로는 7개로 봐야 하겠군요.

사전은 org/apache/lucene/analysis/kr/dic 아래에 있습니다.

이 사전은 모두 jar 에 함께 패키징되어 있는데, KoreanAnalyzer 는 우선 classpath 에 있는 파일에서 찾고

없으면 jar 에 패키징되어 있는 것을 읽어 옵니다. 따라서 커스터마이징된 사전을 사용하고자 한다면

%CLASSPATH%/org/apache/lucene/analysis/kr/dic 아래에 각자의 사전을 저장해서 사용하면 됩니다.

 

각 사전에 대한 자세한 설명은 다음과 같습니다.

 

1. total.dic : 기본사전

용언과 체언을 포함한 기본사전입니다. 사전의 형식을 보면 다음과 같이 구성되어 있습니다.

================

납부,10011X

================

콤마(,)를 중심으로 좌측은 단어이고 우측은 단어정보입니다.

단어정보는 6글자로 구성되어 있는데 각 글자는 단어의 사용규칙을 나타내며 아래와 같습니다.

=========================================================

  1      2         3            4             5             6

명사 동사  기타품사  하여동사  되어동사  불규칙변형

=========================================================

1~3은 품사에 대한 정보이며, 위에 기술한 각 품사 여부를 나타냅니다.

4~5는 명사인 경우 "하다"와 "되다"가 붙을 수 있는 경우를 나타납내다. 주의)동사는 반드시(0)이어야 합니다.

6은 동사인 경우 불규칙변형의 종류를 나타내며 종류는 아래와 같습니다.

    B:ㅂ 불규칙, H:ㅎ 불규칙, L:르 불규칙, U:ㄹ 불규칙, S:ㅅ 불규칙, D:ㄷ 불규칙, R:러 불규칙, X:규칙

 

2. extension.dic : 확장사전

기본사전은 가능한 그 대로 사용하는 것이 좋을 것입니다. 그런데 사전을 조금 보완하여야 할때 확장사전을

사용하면 됩니다. 사전을 구성하는 규칙은 기본사전과 동일합니다.

 

3. josa.dic : 조사사전

조사들만 모아둔 사전입니다. 각 조사는 한줄씩 구분되어 있습니다.

 

4. eomi.dic : 어미사전

어미들만 모아둔 사전입니다. 각 어미는 한줄씩 구분되어 있습니다.

 

5. prefix.dic : 접두어 사전

복합명사를 분해시 2글자 이상의 단어로만 분해합니다. 그러나 "과소비" 같은 경우 "과"를 접두어로 분리해 내어

"과소비"와 "소비"를 색인어로 추출하기 위해 만든 사전입니다.

 

6. suffix.dic : 접미어 사전

복합명사를 분해 시 "현관문" 같은 경우 "문"을 접미어로 분해하여 "현관문"과 "현관"을 색인어로 추출하기 위해

만든사전입니다.

 

7. compounds.dic : 기분석 복합명사 사전

복합명사는 명사 사전을 기반으로 최장일치법에 의해 분해를 합니다. 그런데 "근로자의날" 같은 경우 중간에 조사가

포함되어 있으므로 분해가 불가능합니다. 이런 경우 복합명사 사전에 등록을 합니다. 규칙은 아래와 같습니다.

=========================================

근로자의날:근로자,날

=========================================

콜론(:)을 중심으로 좌측은 복합명사이고 우측은 함께 추출될 색인어입니다. 따라서 위의 경우는 색인어로

"근로자의날","근로자","날" 이렇게 3개가 추출됩니다.

 

 

 

:

lucene 색인 옵션

Elastic/Elasticsearch 2012. 12. 10. 10:50

짧게 정리...

※ Store 옵션
데이터를 저장 할지에 대한 정의.
결국, 검색 후 화면에 출력을 할 것인지 말 것인지에 따라 정의.

Store.YES : 저장 함
Store.NO : 저장 안함 
Store.COMPRESS : 압축 저장 함 (글 내용이 크거나, binary 파일)


※ Index 옵션
검색을 위한 색인을 할지에 대한 정의.
아래는 2.x 대 내용이니 패스, 4.0 을 보면 전부 deprecated 된 걸로 나오내요.
그래도 의미는 파악 하고 있음 좋겠죠.

Index.NO : 색인을 하지 않음 (검색 field 로 사용하지 않음)
Index.TOKENIZED : 검색 가능 하도록 색인 함, analyzer 에 의한 tokenized 수행을 통해 색인을 함.
Index.UN_TOKENIZED : 검색 가능 하도록 색인 함, 단 analyzer 에 의한 분석을 하지 않기 때문에 색인 속도가 빠름. (숫자나 분석이 필요 없는 경우)
Index.NO_NORMS : 검색 가능 하도록 색임 함, 단 색인 속도가 매우 빨라야 할 경우 사용하며, analyzer 에 의한 분석을 수행 하지 않고, field length normalize 를 수행 하지 않음.


http://lucene.apache.org/core/4_0_0/core/index.html

Enum Constant and Description
ANALYZED
Deprecated. 
Index the tokens produced by running the field's value through an Analyzer.
ANALYZED_NO_NORMS
Deprecated. 
Expert: Index the tokens produced by running the field's value through an Analyzer, and also separately disable the storing of norms.
NO
Deprecated. 
Do not index the field value.
NOT_ANALYZED
Deprecated. 
Index the field's value without using an Analyzer, so it can be searched.
NOT_ANALYZED_NO_NORMS
Deprecated. 
Expert: Index the field's value without an Analyzer, and also disable the indexing of norms.

: