[Elasticsearch] Synonym filter position 문제 개선

Elastic/Elasticsearch 2018. 4. 24. 15:52

이미 synonym filter 테스트 관련 글을 공유 했었습니다.

- [Elasticsearch] Synonym filter 테스트

이 테스트에서 발생한 문제는 동의어에 대한 position 정보가 잘 못되는 것입니다.

테스트 환경은 기본 Elasticsearch 6.x 에서 진행 되었습니다.

아래는 SynonymFilter 코드 내 주석 입니다.

[SynonymFilter.java]

Matches single or multi word synonyms in a token stream.

This token stream cannot properly handle position

increments != 1, ie, you should place this filter before

filtering out stop words.

그리고 아래는 동의어 처리 시 문제가 발생 하는 부분의 코드 입니다.

[SynonymMap.java]

/** Sugar: analyzes the text with the analyzer and

* separates by {@link SynonymMap#WORD_SEPARATOR}.

* reuse and its chars must not be null. */

public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {

try (TokenStream ts = analyzer.tokenStream("", text)) {

CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

PositionIncrementAttribute posIncAtt = ts.addAttribute(PositionIncrementAttribute.class);

ts.reset();

reuse.clear();

while (ts.incrementToken()) {

int length = termAtt.length();

if (length == 0) {

throw new IllegalArgumentException("term: " + text + " analyzed to a zero-length token");

}

if (posIncAtt.getPositionIncrement() != 1) {

throw new IllegalArgumentException("term: " + text + " analyzed to a token (" + termAtt +

") with position increment != 1 (got: " + posIncAtt.getPositionIncrement() + ")");

}

reuse.grow(reuse.length() + length + 1); /* current + word + separator */

int end = reuse.length();

if (reuse.length() > 0) {

reuse.setCharAt(end++, SynonymMap.WORD_SEPARATOR);

reuse.setLength(reuse.length() + 1);

}

System.arraycopy(termAtt.buffer(), 0, reuse.chars(), end, length);

reuse.setLength(reuse.length() + length);

}

ts.end();

}

if (reuse.length() == 0) {

throw new IllegalArgumentException("term: " + text + " was completely eliminated by analyzer");

}

return reuse.get();

}

기본적으로 동의어 처리에 대한 문제는 이미 lucene 레벨에서 개선이 되었습니다.

관련 참고 링크는 아래와 같습니다.

[Reference links]

https://issues.apache.org/jira/browse/LUCENE-6664

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

이와 같이 개선된 synonym filter 를 elasticsearch 에서는 아래와 같이 사용 할 수 있습니다.

[Reference links]

https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-graph-tokenfilter.html

[Create index]

PUT /syngtest

{

"settings": {

"index.number_of_shards": 1,

"index.number_of_replicas": 0,

"index": {

"analysis": {

"analyzer": {

"arirang_custom": {

"tokenizer": "arirang_tokenizer",

"filter": [

"lowercase",

"trim",

"arirang_filter",

"custom_synonym"

]

}

"filter": {

"custom_synonym": {

"type": "synonym_graph",

"synonyms": [

"henry,헨리,앙리",

"신해철,마왕"

]

}

[Request analyze]

GET /syngtest/_analyze

{

"tokenizer": "arirang_tokenizer",

"filter": [

"lowercase",

"trim",

"arirang_filter",

"custom_synonym"

"text": "신해철은 henry"

}

[Analyzed result]

{

"tokens": [

{

"token": "마왕",

"start_offset": 0,

"end_offset": 3,

"type": "SYNONYM",

"position": 0

{

"token": "신해철",

"start_offset": 0,

"end_offset": 3,

"type": "korean",

"position": 0

{

"token": "헨리",

"start_offset": 5,

"end_offset": 10,

"type": "SYNONYM",

"position": 1

{

"token": "앙리",

"start_offset": 5,

"end_offset": 10,

"type": "SYNONYM",

"position": 1

{

"token": "henry",

"start_offset": 5,

"end_offset": 10,

"type": "word",

"position": 1

}

]

}

특별히 코드를 수정 하거나 하지 않고 문제가 해결 된 것을 확인 하실 수 있습니다.

왜 해결 되었는지는 위 synonym graph filter 에 대해서 문서를 보시면 되겠습니다.

저작자표시 비영리 변경금지 (새창열림)

jjeong

[Elasticsearch] Synonym filter position 문제 개선

티스토리툴바