'synonym' 태그의 글 목록

[Elasticsearch] Synonym filter position 문제 개선

Elastic/Elasticsearch 2018. 4. 24. 15:52

이미 synonym filter 테스트 관련 글을 공유 했었습니다.

- [Elasticsearch] Synonym filter 테스트

이 테스트에서 발생한 문제는 동의어에 대한 position 정보가 잘 못되는 것입니다.

테스트 환경은 기본 Elasticsearch 6.x 에서 진행 되었습니다.

아래는 SynonymFilter 코드 내 주석 입니다.

[SynonymFilter.java]

Matches single or multi word synonyms in a token stream.

This token stream cannot properly handle position

increments != 1, ie, you should place this filter before

filtering out stop words.

그리고 아래는 동의어 처리 시 문제가 발생 하는 부분의 코드 입니다.

[SynonymMap.java]

/** Sugar: analyzes the text with the analyzer and

* separates by {@link SynonymMap#WORD_SEPARATOR}.

* reuse and its chars must not be null. */

public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {

try (TokenStream ts = analyzer.tokenStream("", text)) {

CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

PositionIncrementAttribute posIncAtt = ts.addAttribute(PositionIncrementAttribute.class);

ts.reset();

reuse.clear();

while (ts.incrementToken()) {

int length = termAtt.length();

if (length == 0) {

throw new IllegalArgumentException("term: " + text + " analyzed to a zero-length token");

}

if (posIncAtt.getPositionIncrement() != 1) {

throw new IllegalArgumentException("term: " + text + " analyzed to a token (" + termAtt +

") with position increment != 1 (got: " + posIncAtt.getPositionIncrement() + ")");

}

reuse.grow(reuse.length() + length + 1); /* current + word + separator */

int end = reuse.length();

if (reuse.length() > 0) {

reuse.setCharAt(end++, SynonymMap.WORD_SEPARATOR);

reuse.setLength(reuse.length() + 1);

}

System.arraycopy(termAtt.buffer(), 0, reuse.chars(), end, length);

reuse.setLength(reuse.length() + length);

}

ts.end();

}

if (reuse.length() == 0) {

throw new IllegalArgumentException("term: " + text + " was completely eliminated by analyzer");

}

return reuse.get();

}

기본적으로 동의어 처리에 대한 문제는 이미 lucene 레벨에서 개선이 되었습니다.

관련 참고 링크는 아래와 같습니다.

[Reference links]

https://issues.apache.org/jira/browse/LUCENE-6664

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

이와 같이 개선된 synonym filter 를 elasticsearch 에서는 아래와 같이 사용 할 수 있습니다.

[Reference links]

https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-graph-tokenfilter.html

[Create index]

PUT /syngtest

{

"settings": {

"index.number_of_shards": 1,

"index.number_of_replicas": 0,

"index": {

"analysis": {

"analyzer": {

"arirang_custom": {

"tokenizer": "arirang_tokenizer",

"filter": [

"lowercase",

"trim",

"arirang_filter",

"custom_synonym"

]

}

},

"filter": {

"custom_synonym": {

"type": "synonym_graph",

"synonyms": [

"henry,헨리,앙리",

"신해철,마왕"

]

}

[Request analyze]

GET /syngtest/_analyze

{

"tokenizer": "arirang_tokenizer",

"filter": [

"lowercase",

"trim",

"arirang_filter",

"custom_synonym"

],

"text": "신해철은 henry"

}

[Analyzed result]

{

"tokens": [

{

"token": "마왕",

"start_offset": 0,

"end_offset": 3,

"type": "SYNONYM",

"position": 0

},

{

"token": "신해철",

"start_offset": 0,

"end_offset": 3,

"type": "korean",

"position": 0

},

{

"token": "헨리",

"start_offset": 5,

"end_offset": 10,

"type": "SYNONYM",

"position": 1

},

{

"token": "앙리",

"start_offset": 5,

"end_offset": 10,

"type": "SYNONYM",

"position": 1

},

{

"token": "henry",

"start_offset": 5,

"end_offset": 10,

"type": "word",

"position": 1

}

]

}

특별히 코드를 수정 하거나 하지 않고 문제가 해결 된 것을 확인 하실 수 있습니다.

왜 해결 되었는지는 위 synonym graph filter 에 대해서 문서를 보시면 되겠습니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] simple query 내 synonym graph 사용

Elastic/Elasticsearch 2017. 12. 20. 10:11

일단 나중에 잊을 수도 있어서 keep 합니다.

Ref.

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-simple-query-string-query.html#_synonyms_2

Simple Query 사용 시 추가 되는 parameter 인데, 이 기능을 잘 활용하면 query expansion (query rewrite) 기능을 대체 할 수도 있겠다는 생각이 듭니다.

그래서 일단 기록!

Synonyms

The simple_query_string query supports multi-terms synonym expansion with the synonym_graph token filter. When this filter is used, the parser creates a phrase query for each multi-terms synonyms. For example, the following synonym: "ny, new york" would produce:

(ny OR ("new york"))

It is also possible to match multi terms synonyms with conjunctions instead:

GET /_search

{

"query": {

"simple_query_string" : {

"query" : "ny city",

"auto_generate_synonyms_phrase_query" : false

}

약간의 부연 설명을 하자면, 보통 사용자가 입력한 검색어만 가지고 검색을 하는 경우는 이커머스에서는 거의 없습니다.

대부분 사용자가 입력한 검색어 + 확장검색어 형태로 질의를 하게 되는데요.

일반적으로 가장 많이 사용하는 방식이 색인 시점에 동의어를 통한 검색어 확장입니다.

이건 색인 시점이고 위 기능을 잘 활용하게 되면 질의 시점에 검색어 확장을 통한 상품 매칭을 할 수 있습니다.

저는 보통 Query Expansion 기능이라고 부르는데요. 이 작업은 Query Rewriter 라고 불리는 영역에서도 수행이 되기도 합니다.

간단한 예를 들자면)

"나이키" 라는 검색어가 들어 왔을 때 이를 개인화 query expansion 기능을 적용 한다면 저 키워드를 입력한 사용자가 선호 하는게 "운동화" 였다면, 실제 검색 매칭에 사용되는 검색어는 "나이키" + "운동화" 가 되는 것입니다.

이건 단순 예시 입니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Lucene] SynonymFilter -> SynonymGraphFilter + FlattenGraphFilter

ITWeb/검색일반 2017. 7. 31. 18:37

오늘 뭐 좀 보다가 그냥 공유해 봅니다.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

lucene 6.6 에서는 SynonymFilter 가 @Deprecated 되어 있습니다.

대체 filter 는 글에도 나와 있지만 SynonymGraphFilter 인데요.

재밌는건 이넘은 search time 에서 동작 하는 거라 index time 에는 여전히 SynonymFilter 또는 FlattenGraphFilter 를 사용해야 한다는 점입니다.

아직 깊게 분석해보지 않아서 ^^;

간만에 lucene 코드 까서 이것 저것 테스트 해보니 재밌내요.

그냥 참고 하시라고 올려봤습니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] Elasticsearch에서 synonyms 사용 시 고려사항.

Elastic/Elasticsearch 2016. 4. 22. 17:59

뭐 이런게 고려 사항 일까 싶지만 그냥 머리 식히기 위해서 작성해 봅니다.

synonyms 는 기본적으로 search 시와 index 시에 다 사용이 가능 합니다.

이 둘 사이에 장단점은 아래 링크를 참고해 주시면 좋겠습니다.

참고링크)

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/synonyms-expand-or-contract.html

search 시 synonyms 를 적용하기 위해서는 match query 종류를 사용하셔야 합니다.

간혹 term query 종류를 사용하시면서 왜 안되지 하시는 분들도 있는데 주의 하셔야 합니다.

index 시 synonyms 를 적용하기 위해서는 synonyms filter 우선순위를 잘 확인 하셔야 합니다.

제일 앞에 있는 filter 때문에 적용이 안될 수도 있으니 주의 하셔야 합니다.

이 경우 search 시 term query 류를 사용하면 안되던 것이 지원이 되기 때문에 요건에 따라 선택해서 사용하시면 좋을 것 같습니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시

Elastic/Elasticsearch 2016. 3. 17. 18:34

나중에 또 잊어 버릴까봐 기록합니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

예시)

"index": {
  "analysis": {
    "analyzer": {
      "arirang_custom": {
        "type": "custom",
        "tokenizer": "arirang_tokenizer",
        "filter": ["lowercase", "trim", "arirang_filter"]
      },
      "arirang_custom_searcher": {
        "tokenizer": "arirang_tokenizer",
        "filter": ["lowercase", "trim", "arirang_filter", "meme_synonym"]
      }
    },
    "filter": {
      "meme_synonym": {
        "type": "synonym",
        "synonyms": [
          "henry,헨리,앙리"
        ]
      }
    }
  }
}

여기서 주의할 점 몇 가지만 기록 합니다.

1. synonym analyzer 생성 시 type을 custom 으로 선언 하거나 type을 아예 선언 하지 않습니다.

2. synonym 은 filter 로 생성 해서 analyzer 에 filter 로 할당 합니다.

3. 색인 시 사용할 것인지 질의 시 사용할 것인지 장단점과 서비스 특성에 맞게 검토 합니다.

4. synonyms_path 를 이용하도록 합니다. (이건 주의라기 보다 관리적 차원)

5. match type 의 query만 사용이 가능 하며, term type 의 query를 사용하고 싶으시다면 색인 시 synonym 적용해야 합니다.

그럼 1번에서 선언 하지 않는 다는 이야기는 뭘까요?

선언 하지 않으시면 그냥 custom 으로 만들어 줍니다.

못 믿으시는 분들을 위해 아래 소스코드 투척 합니다.

[AnalysisModule.java]

String typeName = analyzerSettings.get("type");
Class<? extends AnalyzerProvider> type;
if (typeName == null) {
    if (analyzerSettings.get("tokenizer") != null) {
        // custom analyzer, need to add it
        type = CustomAnalyzerProvider.class;
    } else {
        throw new IllegalArgumentException("Analyzer [" + analyzerName + "] must have a type associated with it");
    }
} else if (typeName.equals("custom")) {
    type = CustomAnalyzerProvider.class;
} else {
    type = analyzersBindings.analyzers.get(typeName);
    if (type == null) {
        throw new IllegalArgumentException("Unknown Analyzer type [" + typeName + "] for [" + analyzerName + "]");
    }
}

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] Synonym expand or contract 알아보기

Elastic/Elasticsearch 2016. 2. 26. 15:00

synonym 은 아주 유용한 기능 입니다. 하지만 이 기능을 사용하기에 앞서 index time 시 장단점과 search time 시 장단점을 잘 이해하고 사용하시길 추천 드립니다.

관련 내용은 elastic.co에 "The Definitive Guide" 에서 가져왔습니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

[원문 Snippet]

Simple Expansionedit

With simple expansion, any of the listed synonyms is expanded into all of the listed synonyms:

"jump,hop,leap"

Expansion can be applied either at index time or at query time.
Each has advantages (⬆)︎ and disadvantages (⬇)︎.
When to use which comes down to performance versus flexibility.

	Index time	Query time
Index size	⬇︎ Bigger index because all synonyms must be indexed.	⬆︎ Normal.
Relevance	⬇︎ All synonyms will have the same IDF (see What Is Relevance?), meaning that more commonly used words will have the same weight as less commonly used words.	⬆︎ The IDF for each synonym will be correct.
Performance	⬆︎ A query needs to find only the single term specified in the query string.	⬇︎ A query for a single term is rewritten to look up all synonyms, which decreases performance.
Flexibility	⬇︎ The synonym rules can’t be changed for existing documents. For the new rules to have effect, existing documents have to be reindexed.	⬆︎ Synonym rules can be updated without reindexing documents.

Simple Contractionedit

Simple contraction maps a group of synonyms on the left side to a single value on the right side:

"leap,hop => jump"

It must be applied both at index time and at query time, to ensure that query terms are mapped to the same single value that exists in the index.

This approach has some advantages and some disadvantages compared to the simple expansion approach:

Index size

⬆︎ The index size is normal, as only a single term is indexed.

Relevance

⬇︎ The IDF for all terms is the same, so you can’t distinguish between more commonly used words and less commonly used words.

Performance

⬆︎ A query needs to find only the single term that appears in the index.

Flexibility

⬆︎ New synonyms can be added to the left side of the rule and applied at query time.
For instance, imagine that we wanted to add the word bound to the rule specified previously.
The following rule would work for queries that contain bound or for newly added documents that contain bound:

"leap,hop,bound => jump"

But we could expand the effect to also take into account existing documents that contain bound by writing the rule as follows:

"leap,hop,bound => jump,bound"

When you reindex your documents, you could revert to the previous rule to gain the performance benefit of querying only a single term.

Genre Expansionedit

Genre expansion is quite different from simple contraction or expansion.
Instead of treating all synonyms as equal, genre expansion widens the meaning of a term to be more generic. Take these rules, for example:

"cat    => cat,pet",
"kitten => kitten,cat,pet",
"dog    => dog,pet"
"puppy  => puppy,dog,pet"

By applying genre expansion at index time:

A query for kitten would find just documents about kittens.
A query for cat would find documents abouts kittens and cats.
A query for pet would find documents about kittens, cats, puppies, dogs, or pets.

Alternatively, by applying genre expansion at query time,
a query for kitten would be expanded to return documents that mention kittens, cats, or pets specifically.

문서에서 보시면 index time 보다 search time 에 적용하는게 더 이점이 있는 것으로 나옵니다.

그런데 simple expansion 과 contraction 을 적절히 사용한다고 하면 검색 성능이나 품질을 풍성하게 할 수 있지 않을까 생각 합니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] Formatting Synonyms 알아보기

Elastic/Elasticsearch 2016. 2. 26. 14:48

synonym 구성을 어떻게 하는지 궁금해 하시는 분들이 있을 것 같아 정리해 봅니다.

내용은 elastic.co 의 "The Definitive Guide"에서 가져왔습니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonym-formats.html

[원문 Snippet]

In their simplest form, synonyms are listed as comma-separated values:

"jump,leap,hop"

If any of these terms is encountered, it is replaced by all of the listed synonyms. For instance:

Original terms:   Replaced by:
────────────────────────────────
jump            → (jump,leap,hop)
leap            → (jump,leap,hop)
hop             → (jump,leap,hop)

Alternatively, with the => syntax, it is possible to specify a list of terms to match (on the left side),
and a list of one or more replacements (on the right side):

"u s a,united states,united states of america => usa"
"g b,gb,great britain => britain,england,scotland,wales"

Original terms:   Replaced by:
────────────────────────────────
u s a           → (usa)
united states   → (usa)
great britain   → (britain,england,scotland,wales)

If multiple rules for the same synonyms are specified, they are merged together.
The order of rules is not respected. Instead, the longest matching rule wins.
Take the following rules as an example:

"united states            => usa",
"united states of america => usa"

If these rules conflicted, Elasticsearch would turn United States of America into the terms (usa),(of), (america). Instead, the longest sequence wins, and we end up with just the term (usa).

두 가지 방법으로 설정 하는 예제가 나와 있습니다.

Case 1) "jump,leap,hop" 과 같이 double quotation 으로 묶는 방법

색인 시 jump 라는 term 이 발생 하게 되면 leap, hop 두 개의 term 이 추가 되어서 색인이 되게 됩니다.

그렇기 때문에 색인 크기가 증가 되는 이슈가 있을 수 있습니다.

Case 2) => 기호를 사용한 양자택일 방법

이 방법은 왼쪽에 있는 term을 오른쪽에 있는 term으로 replacement 하게 됩니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] synonyms_path 설정 정보

Elastic/Elasticsearch 2016. 2. 26. 14:30

synonym 기능을 사용하기 위해서 해당 사전 파일을 엔진에 위치 시켜야 하는데요.

관련해서 경로가 어떻게 되는지 궁금해 하시는 분들도 있을 것 같아 기록해 봅니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

[원문 Snippet]

The above configures a synonym filter, with a path of analysis/synonym.txt (relative to the configlocation).
The synonym analyzer is then configured with the filter.
Additional settings are: ignore_case(defaults to false), and expand (defaults to true).

내용 보셔서 아시겠지만, 상대 경로 입니다.

elasticsearch 압축 푸시면 config 폴더 경로 아래 analysis 폴더 만들고 그 아래로 synonym.txt 파일이 위치해 있으면 됩니다.

저작자표시 비영리 변경금지 (새창열림)

:

elasticsearch synonym 적용시 주의 사항.

Elastic/Elasticsearch 2012. 12. 24. 11:07

synonym 관련 글은 아래 참고 하시구요.
제가 적용 하면서 실수했던 내용을 공유 합니다.

1. cluster 구성을 했을 경우 synonym.txt 파일을 모든 서버에 생성을 해야 합니다.
: 당연한 이야기 인데 저는 클러스터 구성한걸 까맣게 잊고 서버 한대에만 적용해 놓고 왜 안되지 이러고 있었습니다. ㅡ.ㅡ;;

2. synonym.txt 파일 위치 지정
: elasticsearch.org 에도 있는데 문서를 제대로 안읽어 보고 파일을 어디에 놓아야 하는거야 하고 삽질을 했습니다.
: 기본 설치된 경로에서 config 폴더를 기준으로 상대경로로 인식 합니다. (analysis/synonym.txt)
: 소스 코드를 보면 full path 로 넣으셔도 됩니다. (소스를 보는 것도 좋은 에러 해결 방법 입니다.)
: Environment.java (org.elasticsearch.env 패키지 아래 있습니다.)

그리고 제가 적용한 방법은 solr 용으로 테스트 하였습니다.

:

Elasticsearch 동의어/유의어 사전 활용

Elastic/Elasticsearch 2012. 12. 18. 22:22

[title=Elasticsearch 동의어/유의어 설정]

- 색인 파일 생성 시 설정을 해줘야 함

- 기본 kr_analysis 가 적용되어 있어야 함

- 없을 경우 한국어 처리가 안됨

- synonym.txt 파일을 적절한 위치에 생성

- http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html

[title=synonym.txt 샘플]

Solr synonyms

The following is a sample format of the file:

# blank lines and lines starting with pound are comments.

#Explicit mappings match any token sequence on the LHS of "=>"

#and replace with all alternatives on the RHS. These types of mappings

#ignore the expand parameter in the schema.

#Examples:

i-pod, i pod => ipod,

sea biscuit, sea biscit => seabiscuit

#Equivalent synonyms may be separated with commas and give

#no explicit mapping. In this case the mapping behavior will

#be taken from the expand parameter in the schema. This allows

#the same synonym file to be used in different synonym handling strategies.

#Examples:

ipod, i-pod, i pod

foozball , foosball

universe , cosmos

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:

ipod, i-pod, i pod => ipod, i-pod, i pod

# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:

ipod, i-pod, i pod => ipod

#multiple synonym mapping entries are merged.

foo => foo bar

foo => baz

#is equivalent to

foo => foo bar, baz

[색인파일 생성 샘플코드 - synonym 적용]

curl -XPUT 'http://localhost:9200/test' -d '{

"settings" : {

"number_of_shards" : 5,

"number_of_replicas" : 1,

"index" : {

"analysis" : {

"analyzer" : {

"kr_analyzer" : {

"type" : "custom",

"tokenizer" : "kr_tokenizer",

"filter" : ["trim", "kr_filter", "kr_synonym"]

},

"kr_analyzer" : {

"type" : "custom",

"tokenizer" : "kr_tokenizer",

"filter" : ["trim", "kr_filter", "kr_synonym"]

}

},

"filter" : {

"kr_synonym" : {

"type" : "synonym",

"synonyms_path" : "analysis/synonym.txt"

}

}'

~~[title=색인파일 생성 샘플코드]~~

~~curl -XPUT 'http://10.101.254.223:9200/test' -d '{~~

~~"settings" : {~~

~~"number_of_shards" : 5,~~

~~"number_of_replicas" : 1~~

},

~~"index" : {~~

~~"analysis" : {~~

~~"analyzer" : {~~

~~"synonym" : {~~

~~"tokenizer" : "kr_analyzer",~~

~~"filter" : ["synonym"]~~

}

},

~~"filter" : {~~

~~"synonym" : {~~

~~"type" : "synonym",~~

~~"synonyms_path" : "/home/계정/apps/elasticsearch/plugins/analysis-korean/analysis/synonym.txt"~~

}

},

~~"mappings" : {~~

~~"docs" : {~~

~~"properties" : {~~

~~...................................(요기 부분은 다른 문서들 참고 하시면 됩니다.)~~

}

}'

:

jjeong

'synonym'에 해당되는 글 10건

[Elasticsearch] Synonym filter position 문제 개선

[Elasticsearch] simple query 내 synonym graph 사용

[Lucene] SynonymFilter -> SynonymGraphFilter + FlattenGraphFilter

[Elasticsearch] Elasticsearch에서 synonyms 사용 시 고려사항.

[Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시

[Elasticsearch] Synonym expand or contract 알아보기

Simple Expansionedit

Simple Contractionedit

Genre Expansionedit

[Elasticsearch] Formatting Synonyms 알아보기

[Elasticsearch] synonyms_path 설정 정보

elasticsearch synonym 적용시 주의 사항.

Elasticsearch 동의어/유의어 사전 활용

티스토리툴바