'Elastic/Elasticsearch' 카테고리의 글 목록 (16 Page)

[Elasticsearch] Synonym expand or contract 알아보기

Elastic/Elasticsearch 2016. 2. 26. 15:00

synonym 은 아주 유용한 기능 입니다. 하지만 이 기능을 사용하기에 앞서 index time 시 장단점과 search time 시 장단점을 잘 이해하고 사용하시길 추천 드립니다.

관련 내용은 elastic.co에 "The Definitive Guide" 에서 가져왔습니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

[원문 Snippet]

Simple Expansionedit

With simple expansion, any of the listed synonyms is expanded into all of the listed synonyms:

"jump,hop,leap"

Expansion can be applied either at index time or at query time.
Each has advantages (⬆)︎ and disadvantages (⬇)︎.
When to use which comes down to performance versus flexibility.

	Index time	Query time
Index size	⬇︎ Bigger index because all synonyms must be indexed.	⬆︎ Normal.
Relevance	⬇︎ All synonyms will have the same IDF (see What Is Relevance?), meaning that more commonly used words will have the same weight as less commonly used words.	⬆︎ The IDF for each synonym will be correct.
Performance	⬆︎ A query needs to find only the single term specified in the query string.	⬇︎ A query for a single term is rewritten to look up all synonyms, which decreases performance.
Flexibility	⬇︎ The synonym rules can’t be changed for existing documents. For the new rules to have effect, existing documents have to be reindexed.	⬆︎ Synonym rules can be updated without reindexing documents.

Simple Contractionedit

Simple contraction maps a group of synonyms on the left side to a single value on the right side:

"leap,hop => jump"

It must be applied both at index time and at query time, to ensure that query terms are mapped to the same single value that exists in the index.

This approach has some advantages and some disadvantages compared to the simple expansion approach:

Index size

⬆︎ The index size is normal, as only a single term is indexed.

Relevance

⬇︎ The IDF for all terms is the same, so you can’t distinguish between more commonly used words and less commonly used words.

Performance

⬆︎ A query needs to find only the single term that appears in the index.

Flexibility

⬆︎ New synonyms can be added to the left side of the rule and applied at query time.
For instance, imagine that we wanted to add the word bound to the rule specified previously.
The following rule would work for queries that contain bound or for newly added documents that contain bound:

"leap,hop,bound => jump"

But we could expand the effect to also take into account existing documents that contain bound by writing the rule as follows:

"leap,hop,bound => jump,bound"

When you reindex your documents, you could revert to the previous rule to gain the performance benefit of querying only a single term.

Genre Expansionedit

Genre expansion is quite different from simple contraction or expansion.
Instead of treating all synonyms as equal, genre expansion widens the meaning of a term to be more generic. Take these rules, for example:

"cat    => cat,pet",
"kitten => kitten,cat,pet",
"dog    => dog,pet"
"puppy  => puppy,dog,pet"

By applying genre expansion at index time:

A query for kitten would find just documents about kittens.
A query for cat would find documents abouts kittens and cats.
A query for pet would find documents about kittens, cats, puppies, dogs, or pets.

Alternatively, by applying genre expansion at query time,
a query for kitten would be expanded to return documents that mention kittens, cats, or pets specifically.

문서에서 보시면 index time 보다 search time 에 적용하는게 더 이점이 있는 것으로 나옵니다.

그런데 simple expansion 과 contraction 을 적절히 사용한다고 하면 검색 성능이나 품질을 풍성하게 할 수 있지 않을까 생각 합니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Formatting Synonyms 알아보기

Elastic/Elasticsearch 2016. 2. 26. 14:48

synonym 구성을 어떻게 하는지 궁금해 하시는 분들이 있을 것 같아 정리해 봅니다.

내용은 elastic.co 의 "The Definitive Guide"에서 가져왔습니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonym-formats.html

[원문 Snippet]

In their simplest form, synonyms are listed as comma-separated values:

"jump,leap,hop"

If any of these terms is encountered, it is replaced by all of the listed synonyms. For instance:

Original terms:   Replaced by:
────────────────────────────────
jump            → (jump,leap,hop)
leap            → (jump,leap,hop)
hop             → (jump,leap,hop)

Alternatively, with the => syntax, it is possible to specify a list of terms to match (on the left side),
and a list of one or more replacements (on the right side):

"u s a,united states,united states of america => usa"
"g b,gb,great britain => britain,england,scotland,wales"

Original terms:   Replaced by:
────────────────────────────────
u s a           → (usa)
united states   → (usa)
great britain   → (britain,england,scotland,wales)

If multiple rules for the same synonyms are specified, they are merged together.
The order of rules is not respected. Instead, the longest matching rule wins.
Take the following rules as an example:

"united states            => usa",
"united states of america => usa"

If these rules conflicted, Elasticsearch would turn United States of America into the terms (usa),(of), (america). Instead, the longest sequence wins, and we end up with just the term (usa).

두 가지 방법으로 설정 하는 예제가 나와 있습니다.

Case 1) "jump,leap,hop" 과 같이 double quotation 으로 묶는 방법

색인 시 jump 라는 term 이 발생 하게 되면 leap, hop 두 개의 term 이 추가 되어서 색인이 되게 됩니다.

그렇기 때문에 색인 크기가 증가 되는 이슈가 있을 수 있습니다.

Case 2) => 기호를 사용한 양자택일 방법

이 방법은 왼쪽에 있는 term을 오른쪽에 있는 term으로 replacement 하게 됩니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] synonyms_path 설정 정보

Elastic/Elasticsearch 2016. 2. 26. 14:30

synonym 기능을 사용하기 위해서 해당 사전 파일을 엔진에 위치 시켜야 하는데요.

관련해서 경로가 어떻게 되는지 궁금해 하시는 분들도 있을 것 같아 기록해 봅니다.

[원문 링크]

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

[원문 Snippet]

The above configures a synonym filter, with a path of analysis/synonym.txt (relative to the configlocation).
The synonym analyzer is then configured with the filter.
Additional settings are: ignore_case(defaults to false), and expand (defaults to true).

내용 보셔서 아시겠지만, 상대 경로 입니다.

elasticsearch 압축 푸시면 config 폴더 경로 아래 analysis 폴더 만들고 그 아래로 synonym.txt 파일이 위치해 있으면 됩니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Delimited Payload Token Filter

Elastic/Elasticsearch 2016. 2. 23. 16:31

2.X 올라가면서 전체 API를 살펴 보지 못했는데 형분기 관련 구성을 하다가 눈에 확 들어 오는게 있어서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-delimited-payload-tokenfilter.html

원문스크랩)

Named delimited_payload_filter. Splits tokens into tokens and payload whenever a delimiter character is found.

Example: "the|1 quick|2 fox|3" is split by default into tokens the, quick, and fox with payloads 1, 2, and 3 respectively.

Parameters:

delimiter

Character used for splitting the tokens. Default is |.

encoding

The type of the payload. int for integer, float for float and identity for characters. Default is float.

예전에 문서가 가지는 별도의 rankin, boostin, keyword score 등등 검색 로그를 기반으로 문서 가중치, 랭킹, 추천 데이터를 생성해서 문서에 반영해서 질의 시점에 사용을 했었는데요.

이 token filter 가 초기 0.90 사용할때 없어서 별도로 script plugin을 만들어서 사용했었습니다.

그렇다 보니 string 연산을 script 내부에서 하니까 성능적으로 문제가 있었는데요. 이 기능을 사용하면 성능 이슈 없이 쉽게 구현할 수 있을 것 같습니다.

1.3 부터 들어온 API 인것 같은데 그 동안 왜 몰랐나 싶내요.

한번 실험해 보고 결과도 공유 하도록 하겠습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Intellij 에서 Latest Elasticsearch Import 시 Gradle 이슈

Elastic/Elasticsearch 2016. 2. 1. 23:27

일단 증상은 아래와 같은 에러가 발생을 해서 intellij 에서 clone 한 elasticsearch의 master 브랜치 import 가 안됩니다.

이게 저만 그런건지 환경의 문제 인건지 시간이 별로 없어서 확인을 끝까지 못했기 때문에 일단 기록 부터 합니다.

아래는 우회 하는 방법을 기록 했습니다.

[에러 메시지]

- build.gradle 에 아래와 같은 조건이 있습니다.

if (System.getProperty('idea.active') != null && ideaMarker.exists() == false) {

throw new GradleException('You must run gradle idea from the root of elasticsearch before importing into IntelliJ')

}

[우회 방법]

- maven project로 구성된 다른 branch 를 checkout 받아 intellij로 import 합니다.

$ git checkout 2.2

저작자표시 비영리 변경금지

:

[Elasticsearch] es blog - this week in es 2016.01.25

Elastic/Elasticsearch 2016. 1. 26. 18:45

그냥 투척~

[원문링크]

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-01-25

예전에 제가 공유 했었는지 기억이 안나지만 jar hell 을 disable 할수 있게 되었나 봅니다. ㅋㅋ

테스트 할때 이것 땜시 개고생... ㅡ.ㅡ;

또 눈에 띄는건 scripting 에 throw 랑 try/catch 도 가능..

또 재시작시 기존 primary를 그대로 사용... (그랴 그래야지..)

이외도 내용이 많내요.

근데 음.. es 도 규모가 커져서 인걸까요?

뭔가 전 보다는 약간 덜 active 해지는 느낌이랄까요??

ㅋㅋ 아마도 제가 백수 된지 얼마 안돼서 그런가 봅니다.

ㅜ ㅜ

저작자표시 비영리 변경금지

:

[Elasticsearch] _version 에 대한 오해.

Elastic/Elasticsearch 2016. 1. 5. 18:06

제가 잘못 알고 있었습니다.

문서를 자세히 안본 저의 불찰 입니다.

그래서 기록해 봅니다. ^^;

https://www.elastic.co/guide/en/elasticsearch/reference/2.1/docs-index_.html#index-versioning

elasticsearch에서 제공하고 있는 version 은 transaction 처리 시 동시성 제어를 위해 사용하는 것입니다.

즉, 하나의 문서에 대해서 서로 다른 update 요청이 들어 왔을 때 이를 제어 하기 위해서라고 보시면 되겠습니다.

더 자세한 내용은 위 문서에 잘나와 있습니다.

저 처럼 당연히 이건 기존에 것들과 비슷 한걸거야 하고 넘어 가지마세요. ㅡ.ㅡ;

저로 인해서 정보에 대한 노이즈를 제공하게 되어 죄송하게 생각합니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] multi match query types

Elastic/Elasticsearch 2015. 12. 15. 10:39

multi match query 에서 사용하는 type 설명 입니다.

그냥 소스코드에서 발췌 했습니다.

[MultiMatchQueryBuilder.java]

/**
 * Uses the best matching boolean field as main score and uses
 * a tie-breaker to adjust the score based on remaining field matches
 */
BEST_FIELDS(MatchQuery.Type.BOOLEAN, 0.0f, new ParseField("best_fields", "boolean")),

/**
 * Uses the sum of the matching boolean fields to score the query
 */
MOST_FIELDS(MatchQuery.Type.BOOLEAN, 1.0f, new ParseField("most_fields")),

/**
 * Uses a blended DocumentFrequency to dynamically combine the queried
 * fields into a single field given the configured analysis is identical.
 * This type uses a tie-breaker to adjust the score based on remaining
 * matches per analyzed terms
 */
CROSS_FIELDS(MatchQuery.Type.BOOLEAN, 0.0f, new ParseField("cross_fields")),

/**
 * Uses the best matching phrase field as main score and uses
 * a tie-breaker to adjust the score based on remaining field matches
 */
PHRASE(MatchQuery.Type.PHRASE, 0.0f, new ParseField("phrase")),

/**
 * Uses the best matching phrase-prefix field as main score and uses
 * a tie-breaker to adjust the score based on remaining field matches
 */
PHRASE_PREFIX(MatchQuery.Type.PHRASE_PREFIX, 0.0f, new ParseField("phrase_prefix"));

저작자표시 비영리 변경금지

:

[Elasticsearch] Merge Throttle 설정 튜닝

Elastic/Elasticsearch 2015. 12. 11. 15:56

bulk indexing 을 하다 보면 색인 하는 과정에서 느려지는 현상을 경험 할 수 있습니다.

여러가지 원인이 있을 수 있지만 간단하게 설정을 통해서 성능 향상을 시킬수 있는 방법을 소개해 드립니다.

기본적인 정보는 이미 Elasticsearch Reference 에서 제공하고 있기 때문에 관련 내용을 찾아 보시면 이해 하시는데 도움이 됩니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling?q=store

https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html

Elasticsearch 역시 lucene 기반의 검색 엔진이기 때문에 과거부터 전해져 오는 segment merge 시 발생 하는 성능 저하 문제는 피해 갈 수가 없습니다.

이를 좀 더 효율적으로 사용하기 위해 아래 설정을 활용 하시면 됩니다.

Merge throttle 은 두 가지 방법을 제공해 주고 있습니다.

1. Node level throttle

이것은 merge 동작은 shard 단위로 발생을 하기 때문에 같은 node 에 있는 shard 들은 동일한 자원을 사용하게 됩니다.

즉, disk i/o 에 대한 경합을 할 수 밖에 없는 것인데요.

이런 이유로 node level 설정을 사용하게 됩니다.

indices.store.throttle.type: “merge” ## all, none

indices.store.throttle.max_bytes_per_sec: “20mb"

※ 여기서 수정이 필요한 부분은 색인 데이터의 크기를 감안해서 max_bytes_per_sec 을 적합한 크기로 설정해 주시면 됩니다.

2. Index level throttle

특정 index 에 대해서 관리를 하고 싶을 때 node level throttle 설정을 무시 하고 설정을 하도록 해주는 것입니다.

설정 방법은 index update settings 를 통해서 할 수 있습니다.

index.store.throttle.type: “node"

index.store.throttle.max_bytes_per_sec: “20mb"

※ 여기서 수정이 필요한 부분은 색인 데이터의 크기를 감안해서 max_bytes_per_sec 을 적합한 크기로 설정해 주시면 됩니다.

※ throttle type을 none 으로 할 경우 disable merge 설정이 됩니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Shard Allocation Filtering 설정 시 주의사항. (on 2.1)

Elastic/Elasticsearch 2015. 12. 10. 11:01

hot-warm architecture 구성시 경험했던 팁 공유 합니다.

아주 사소한 팁입니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/reference/2.1/shard-allocation-filtering.html

index settings 기능을 이용해서 "index.routing.allocation.{attribute}.{attribute}" 설정을 하게 됩니다.

이 과정에서 사용하는 REST API 가 두 가지가 있습니다.

[_settings API]

$ curl -XPUT "http://localhost:9200/db/_settings" -d'

{

"index.routing.allocation.require.box_type":"warm"

}'

[Request body에 settings]

$ curl -XPUT "http://localhost:9200/db" -d'

{

"settings": {

"index.routing.allocation.require.box_type":"warm"

}

}'

개인적으로는 위 두 가지 방식이 다 동작해야 한다고 생각해서 실행을 시켰습니다.

해보시면 아시겠지만 아래 방식은 index_already_exists_exception 에러가 발생을 합니다.

Elasticsearch에 확인해본 결과로는 에러 메시지를 잘못 return 해준 경우라고 하내요. 즉, trivial 정도의 bug(?) 라고 봐도 될 것 같긴 합니다.

어쨌든 수정할 거라고 하니 나중에는 반영 되리라 기대 합니다.

그리고 http method 사용시 보시면 아시겠지만 PUT method 를 사용하셔야 합니다.

제가 POST 를 사용했는데요.

이 경우에는 PUT 을 사용하는게 맞다고 합니다.

제가 삽질한 이유가 되겠습니다. ㅡ.ㅡ;;

저작자표시 비영리 변경금지

:

jjeong

'Elastic/Elasticsearch'에 해당되는 글 385건

[Elasticsearch] Synonym expand or contract 알아보기

Simple Expansionedit

Simple Contractionedit

Genre Expansionedit

[Elasticsearch] Formatting Synonyms 알아보기

[Elasticsearch] synonyms_path 설정 정보

[Elasticsearch] Delimited Payload Token Filter

[Elasticsearch] Intellij 에서 Latest Elasticsearch Import 시 Gradle 이슈

[Elasticsearch] es blog - this week in es 2016.01.25

[Elasticsearch] _version 에 대한 오해.

[Elasticsearch] multi match query types

[Elasticsearch] Merge Throttle 설정 튜닝

[Elasticsearch] Shard Allocation Filtering 설정 시 주의사항. (on 2.1)

티스토리툴바