'Elastic/TheDefinitiveGuide' 카테고리의 글 목록

[Elasticsearch - The Definitive Guide] Controlling Stemming

Elastic/TheDefinitiveGuide 2015. 12. 22. 13:42

형태소 분석에서 stemming 관련 글 입니다.

언어학에 대한 지식은 검색 기술에서도 매우 중요 합니다.

하지만 저는 컴퓨터과학을 전공하였기에 모르는건 기록하고 학습을 해야... 그래서 기록합니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html

원문 Snippet)

- Preventing Stemming

당연히 아시겠지만, stopwords 와는 다릅니다.

PUT /my_index { "settings": { "analysis": { "filter": { "no_stem": { "type": "keyword_marker", "keywords": [ "skies" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "no_stem", "porter_stem" ] } } } } }

GET /my_index/_analyze?analyzer=my_english sky skies skiing skis

- Customizing Stemming

추출 term에 대한 맵핑 정보를 사용하는 방법이며, 조금은 다르지만 synonym과 비슷 합니다.

역시 색인 및 검색 시 사용이 가능합니다.

PUT /my_index { "settings": { "analysis": { "filter": { "custom_stem": { "type": "stemmer_override", "rules": [ "skies=>sky", "mice=>mouse", "feet=>foot" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "custom_stem", "porter_stem" ] } } } } } GET /my_index/_analyze?analyzer=my_english The mice came down from the skies and ran over my feet

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] One Language per Document.

Elastic/TheDefinitiveGuide 2015. 12. 22. 11:18

다국어 지원시 인덱스 모델링 시 type 을 사용하지 말라는 주의 문구가 있어서 기록 합니다.

혹시라도 type 을 통한 구성을 고민 하고 계셨다면 한번 읽어 보시면 좋을 것 같습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-docs.html

원문 Snippet)

Don’t Use Types for Languages

You may be tempted to use a separate type for each language, instead of a separate index. For best results, you should avoid using types for this purpose. As explained in Types and Mappings, fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies from each type (and thus each language) are mixed together.

To ensure that the term frequencies of one language don’t pollute those of another, either use a separate index for each language, or a separate field, as explained in the next section.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Single Query String

Elastic/TheDefinitiveGuide 2015. 12. 22. 11:13

여러 필드에 질의 할 때 문서에 대한 score 계산시 가장 relevant 한 결과를 만들어 내기 위해 사용하는 기법이라고 보면 될 것 같습니다.

하지만 기능적으로 어떤의미를 갖는지를 알아야 응용이 가능 하겠습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_single_query_string.html

원문 Snippet)

Best fields

When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other. Documents should have as many words as possible in the same field, and the score should come from the best-matching field.

Most fields

A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.

The main field may contain words in their stemmed form, synonyms, and words stripped of theirdiacritics, or accents. It is used to match as many documents as possible.

The same text could then be indexed in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with accents, and a third might use shinglesto provide information about word proximity.

These other fields act as signals to increase the relevance score of each matching document. The more fields that match, the better.

Cross fields

For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:

Person: first_name and last_name
Book: title, author, and description
Address: street, city, country, and postcode

가볍게 요약하면 이렇습니다.

- best fields 는 가장 적합한 field 의 score 를 리턴 합니다.

- most fields 는 field 들의 score 를 더한 값을 리턴 합니다.

- cross fields 는 field 들을 섞어서 score 를 계산 하여 리턴 합니다. (field 들 중 최소의 idf 값을 사용 합니다.)

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Pitfalls of Mixing Languages

Elastic/TheDefinitiveGuide 2015. 12. 22. 10:48

여러 종류의 언어가 섞여 있는 경우에 대한 이야기 입니다.

다국어 지원 자체는 기능적 요건에 따라 구현 방법이 달라 질 수 있기 때문에 고민을 많이 할 수 밖에 없는 것 같습니다.

Elasticsearch 에서는 쉬운 접근법으로 _all 과 custom all (copy_to, multi fields) 기능을 제시해 주고 있습니다.

이 글에서는 저 한테 필요한 내용인 language detection 관련 링크가 있어 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/language-pitfalls.html

Language Detection)

https://github.com/mikemccand/chromium-compact-language-detector

https://code.google.com/p/cld2/

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Dealing with Human Language

Elastic/TheDefinitiveGuide 2015. 12. 16. 17:47

글 제목과 비슷할 수도 다를 수도 있습니다.

precision 과 recall 에 대한 설명이 짧게 잘 표현이 되어 있어서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html

원문 Snippet)

Full-text search is a battle between precision—returning as few irrelevant documents as possible—andrecall—returning as many relevant documents as possible.

원래 이 문서는 언어에 대한 처리 목적 이였습니다.

그래서 정의한 5가지 title 만 정리해 봤습니다.

- Normalizing Tokens

추출 된 token 에서 필요 없는 character를 제거 합니다.

- Reducing Words To Their Root Form

Word 에 붙은 불필요한 정보를 제거 합니다. (word의 origin을 만든다고 보시면 쉽습니다.)

- Stopwords

불용어 처리를 합니다. (즉, 색인 대상에서 제외 시킵니다.)

- Synonyms

동의어 또는 유의어 처리를 합니다.

- Typoes and Mispelings

오타 처리를 합니다.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Controlling Relevance

Elastic/TheDefinitiveGuide 2015. 12. 16. 14:08

Relevance 관련 글이 많아서 일단 한번쯤은 꼭 읽어 봐야 하는 글만 모아 봤습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/query-scoring.html

- 이 글에서는 bool query에 대한 query competition 에 대한 내용이 포함 되어 있습니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/not-quite-not.html

- 이 글에서는 boosting query에 대한 내용이 포함 되어 있습니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html

- 이 글에서는 constant_score query에 대한 내용이 포함 되어 있습니다.

- 즉, 모든 점수가 1로 나오며, boosting 설정 값에 따라 score 를 부여하게 됩니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

- 이 글에서는 function_score query에 대한 내용이 포함 되어 있습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Elastic/TheDefinitiveGuide 2015. 12. 16. 11:21

가장 기본이 되는 TF/IDF Scoring 에 대한 설명 입니다.

복습 차원에서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html

원문 Snippet)

Term frequencyedit

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:

tf(t in d) = √frequency

The term frequency (tf) for term t in document d is the square root of
the number of times the term appears in the document.

Inverse document frequency

edit

How often does the term appear in all documents in the collection? The more often, the lower the weight.Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1))

The inverse document frequency (idf) of term t is the logarithm of the number
of documents in the index, divided by the number of documents that contain the term.

Field-length normedit

How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows:

norm(d) = 1 / √numTerms

The field-length norm (norm) is the inverse square root of the number of terms in the field.

가볍게 정리 하면)

- tf는 문서 내 발생한 term 빈도수 : term 빈도수가 클 수록 weight 가 높습니다.

- idf는 전체 문서에서 발생한 term 빈도수 : term 빈도수가 작을 수록 weight 가 높습니다.

- norm은 field내 text의 길이 : 길이가 짧을 수록 weight 가 높습니다.

더불어 mapping 설정도 살짝 살펴 보면)

필요한 정보만 저장할 경우 저장소 낭비를 방지 할 수 도 있으며, score 계산시 조금이나마 성능적 효과도 볼 수 있습니다.

모든 옵션을 다 사용해야 할지 선택적으로 사용해도 문제 없을지 잘 판단 하시면 좋을 것 같습니다.

- index_options

Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).

- norms: {enabled: <value>}

Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Index Time Optimizations

Elastic/TheDefinitiveGuide 2015. 12. 15. 13:51

그냥 좋은 내용이라 기록 합니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_optimizations.html

원문 Snippet)

The flexibility of query-time operations comes at a cost: search performance. Sometimes it may make sense to move the cost away from the query. In a real- time web application, an additional 100ms may be too much latency to tolerate.

By preparing your data at index time, you can make your searches more flexible and improve performance. You still pay a price: increased index size and slightly slower indexing throughput, but it is a price you pay once at index time, instead of paying it on every query.

짧게 정리 하면)

- query-time operation 은 비용이 들지만 flexible 하다.

- index-time operation 은 색인 성능이 떨어지고 저장공간이 늘어 날 수 있지만 질의 성능이 좋아지고 더욱 유연하게 사용할 수 있게 해준다.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Finding Associated Words

Elastic/TheDefinitiveGuide 2015. 12. 15. 12:46

문서에 대한 relevance를 어떻게 구현해야 비용이 적게 들까? 고민 해보신 적이 있을 실겁니다.

꼭 이 글이 일치 하지는 않지만 match query 사용 시 비싼 비용을 지불하지 말고 indexing time 에 활용을 한번 해보면 어떨까 하는 아이디어에서 나온 글로 보시면 어떨까 합니다.

그런 의미에서 기록해 봤습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html

https://www.elastic.co/guide/en/elasticsearch/reference/2.1/analysis-shingle-tokenfilter.html

원문 Snippet)

Performanceedit

Not only are shingles more flexible than phrase queries, but they perform better as well. Instead of paying the price of a phrase query every time you search, queries for shingles are just as efficient as a simplematch query. A small price is paid at index time, because more terms need to be indexed, which also means that fields with shingles use more disk space. However, most applications write once and read many times, so it makes sense to optimize for fast queries.

This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.

※ shingles 란? 쉽게 token(term)들에 대한 ngram 이라고 이해 하시면 쉽습니다.

결국 검색 질의 시 slop 이나 position_offset_gap 기능을 이용하는 것 보다 색인 시 shigles analyzed 를 통해 질의 비용을 줄이는 것이라고 보시면 됩니다.

단, 문서에도 있지만 저장 공간이 늘어 나게 됩니다.

저작자표시 비영리 변경금지

:

[Elasticsearch - The Definitive Guide] Improving Performance

Elastic/TheDefinitiveGuide 2015. 12. 15. 12:24

검색 질의에 대한 성능 향상 관련 글 입니다.

한번 쯤 읽어 보시면 좋을 것 같아 기록 합니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_improving_performance.html

요약)

- term query 가 match query 종류 보다 10배 에서 20배 빠릅니다.

- 그렇다고 해서 match query 종류가 느린것은 아닙니다. ( 수 밀리초 내 응답 합니다.)

- slop, position_offset_gap 과 같은 속성 사용에 대한 이해가 필요합니다.

Rescoring 이라는 것이 나오는데요.

이건 function_score 랑도 비슷합니다.

하지만 두 개의 목적은 비슷하지만 사용법은 다르죠.

결과적으로 두 API 다 개별 shard 에서의 top N 개의 문서를 가지고 다시 rescoring 하게 된다는 것입니다.

저작자표시 비영리 변경금지

:

jjeong

'Elastic/TheDefinitiveGuide'에 해당되는 글 33건

[Elasticsearch - The Definitive Guide] Controlling Stemming

[Elasticsearch - The Definitive Guide] One Language per Document.

[Elasticsearch - The Definitive Guide] Single Query String

[Elasticsearch - The Definitive Guide] Pitfalls of Mixing Languages

[Elasticsearch - The Definitive Guide] Dealing with Human Language

[Elasticsearch - The Definitive Guide] Controlling Relevance

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Term frequencyedit

edit

Field-length normedit

[Elasticsearch - The Definitive Guide] Index Time Optimizations

[Elasticsearch - The Definitive Guide] Finding Associated Words

Performanceedit

[Elasticsearch - The Definitive Guide] Improving Performance

티스토리툴바