jjeong :: [Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Elastic/TheDefinitiveGuide 2015. 12. 16. 11:21

가장 기본이 되는 TF/IDF Scoring 에 대한 설명 입니다.

복습 차원에서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html

원문 Snippet)

Term frequencyedit

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:

tf(t in d) = √frequency

The term frequency (tf) for term t in document d is the square root of
the number of times the term appears in the document.

Inverse document frequency

edit

How often does the term appear in all documents in the collection? The more often, the lower the weight.Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1))

The inverse document frequency (idf) of term t is the logarithm of the number
of documents in the index, divided by the number of documents that contain the term.

Field-length normedit

How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows:

norm(d) = 1 / √numTerms

The field-length norm (norm) is the inverse square root of the number of terms in the field.

가볍게 정리 하면)

- tf는 문서 내 발생한 term 빈도수 : term 빈도수가 클 수록 weight 가 높습니다.

- idf는 전체 문서에서 발생한 term 빈도수 : term 빈도수가 작을 수록 weight 가 높습니다.

- norm은 field내 text의 길이 : 길이가 짧을 수록 weight 가 높습니다.

더불어 mapping 설정도 살짝 살펴 보면)

필요한 정보만 저장할 경우 저장소 낭비를 방지 할 수 도 있으며, score 계산시 조금이나마 성능적 효과도 볼 수 있습니다.

모든 옵션을 다 사용해야 할지 선택적으로 사용해도 문제 없을지 잘 판단 하시면 좋을 것 같습니다.

- index_options

Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).

- norms: {enabled: <value>}

Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.

저작자표시 비영리 변경금지

jjeong

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Term frequencyedit

edit

Field-length normedit

티스토리툴바