[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring
Elastic/TheDefinitiveGuide 2015. 12. 16. 11:21가장 기본이 되는 TF/IDF Scoring 에 대한 설명 입니다.
복습 차원에서 기록해 봅니다.
원문링크)
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html
원문 Snippet)
Term frequencyedit
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
The term frequency ( |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight.Common terms like and
or the
contribute little to relevance, as they appear in most documents, while uncommon terms like elastic
or hippopotamus
help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
Field-length normedit
How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title
field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body
field. The field length norm is calculated as follows:
norm(d) = 1 / √numTerms
가볍게 정리 하면)
- tf는 문서 내 발생한 term 빈도수 : term 빈도수가 클 수록 weight 가 높습니다.
- idf는 전체 문서에서 발생한 term 빈도수 : term 빈도수가 작을 수록 weight 가 높습니다.
- norm은 field내 text의 길이 : 길이가 짧을 수록 weight 가 높습니다.
더불어 mapping 설정도 살짝 살펴 보면)
필요한 정보만 저장할 경우 저장소 낭비를 방지 할 수 도 있으며, score 계산시 조금이나마 성능적 효과도 볼 수 있습니다.
모든 옵션을 다 사용해야 할지 선택적으로 사용해도 문제 없을지 잘 판단 하시면 좋을 것 같습니다.
- index_options
Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).
- norms: {enabled: <value>}
Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.