가장 기본이 되는 TF/IDF Scoring 에 대한 설명 입니다.
복습 차원에서 기록해 봅니다.
원문링크)
원문 Snippet)
Term frequencyedit
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
| The term frequency (tf ) for term t in document d is the square root of the number of times the term appears in the document. |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight.Common terms like and
or the
contribute little to relevance, as they appear in most documents, while uncommon terms like elastic
or hippopotamus
help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
| The inverse document frequency (idf ) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. |
How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title
field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body
field. The field length norm is calculated as follows:
norm(d) = 1 / √numTerms
| The field-length norm (norm ) is the inverse square root of the number of terms in the field. |
가볍게 정리 하면)
- tf는 문서 내 발생한 term 빈도수 : term 빈도수가 클 수록 weight 가 높습니다.
- idf는 전체 문서에서 발생한 term 빈도수 : term 빈도수가 작을 수록 weight 가 높습니다.
- norm은 field내 text의 길이 : 길이가 짧을 수록 weight 가 높습니다.
더불어 mapping 설정도 살짝 살펴 보면)
필요한 정보만 저장할 경우 저장소 낭비를 방지 할 수 도 있으며, score 계산시 조금이나마 성능적 효과도 볼 수 있습니다.
모든 옵션을 다 사용해야 할지 선택적으로 사용해도 문제 없을지 잘 판단 하시면 좋을 것 같습니다.
- index_options
Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).
- norms: {enabled: <value>}
Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.