'td'에 해당되는 글 2건

  1. 2013.07.12 [lucene score] 펌) score 계산식.
  2. 2013.02.19 TFIDF

[lucene score] 펌) score 계산식.

Elastic/Elasticsearch 2013. 7. 12. 17:29

Original URL : http://www.lucenetutorial.com/advanced-topics/scoring.html


Lucene Scoring

The authoritative document for scoring is found on the Lucene site here. Read that first.

Lucene implements a variant of the TfIdf scoring model. That is documented here.

The factors involved in Lucene's scoring algorithm are as follows:

  1. tf = term frequency in document = measure of how often a term appears in the document
  2. idf = inverse document frequency = measure of how often the term appears across the index
  3. coord = number of terms in the query that were found in the document
  4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
  5. queryNorm = normalization factor so that queries can be compared
  6. boost (index) = boost of the field at index-time
  7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are: 

note: the implication of these factors should be read as, "Everything else being equal, ... [implication]"

1. tf 
Implementation: sqrt(freq) 
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score 
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more

queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as1/sqrt(sumOfSquaredWeights)

So, in summary (quoting Mark Harwood from the mailing list),

* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found athttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.

For example, if you want to ignore how common a term appears across the index,

Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better

Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 *Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}


:

TFIDF

ITWeb/개발일반 2013. 2. 19. 18:48

Reference URLs :

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

http://ko.wikipedia.org/wiki/TF-IDF

http://nlp.stanford.edu/IR-book/html/htmledition/index-1.html



TF-IDF(Term Frequency - Inverse Document Frequency)는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 핵심어를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다.


TF(단어 빈도수, term frequency)는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 이 값이 높을수록 문서에서 중요하다고 생각할 수 있다. 하지만 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미한다. 이것을 DF(문서 빈도수, document frequency)라고 하며, 이 값의 역수를 IDF(inverse document frequence)라고 한다. TF-IDF는 TF와 IDF를 곱한 값이다.


IDF 값은 문서군의 성격에 따라 결정된다. 예를 들어 '원자'라는 낱말은 일반적인 문서들 사이에서는 잘 나오지 않기 때문에 IDF 값이 높아지고 문서의 핵심어가 될 수 있지만, 원자에 대한 문서를 모아놓은 문서군의 경우 이 낱말은 상투어가 되어 각 문서들을 세분화하여 구분할 수 있는 다른 낱말들이 높은 가중치를 얻게 된다.


: