[lucene] precision_step 설정.

Elastic/Elasticsearch 2014. 1. 14. 17:43

 field  가 number type 인 경우 이 설정을 어떻게 해주느냐에 따라 검색 성능에 영향을 줄 수 있습니다.

계산식은 아래 보는 바와 같습니다.


업데이트 : http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/NumericRangeQuery.html

원문은 : http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/NumericRangeQuery.html#precisionStepDesc


Precision Step

You can choose any precisionStep when encoding values. Lower step values mean more precisions and so more terms in index (and index gets larger). The number of indexed terms per value is (those are generated by NumericTokenStream):

  indexedTermsPerValue = ceil(bitsPerValue / precisionStep)

As the lower precision terms are shared by many values, the additional terms only slightly grow the term dictionary (approx. 7% for precisionStep=4), but have a larger impact on the postings (the postings file will have more entries, as every document is linked to indexedTermsPerValue terms instead of one). The formula to estimate the growth of the term dictionary in comparison to one term per value:

  \mathrm{termDictOverhead} = \sum\limits_{i=0}^{\mathrm{indexedTermsPerValue}-1} \frac{1}{2^{\mathrm{precisionStep}\cdot i}}

On the other hand, if the precisionStep is smaller, the maximum number of terms to match reduces, which optimizes query speed. The formula to calculate the maximum number of terms that will be visited while executing the query is:

  \mathrm{maxQueryTerms} = \left[ \left( \mathrm{indexedTermsPerValue} - 1 \right) \cdot \left(2^\mathrm{precisionStep} - 1 \right) \cdot 2 \right] + \left( 2^\mathrm{precisionStep} - 1 \right)



int 형 field 일 경우 4 bytes = 32 bits 로 

indexedTermsPerValue = ceil(42 / 4)

maxQueryTerms =  [ ( 8 - 1 ) * (16 - 1 ) * 2 ] + (16 - 1 ) = 7 * 15 * 2 + 15 =  225

: