'elasticsearch' 태그의 글 목록 (20 Page)

[Elasticsearch] Intellij 에서 Latest Elasticsearch Import 시 Gradle 이슈

Elastic/Elasticsearch 2016. 2. 1. 23:27

일단 증상은 아래와 같은 에러가 발생을 해서 intellij 에서 clone 한 elasticsearch의 master 브랜치 import 가 안됩니다.

이게 저만 그런건지 환경의 문제 인건지 시간이 별로 없어서 확인을 끝까지 못했기 때문에 일단 기록 부터 합니다.

아래는 우회 하는 방법을 기록 했습니다.

[에러 메시지]

- build.gradle 에 아래와 같은 조건이 있습니다.

if (System.getProperty('idea.active') != null && ideaMarker.exists() == false) {

throw new GradleException('You must run gradle idea from the root of elasticsearch before importing into IntelliJ')

}

[우회 방법]

- maven project로 구성된 다른 branch 를 checkout 받아 intellij로 import 합니다.

$ git checkout 2.2

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] es blog - this week in es 2016.01.25

Elastic/Elasticsearch 2016. 1. 26. 18:45

그냥 투척~

[원문링크]

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-01-25

예전에 제가 공유 했었는지 기억이 안나지만 jar hell 을 disable 할수 있게 되었나 봅니다. ㅋㅋ

테스트 할때 이것 땜시 개고생... ㅡ.ㅡ;

또 눈에 띄는건 scripting 에 throw 랑 try/catch 도 가능..

또 재시작시 기존 primary를 그대로 사용... (그랴 그래야지..)

이외도 내용이 많내요.

근데 음.. es 도 규모가 커져서 인걸까요?

뭔가 전 보다는 약간 덜 active 해지는 느낌이랄까요??

ㅋㅋ 아마도 제가 백수 된지 얼마 안돼서 그런가 봅니다.

ㅜ ㅜ

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] _version 에 대한 오해.

Elastic/Elasticsearch 2016. 1. 5. 18:06

제가 잘못 알고 있었습니다.

문서를 자세히 안본 저의 불찰 입니다.

그래서 기록해 봅니다. ^^;

https://www.elastic.co/guide/en/elasticsearch/reference/2.1/docs-index_.html#index-versioning

elasticsearch에서 제공하고 있는 version 은 transaction 처리 시 동시성 제어를 위해 사용하는 것입니다.

즉, 하나의 문서에 대해서 서로 다른 update 요청이 들어 왔을 때 이를 제어 하기 위해서라고 보시면 되겠습니다.

더 자세한 내용은 위 문서에 잘나와 있습니다.

저 처럼 당연히 이건 기존에 것들과 비슷 한걸거야 하고 넘어 가지마세요. ㅡ.ㅡ;

저로 인해서 정보에 대한 노이즈를 제공하게 되어 죄송하게 생각합니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Controlling Stemming

Elastic/TheDefinitiveGuide 2015. 12. 22. 13:42

형태소 분석에서 stemming 관련 글 입니다.

언어학에 대한 지식은 검색 기술에서도 매우 중요 합니다.

하지만 저는 컴퓨터과학을 전공하였기에 모르는건 기록하고 학습을 해야... 그래서 기록합니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html

원문 Snippet)

- Preventing Stemming

당연히 아시겠지만, stopwords 와는 다릅니다.

PUT /my_index { "settings": { "analysis": { "filter": { "no_stem": { "type": "keyword_marker", "keywords": [ "skies" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "no_stem", "porter_stem" ] } } } } }

GET /my_index/_analyze?analyzer=my_english sky skies skiing skis

- Customizing Stemming

추출 term에 대한 맵핑 정보를 사용하는 방법이며, 조금은 다르지만 synonym과 비슷 합니다.

역시 색인 및 검색 시 사용이 가능합니다.

PUT /my_index { "settings": { "analysis": { "filter": { "custom_stem": { "type": "stemmer_override", "rules": [ "skies=>sky", "mice=>mouse", "feet=>foot" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "custom_stem", "porter_stem" ] } } } } } GET /my_index/_analyze?analyzer=my_english The mice came down from the skies and ran over my feet

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] One Language per Document.

Elastic/TheDefinitiveGuide 2015. 12. 22. 11:18

다국어 지원시 인덱스 모델링 시 type 을 사용하지 말라는 주의 문구가 있어서 기록 합니다.

혹시라도 type 을 통한 구성을 고민 하고 계셨다면 한번 읽어 보시면 좋을 것 같습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-docs.html

원문 Snippet)

Don’t Use Types for Languages

You may be tempted to use a separate type for each language, instead of a separate index. For best results, you should avoid using types for this purpose. As explained in Types and Mappings, fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies from each type (and thus each language) are mixed together.

To ensure that the term frequencies of one language don’t pollute those of another, either use a separate index for each language, or a separate field, as explained in the next section.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Single Query String

Elastic/TheDefinitiveGuide 2015. 12. 22. 11:13

여러 필드에 질의 할 때 문서에 대한 score 계산시 가장 relevant 한 결과를 만들어 내기 위해 사용하는 기법이라고 보면 될 것 같습니다.

하지만 기능적으로 어떤의미를 갖는지를 알아야 응용이 가능 하겠습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_single_query_string.html

원문 Snippet)

Best fields

When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other. Documents should have as many words as possible in the same field, and the score should come from the best-matching field.

Most fields

A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.

The main field may contain words in their stemmed form, synonyms, and words stripped of theirdiacritics, or accents. It is used to match as many documents as possible.

The same text could then be indexed in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with accents, and a third might use shinglesto provide information about word proximity.

These other fields act as signals to increase the relevance score of each matching document. The more fields that match, the better.

Cross fields

For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:

Person: first_name and last_name
Book: title, author, and description
Address: street, city, country, and postcode

가볍게 요약하면 이렇습니다.

- best fields 는 가장 적합한 field 의 score 를 리턴 합니다.

- most fields 는 field 들의 score 를 더한 값을 리턴 합니다.

- cross fields 는 field 들을 섞어서 score 를 계산 하여 리턴 합니다. (field 들 중 최소의 idf 값을 사용 합니다.)

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Pitfalls of Mixing Languages

Elastic/TheDefinitiveGuide 2015. 12. 22. 10:48

여러 종류의 언어가 섞여 있는 경우에 대한 이야기 입니다.

다국어 지원 자체는 기능적 요건에 따라 구현 방법이 달라 질 수 있기 때문에 고민을 많이 할 수 밖에 없는 것 같습니다.

Elasticsearch 에서는 쉬운 접근법으로 _all 과 custom all (copy_to, multi fields) 기능을 제시해 주고 있습니다.

이 글에서는 저 한테 필요한 내용인 language detection 관련 링크가 있어 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/language-pitfalls.html

Language Detection)

https://github.com/mikemccand/chromium-compact-language-detector

https://code.google.com/p/cld2/

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Dealing with Human Language

Elastic/TheDefinitiveGuide 2015. 12. 16. 17:47

글 제목과 비슷할 수도 다를 수도 있습니다.

precision 과 recall 에 대한 설명이 짧게 잘 표현이 되어 있어서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html

원문 Snippet)

Full-text search is a battle between precision—returning as few irrelevant documents as possible—andrecall—returning as many relevant documents as possible.

원래 이 문서는 언어에 대한 처리 목적 이였습니다.

그래서 정의한 5가지 title 만 정리해 봤습니다.

- Normalizing Tokens

추출 된 token 에서 필요 없는 character를 제거 합니다.

- Reducing Words To Their Root Form

Word 에 붙은 불필요한 정보를 제거 합니다. (word의 origin을 만든다고 보시면 쉽습니다.)

- Stopwords

불용어 처리를 합니다. (즉, 색인 대상에서 제외 시킵니다.)

- Synonyms

동의어 또는 유의어 처리를 합니다.

- Typoes and Mispelings

오타 처리를 합니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Controlling Relevance

Elastic/TheDefinitiveGuide 2015. 12. 16. 14:08

Relevance 관련 글이 많아서 일단 한번쯤은 꼭 읽어 봐야 하는 글만 모아 봤습니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/query-scoring.html

- 이 글에서는 bool query에 대한 query competition 에 대한 내용이 포함 되어 있습니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/not-quite-not.html

- 이 글에서는 boosting query에 대한 내용이 포함 되어 있습니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html

- 이 글에서는 constant_score query에 대한 내용이 포함 되어 있습니다.

- 즉, 모든 점수가 1로 나오며, boosting 설정 값에 따라 score 를 부여하게 됩니다.

https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

- 이 글에서는 function_score query에 대한 내용이 포함 되어 있습니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Elastic/TheDefinitiveGuide 2015. 12. 16. 11:21

가장 기본이 되는 TF/IDF Scoring 에 대한 설명 입니다.

복습 차원에서 기록해 봅니다.

원문링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html

원문 Snippet)

Term frequencyedit

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:

tf(t in d) = √frequency

The term frequency (tf) for term t in document d is the square root of
the number of times the term appears in the document.

Inverse document frequency

edit

How often does the term appear in all documents in the collection? The more often, the lower the weight.Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1))

The inverse document frequency (idf) of term t is the logarithm of the number
of documents in the index, divided by the number of documents that contain the term.

Field-length normedit

How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows:

norm(d) = 1 / √numTerms

The field-length norm (norm) is the inverse square root of the number of terms in the field.

가볍게 정리 하면)

- tf는 문서 내 발생한 term 빈도수 : term 빈도수가 클 수록 weight 가 높습니다.

- idf는 전체 문서에서 발생한 term 빈도수 : term 빈도수가 작을 수록 weight 가 높습니다.

- norm은 field내 text의 길이 : 길이가 짧을 수록 weight 가 높습니다.

더불어 mapping 설정도 살짝 살펴 보면)

필요한 정보만 저장할 경우 저장소 낭비를 방지 할 수 도 있으며, score 계산시 조금이나마 성능적 효과도 볼 수 있습니다.

모든 옵션을 다 사용해야 할지 선택적으로 사용해도 문제 없을지 잘 판단 하시면 좋을 것 같습니다.

- index_options

Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).

- norms: {enabled: <value>}

Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.

저작자표시 비영리 변경금지 (새창열림)

:

jjeong

'elasticsearch'에 해당되는 글 420건

[Elasticsearch] Intellij 에서 Latest Elasticsearch Import 시 Gradle 이슈

[Elasticsearch] es blog - this week in es 2016.01.25

[Elasticsearch] _version 에 대한 오해.

[Elasticsearch - The Definitive Guide] Controlling Stemming

[Elasticsearch - The Definitive Guide] One Language per Document.

[Elasticsearch - The Definitive Guide] Single Query String

[Elasticsearch - The Definitive Guide] Pitfalls of Mixing Languages

[Elasticsearch - The Definitive Guide] Dealing with Human Language

[Elasticsearch - The Definitive Guide] Controlling Relevance

[Elasticsearch - The Definitive Guide] Theory Behind Relevance Scoring

Term frequencyedit

edit

Field-length normedit

티스토리툴바