'Elastic/Elasticsearch'에 해당되는 글 385건

  1. 2014.01.10 [elasticsearch] elasticsearch chrome extension....
  2. 2014.01.08 [Elasticsearch] template 생성 힌트.
  3. 2014.01.08 [elasticsearch] 색인 생성 스크립트.
  4. 2014.01.07 [elasticsearch] logstash 용 template 샘플.
  5. 2014.01.07 [elasticsearch] settings & mappings 샘플용 코드...
  6. 2014.01.07 [Elasticsearch] maven build 하기....
  7. 2013.12.18 [Elasticsearch] 쉽게 자동완성 기능 구현해 보기. 2
  8. 2013.12.16 [Elasticsearch] shard reroute 하기..
  9. 2013.12.09 elasticsearch-hadoop 기능 테스트.
  10. 2013.12.06 queryWeight, queryNorm, fieldWeight, fieldNorm

[elasticsearch] elasticsearch chrome extension....

Elastic/Elasticsearch 2014. 1. 10. 17:53

공유했다고 생각했는데 아니였나 보내요.

이미 많은 분들이 알고 계실수도 있지만, 반복학습 차원에서.. ^^


https://github.com/bleskes/sense


Sense

A JSON aware developer's interface to ElasticSearch. Comes with handy machinery such as syntax highlighting, autocomplete, formatting and code folding.


Installation

Sense is installed as a Chrome Extension. Install it from the Chrome Webstore .

:

[Elasticsearch] template 생성 힌트.

Elastic/Elasticsearch 2014. 1. 8. 18:33

아래 글 참고

http://jjeong.tistory.com/914


이해하기 쉽도록 힌트 몇자 적습니다.

기본 이해는

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html

이 문서를 보시면 됩니다.


step 1.

template 을 생성 합니다.

curl -XPUT localhost:9200/_template/template_1 -d '
{
   
"template" : "te*",
   
"settings" : {
       
"number_of_shards" : 1
   
},
   
"mappings" : {
       
"type1" : {
           
"_source" : { "enabled" : false }
       
}
   
}
}
'

여기서 template : te* 이 의미 하는 것은 index 명입니다.


step 2.

curl -XPUT 'http://localhost:9200/temp/'

이렇게 생성하면 temp 라는 인덱스의 settings/mappings 정보는 template_1 값을 가지게 됩니다.


logstash 예제로 보겠습니다.

아래는 template 생성용 json 입니다.

{
  "template" : "logstash-*",
  "settings" : {
    "index.refresh_interval" : "5s",
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "standard",
          "stopwords" : "_none_"
        }
      }
    }
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true},
       "dynamic_templates" : [ {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "multi_field",
               "fields" : {
                 "{name}" : {"type": "string", "index" : "analyzed", "omit_norms" : true, "index_options" : "docs"},
                 "{name}.raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
               }
           }
         }
       } ],
       "properties" : {
         "@version": { "type": "string", "index": "not_analyzed" },
         "geoip" : {
           "type" : "object",
             "dynamic": true,
             "path": "full",
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         }
       }
    }
  }
}

보시면 인덱스 명이 logstash-* 로 시작하는 것들은 이 템플릿을 따르게 됩니다.

_all 을 enable 한 이유는 특정 필드에 대해서 동적으로 검색을 지원하기 위해서 라고 보시면 됩니다.

특히 string 필드에 대해서는 검색을 하는 것으로 지정을 하였고, multi_field 구성한 이유는 not_analyzed 로 봐서는 facet 기능이나 sort 등의 다른 기능을 활용하기 위해서 인것으로 보입니다.


그럼 이만... :)

:

[elasticsearch] 색인 생성 스크립트.

Elastic/Elasticsearch 2014. 1. 8. 12:08

색인 스키마를 json 파일로 만들어 놓고 rest api 로 생성 할 때 유용한 스크립트 입니다.

그냥 제가 사용하기 편할라고 대충 만들어 놓은거랍니다.


#!/bin/bash

size=$#

if [ $size -ne 3 ]; then
    echo "Usage: create_index.sh IP:PORT INDICE SCHEME_FILE";
    echo "Example: create_index.sh localhost:9200 idx_local schema.json";
    exit 0;
fi

serviceUri=$1
indice=$2
schema=$3

curl -XDELETE 'http://'$serviceUri'/'$indice

curl -XPUT 'http://'$serviceUri'/'$indice -d @$schema


:

[elasticsearch] logstash 용 template 샘플.

Elastic/Elasticsearch 2014. 1. 7. 18:55

색인 스키마 관리를 위해서 템플릿 생성을 할 수 있습니다.

쉽게 접할 수 있는 예제로 logstash 정보가 괜찮아 보여서 공유합니다.


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-root-object-type.html#_dynamic_templates


https://gist.github.com/deverton/2970285

https://github.com/logstash/logstash/blob/v1.3.1/lib/logstash/outputs/elasticsearch/elasticsearch-template.json



{
"template": "logstash-*",
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index" : {
"query" : { "default_field" : "@message" },
"store" : { "compress" : { "stored" : true, "tv": true } }
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"_source": { "compress": true },
"dynamic_templates": [
{
"string_template" : {
"match" : "*",
"mapping": { "type": "string", "index": "not_analyzed" },
"match_mapping_type" : "string"
}
}
],
"properties" : {
"@fields": { "type": "object", "dynamic": true, "path": "full" },
"@message" : { "type" : "string", "index" : "analyzed" },
"@source" : { "type" : "string", "index" : "not_analyzed" },
"@source_host" : { "type" : "string", "index" : "not_analyzed" },
"@source_path" : { "type" : "string", "index" : "not_analyzed" },
"@tags": { "type": "string", "index" : "not_analyzed" },
"@timestamp" : { "type" : "date", "index" : "not_analyzed" },
"@type" : { "type" : "string", "index" : "not_analyzed" }
}
}
}
}


{
  "template" : "logstash-*",
  "settings" : {
    "index.refresh_interval" : "5s",
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "standard",
          "stopwords" : "_none_"
        }
      }
    }
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true},
       "dynamic_templates" : [ {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "multi_field",
               "fields" : {
                 "{name}" : {"type": "string", "index" : "analyzed", "omit_norms" : true, "index_options" : "docs"},
                 "{name}.raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
               }
           }
         }
       } ],
       "properties" : {
         "@version": { "type": "string", "index": "not_analyzed" },
         "geoip" : {
           "type" : "object",
             "dynamic": true,
             "path": "full",
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         }
       }
    }
  }
}


:

[elasticsearch] settings & mappings 샘플용 코드...

Elastic/Elasticsearch 2014. 1. 7. 18:41

그냥 참고용으로 올려 놓는 것입니다.

각 속성들은 서비스 특성에 맞춰서 설정 하시는게 좋습니다.


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-put-mapping.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html


{
    "settings" : {
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "index" : {
            "refresh_interval" : "1s",
            "merge" : {
                "policy" : { "segments_per_tier" : 5 }
            },
            "analysis" : {
                "analyzer" : {
                    "analyzer_standard" : {
                        "type" : "standard",
                        "tokenizer" : "whitespace",
                        "filter" : ["lowercase", "trim"]
                    },
                    "analyzer_pattern" : {
                        "type" : "custom",
                        "tokenizer" : "tokenizer_pattern",
                        "filter" : ["lowercase", "trim"]
                    },
                    "analyzer_ngram" : {
                        "type" : "custom",
                        "tokenizer" : "tokenizer_ngram",
                        "filter" : ["lowercase", "trim"]
                    }
                },
                "tokenizer" : {
                    "tokenizer_ngram" : {
                        "type" : "nGram",
                        "min_gram" : "2",
                        "max_gram" : "10",
                        "token_chars": [ "letter", "digit" ]
                    },
                    "tokenizer_pattern" : {
                        "type" : "pattern",
                        "pattern" : ","
                    }
                }
            },
            "store" : {
                "type" : "mmapfs",
                "compress" : {
                    "stored" : true,
                    "tv" : true
                }
            }
        }
    },
    "mappings" : {
        "INDICE_TYPE_NAME" : {
            "_id" : {
                "index" : "not_analyzed",
                "path" : "KEY_FIELD_NAME"
            },
            "_source" : {
                "enabled" : "true"
            },
            "_all" : {
                "enabled" : "false"
            },
            "_boost" : {
                "name" : "_boost",
                "null_value" : 1.0
            },
            "analyzer" : "analyzer_standard",
            "index_analyzer" : "analyzer_standard",
            "search_analyzer" : "analyzer_standard",
            "properties" : {
                "LONG_KEY_FIELD" : {"type" : "long", "store" : "no", "index" : "not_analyzed",  "omit_norms" : true, "index_options" : "docs", "ignore_malformed" : true, "include_in_all" : false},
                "STRING_SEARCH_FIELD" : {"type" : "string", "store" : "no", "index" : "analyzed", "omit_norms" : false, "index_options" : "offsets", "term_vector" : "with_positions_offsets", "include_in_all" : false},
                "STRING_VIEW_FIELD" : {"type" : "string", "store" : "yes", "index" : "no", "include_in_all" : false},
                "INTEGER_KEY_FIELD" : {"type" : "integer", "store" : "no", "index" : "not_analyzed",  "omit_norms" : true, "index_options" : "docs", "ignore_malformed" : true, "include_in_all" : false},
                "FLOAT_KEY_FIELD" : {"type" : "float", "store" : "no", "index" : "not_analyzed", "omit_norms" : true, "index_options" : "docs", "ignore_malformed" : true, "include_in_all" : false},
                "LONG_VIEW_FIELD" : {"type" : "long", "store" : "yes", "index" : "no",  "ignore_malformed" : true, "include_in_all" : false},
                "STRING_KEY_FIELD" : {"type" : "string", "store" : "no", "index" : "not_analyzed", "omit_norms" : true, "index_options" : "docs", "include_in_all" : false},
                "NESTED_KEY_FIELD" : {"type" : "nested",
                "properties" : {
                    "STRING_KEY_FIELD" : {"type" : "string", "store" : "no", "index" : "not_analyzed", "omit_norms" : true, "index_options" : "docs", "include_in_all" : false},
                    "INTEGER_VIEW_FIELD" : {"type" : "integer", "store" : "yes", "index" : "no",  "ignore_malformed" : true, "include_in_all" : false}
                    }
                },
                "BOOLEAN_VIEW_FIELD" : {"type" : "boolean", "store" : "yes", "include_in_all" : false},
                "BOOLEAN_KEY_FIELD" : {"type" : "boolean", "store" : "no", "index" : "not_analyzed", "omit_norms" : true, "index_options" : "docs", "include_in_all" : false},
                "OBJECT_VIEW_FIELD" : {"type" : "object", "dynamic" : true, "store" : "yes", "index" : "no", "include_in_all" : false}
            }
        }
    }
}






:

[Elasticsearch] maven build 하기....

Elastic/Elasticsearch 2014. 1. 7. 12:30

http://www.elasticsearch.org/contributing-to-elasticsearch/


elasticsearch checkout 받은 후 코드 수정 또는 빌드를 하고 싶을때 참고 하시면 됩니다.

이전 버전에서는 maven 2.x 에서 되었던 것 같은데 지금은 3.x 가 필요 하내요.

run configure 에서 maven 3.x 로 변경 하시고 빌드 하시면 정상적으로 빌드가 됩니다.


위 문서에 나와 있는 것 처럼 Goals 에 아래 옵션을 넣고 빌드 하세요.

clean package -DskipTests


:

[Elasticsearch] 쉽게 자동완성 기능 구현해 보기.

Elastic/Elasticsearch 2013. 12. 18. 14:59

뭐 별로 어렵거나 거창하지 않습니다.

자동완성의 경우 오타교정, 사전연동 등등 조합이 필요 하지만 일단 es 에서 제공해 주는 prefix query 를 통해서 아주 쉽게 구현 할 수 있습니다.


[Reference]

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html


[Test URI]

http://localhost:9200/idx_local/_search?{"query":{"prefix":{"item_name":"나"}}}

http://localhost:9200/idx_local/_search?{"query":{"prefix":{"item_name":"나이"}}}


[설명]

- name 이라는 문서 필드에 "홍길"로 시작하는 문서를 검색해 줍니다.

- name field 는 기본 index:not_analyzed 로 선언 되어 있어야 합니다.

- 쇼핑 같은데서 인기검색어 자동완성 또는 검색어 자동완성 이런 용도로 활용 하시면 되겠내요.

:

[Elasticsearch] shard reroute 하기..

Elastic/Elasticsearch 2013. 12. 16. 17:04

간혹 elasticsearch 가 비정상 종료 되거나 했을 경우 unassigned shard 가 남는 경우가 있습니다.

이럴 경우 아래와 같이 재할당을 할 수 있는데요.

운영 하실 때 참고 하면 좋습니다.

[shard_reroute.sh]

#!/bin/bash

size=$#

if [ $size -ne 2 ]; then
    echo "Usage: shard.sh INDICE_NAME SHARD_NUMBER";
    echo "Example: shard.sh products 1";
    exit 0;
fi

indice=$1
shard=$2

echo "curl -XPOST http://localhost:9200/_cluster/reroute -d '{ "commands" : [ { "allocate" : { "index" : \"$indice\", "shard" : $shard, "node" : "node1", "allow_primary" : true } } ] }'"
echo ""
curl -XPOST http://localhost:9200/_cluster/reroute -d '{ "commands" : [ { "allocate" : { "index" : "'$indice'", "shard" : '$shard', "node" : "node1", "allow_primary" : true } } ] }'

[Reference URL]

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html

:

elasticsearch-hadoop 기능 테스트.

Elastic/Elasticsearch 2013. 12. 9. 17:54

[프로젝트]

https://github.com/elasticsearch/elasticsearch-hadoop


[소개]

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/requirements.html


hadoop 이랑 연동 하기 위해서 위 프로젝트를 테스트 하였습니다.

정말 예전에 나왔던것 보다 많이 좋아졌내요.


간략하게 정리 하면 hdfs 에 저장되어 있는 데이터를 elasticsearch 로 migration 하는 거라고 보시면 됩니다.

당연한 이야기 입니다만, es 로 데이터가 들어 가기 때문에 당연히 검색도 되겠죠.


저 같은 경우는 mapreducer 와 hive 를 까지만 테스트 하였는데요.

기능 동작 잘되고 활용할 부분이 많을 듯 싶내요.

어떻게 활용할지는 아래 링크 참고 하시면 좋을 듯 합니다.

http://hortonworks.com/blog/fast-search-and-analytics-on-hadoop-with-elasticsearch-and-hdp/


Use Cases

Here are just some of the use case results from Elasticsearch:

  • Perform real-time analysis of 200 million conversations across the social web each day helping major brands make business decisions based on social data
  • Run marketing campaigns that quickly identify the right key influencers from a database of 400 million users
  • Provide real-time search results from an index of over 10 billion documents
  • Power intelligent search and better inform recommendations to millions of customers a month
  • Increase the speed of searches by 1000 times
  • Instant search for 100,000 source code repositories containing tens of billions lines of code

위 그림에서만 보면 호튼웍스가 꼭 끼어야 할 것 처럼 보이지만 뭐 없어도 됩니다.

그냥 hadoop + elasticsearch 로 활용 하시면 되니까.. 잘 만들어 쓰시면 좋겠내요.


실제 제가 참고한 코드는

[MapReducer]

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapreduce.html


[Hive]

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/hive.html

이렇습니다.


[SW Version]

hadoop 1.2.1

hive 0.12.0

elasticsearch-hadoop-1.3.0-SNAMSHOP





:

queryWeight, queryNorm, fieldWeight, fieldNorm

Elastic/Elasticsearch 2013. 12. 6. 10:55

reference : http://grokbase.com/t/lucene/solr-user/12cdvgz48t/score-calculation

queryWeight = the impact of the query against the field
implementation: boost(query)*idf*queryNorm


boost(query) = boost of the field at query-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term in field B


idf = inverse document frequency = measure of how often the term appears across the index for this field
implementation: log(numDocs/(docFreq+1))+1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones
numDocs = the total number of documents in the index, not including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index).
docFreq = the number of documents in the index which contain the term in this field. This is a constant (the same value for all documents in the index containing this field)


queryNorm = normalization factor so that queries can be compared
implementation: 1/sqrt(sumOfSquaredWeights)
Implication: doesn't impact the relevancy of this result
Rationale: queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. This value is equal for all results of the query


fieldWeight = the score of a term matching the field
implementation: tf*idf*fieldNorm


tf = term frequency in a field = measure of how often a term appears in the field
implementation: sqrt(freq)
Implication: the more frequent a term occurs in a field, the greater its score
Rationale: fields which contains more of a term are generally more relevant
freq = termFreq = amount of times the term occurs in the field for this document


fieldNorm = impact of a hit in this field
implementation: lengthNorm*boost(index)
lengthNorm = measure of the importance of a term according to the total number of terms in the field
implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more
numTerms = amount of terms in a field
boost (index) = boost of the field at index-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term in field B


maxDocs = the number of documents in the index, including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index)
Implication: (probably) doesn't play a role in the scoring calculation


coord = number of terms in the query that were found in the document (omitted if equal to 1)
implementation: overlap/maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: documents that match the most optional terms score highest
overlap = the number of query terms matched in the document
maxOverlap = the total number of terms in the query


FunctionQuery = could be any kind of custom ranking function, which outcome is added to, or multiplied with the default rank score.
Implication: various


Look at the EXPLAIN information to see how the final score is calculated.

: