'Elastic/Elasticsearch' 카테고리의 글 목록 (34 Page)

[elasticsearch] Mapping - Array/Object/Nested Type

Elastic/Elasticsearch 2013. 4. 16. 12:04

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch 리뷰]

원문 링크

http://www.elasticsearch.org/guide/reference/mapping/

http://www.elasticsearch.org/guide/reference/mapping/array-type/

http://www.elasticsearch.org/guide/reference/mapping/object-type/

http://www.elasticsearch.org/guide/reference/mapping/nested-type/

원문 예제가 잘 나와 있어 그대로 사용 합니다.

[Document]

    "tweet" : {
        "message" : "some arrays in this tweet...",
        "tags" : ["elasticsearch", "wow"],
        "lists" : [
            {
                "name" : "prog_list",
                "description" : "programming list"
            },
            {
                "name" : "cool_list",
                "description" : "cool stuff list"
            }
        ]
    }

[Mapping]

    "tweet" : {
        "properties" : {
            "message" : {"type" : "string"},
            "tags" : {"type" : "string", "index_name" : "tag"},
            "lists" : {
                "properties" : {
                    "name" : {"type" : "string"},
                    "description" : {"type" : "string"}
                }
            }
        }
    }

[설명]

- elasticsearch 가 지원 하는 장점 중 하나 입니다.

- mapping 정보를 보면 문서 안에 sub 문서를 만들어 넣을 수 있습니다.

- _parent field 의 경우 index type 에 대한 parent 구조를 만들 수 있다고 하면 이건 document 자체에 parent-child 구조를 만들수 있습니다.

- mapping 에서 사용하는 index_name 의 경우 이런 array type 에 대해서 색인 시 field 명을 정의 할 수가 있습니다.

- nested loop 구조를 요구하는 문서가 있을 경우 잘 활용 하시면 좋습니다.

- 예를 들면 대화방이 있고 대화방안에는 여러 사람들이 나눈 대화 목록이있을 경우

"chat_room" : {

"properties" : {

"room_id" : {.....},

........

"chat_lists" : {

"properties" : {

"chat_id" : {....},

"sender" : {....},

"receiver" : {....},

"message" : {....)

}

. 이와 같은 구조로 생성을 할 수도 있습니다.

[Object/Nested]

- 이 두가지 type 도 array 와 유사 합니다.

- array 의 경우 [] 이와 같이 사용했다면, object 는 {} 을 사용하게 됩니다.

- nested 타입은 정의한 field 형식의 집합을 구성하게 됩니다. object 로 정의한 형식이 있다면 이것을 여러개의 집합으로 구성을 하게 됩니다.

아래 원문에 나온 예제를 보시면 쉽게 이해가 됩니다.

[Object 예제]

    "person" : {
        "properties" : {
            "name1" : {
                "type" : "object",
                "path" : "just_name",
                "properties" : {
                    "first1" : {"type" : "string"},
                    "last1" : {"type" : "string", "index_name" : "i_last_1"}
                }
            },
            "name2" : {
                "type" : "object",
                "path" : "full",
                "properties" : {
                    "first2" : {"type" : "string"},
                    "last2" : {"type" : "string", "index_name" : "i_last_2"}
                }
            }
        }
    }

- path 는 두가지 옵션을 갖습니다.

. just_name 과 full

. just_name 의 경우 mapping 에서 정의한 index_name 을 사용하게 되며

. full 의 경우 mapping 에서 정의한 full name 을 사용하게 됩니다.

- 즉 원문에 나온 결과를 보시면 이해가 쉽습니다.

JSON Name	Document Field Name
`name1`/`first1`	`first1`
`name1`/`last1`	`i_last_1`
`name2`/`first2`	`name2.first2`
`name2`/`last2`	`name2.i_last_2`

[Nested 예제]

{
    "type1" : {
        "properties" : {
            "obj1" : {
                "type" : "nested"
            }
        }
    }
}

- 위 예제에서는 "obj1" 내부 field 정의가 빠져 있으나 설정이 가능 합니다.

- 아래와 같이 하시면 됩니다.

"obj1" : {

"type" : "nested",

"properties" : {

"first_name" : { "type" : "string", ....},

"last_name" : { "type" : "string", ....}

......

}

:

[elasticsearch] Java API : mapping property.

Elastic/Elasticsearch 2013. 4. 16. 10:57

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크

http://www.elasticsearch.org/guide/reference/mapping/

이번 문서는 Java API에서도 제공하고 있는 mapping 관련 설정 값들 입니다.

자세한 설명은 원문의 Fields 와 Types 부분은 꼭 한번 씩 보시기 바랍니다.

[Mapping Template Sample]

"mapping" : {

"TYPE_NAME" : {

"analyzer" : "standard",

"index_analyzer" : "stadnard",

"search_analyzer" : "standard",

"_id" : {

"index" : "not_analyzed",

"store" : "yes",

"path" : "FIELD_NAME"

},

"_type" : {

"index" : "not_analyzed",

"store" : "yes"

},

"_source" : {

"enabled" : "false"

},

"_all" : {

"enabled" : "false"

},

"_boost" : {

"name" : "_boost",

"null_value" : 1.0

},

"_parent" : {

"type" : "PARENT_TYPE_NAME"

},

"_routing" : {

"required" : true,

"path" : "TYPE_NAME.FIELD_NAME"

},

"_timestamp" : {

"enabled" : true,

"path" : "DATE_FIELD_NAME",

"format" : "dateOptionalTime"

},

"properties" : {

"FIELD_NAME" : {

"type" : "string",

"index_name" : ,

"store" : ,

"index" : ,

"term_vector" : ,

"boost" : ,

"null_value" : ,

"omit_norms" : ,

"omit_term_freq_and_positions" : ,

"index_options" : ,

"analyzer" : ,

"index_analyzer" : ,

"search_analyzer" : ,

"include_in_all" : ,

"ignore_above" : ,

"position_offset_gap" :

},

"FIELD_NAME" : {

"type" : "float, double, byte, short, integer, and long",

"index_name" : ,

"store" : ,

"index" : ,

"precision_step" : ,

"null_value" : ,

"boost" : ,

"include_in_all" : ,

"ignore_malformed" :

},

"FIELD_NAME" : {

"type" : "date",

"index_name" : ,

"format" : ,

"store" : ,

"index" : ,

"precision_step" : ,

"null_value" : ,

"boost" : ,

"include_in_all" : ,

"ignore_malformed" :

},

"FIELD_NAME" : {

"type" : "boolean",

"index_name" : ,

"store" : ,

"index" : ,

"null_value" : ,

"boost" : ,

"include_in_all" : ,

},

"FIELD_NAME" : {

"type" : "binary",

"index_name" : ,

}

[Fields & Core Type]

fields

_id

document 의 unique id 는 _uid (_id + _type) 이며, _id 는 색인 ID 로 사용될 수 있다.

기본적으로 색인되지 않고 저장 하지 않습니다.

_type

기본적으로 색인은 하지만 저장은 하지 않습니다.

_source

자동으로 field 생성을 허용할지 결정 합니다.

_all

하나 또는 더 많은 field 를 색인시 저장 할 것인지 결정을 합니다.

"simple1" : {"type" : "long", "include_in_all" : true},

"simple2" : {"type" : "long", "include_in_all" : false}

_analyzer (설정 하지 않아도 되는 field)

색인 시 등록된 analyzer 또는 index_analyzer 를 사용 합니다.

또한, 특정 field 를 지정 할 경우 해당 field 에 정의된 analyzer 를 사용하게 됩니다.

_boost

문서나 field 의 연관성을 향상시키기 위해 사용한다.

_parent

parent type 을 지시하는 child mapping 정의 입니다.

blog type 과 blog_tag type 이 있을 경우 blog_tag 의 parent type 은 blog 가 됩니다.

_routing

색인 데이터에 대한 routing 관리를 위해서 사용 합니다.

routing field 는 store : yes, index : not_analyzed 로 설정이 되어야 합니다.

_index (설정 하지 않아도 되는 field)

index 가 소유한 문서를 store 합니다.

default false 로 저장 하지 않음.

_size (설정 하지 않아도 되는 field)

_source 에 의해서 자동으로 생성된 색인 field 의 수.

default disabled 입니다.

_timestamp

색인 시 문서의 timestamp 입니다.

기본 store : no, index : not_analyzed 이며,

설정 시 field 지정이 가능 합니다.

format 은 기본 dateOptionalTime. (http://www.elasticsearch.org/guide/reference/mapping/date-format/)

_ttl

색인 시 문서의 expiration date를 설정 합니다.

기본 disabled 입니다.

설정 시 ttl 이후 문서는 자동 삭제 됩니다.

core types

string type

index_name

array type 선언 시 사용되는 항목으로 array list 항목에 대한 개별 field 명으로 사용된다.

store

default no 이며, 저장에 대한 설정을 위해서 사용 된다.

yes 시 저장

index

검색 또는 색인 시 분석관련 설정을 위해서 사용 된다.

analyzed

검색과 색인 시 analyzer 를 이용해서 분석

not_analyzed

검색가능 하다는 의미

no

검색 불가능 하다는 의미

term_vector

기본 no 설정

no

yes

with_offsets

with_positions

with_positions_offsets

boost

기본 1.0

null_value

null value 시 기본 값은 아무것도 넣지 않으나 설정한 값이 있을 경우 등록 함.

omit_norms

기본 false 로 analyzed field 설정, true 일 경우 not_analyzed field 에서 설정

index_options

색인 옵션

docs

not_analyzed field

freqs

analyzed field

positions

analyzed field

analyzer

global 설정으로 검색과 색인 시 사용된다.

index_analyzer

색인 시에 사용된다.

search_analyzer

검색 시에 사용된다.

include_in_all

기본 true 로 설정 됨.

_all field 에 저장할 것인지 지정함.

ignore_above

지정한 크기 이상의 문자열을 무시 합니다.

position_offset_gap

number type

type : "float, double, byte, short, integer, and long",

index_name

store

index

precision_step

number 의 term value 를 만들어 냅니다.

설정 값이 작을 수록 검색은 빠르게 이루어 집니다.

기본 값은 4이며, 32bits 는 4 정도, 64bits 는 6~8정도를 사용합니다.

0 은 disable 입니다.

null_value

boost

include_in_all

ignore_malformed

비정상적인 숫자를 무시 한다.

기본 false로 설정 되어 있기 때문에 true 설정 하는 것이 좋다.

date type

index_name

format

http://www.elasticsearch.org/guide/reference/mapping/date-format.html

store

index

precision_step

number 의 term value 를 만들어 냅니다.

설정 값이 작을 수록 검색은 빠르게 이루어 집니다.

기본 값은 4이며, 32bits 는 4 정도, 64bits 는 6~8정도를 사용합니다.

0 은 disable 입니다.

null_value

boost

include_in_all

ignore_malformed

비정상적인 숫자를 무시 한다.

기본 false로 설정 되어 있기 때문에 true 설정 하는 것이 좋다.

boolean type

index_name

store

index

null_value

boost

include_in_all

binary type

index_name

:

[elasticsearch] Java API : settings property.

Elastic/Elasticsearch 2013. 4. 16. 10:54

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크

http://www.elasticsearch.org/guide/reference/modules/

http://www.elasticsearch.org/guide/reference/index-modules/

http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings/

이번 문서는 Java API에서도 제공하고 있는 settings 관련 설정 값들 입니다.

물론 cluster.settings 와 index.settings 도 있기 때문에 모두 확인을 하셔야 합니다.

보통 cluster 와 index 에 대한 설정들은 모두 global setting 을 사용하도록 구성 하기 때문에 elasticsearch.yml 을 구성 할 때 활용 하시면 됩니다.

업데이트 세팅과 작성된 JSON 형식의 예제를 확인해 보도록 하겠습니다.

[admin indices update settings]

Setting	Description
`index.number_of_replicas`	The number of replicas each shard has.
`index.auto_expand_replicas`	Set to an actual value (like `0-all`) or `false` to disable it.
`index.blocks.read_only`	Set to `true` to have the index read only. `false` to allow writes and metadata changes.
`index.blocks.read`	Set to `true` to disable read operations against the index.
`index.blocks.write`	Set to `true` to disable write operations against the index.
`index.blocks.metadata`	Set to `true` to disable metadata operations against the index.
`index.refresh_interval`	The async refresh interval of a shard.
`index.term_index_interval`	The Lucene index term interval. Only applies to newly created docs.
`index.term_index_divisor`	The Lucene reader term index divisor.
`index.translog.flush_threshold_ops`	When to flush based on operations.
`index.translog.flush_threshold_size`	When to flush based on translog (bytes) size.
`index.translog.flush_threshold_period`	When to flush based on a period of not flushing.
`index.translog.disable_flush`	Disables flushing. Note, should be set for a short interval and then enabled.
`index.cache.filter.max_size`	The maximum size of filter cache (per segment in shard). Set to `-1` to disable.
`index.cache.filter.expire`	The expire after access time for filter cache. Set to `-1` to disable.
`index.gateway.snapshot_interval`	The gateway snapshot interval (only applies to shared gateways).
merge policy	All the settings for the merge policy currently configured. A different merge policy can’t be set.
`index.routing.allocation.include.*`	A node matching any rule will be allowed to host shards from the index.
`index.routing.allocation.exclude.*`	A node matching any rule will NOT be allowed to host shards from the index.
`index.routing.allocation.require.*`	Only nodes matching all rules will be allowed to host shards from the index.
`index.routing.allocation.total_shards_per_node`	Controls the total number of shards allowed to be allocated on a single node. Defaults to unbounded.
`index.recovery.initial_shards`	When using local gateway a particular shard is recovered only if there can be allocated quorum shards in the cluster. It can be set to `quorum` (default), `quorum-1` (or `half`), `full` and `full-1`. Number values are also supported, e.g. `1`.
`index.gc_deletes`
`index.ttl.disable_purge`	Disables temporarily the purge of expired docs.

이 세팅 값들은 서비스 특성에 맞게 구성을 하셔야 합니다.

아래는 위 속성들에 대한 참고용 입니다.

[Sample JSON String]

curl -XPUT 'http://localhost:9200/test/' -d '{

"settings" : {

"number_of_shards" : 5,

"number_of_replicas" : 1,

"index" : {

"analysis" : {

"analyzer" : {

"default" : {

"type" : "standard",

"tokenizer" : "standard",

"filter" : ["lowercase", "trim"]

},

"default_index" : {

"type" : "standard",

"tokenizer" : "standard",

"filter" : ["lowercase", "trim"]

},

"default_search" : {

"type" : "standard",

"tokenizer" : "standard",

"filter" : ["lowercase", "trim"]

},

"my_analyzer1" : {

"tokenizer" : "standard",

"filter" : ["standard", "lowercase", "trim"]

},

"my_analyzer2" : {

"type" : "custom",

"tokenizer" : "tokenizer1",

"filter" : ["filter1", "trim"]

}

},

"tokenizer" : {

"tokenizer1" : {

"type" : "standard",

"max_token_length" : 255

}

},

"filter" : {

"filter1" : {

"type" : "lowercase",

"language" : "greek"

}

},

"compound_format" : false,

"merge" : {

"policy" : {

"max_merge_at_once" : 10,

"segments_per_tier" : 20

}

},

"refresh_interval" : "1s",

"term_index_interval" : 1,

"store" : {

"type" : "mmapfs",

"compress" : {

"stored" : true,

"tv" : true

}

}'

- 위 표에 없는 설정에 대해서만 기술 합니다.

number_of_shards	색인 파일에 대한 shard 수
index.analysis.analyzer .default .default_index .default_search .my_analyzer1 .my_analyzer2 .tokenizer .filter	색인 및 검색 시 사용할 분석기를 등록함 .default* 은 기본 분석기를 등록 index/type 에 대한 기본 설정으로 동작 .my_analyzer* 은 사용자 정의 분석기 .tokenizer 와 .filter 는 analyzer 에서 사용하게 될 tokenizer 와 filter 를 정의
index.compound_format	파일 기반 저장 시스템을 사용할 경우 false 로 설정해야 더 나은 성능을 지원함
index.store.type
index.store.compress .stored .tv	색인 저장 시 압축 기능에 대한 설정 . 64KB 이하의 작은 문서에 대한 압축 효과가 좋음

:

[elasticsearch] Java API : Search

Elastic/Elasticsearch 2013. 4. 15. 11:36

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/java-api/search/

- 이 API는 검색 쿼리를 실행 할 수 있으며, 쿼리와 일치하는 결과를 구할 수 있습니다.

원문 예제 부터 살펴 보겠습니다.

SearchResponse response = client.prepareSearch("index1", "index2")

.setTypes("type1", "type2")

.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)

.setQuery(QueryBuilders.termQuery("multi", "test")) // Query

.setFilter(FilterBuilders.rangeFilter("age").from(12).to(18)) // Filter

.setFrom(0).setSize(60).setExplain(true)

.execute()

.actionGet();

- 두 개의 색인 파일에 대한 검색 조건을 생성 합니다. (index1 과 index2)

- 역시 각 index 에 대한 type을 지정 합니다.

- REST 방식으로 표현하게 되면 http://localhost:9200/index1,inde2/type1,type2/_search?....... 과 같이 됩니다.

- http://www.elasticsearch.org/guide/reference/api/search/indices-types/

- termQuery 에서 multi 라는 field 에 test 라는 term 을 찾게 됩니다.

/**

* A Query that matches documents containing a term.

*

* @param name The name of the field

* @param value The value of the term

*/

public static TermQueryBuilder termQuery(String name, String value) {

return new TermQueryBuilder(name, value);

}

[Operation Threading Model]

- NO_THREADS : 호출된 쓰레드에서 실행

- SINGLE_THREAD : 또 다른 하나의 쓰레드를 생성하여 모든 shard 를 조회

- THREAD_PER_SHARD : 각 개별 shard 별로 쓰레드를 생성하여 실행

- Default SINGLE_THREAD 로 설정이 되어 있고, 성능적인 확인은 필요함.

[MultiSearch API]

SearchRequestBuilder srb1 = node.client()

.prepareSearch().setQuery(QueryBuilders.queryString("elasticsearch")).setSize(1);

SearchRequestBuilder srb2 = node.client()

.prepareSearch().setQuery(QueryBuilders.matchQuery("name", "kimchy")).setSize(1);

MultiSearchResponse sr = node.client().prepareMultiSearch()

.add(srb1)

.add(srb2)

.execute().actionGet();

// You will get all individual responses from MultiSearchResponse#responses()

long nbHits = 0;

for (MultiSearchResponse.Item item : sr.responses()) {

SearchResponse response = item.response();

nbHits += response.hits().totalHits();

}

- 개별 검색 조건에 대한 한번의 요청으로 각각의 검색 결과를 얻을 수 있습니다.

[Using Facets]

- http://www.elasticsearch.org/guide/reference/java-api/facets/

SearchResponse sr = node.client().prepareSearch()

.setQuery(QueryBuilders.matchAllQuery())

.addFacet(FacetBuilders.termsFacet("f1").field("field"))

.addFacet(FacetBuilders.dateHistogramFacet("f2").field("birth").interval("year"))

.execute().actionGet();

// Get your facet results

TermsFacet f1 = (TermsFacet) sr.facets().facetsAsMap().get("f1");

DateHistogramFacet f2 = (DateHistogramFacet) sr.facets().facetsAsMap().get("f2");

- facet 검색은 검색 결과에 대한 분류 또는 그룹핑으로 이해 하시면 접근 하시는데 조금 쉽습니다.

- termsFacet("f1") 에서 f1 은 facet name 입니다.

아래 부터는 테스트로 작성한 코드 입니다.

참고용으로만 사용하시기 바랍니다.

[기본 Query Search]

response = client.prepareSearch("facebook")

.setOperationThreading(SearchOperationThreading.THREAD_PER_SHARD)

.setRouting("1365503894967")

.setTypes("post")

.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)

.setQuery(QueryBuilders.termQuery("title", "9"))

.setFrom(0)

.setSize(20)

.setExplain(true)

.execute()

.actionGet();

log.debug("{}", response);

[Multi Search]

SearchRequestBuilder srb1 = client

.prepareSearch("facebook").setQuery(QueryBuilders.queryString("93").field("title")).setSize(1);

SearchRequestBuilder srb2 = client

.prepareSearch("facebook").setQuery(QueryBuilders.matchQuery("title", "94")).setSize(1);

MultiSearchResponse sr = client.prepareMultiSearch()

.add(srb1)

.add(srb2)

.execute().actionGet();

// You will get all individual responses from MultiSearchResponse#responses()

long nbHits = 0;

for (MultiSearchResponse.Item item : sr.responses()) {

response = item.response();

log.debug("{}", response);

}

[MatchQeury]

response = client.prepareSearch("facebook")

.setOperationThreading(SearchOperationThreading.THREAD_PER_SHARD)

.setTypes("post")

.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)

.setQuery(

QueryBuilders.matchQuery("title", "1365577624100 twitter")

.type(Type.BOOLEAN) // default, PHASE 와 PHARE_PREFIX 는 하나의 TERM 으로 매칭.

.analyzer("gruter_analyzer") // analyzer는 지정 하지 않으면 settings 값으로 동작 함.

.operator(Operator.OR) // query 에 대한 token 연산을 의미 함.

)

.setFrom(0)

.setSize(5)

.setExplain(false)

.execute()

.actionGet();

log.debug("{}", response);

[Multi MatchQuery]

response = client.prepareSearch("facebook")

.setOperationThreading(SearchOperationThreading.THREAD_PER_SHARD)

.setTypes("post")

.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)

.setQuery(

QueryBuilders.multiMatchQuery("136557762410", "docid", "title") // 각 field 에 대해서 matchQuery 를 수행 하는 것과 같은 효과.

.type(Type.PHRASE_PREFIX)

.operator(Operator.OR)

)

.setFrom(0)

.setSize(14)

.setExplain(false)

.execute()

.actionGet();

log.debug("{}", response);

[Facet Search]

response = client.prepareSearch("blog")

.setQuery(QueryBuilders.matchAllQuery())

.addFacet(FacetBuilders.termsFacet("facetYear").field("year"))

.addFacet(FacetBuilders.termsFacet("facetMonth").field("month"))

.addFacet(FacetBuilders.termsFacet("facetDay").field("day"))

.execute()

.actionGet();

TermsFacet facetYear = (TermsFacet) response.facets().facetsAsMap().get("facetYear");

TermsFacet facetMonth = (TermsFacet) response.facets().facetsAsMap().get("facetMonth");

TermsFacet facetDay = (TermsFacet) response.facets().facetsAsMap().get("facetDay");

log.debug("{}", response);

- facet search 테스트를 위한 scheme 정보는 아래와 같습니다.

"mappings" : {

"post" : {

"properties" : {

"docid" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false },

"title" : { "type" : "string", "store" : "yes", "index" : "analyzed", "term_vector" : "yes", "analyzer" : "gruter_analyzer", "include_in_all" : false },

"year" : { "type" : "integer", "store" : "no", "index" : "not_analyzed", "include_in_all" : false },

"month" : { "type" : "integer", "store" : "no", "index" : "not_analyzed", "include_in_all" : false },

"day" : { "type" : "integer", "store" : "no", "index" : "not_analyzed", "include_in_all" : false }

}

:

[algorithm] elasticsearch multimatchquery 옵션 테스트 중 .maxExpansions(..)

Elastic/Elasticsearch 2013. 4. 10. 17:40

maxExpansions() 옵션 설정 후 어떻게 동작 하는지 확인 하려다 이넘이 동작하는 원리가 궁금해 졌습니다.

그래서 찾아 보니 Levenshtein distance 라는 알고리즘을 사용하고 있더군요.

설명은 아래를 참고해 주세요.

(similarity 할때 사용하는 줄 알았는데 multiMatchQuery 에서도 사용하내요)

원문 : http://progh2.tistory.com/195

URL: http://www.merriampark.com/ld.htm

자주쓰이는 3개의 언어로 구현한 Levenshtein Distance

작성자: Michael Gilleland, Merriam Park Software

이 짧은 에세이를 쓰게 된 것은 Levenshtein distance 알고리즘에 대해서

설명하고 또 그것이 각각의 세가지 프로그래밍 언어에서 어떻게 구현되는가를

보이기 위해서입니다.

Levenshtein Distance이란 무엇인가?

데모

알고리즘

세가지 언어로 구현된 소스코드

레퍼런스

다른 언어로 구현

Levenshtein Distance이란 무엇인가?

Levenshtein Distance(이하 LD)는 두 문자열의 비슷한 정도를 측정하기위해 고안되었습니다.

여기서 원문자열을 (s)로, 대상문자열을 (t) 라고 나타낸다고 하겠습니다. distance란 s를

t로 변형시키기 위해 삭제, 추가, 또는 수정이 필요한 횟수를 뜻합니다. 예를든다면,

* s가 "test"이고 t도 "test"라면, LD(s,t) = 0 이 됩니다. 왜냐하면 문자열들이 이미 동일하여 변환이 필요하지 않기 때문입니다.

* s가 "test"이고 t가 "tent"라면, LD(s,t) = 1 이 됩니다. 왜냐하면 s를 t로 만들기 위해서는 "s"를 "n"으로 한번 수정이 필요하기 때문입니다.

Levenshtein distance는 string 간의 차이가 클수록 위대함을 느낄 수 있습니다.

Levenshtein distance는 러시아 과학자인 Vladimir Levenshtein가 1965년에 고안하여 그렇게 이름지어졌습니다.

Levenshtein 이란 단어가 쓰거나 읽기 힘들기 때문에 종종 edit distance라고도 불립니다.

Levenshtein distance 알고리즘은 다음과 같은 분야에 쓰여집니다:

* 철자 검사

* 음성 인식

* DNA 분석

* 표절여부 검사

데모

아래의 간단한 자바 애플릿으로 두 문자열의 Levenshtein distance를 알아보세요.

원래 문자열

대상 문자열

알고리즘

알고리즘 작동 단계

단계 설명

1

s의 문자열 길이를 n에 넣는다.

t의 문자열의 길이를 m에 넣는다.

만약 n = 0 이라면, m 을 리턴하고 종료한다.

만약 m = 0 이라면, n 을 리턴하고 종료한다.

0..m 행과, 0..n 열로 이루어진 행열을 만든다.

2

첫번째 행인 0..n을 초기화 한다.

첫번째 열인 0..m을 초기화 한다.

3

s의 각 문자(i는 1부터 n까지)를 검사한다.

4

t의 각 문자(j는 1부터 m까지)를 검사한다.

5

s[i]와 t[j]가 같다면, 변경하기 위한 비용은 0이 된다.

s[i]와 t[j]가 같지 않다면, 비용은 1이 된다.

6

행열의 셀 d[i,j]에 다음의 것들 중 가장 작은 값을 넣는다.

a. 바로 위의 셀이 더하기 1이 되는 경우: d[i-1, j] + 1

b. 바로 왼쪽 셀이 더하기 일이 되는 경우: d[i,j-1] + 1

c. 대각선으로 연속적인, 바로 왼,위쪽 셀의 비용: d[i-1,j-1] + cost

7

(3, 4, 5, 6) 단계를 반복하여 완료되면, d[n, m]셀에 있는 것이 distance가 된다.

예제

이 예제절에서는 원래 문자열이 "GUMBO"이고 대상 문자열이 "GAMBOL"이라 할 때

어떻게 Levenshtein distance가 계산되는지에 대해서 다룬다.

1 과 2 단계

i가 1일 때 3에서 6 단계

i가 2일 때 3에서 6 단계

i가 3일 때 3에서 6 단계

i가 4일 때 3에서 6 단계

i가 5일 때 3에서 6 단계

7단계

행열의 가장 오른쪽 아래에 있는 값이 distance가 된다.(여기서는 2)

이 결과는 "GUMBO"가 "GAMBOL"이 되기 위해서 "U"를 "A"로 바꾸고

"L"을 추가해야한다는, 직관적으로 알 수 있는 결과와 일치합니다.

( 1번의 수정과 1번의 추가 = 2 번의 변경 )

세가지 언어로 구현된 소스코드

프로그래밍 언어들간에 차이에 대해서 토론하는 엔지니어들 사이에서는 종교 전쟁이 일어나기도합니다.

이러한 예로, 전형적인 주장은 JavaWorld article에서 일어난(July 1999) Allen Holub의 주장입니다.:

"예를들자면, 비주얼 베이식은 전혀 객체지향이라고 말할 수 없다. Microsoft Foundation Classes(MFC)

또는 대부분의 다른 마이크로소프트의 테크놀러지는 어느것도 객체지향이라 주장할 수 없다."

Salon에 계제된(Jan. 8, 2001) Simson Garfinkels의 글에서 다른 진영의 반박이 이루어졌습니다.

이 글은 "Java: 느리고, 꼴사납고, 부적절한 언어"라는 제목으로 알려져 있는데, 명료하게

표현하자면 "나는 자바를 증오해"라고 나타낼 수 있습니다.

우리는 이러한 종교 전쟁들 속에서 자연스럽고 조심스런 입장을 취하로 했습니다. 배우기 위한 교재로써,

하나의 프로그래밍 언어에서만 해결할 수 있는 문제라면 대개 다른 언어에서도 마찬가지로 해결할 수

있을 것입니다. 우수한 프로그래머는 완전히 새로운 언어를 배우면서 한다고 하더라도 하나의 언어에서

다른 언어로 비교적 쉽게, 큰 어려움에 당면하지 않고 옮길 수 있습니다. 프로그래밍 언어라는 것은

목적을 이루기 위한 것이지, 그 자체가 목적은 아닌 것입니다.

이러한 중도의 입장에서, 우리는 Levenshtein distance 알고리즘을 아래에 있는 프로그래밍 언어들로

구현하여 소스코드를 보였습니다.

* Java

* C++

* Visual Basic

소스코드들 (블라블라)

참고문헌

Levenshtein distance에 관련된 다릍 토의를 다음 링크들에서 발견하실 수 있습니다.

* http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit.html (Lloyd Allison)

* http://www.cut-the-knot.com/do_you_know/Strings.html (Alex Bogomolny)

* http://www-igm.univ-mlv.fr/~lecroq/seqcomp/node2.html (Thierry Lecroq)

다른 언어로 구현

아래 있는 분들은 그들 각자가 다양한 언어로 Levenshtein Distance 알고리즘을 구현한 것을

여기서 사용할 수 있게 친절히 승낙해 주셨습니다.

* Eli Bendersky 은 펄로 구현해주셨습니다.

* Barbara Boehmer 은 Oracle PL/SQL 로 구현해주셨습니다.

* Rick Bourner Objective-C 로 구현해주셨습니다.

* Joseph Gama 는 TSQL로 Planet Source Code 에 있는 TSQL 함수 패키지의 한 파트로 구현해주셨습니다.

* Anders Sewerin Johansen 는 C++로 제가 만든 것보다 C++의 정신에 가깝게, 더 최적화되고 세련되게 구현해주셨습니다.

* Lasse Johansen 는 C#으로 구현해 주셨습니다.

* Alvaro Jeria Madariaga는 Delphi로 구현해 주셨습니다.

* Lorenzo Seidenari 는 C로 구현해 주셨습니다.

* Steve Southwell는 Progress 4gl로 구현해 주셨습니다.

이 페이지 밖에 있는 다른 구현들:

* Art Taylor의 Emacs Lisp로의구현.

* Magnus Lie Hetland의 Python로의 구현.

* Richard Suchenwirth의 Tcl로의 구현(링크가 깨진 것을 알려주신 Stefan Seidler님 감사합니다).

:

[elasticsearch] Java API : Delete

Elastic/Elasticsearch 2013. 4. 10. 10:04

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/java-api/delete/

- 이 API는 index 의 id(_id) 를 기반으로 json 형식의 문서를 삭제 할 수 있습니다.

- Delete API 는 Get API 와 사용법이 유사 하기 때문에 간단하게 정리 합니다.

아래는 원문에서 제공하는 두 가지 예제 코드 입니다.

.setOperationThreaded(false) 옵션에 대한 설명은 이전 글 참고 바랍니다.

- http://jjeong.tistory.com/795

[기본]

DeleteResponse response = client.prepareDelete("twitter", "tweet", "1")

.execute()

.actionGet();

- index name : twitter

- index type : tweet

- doc id (_id) : 1

[Threading Model 설정]

DeleteResponse response = client.prepareDelete("twitter", "tweet", "1")

.setOperationThreaded(false)

.execute()

.actionGet();

:

[elasticsearch] bulkRequest setXXX options.

Elastic/Elasticsearch 2013. 4. 9. 16:42

[ReplicationType.java]

/**

* The type of replication to perform.

*/

public enum ReplicationType {

/**

* Sync replication, wait till all replicas have performed the operation.

*/

SYNC((byte) 0),

/**

* Async replication. Will send the request to replicas, but will not wait for it

*/

ASYNC((byte) 1),

/**

* Use the default replication type configured for this node.

*/

DEFAULT((byte) 2);

[WriteConsistencyLevel.java]

/**

* Write Consistency Level control how many replicas should be active for a write operation to occur (a write operation

* can be index, or delete).

*

*/

public enum WriteConsistencyLevel {

DEFAULT((byte) 0),

ONE((byte) 1),

QUORUM((byte) 2),

ALL((byte) 3);

bulkRequest 시 사용되는 옵션이라 찾아 봤습니다.

:

[elasticsearch] Java API : Get

Elastic/Elasticsearch 2013. 4. 9. 12:25

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/java-api/get/

- 이 API는 index 의 id(_id) 를 기반으로 json 형식의 문서를 구할 수 있습니다.

예제는 아래 문서에서 생성한 데이터로 테스트 합니다.

http://jjeong.tistory.com/792

GetResponse response = client.prepareGet("facebook", "post", "2")

.execute()

.actionGet();

log.debug("{}", response.getId());

log.debug("{}", response.getSource());

response = client.prepareGet("facebook", "post", "1")

.setOperationThreaded(false)

.execute()

.actionGet();

log.debug("{}", response.getId());

log.debug("{}", response.getSource());

- 보시면 매우 간단 합니다.

- REST URL 방식으로 고치면 http://localhost:9200/facebook/post/2 와 같은 형식을 갖습니다.

좀 더 자사한 내용은 아래 링크 참고 바랍니다.

http://www.elasticsearch.org/guide/reference/api/get/

[.setOperationThreaded 옵션]

- 이 옵션의 기본 값은 true 로 설정이 되어 있습니다.

- operation 에 대한 threading model 을 지원.

- 설명이 어려워서 일단 쉽게 이해를 돕기 위해 소스코드를 좀 봅시다.

[SearchOperationThreading.java]

/**

* No threads are used, all the local shards operations will be performed on the calling

* thread.

*/

NO_THREADS((byte) 0),

/**

* The local shards operations will be performed in serial manner on a single forked thread.

*/

SINGLE_THREAD((byte) 1),

/**

* Each local shard operation will execute on its own thread.

*/

THREAD_PER_SHARD((byte) 2);

[ThreadingModel.java]

NONE((byte) 0),

OPERATION((byte) 1),

LISTENER((byte) 2),

OPERATION_LISTENER((byte) 3);

/**

* <tt>true</tt> if the actual operation the action represents will be executed

* on a different thread than the calling thread (assuming it will be executed

* on the same node).

*/

public boolean threadedOperation() {

return this == OPERATION || this == OPERATION_LISTENER;

}

/**

* <tt>true</tt> if the invocation of the action result listener will be executed

* on a different thread (than the calling thread or an "expensive" thread, like the

* IO thread).

*/

public boolean threadedListener() {

return this == LISTENER || this == OPERATION_LISTENER;

}

- 주석을 보시면 이해가 되시죠?

- 즉 false이면 자긴껀 자기가 실행 하고, true 이면 thread 를 fork 해서 넘겨준다고 보면 되겠내요.

※ 커뮤니티에 올라와 있는 것들을 보면 아래와 같이 사용하는걸 권장 하는 것 같습니다.

- read operation : setThreadOperation(true) // 즉 선언 하지 않아도 되구요.

- insert/update/delete operation : setThreadOperation(false)

:

[elasticsearch] es 실행 파일.

Elastic/Elasticsearch 2013. 4. 9. 10:37

그냥 궁금해서 본건데..

[background]

ElasticSearch.java 가 Bootstrap.java 를 상속 받아서 Bootstrap.main() 실행

[foreground]

실행 시 -f 옵션을 주면 ElasticSearchF.java 에서 System.setProperty("es.foreground", "yes");하고 Bootstrap..main() 실행

:

[elasticsearch] Java API : Index

Elastic/Elasticsearch 2013. 4. 8. 18:42

본 문서는 개인적인 테스트와 elasticsearch.org 그리고 community 등을 참고해서 작성된 것이며,

정보 교환이 목적입니다.

잘못된 부분에 대해서는 지적 부탁 드립니다.

(예시 코드는 성능 및 보안 검증이 되지 않았습니다.)

[elasticsearch java api 리뷰]

원문 링크 : http://www.elasticsearch.org/guide/reference/java-api/index_/

json document 를 생성하는 몇 가지 방법을 설명하고 있습니다.

There are different way of generating JSON document:

- Manually (aka do it yourself) using native byte[] or as a String

- Using Map that will be automatically converted to its JSON equivalent

- Using a third party library to serialize your beans such as Jackson

- Using built-in helpers XContentFactory.jsonBuilder()

위 방법들 중에서 제일 아래 elasticsearch helper 를 이용한 방법을 테스트해 봅니다.

우선 간단하게 index 와 index type 을 정의해 보도록 하겠습니다.

curl -XPUT 'http://localhost:9200/facebook' -d '{

"settings" : {

"number_of_shards" : 5,

"number_of_replicas" : 1

},

"mappings" : {

"post" : {

"properties" : {

"docid" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" },

"title" : { "type" : "string", "store" : "yes", "index" : "analyzed", "term_vector" : "yes", "analyzer" : "standard" }

}

}'

- index 는 facebook 으로 생성을 하고

- index type 은 post 라고 생성을 합니다.

- settings 와 mappings 에 대한 상세한 내용은 아래 링크 참고 하시기 바랍니다.

http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index/

http://www.elasticsearch.org/guide/reference/index-modules/

http://www.elasticsearch.org/guide/reference/mapping/

http://www.elasticsearch.org/guide/reference/mapping/core-types/

index, index type 생성이 끝났으면 이제 색인을 해보도록 하겠습니다

// 생성할 문서가 아래와 같다고 가정

// curl -XPUT 'http://localhost:9200/facebook/post/1' -d '{ "docid" : "henry", "title" : "This is the elasticsearch hadoop test." }'

// curl -XPUT 'http://localhost:9200/facebook/post/2' -d '{ "docid" : "henry", "title" : "elasticsearch test." }'

// curl -XPUT 'http://localhost:9200/facebook/post/3' -d '{ "docid" : "howook", "title" : "hadoop test." }'

// curl -XPUT 'http://localhost:9200/facebook/post/4' -d '{ "docid" : "howook", "title" : "test." }'

IndexRequestBuilder requestBuilder;

IndexResponse response;

requestBuilder = client.prepareIndex("facebook", "post");

// setSource parameter 로 json string 형태로 등록

requestBuilder.setId("1");

requestBuilder.setSource("{ \"docid\" : \"henry\", \"title\" : \"This is the elasticsearch hadoop test.\" }");

response = requestBuilder.execute().actionGet();

// XContentBuilder 로 setSource 전달

XContentBuilder jsonBuilderDocument = jsonBuilder().startObject();

jsonBuilderDocument.field("docid", "henry");

jsonBuilderDocument.field("title", "elasticsearch test.");

jsonBuilderDocument.endObject();

requestBuilder.setId("2");

requestBuilder.setSource(jsonBuilderDocument);

response = requestBuilder.execute().actionGet();

- IndexRequestBuilder 의 setSource 에 대한 코드를 보시면 어떤 arguments 받는지 알 수 있습니다.

- 그리고 문서 색인에 사용되는 여러가지 다양항 옵션들은 아래 링크를 참고 하시기 바랍니다.

http://www.elasticsearch.org/guide/reference/api/index_/

아래는 index 생성 시 필요한 settings 와 mappings 에 대한 예제 코드 입니다.

맛보기 참고용 입니다.

IndicesAdminClient indices = client.admin().indices();

CreateIndexRequest indexRequest = new CreateIndexRequest("INDEX_NAME");

indexRequest

.settings(jsonBuilderIndexSetting)

.mapping("INDEX_TYPE_NAME", jsonBuilderIndiceSetting);

indices.create(indexRequest).actionGet();

- INDEX_NAME 은 생성한 index

- INDEX_TYPE_NAME 은 생성한 index type

- jsonBuilerIndexSetting 과 jsonBuilderIndiceSetting 은 XContentBuilder 객체

:

jjeong

'Elastic/Elasticsearch'에 해당되는 글 385건

[elasticsearch] Mapping - Array/Object/Nested Type

[elasticsearch] Java API : mapping property.

[elasticsearch] Java API : settings property.

[elasticsearch] Java API : Search

[algorithm] elasticsearch multimatchquery 옵션 테스트 중 .maxExpansions(..)

[elasticsearch] Java API : Delete

[elasticsearch] bulkRequest setXXX options.

[elasticsearch] Java API : Get

[elasticsearch] es 실행 파일.

[elasticsearch] Java API : Index

티스토리툴바