|
Elastic/Elasticsearch 2017. 3. 9. 11:51
elasticsearch 2.4 에서 사용하던 java api 중 TransportClinet 사용 방법이 바뀌어서 작성 합니다. 변경된 내용에 대해서는 elasticsearch 공식 홈페이지에 자세히 나와 있습니다.
[참고문서] https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_maven_repository.html https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html
[코드 변경]
2.x) settings = settingsBuilder() .put("cluster.name", cluster) .put("client.transport.sniff", true) .put("network.tcp.blocking", false) // tcp non-blocking mode .put("client.transport.ping_timeout", "10s") .build();
5.x) settings = builder() .put("cluster.name", cluster) .put("client.transport.sniff", true) .put("network.tcp.blocking", false) // tcp non-blocking mode .put("client.transport.ping_timeout", "10s") .build();
2.x) TransportClient client = TransportClient.builder().settings(settings).build();
5.x) TransportClient client = new PreBuiltTransportClient(settings);
여기서 주의 하실 점은 참고문서에 있지만 transport 가 분리 되었기 때문에 별도로 dependency 구성을 해주셔야 합니다.
Maven Dependency 추가) <dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>transport</artifactId> <version>${elasticsearch.version}</version> </dependency>
별 내용은 아니지만 혹시라도 삽질 하시는 분들이 계실 수 있어 작성해 봤습니다.
Elastic/Elasticsearch 2017. 2. 21. 13:23
2.x 에서 사용하던 설정을 그대로 5.x 로 올려서 실행을 시키면 몇 가지 볼 수 있는 에러들이 있습니다. 뭐 이런건 breaking changes 를 참고하거나 소스코드를 보면 금방 해결이 되긴 합니다. 그냥 복습하는 차원에서 기록해 봅니다.
[참고문서] https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.2.html https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.1.html https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.0.html
[발생 에러들] unknown setting [es.default.path.conf] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
node settings must not contain any index level settings
unknown setting [action.disable_shutdown] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
unknown setting [discovery.zen.ping.multicast.enabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
unknown setting [resource.reload.interval] did you mean any of [resource.reload.interval.low, resource.reload.interval.high, resource.reload.interval.medium, resource.reload.enabled]?
unknown setting [script.indexed] did you mean any of [script.inline, script.ingest]?
node validation exception bootstrap checks failed memory locking requested for elasticsearch process but memory is not locked max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
- es.default.path.conf 는 path.conf 로 변경 되었습니다. (-D 가 -E 로 변경 되었구요.) - index level 설정이 있다면 제거 하시면 됩니다. - action.disable_shutdown 은 없어 진 것으로 보입니다. (미쳐 확인 하지는 못했구요. 문서를 보면 _shutdown 이 없어 진것으로 미루어 봤을때...) - multicast 설정도 제거 하시면 됩니다. - resource 설정은 이름과 방법이 바뀌었으니 삭제 하시거나 변경해 주면 됩니다. - script.indexed 설정도 제거 하시면 됩니다. (아마도 stored 로 바뀐것 같습니다.) - bootstrap 설정은 root 권한을 주시거나 limits.conf 인가에서 수정을 해주셔야 할 것 같구요. - vm.max_map_count 설정은 문서에 잘 나와 있습니다. ($ sudo sysctl -w vm.max_map_count=262144)
Elastic/Elasticsearch 2017. 2. 21. 12:41
elasticsearch-analysis-arirang-5.2.1 공유 합니다.
Lucene 6.4.1 Elasticsearch 5.2.1 기준 입니다.
elasticsearch-analysis-arirang-5.2.1.zip
설치 방법) $ bin/elasticsearch-plugin install --verbose file:///services/apps/elasticsearch-analysis-arirang-5.2.1.zip
Elastic/Kibana 2017. 2. 9. 12:54
기억력 극복을 위해 또 기록해 봅니다.
elasticsearch의 cardinality aggregation 을 kibana 에서는 unique count 로 사용 합니다. 여기에 정확도 조절을 위해 precision_threshold 를 설정할 수 있는데요.
참고문서) https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
QueryDSL) {
"aggs" : {
"author_count" : {
"cardinality" : {
"field" : "author_hash",
"precision_threshold": 100
}
}
}
}
Kibana Script) { "precision_threshold":40000 }
위와 같이 사용하시면 됩니다. 주의 하실 부분은 CPU와 Memory 사용에 민감하고 circuit breaker 설정도 확인하셔야 합니다.
Elastic/Elasticsearch 2017. 1. 24. 11:47
Elastic Stack 과 Apache mahout 을 이용한 추천 데이터 생성을 다뤄 볼까 합니다. 기본적으로는 Elastic Stack 만 가지고도 cohort 분석을 통해 추천 데이터 마트 구성이 가능 한데요. 추천 데이터에 대한 품질을 좀 더 좋게 하기 위해 Apache mahout 을 활용해 보도록 하겠습니다.
여기서 다루는 내용은 누구나 쉽게 접근 할 수 있도록 Hello World! 수준만 기술 합니다.
[Elastic Stack] https://www.elastic.co/products
[Apache mahout] https://mahout.apache.org/
위 두 솔루션은 모두 오픈소스 이며 예제 코드가 해당 소스에 잘 만들어져 있어 누구나 쉽게 활용이 가능합니다.
Step 1) Elasticsearch + Logstash + Kibana 를 이용해 로그를 수집하고 추천 할 raw data 를 생성 합니다.
User item click log -> Logstash collect -> Elasticsearch store -> Kibana visualize -> CSV download
여기서 수집한 데이터 중 추출 데이터는 user id + item id + click count 입니다. 아래는 Kibana QueryDSL 예제 입니다. { "size": 0, "query": { "filtered": { "query": { "query_string": { "query": "cp:CLK AND id:[0 TO *]", "analyze_wildcard": true } }, "filter": { "bool": { "must": [ { "range": { "time": { "gte": 1485010800000, "lte": 1485097199999, "format": "epoch_millis" } } } ], "must_not": [] } } } }, "aggs": { "2": { "terms": { "field": "user_id", "size": 30000, "order": { "_count": "desc" } }, "aggs": { "3": { "terms": { "field": "item_id", "size": 10, "order": { "_count": "desc" } } } } } } }
Step 2) Apache mahout 에서 사용할 recommender 는 UserBasedRecommender 입니다. 샘플 코드에도 나와 있지만 dataset.csv 파일은 아래와 같은 형식 입니다. - Creating a User-Based Recommender in 5 minutes
1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0 형식) userId,itemId,ratingValue
Step1 에서 위와 같은 형식을 맞추기 위해 user_id, item_id, click_count 를 생성 하였습니다. 이 데이터를 기반으로 UserBasedRecommender 를 돌려 보도록 하겠습니다.
Step 3) 아래 보시면 샘플 코드가 잘 나와 있습니다. https://github.com/apache/mahout/tree/master/examples/src/main/java/org/apache/mahout
Main class 하나 만드셔서 Step2 에 나와 있는 코드로 돌려 보시면 됩니다. 저는 UserBasedRecommender 를 implements 해서 별도로 구현했습니다. 이건 누구나 쉽게 하실 수 있는 부분이기 때문에 examples 에 나와 있는 BookCrossingRecommender 클래스등을 참고 하시면 됩니다.
UserBasedRecommenderRunner runner = new UserBasedRecommenderRunner(); Recommender recommender = runner.buildRecommender();
// 710039번 유저에 대한 추천 아이템 3개 List<RecommendedItem> recommendations = recommender.recommend(710039, 3);
for (RecommendedItem recommendation : recommendations) { LOG.debug("추천 아이템 : {}", recommendation); }
[실행 로그] 11:39:31.527 [main] INFO o.a.m.c.t.i.model.file.FileDataModel - Creating FileDataModel for file /git/prototype/data/user-to-item.csv 11:39:31.626 [main] INFO o.a.m.c.t.i.model.file.FileDataModel - Reading file info... 11:39:31.765 [main] INFO o.a.m.c.t.i.model.file.FileDataModel - Read lines: 63675 11:39:31.896 [main] INFO o.a.m.c.t.i.model.GenericDataModel - Processed 10000 users 11:39:31.911 [main] INFO o.a.m.c.t.i.model.GenericDataModel - Processed 19124 users 11:39:31.949 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommending items for user ID '710039' 11:39:31.965 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommendations are: [RecommendedItem[item:35222, value:4.0], RecommendedItem[item:12260, value:4.0], RecommendedItem[item:12223, value:1.5]] 11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:35222, value:4.0] 11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12260, value:4.0] 11:39:31.967 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12223, value:1.5]
[Recommender] similarity = new PearsonCorrelationSimilarity(dataModel);
// 이웃한 N명의 사용자 데이터로 추천 데이터 생성 // UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, dataModel, 0.2);
// 특정 값이나 임계치를 넘는 모든 사용자의 데이터로 추천 데이터 생성, samplingrate : user sampling rate 10% // UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel, 0.1);
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.2, similarity, dataModel, 1.0); recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
- 데이터 크기가 너무 작아 ThresholdUserNeighborhood 를 이용하였습니다.
이와 같이 검색 클릭 로그를 기반으로 CF를 돌려 추천 데이터를 만드는 아주 간단한 방법을 알아봤습니다. 만든 추천 데이터에 대한 평가도 가능 합니다. 역시 examples 에 xxxxxxEvaluator 클래스들을 참고하셔서 구현해 보시면 됩니다.
ITWeb/검색일반 2017. 1. 19. 18:41
아주 기초적인 것도 잊어버리는 것 같아 기록해 봅니다.
Multi-value fields and the inverted index The fact that all field types support multi-value fields out of the box is a consequence of the origins of Lucene. Lucene was designed to be a full text search engine. In order to be able to search for individual words within a big block of text, Lucene tokenizes the text into individual terms, and adds each term to the inverted index separately. This means that even a simple text field must be able to support multiple values by default. When other datatypes were added, such as numbers and dates, they used the same data structure as strings, and so got multi-values for free.
이 글은 아래 elasticsearch 에서 퍼왔습니다.
[문서] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/array.html
Elastic/Elasticsearch 2017. 1. 2. 12:44
도대체 왜 맨날 잊어버리는지 모르겠지만, 기억력 회복을 위해 기록해 봅니다.
Range query 사용 시 from, to, gt, gte, lt, lte parameter 를 사용 합니다. RangeQueryBuilder.java 소스코드를 보면 아래와 같이 정의가 되어 있습니다.
private final String name; private Object from; private Object to; private String timeZone; private boolean includeLower = true; private boolean includeUpper = true; private float boost = -1; private String queryName; private String format;
기본적으로 lower, upper 값을 포함하게 되어 있습니다. 그러므로, from, to 는 값을 포함 하게 됩니다. MySQL 에서 제공하고 있는 BETWEEN min AND max 도 min 과 max 값을 포함 하고 있는 것 처럼 동일 합니다.
Elastic/Elasticsearch 2016. 12. 6. 12:18
그냥 볼일이 있어서 올려봅니다. nodes 랑 indices 만.
[참고문서] https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html id | id,nodeId | unique node id
pid | p | process id
host | h | host name
ip | i | ip address
port | po | bound transport port
version | v | es version
build | b | es build hash
jdk | j | jdk version
disk.avail | d,disk,diskAvail | available disk space
heap.current | hc,heapCurrent | used heap
heap.percent | hp,heapPercent | used heap ratio
heap.max | hm,heapMax | max configured heap
ram.current | rc,ramCurrent | used machine memory
ram.percent | rp,ramPercent | used machine memory ratio
ram.max | rm,ramMax | total machine memory
file_desc.current | fdc,fileDescriptorCurrent | used file descriptors
file_desc.percent | fdp,fileDescriptorPercent | used file descriptor ratio
file_desc.max | fdm,fileDescriptorMax | max file descriptors
cpu | cpu | recent cpu usage
load | l | most recent load avg
uptime | u | node uptime
node.role | r,role,dc,nodeRole | d:data node, c:client node
master | m | m:master-eligible, *:current master
name | n | node name
completion.size | cs,completionSize | size of completion
fielddata.memory_size | fm,fielddataMemory | used fielddata cache
fielddata.evictions | fe,fielddataEvictions | fielddata evictions
query_cache.memory_size | qcm,queryCacheMemory | used query cache
query_cache.evictions | qce,queryCacheEvictions | query cache evictions
request_cache.memory_size | rcm,requestCacheMemory | used request cache
request_cache.evictions | rce,requestCacheEvictions | request cache evictions
request_cache.hit_count | rchc,requestCacheHitCount | request cache hit counts
request_cache.miss_count | rcmc,requestCacheMissCount | request cache miss counts
flush.total | ft,flushTotal | number of flushes
flush.total_time | ftt,flushTotalTime | time spent in flush
get.current | gc,getCurrent | number of current get ops
get.time | gti,getTime | time spent in get
get.total | gto,getTotal | number of get ops
get.exists_time | geti,getExistsTime | time spent in successful gets
get.exists_total | geto,getExistsTotal | number of successful gets
get.missing_time | gmti,getMissingTime | time spent in failed gets
get.missing_total | gmto,getMissingTotal | number of failed gets
indexing.delete_current | idc,indexingDeleteCurrent | number of current deletions
indexing.delete_time | idti,indexingDeleteTime | time spent in deletions
indexing.delete_total | idto,indexingDeleteTotal | number of delete ops
indexing.index_current | iic,indexingIndexCurrent | number of current indexing ops
indexing.index_time | iiti,indexingIndexTime | time spent in indexing
indexing.index_total | iito,indexingIndexTotal | number of indexing ops
indexing.index_failed | iif,indexingIndexFailed | number of failed indexing ops
merges.current | mc,mergesCurrent | number of current merges
merges.current_docs | mcd,mergesCurrentDocs | number of current merging docs
merges.current_size | mcs,mergesCurrentSize | size of current merges
merges.total | mt,mergesTotal | number of completed merge ops
merges.total_docs | mtd,mergesTotalDocs | docs merged
merges.total_size | mts,mergesTotalSize | size merged
merges.total_time | mtt,mergesTotalTime | time spent in merges
percolate.current | pc,percolateCurrent | number of current percolations
percolate.memory_size | pm,percolateMemory | memory used by percolations
percolate.queries | pq,percolateQueries | number of registered percolation queries
percolate.time | pti,percolateTime | time spent percolating
percolate.total | pto,percolateTotal | total percolations
refresh.total | rto,refreshTotal | total refreshes
refresh.time | rti,refreshTime | time spent in refreshes
script.compilations | scrcc,scriptCompilations | script compilations
script.cache_evictions | scrce,scriptCacheEvictions | script cache evictions
search.fetch_current | sfc,searchFetchCurrent | current fetch phase ops
search.fetch_time | sfti,searchFetchTime | time spent in fetch phase
search.fetch_total | sfto,searchFetchTotal | total fetch ops
search.open_contexts | so,searchOpenContexts | open search contexts
search.query_current | sqc,searchQueryCurrent | current query phase ops
search.query_time | sqti,searchQueryTime | time spent in query phase
search.query_total | sqto,searchQueryTotal | total query phase ops
search.scroll_current | scc,searchScrollCurrent | open scroll contexts
search.scroll_time | scti,searchScrollTime | time scroll contexts held open
search.scroll_total | scto,searchScrollTotal | completed scroll contexts
segments.count | sc,segmentsCount | number of segments
segments.memory | sm,segmentsMemory | memory used by segments
segments.index_writer_memory | siwm,segmentsIndexWriterMemory | memory used by index writer
segments.index_writer_max_memory | siwmx,segmentsIndexWriterMaxMemory | maximum memory index writer may use before it must write buffered documents to a new segment
segments.version_map_memory | svmm,segmentsVersionMapMemory | memory used by version map
segments.fixed_bitset_memory | sfbm,fixedBitsetMemory | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
suggest.current | suc,suggestCurrent | number of current suggest ops
suggest.time | suti,suggestTime | time spend in suggest
suggest.total | suto,suggestTotal | number of suggest ops
[참고문서] https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html health | h | current health status
status | s | open/close status
index | i,idx | index name
pri | p,shards.primary,shardsPrimary | number of primary shards
rep | r,shards.replica,shardsReplica | number of replica shards
docs.count | dc,docsCount | available docs
docs.deleted | dd,docsDeleted | deleted docs
creation.date | cd | index creation date (millisecond value)
creation.date.string | cds | index creation date (as string)
store.size | ss,storeSize | store size of primaries & replicas
pri.store.size | | store size of primaries
completion.size | cs,completionSize | size of completion
pri.completion.size | | size of completion
fielddata.memory_size | fm,fielddataMemory | used fielddata cache
pri.fielddata.memory_size | | used fielddata cache
fielddata.evictions | fe,fielddataEvictions | fielddata evictions
pri.fielddata.evictions | | fielddata evictions
query_cache.memory_size | qcm,queryCacheMemory | used query cache
pri.query_cache.memory_size | | used query cache
query_cache.evictions | qce,queryCacheEvictions | query cache evictions
pri.query_cache.evictions | | query cache evictions
request_cache.memory_size | rcm,requestCacheMemory | used request cache
pri.request_cache.memory_size | | used request cache
request_cache.evictions | rce,requestCacheEvictions | request cache evictions
pri.request_cache.evictions | | request cache evictions
request_cache.hit_count | rchc,requestCacheHitCount | request cache hit count
pri.request_cache.hit_count | | request cache hit count
request_cache.miss_count | rcmc,requestCacheMissCount | request cache miss count
pri.request_cache.miss_count | | request cache miss count
flush.total | ft,flushTotal | number of flushes
pri.flush.total | | number of flushes
flush.total_time | ftt,flushTotalTime | time spent in flush
pri.flush.total_time | | time spent in flush
get.current | gc,getCurrent | number of current get ops
pri.get.current | | number of current get ops
get.time | gti,getTime | time spent in get
pri.get.time | | time spent in get
get.total | gto,getTotal | number of get ops
pri.get.total | | number of get ops
get.exists_time | geti,getExistsTime | time spent in successful gets
pri.get.exists_time | | time spent in successful gets
get.exists_total | geto,getExistsTotal | number of successful gets
pri.get.exists_total | | number of successful gets
get.missing_time | gmti,getMissingTime | time spent in failed gets
pri.get.missing_time | | time spent in failed gets
get.missing_total | gmto,getMissingTotal | number of failed gets
pri.get.missing_total | | number of failed gets
indexing.delete_current | idc,indexingDeleteCurrent | number of current deletions
pri.indexing.delete_current | | number of current deletions
indexing.delete_time | idti,indexingDeleteTime | time spent in deletions
pri.indexing.delete_time | | time spent in deletions
indexing.delete_total | idto,indexingDeleteTotal | number of delete ops
pri.indexing.delete_total | | number of delete ops
indexing.index_current | iic,indexingIndexCurrent | number of current indexing ops
pri.indexing.index_current | | number of current indexing ops
indexing.index_time | iiti,indexingIndexTime | time spent in indexing
pri.indexing.index_time | | time spent in indexing
indexing.index_total | iito,indexingIndexTotal | number of indexing ops
pri.indexing.index_total | | number of indexing ops
indexing.index_failed | iif,indexingIndexFailed | number of failed indexing ops
pri.indexing.index_failed | | number of failed indexing ops
merges.current | mc,mergesCurrent | number of current merges
pri.merges.current | | number of current merges
merges.current_docs | mcd,mergesCurrentDocs | number of current merging docs
pri.merges.current_docs | | number of current merging docs
merges.current_size | mcs,mergesCurrentSize | size of current merges
pri.merges.current_size | | size of current merges
merges.total | mt,mergesTotal | number of completed merge ops
pri.merges.total | | number of completed merge ops
merges.total_docs | mtd,mergesTotalDocs | docs merged
pri.merges.total_docs | | docs merged
merges.total_size | mts,mergesTotalSize | size merged
pri.merges.total_size | | size merged
merges.total_time | mtt,mergesTotalTime | time spent in merges
pri.merges.total_time | | time spent in merges
percolate.current | pc,percolateCurrent | number of current percolations
pri.percolate.current | | number of current percolations
percolate.memory_size | pm,percolateMemory | memory used by percolations
pri.percolate.memory_size | | memory used by percolations
percolate.queries | pq,percolateQueries | number of registered percolation queries
pri.percolate.queries | | number of registered percolation queries
percolate.time | pti,percolateTime | time spent percolating
pri.percolate.time | | time spent percolating
percolate.total | pto,percolateTotal | total percolations
pri.percolate.total | | total percolations
refresh.total | rto,refreshTotal | total refreshes
pri.refresh.total | | total refreshes
refresh.time | rti,refreshTime | time spent in refreshes
pri.refresh.time | | time spent in refreshes
search.fetch_current | sfc,searchFetchCurrent | current fetch phase ops
pri.search.fetch_current | | current fetch phase ops
search.fetch_time | sfti,searchFetchTime | time spent in fetch phase
pri.search.fetch_time | | time spent in fetch phase
search.fetch_total | sfto,searchFetchTotal | total fetch ops
pri.search.fetch_total | | total fetch ops
search.open_contexts | so,searchOpenContexts | open search contexts
pri.search.open_contexts | | open search contexts
search.query_current | sqc,searchQueryCurrent | current query phase ops
pri.search.query_current | | current query phase ops
search.query_time | sqti,searchQueryTime | time spent in query phase
pri.search.query_time | | time spent in query phase
search.query_total | sqto,searchQueryTotal | total query phase ops
pri.search.query_total | | total query phase ops
search.scroll_current | scc,searchScrollCurrent | open scroll contexts
pri.search.scroll_current | | open scroll contexts
search.scroll_time | scti,searchScrollTime | time scroll contexts held open
pri.search.scroll_time | | time scroll contexts held open
search.scroll_total | scto,searchScrollTotal | completed scroll contexts
pri.search.scroll_total | | completed scroll contexts
segments.count | sc,segmentsCount | number of segments
pri.segments.count | | number of segments
segments.memory | sm,segmentsMemory | memory used by segments
pri.segments.memory | | memory used by segments
segments.index_writer_memory | siwm,segmentsIndexWriterMemory | memory used by index writer
pri.segments.index_writer_memory | | memory used by index writer
segments.index_writer_max_memory | siwmx,segmentsIndexWriterMaxMemory | maximum memory index writer may use before it must write buffered documents to a new segment
pri.segments.index_writer_max_memory | | maximum memory index writer may use before it must write buffered documents to a new segment
segments.version_map_memory | svmm,segmentsVersionMapMemory | memory used by version map
pri.segments.version_map_memory | | memory used by version map
segments.fixed_bitset_memory | sfbm,fixedBitsetMemory | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
pri.segments.fixed_bitset_memory | | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
warmer.current | wc,warmerCurrent | current warmer ops
pri.warmer.current | | current warmer ops
warmer.total | wto,warmerTotal | total warmer ops
pri.warmer.total | | total warmer ops
warmer.total_time | wtt,warmerTotalTime | time spent in warmers
pri.warmer.total_time | | time spent in warmers
suggest.current | suc,suggestCurrent | number of current suggest ops
pri.suggest.current | | number of current suggest ops
suggest.time | suti,suggestTime | time spend in suggest
pri.suggest.time | | time spend in suggest
suggest.total | suto,suggestTotal | number of suggest ops
pri.suggest.total | | number of suggest ops
memory.total | tm,memoryTotal | total used memory
pri.memory.total | | total user memory
Elastic/Elasticsearch 2016. 11. 25. 12:31
Elasticsearch cluster 업그레이드를 위해 먼저 한글형태소 분석기 업그레이드가 필요합니다. 기본적으로 한글형태소 분석기 플러그인을 만들기 위해서는 아래의 내용을 어느 정도는 잘 알고 다룰수 있어야 합니다.
- Elasticsearch - Lucene - Arirang
Arirang 은 아래 링크를 통해서 소스와 jar 파일을 구하실 수 있습니다.
최근에 수명님 이외 mgkaki 님이 컨트리뷰션을 해주시고 계신듯 합니다. :)
Lucene & Arirang 변경 사항) - lucene 6.1 과 6.2 의 패키지 구조가 변경이 되고 클래스도 바뀌었습니다. - arirang 에서 제공하던 pairmap 관련 버그가 수정되었습니다. (그전에 수정이 되었을수도 있습니다. ^^;) - lucene 에서 제공 되던 CharacterUtils 가 refactoring 되었습니다. - arirang 에서 KoreanTokenizer 에 선언된 CharacterUtils 를 변경된 내용에 맞게 고쳐주어야 합니다.
Remove CharacterUtils.getInstance() CharacterUtils.codePointAt(...) to Character.codePointAt(...)
- arirang 6.2 source를 내려 받으시면 위 변경 내용이 반영 되어 있습니다. - arirang.morph 1.1.0 을 내려 받으셔야 합니다.
Elasticsearch Plugin 변경 사항) 플러그인 개발 변경 사항은 기본 구조 변경이 많이 되었기 때문에 수정 사항이 많습니다. 보기에 따라서 적을 수도 있지만 판단은 각자의 몫으로 ^^
- arirang.lucene-analyzer 와 arirang-morph 업데이트가 되어야 합니다. - 기존에 binding 하던 AnalysisBinderProcessor를 사용하지 않습니다. - 이제는 Plugin, AnalysisPlugin 에서 등록을 진행 합니다.
public class AnalysisArirangPlugin extends Plugin implements AnalysisPlugin { @Override public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() { return singletonMap("arirang_filter", ArirangTokenFilterFactory::new); }
@Override public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() { Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>(); extra.put("arirang_tokenizer", ArirangTokenizerFactory::new);
return extra; }
@Override public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() { return singletonMap("arirang_analyzer", ArirangAnalyzerProvider::new); } }
- AnalyzerProvider, TokenFilterFactory, TokenizerFactory 내 생성자 argument 가 바뀌었습니다. IndexSettings indexSettings, Environment env, String name, Settings settings
- assemble 하기 위한 plugin.xml 내 outputDirectory 가 elasticsearch 로 변경이 되었습니다. - outputDirectory 가 elasticsearch 로 작성되어 있지 않을 경우 에러가 발생 합니다.
이 정도 변경 하고 나면 이제 빌드 및 설치를 하셔도 됩니다. 이전 글 참고) [Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1
※ 플러그인을 만들면서 우선 lucene 6.1 과 6.2 가 바뀌어서 살짝 당황 했었습니다. 당연히 6.x 간에는 패키지 구조에 대한 변경은 없을거라는 기대를 했었는데 이게 잘못이였던 것 같습니다.
역시 lucene 5.x 에서 6.x 로 넘어 가기 때문에 elasticsearch 5.x 는 많이 바뀌었을 거라는 생각은 했었구요. 그래도 생각했던 것 보다 오래 걸리지는 않았지만 역시 참고할 만한 문서나 자료는 어디에도 없더라구요. 소스 보는게 진리라는건 변하지 않는 듯 싶내요. 작성하고 보니 이게 개발기인지 애매하내요. ^^;
소스코드) https://github.com/HowookJeong/elasticsearch-analysis-arirang
Elastic/Elasticsearch 2016. 11. 24. 19:02
우선 빌드한 플러그인 zip 파일 먼저 공유 합니다. 나중에 작업한 내용에 대해서는 github 에 올리도록 하겠습니다. 요즘 프로젝트며 운영 업무가 너무 많아서 이것도 겨우 겨우 시간 내서 작업 했내요.
elasticsearch-analysis-arirang-5.0.1.zip
설치 방법) $ bin/elasticsearch-plugin install --verbose file:///elasticsearch-analysis-arirang/target/elasticsearch-analysis-arirang-5.0.1.zip
설치 로그) -> Downloading file:///elasticsearch-analysis-arirang-5.0.1.zip Retrieving zip from file:///elasticsearch-analysis-arirang-5.0.1.zip [=================================================] 100% - Plugin information: Name: analysis-arirang Description: Arirang plugin Version: 5.0.1 * Classname: org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin -> Installed analysis-arirang
Elasticsearch 실행 로그) $ bin/elasticsearch [2016-11-24T18:49:09,922][INFO ][o.e.n.Node ] [] initializing ... [2016-11-24T18:49:10,083][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [733.1gb], net total_space [930.3gb], spins? [unknown], types [hfs] [2016-11-24T18:49:10,084][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] heap size [1.9gb], compressed ordinary object pointers [true] [2016-11-24T18:49:10,085][INFO ][o.e.n.Node ] [aDGu2B9] node name [aDGu2B9] derived from node ID; set [node.name] to override [2016-11-24T18:49:10,087][INFO ][o.e.n.Node ] [aDGu2B9] version[5.0.1], pid[56878], build[080bb47/2016-11-11T22:08:49.812Z], OS[Mac OS X/10.12.1/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15] [2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [aggs-matrix-stats] [2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [ingest-common] [2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-expression] [2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-groovy] [2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-mustache] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-painless] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [percolator] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [reindex] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty3] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty4] [2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded plugin [analysis-arirang] [2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] initialized [2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] starting ... [2016-11-24T18:49:14,377][INFO ][o.e.t.TransportService ] [aDGu2B9] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300} [2016-11-24T18:49:17,511][INFO ][o.e.c.s.ClusterService ] [aDGu2B9] new_master {aDGu2B9}{aDGu2B9mQ8KkWCe3fnqeMw}{_y9RzyKGSvqYAFcv99HBXg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined) [2016-11-24T18:49:17,584][INFO ][o.e.g.GatewayService ] [aDGu2B9] recovered [0] indices into cluster_state [2016-11-24T18:49:17,588][INFO ][o.e.h.HttpServer ] [aDGu2B9] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200} [2016-11-24T18:49:17,588][INFO ][o.e.n.Node ] [aDGu2B9] started
한글형태소분석 실행) $ curl -X POST -H "Cache-Control: no-cache" -H "Postman-Token: 6d392d83-5816-71ad-556b-5cd6f92af634" -d '{ "analyzer" : "arirang_analyzer", "text" : "[한국] 엘라스틱서치 사용자 그룹의 HENRY 입니다." }' "http://localhost:9200/_analyze"
형태소분석 결과) { "tokens": [ { "token": "[", "start_offset": 0, "end_offset": 1, "type": "symbol", "position": 0 }, { "token": "한국", "start_offset": 1, "end_offset": 3, "type": "korean", "position": 1 }, { "token": "]", "start_offset": 3, "end_offset": 4, "type": "symbol", "position": 2 }, { "token": "엘라스틱서치", "start_offset": 5, "end_offset": 11, "type": "korean", "position": 3 }, { "token": "엘라", "start_offset": 5, "end_offset": 7, "type": "korean", "position": 3 }, { "token": "스틱", "start_offset": 7, "end_offset": 9, "type": "korean", "position": 4 }, { "token": "서치", "start_offset": 9, "end_offset": 11, "type": "korean", "position": 5 }, { "token": "사용자", "start_offset": 12, "end_offset": 15, "type": "korean", "position": 6 }, { "token": "그룹", "start_offset": 16, "end_offset": 18, "type": "korean", "position": 7 }, { "token": "henry", "start_offset": 20, "end_offset": 25, "type": "word", "position": 8 }, { "token": "입니다", "start_offset": 26, "end_offset": 29, "type": "korean", "position": 9 } ] }
|