'elasticsearch'에 해당되는 글 420건

  1. 2017.03.09 [Elasticsearch] TransportClient on 5.x
  2. 2017.02.21 [Elasticsearch] 2.4.x to 5.2.x 으로의 elasticsearch.yml
  3. 2017.02.21 [Elasticsearch] elasticsearch-analysis-arirang-5.2.1
  4. 2017.02.09 [Kibana] Unique Count 사용 시 threshold 스크립트 추가.
  5. 2017.01.24 [검색추천] Apache mahout + Elastic Stack 을 이용한 기본 추천
  6. 2017.01.19 [Lucene] Multi-value fields and the inverted index
  7. 2017.01.02 [Elasticsearch] Range Query From, To 포함 여부.
  8. 2016.12.06 [Elasticsearch] _cat nodes/indices help
  9. 2016.11.25 [Elasticsearch] elasticsearch-analysis-arirang 5.0.1 플러그인 개발기
  10. 2016.11.24 [Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

[Elasticsearch] TransportClient on 5.x

Elastic/Elasticsearch 2017. 3. 9. 11:51

elasticsearch 2.4 에서 사용하던 java api 중 TransportClinet 사용 방법이 바뀌어서 작성 합니다.

변경된 내용에 대해서는 elasticsearch 공식 홈페이지에 자세히 나와 있습니다.


[참고문서]

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_maven_repository.html

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html


[코드 변경]


2.x)

settings = settingsBuilder()

  .put("cluster.name", cluster)

  .put("client.transport.sniff", true)

  .put("network.tcp.blocking", false)           // tcp non-blocking mode

  .put("client.transport.ping_timeout", "10s")

  .build();


5.x)

settings = builder()

  .put("cluster.name", cluster)

  .put("client.transport.sniff", true)

  .put("network.tcp.blocking", false)           // tcp non-blocking mode

  .put("client.transport.ping_timeout", "10s")

  .build();


2.x)

TransportClient client = TransportClient.builder().settings(settings).build();


5.x)

TransportClient client = new PreBuiltTransportClient(settings);


여기서 주의 하실 점은 참고문서에 있지만  transport 가 분리 되었기 때문에 별도로 dependency 구성을 해주셔야 합니다.


Maven Dependency 추가)

<dependency>

  <groupId>org.elasticsearch.client</groupId>

  <artifactId>transport</artifactId>

  <version>${elasticsearch.version}</version>

</dependency>


별 내용은 아니지만 혹시라도 삽질 하시는 분들이 계실 수 있어 작성해 봤습니다.

:

[Elasticsearch] 2.4.x to 5.2.x 으로의 elasticsearch.yml

Elastic/Elasticsearch 2017. 2. 21. 13:23

2.x 에서 사용하던 설정을 그대로 5.x 로 올려서 실행을 시키면 몇 가지 볼 수 있는 에러들이 있습니다.

뭐 이런건 breaking changes 를 참고하거나 소스코드를 보면 금방 해결이 되긴 합니다.

그냥 복습하는 차원에서 기록해 봅니다.


[참고문서]

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.2.html

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.1.html

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking-changes-5.0.html



[발생 에러들]

unknown setting [es.default.path.conf] please check that any required plugins are installed, or check the breaking changes documentation for removed settings


node settings must not contain any index level settings


unknown setting [action.disable_shutdown] please check that any required plugins are installed, or check the breaking changes documentation for removed settings


unknown setting [discovery.zen.ping.multicast.enabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings


unknown setting [resource.reload.interval] did you mean any of [resource.reload.interval.low, resource.reload.interval.high, resource.reload.interval.medium, resource.reload.enabled]?


unknown setting [script.indexed] did you mean any of [script.inline, script.ingest]?


node validation exception

bootstrap checks failed

memory locking requested for elasticsearch process but memory is not locked

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]


- es.default.path.conf 는 path.conf  로 변경 되었습니다. (-D 가 -E 로 변경 되었구요.)

- index level 설정이 있다면 제거 하시면 됩니다.

- action.disable_shutdown 은 없어 진 것으로 보입니다. (미쳐 확인 하지는 못했구요. 문서를 보면 _shutdown 이 없어 진것으로 미루어 봤을때...)

- multicast 설정도 제거 하시면 됩니다.

- resource 설정은 이름과 방법이 바뀌었으니 삭제 하시거나 변경해 주면 됩니다.

- script.indexed 설정도 제거 하시면 됩니다. (아마도 stored 로 바뀐것 같습니다.)

- bootstrap 설정은 root 권한을 주시거나 limits.conf 인가에서 수정을 해주셔야 할 것 같구요.

- vm.max_map_count 설정은 문서에 잘 나와 있습니다. ($ sudo sysctl -w vm.max_map_count=262144)


:

[Elasticsearch] elasticsearch-analysis-arirang-5.2.1

Elastic/Elasticsearch 2017. 2. 21. 12:41

elasticsearch-analysis-arirang-5.2.1 공유 합니다.


Lucene 6.4.1

Elasticsearch 5.2.1 

기준 입니다.


elasticsearch-analysis-arirang-5.2.1.zip


설치 방법)

$ bin/elasticsearch-plugin install --verbose file:///services/apps/elasticsearch-analysis-arirang-5.2.1.zip


:

[Kibana] Unique Count 사용 시 threshold 스크립트 추가.

Elastic/Kibana 2017. 2. 9. 12:54

기억력 극복을 위해 또 기록해 봅니다.


elasticsearch의 cardinality aggregation 을 kibana 에서는 unique count 로 사용 합니다.

여기에 정확도 조절을 위해 precision_threshold 를 설정할 수 있는데요.


참고문서)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html


QueryDSL)

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author_hash",
                "precision_threshold": 100 
            }
        }
    }
}



Kibana Script)

{

"precision_threshold":40000

}


위와 같이 사용하시면 됩니다.

주의 하실 부분은 CPU와 Memory 사용에 민감하고 circuit breaker 설정도 확인하셔야 합니다.


:

[검색추천] Apache mahout + Elastic Stack 을 이용한 기본 추천

Elastic/Elasticsearch 2017. 1. 24. 11:47

Elastic Stack 과 Apache mahout 을 이용한 추천 데이터 생성을 다뤄 볼까 합니다.

기본적으로는 Elastic Stack 만 가지고도 cohort 분석을 통해 추천 데이터 마트 구성이 가능 한데요.

추천 데이터에 대한 품질을 좀 더 좋게 하기 위해 Apache mahout 을 활용해 보도록 하겠습니다.


여기서 다루는 내용은 누구나 쉽게 접근 할 수 있도록 Hello World! 수준만 기술 합니다.


[Elastic Stack]

https://www.elastic.co/products


[Apache mahout]

https://mahout.apache.org/


위 두 솔루션은 모두 오픈소스 이며 예제 코드가 해당 소스에 잘 만들어져 있어 누구나 쉽게 활용이 가능합니다.


Step 1)

Elasticsearch + Logstash + Kibana 를 이용해 로그를 수집하고 추천 할 raw data 를 생성 합니다.


User item click log -> Logstash collect -> Elasticsearch store -> Kibana visualize -> CSV download


여기서 수집한 데이터 중 추출 데이터는 user id + item id + click count 입니다.

아래는 Kibana QueryDSL 예제 입니다.

{

  "size": 0,

  "query": {

    "filtered": {

      "query": {

        "query_string": {

          "query": "cp:CLK AND id:[0 TO *]",

          "analyze_wildcard": true

        }

      },

      "filter": {

        "bool": {

          "must": [

            {

              "range": {

                "time": {

                  "gte": 1485010800000,

                  "lte": 1485097199999,

                  "format": "epoch_millis"

                }

              }

            }

          ],

          "must_not": []

        }

      }

    }

  },

  "aggs": {

    "2": {

      "terms": {

        "field": "user_id",

        "size": 30000,

        "order": {

          "_count": "desc"

        }

      },

      "aggs": {

        "3": {

          "terms": {

            "field": "item_id",

            "size": 10,

            "order": {

              "_count": "desc"

            }

          }

        }

      }

    }

  }

}


Step 2)

Apache mahout 에서 사용할 recommender 는 UserBasedRecommender 입니다.

샘플 코드에도 나와 있지만 dataset.csv 파일은 아래와 같은 형식 입니다.

- Creating a User-Based Recommender in 5 minutes


1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0

형식) userId,itemId,ratingValue


Step1 에서 위와 같은 형식을 맞추기 위해 user_id, item_id, click_count 를 생성 하였습니다.

이 데이터를 기반으로 UserBasedRecommender 를 돌려 보도록 하겠습니다.


Step 3)

아래 보시면 샘플 코드가 잘 나와 있습니다.

https://github.com/apache/mahout/tree/master/examples/src/main/java/org/apache/mahout


Main class 하나 만드셔서 Step2 에 나와 있는 코드로 돌려 보시면 됩니다.

저는 UserBasedRecommender 를 implements 해서 별도로 구현했습니다.

이건 누구나 쉽게 하실 수 있는 부분이기 때문에 examples 에 나와 있는 BookCrossingRecommender 클래스등을 참고 하시면 됩니다.


UserBasedRecommenderRunner runner = new UserBasedRecommenderRunner();

Recommender recommender = runner.buildRecommender();


// 710039번 유저에 대한 추천 아이템 3개

List<RecommendedItem> recommendations = recommender.recommend(710039, 3);


for (RecommendedItem recommendation : recommendations) {

    LOG.debug("추천 아이템 : {}", recommendation);

}


[실행 로그]

11:39:31.527 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Creating FileDataModel for file /git/prototype/data/user-to-item.csv

11:39:31.626 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Reading file info...

11:39:31.765 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Read lines: 63675

11:39:31.896 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 10000 users

11:39:31.911 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 19124 users

11:39:31.949 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommending items for user ID '710039'

11:39:31.965 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommendations are: [RecommendedItem[item:35222, value:4.0], RecommendedItem[item:12260, value:4.0], RecommendedItem[item:12223, value:1.5]]

11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:35222, value:4.0]

11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12260, value:4.0]

11:39:31.967 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12223, value:1.5]


[Recommender]

similarity = new PearsonCorrelationSimilarity(dataModel);


// 이웃한 N명의 사용자 데이터로 추천 데이터 생성

// UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, dataModel, 0.2);


// 특정 값이나 임계치를 넘는 모든 사용자의 데이터로 추천 데이터 생성, samplingrate : user sampling rate 10%

// UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel, 0.1);


UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.2, similarity, dataModel, 1.0);

recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);


- 데이터 크기가 너무 작아 ThresholdUserNeighborhood 를 이용하였습니다.


이와 같이 검색 클릭 로그를 기반으로 CF를 돌려 추천 데이터를 만드는 아주 간단한 방법을 알아봤습니다.

만든 추천 데이터에 대한 평가도 가능 합니다.

역시 examples 에 xxxxxxEvaluator 클래스들을 참고하셔서 구현해 보시면 됩니다.


:

[Lucene] Multi-value fields and the inverted index

ITWeb/검색일반 2017. 1. 19. 18:41
아주 기초적인 것도 잊어버리는 것 같아 기록해 봅니다.

Multi-value fields and the inverted index

The fact that all field types support multi-value fields out of the box is a consequence of the origins of Lucene. Lucene was designed to be a full text search engine. In order to be able to search for individual words within a big block of text, Lucene tokenizes the text into individual terms, and adds each term to the inverted index separately.

This means that even a simple text field must be able to support multiple values by default. When other datatypes were added, such as numbers and dates, they used the same data structure as strings, and so got multi-values for free.


이 글은 아래 elasticsearch 에서 퍼왔습니다.


[문서]

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/array.html

:

[Elasticsearch] Range Query From, To 포함 여부.

Elastic/Elasticsearch 2017. 1. 2. 12:44

도대체 왜 맨날 잊어버리는지 모르겠지만, 기억력 회복을 위해 기록해 봅니다.


Range query  사용 시 from, to, gt, gte, lt, lte parameter 를 사용 합니다.

RangeQueryBuilder.java 소스코드를 보면 아래와 같이 정의가 되어 있습니다.


private final String name;
private Object from;
private Object to;
private String timeZone;
private boolean includeLower = true;
private boolean includeUpper = true;
private float boost = -1;
private String queryName;
private String format;


기본적으로 lower, upper 값을 포함하게 되어 있습니다.

그러므로, from, to 는 값을 포함 하게 됩니다.

MySQL 에서 제공하고 있는 BETWEEN min AND max 도 min 과 max 값을 포함 하고 있는 것 처럼 동일 합니다.


:

[Elasticsearch] _cat nodes/indices help

Elastic/Elasticsearch 2016. 12. 6. 12:18

그냥 볼일이 있어서 올려봅니다.

nodes 랑 indices 만.


[참고문서]

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html

id                               | id,nodeId                          | unique node id                                                                                                   
pid                              | p                                  | process id                                                                                                       
host                             | h                                  | host name                                                                                                        
ip                               | i                                  | ip address                                                                                                       
port                             | po                                 | bound transport port                                                                                             
version                          | v                                  | es version                                                                                                       
build                            | b                                  | es build hash                                                                                                    
jdk                              | j                                  | jdk version                                                                                                      
disk.avail                       | d,disk,diskAvail                   | available disk space                                                                                             
heap.current                     | hc,heapCurrent                     | used heap                                                                                                        
heap.percent                     | hp,heapPercent                     | used heap ratio                                                                                                  
heap.max                         | hm,heapMax                         | max configured heap                                                                                              
ram.current                      | rc,ramCurrent                      | used machine memory                                                                                              
ram.percent                      | rp,ramPercent                      | used machine memory ratio                                                                                        
ram.max                          | rm,ramMax                          | total machine memory                                                                                             
file_desc.current                | fdc,fileDescriptorCurrent          | used file descriptors                                                                                            
file_desc.percent                | fdp,fileDescriptorPercent          | used file descriptor ratio                                                                                       
file_desc.max                    | fdm,fileDescriptorMax              | max file descriptors                                                                                             
cpu                              | cpu                                | recent cpu usage                                                                                                 
load                             | l                                  | most recent load avg                                                                                             
uptime                           | u                                  | node uptime                                                                                                      
node.role                        | r,role,dc,nodeRole                 | d:data node, c:client node                                                                                       
master                           | m                                  | m:master-eligible, *:current master                                                                              
name                             | n                                  | node name                                                                                                        
completion.size                  | cs,completionSize                  | size of completion                                                                                               
fielddata.memory_size            | fm,fielddataMemory                 | used fielddata cache                                                                                             
fielddata.evictions              | fe,fielddataEvictions              | fielddata evictions                                                                                              
query_cache.memory_size          | qcm,queryCacheMemory               | used query cache                                                                                                 
query_cache.evictions            | qce,queryCacheEvictions            | query cache evictions                                                                                            
request_cache.memory_size        | rcm,requestCacheMemory             | used request cache                                                                                               
request_cache.evictions          | rce,requestCacheEvictions          | request cache evictions                                                                                          
request_cache.hit_count          | rchc,requestCacheHitCount          | request cache hit counts                                                                                         
request_cache.miss_count         | rcmc,requestCacheMissCount         | request cache miss counts                                                                                        
flush.total                      | ft,flushTotal                      | number of flushes                                                                                                
flush.total_time                 | ftt,flushTotalTime                 | time spent in flush                                                                                              
get.current                      | gc,getCurrent                      | number of current get ops                                                                                        
get.time                         | gti,getTime                        | time spent in get                                                                                                
get.total                        | gto,getTotal                       | number of get ops                                                                                                
get.exists_time                  | geti,getExistsTime                 | time spent in successful gets                                                                                    
get.exists_total                 | geto,getExistsTotal                | number of successful gets                                                                                        
get.missing_time                 | gmti,getMissingTime                | time spent in failed gets                                                                                        
get.missing_total                | gmto,getMissingTotal               | number of failed gets                                                                                            
indexing.delete_current          | idc,indexingDeleteCurrent          | number of current deletions                                                                                      
indexing.delete_time             | idti,indexingDeleteTime            | time spent in deletions                                                                                          
indexing.delete_total            | idto,indexingDeleteTotal           | number of delete ops                                                                                             
indexing.index_current           | iic,indexingIndexCurrent           | number of current indexing ops                                                                                   
indexing.index_time              | iiti,indexingIndexTime             | time spent in indexing                                                                                           
indexing.index_total             | iito,indexingIndexTotal            | number of indexing ops                                                                                           
indexing.index_failed            | iif,indexingIndexFailed            | number of failed indexing ops                                                                                    
merges.current                   | mc,mergesCurrent                   | number of current merges                                                                                         
merges.current_docs              | mcd,mergesCurrentDocs              | number of current merging docs                                                                                   
merges.current_size              | mcs,mergesCurrentSize              | size of current merges                                                                                           
merges.total                     | mt,mergesTotal                     | number of completed merge ops                                                                                    
merges.total_docs                | mtd,mergesTotalDocs                | docs merged                                                                                                      
merges.total_size                | mts,mergesTotalSize                | size merged                                                                                                      
merges.total_time                | mtt,mergesTotalTime                | time spent in merges                                                                                             
percolate.current                | pc,percolateCurrent                | number of current percolations                                                                                   
percolate.memory_size            | pm,percolateMemory                 | memory used by percolations                                                                                      
percolate.queries                | pq,percolateQueries                | number of registered percolation queries                                                                         
percolate.time                   | pti,percolateTime                  | time spent percolating                                                                                           
percolate.total                  | pto,percolateTotal                 | total percolations                                                                                               
refresh.total                    | rto,refreshTotal                   | total refreshes                                                                                                  
refresh.time                     | rti,refreshTime                    | time spent in refreshes                                                                                          
script.compilations              | scrcc,scriptCompilations           | script compilations                                                                                              
script.cache_evictions           | scrce,scriptCacheEvictions         | script cache evictions                                                                                           
search.fetch_current             | sfc,searchFetchCurrent             | current fetch phase ops                                                                                          
search.fetch_time                | sfti,searchFetchTime               | time spent in fetch phase                                                                                        
search.fetch_total               | sfto,searchFetchTotal              | total fetch ops                                                                                                  
search.open_contexts             | so,searchOpenContexts              | open search contexts                                                                                             
search.query_current             | sqc,searchQueryCurrent             | current query phase ops                                                                                          
search.query_time                | sqti,searchQueryTime               | time spent in query phase                                                                                        
search.query_total               | sqto,searchQueryTotal              | total query phase ops                                                                                            
search.scroll_current            | scc,searchScrollCurrent            | open scroll contexts                                                                                             
search.scroll_time               | scti,searchScrollTime              | time scroll contexts held open                                                                                   
search.scroll_total              | scto,searchScrollTotal             | completed scroll contexts                                                                                        
segments.count                   | sc,segmentsCount                   | number of segments                                                                                               
segments.memory                  | sm,segmentsMemory                  | memory used by segments                                                                                          
segments.index_writer_memory     | siwm,segmentsIndexWriterMemory     | memory used by index writer                                                                                      
segments.index_writer_max_memory | siwmx,segmentsIndexWriterMaxMemory | maximum memory index writer may use before it must write buffered documents to a new segment                     
segments.version_map_memory      | svmm,segmentsVersionMapMemory      | memory used by version map                                                                                       
segments.fixed_bitset_memory     | sfbm,fixedBitsetMemory             | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
suggest.current                  | suc,suggestCurrent                 | number of current suggest ops                                                                                    
suggest.time                     | suti,suggestTime                   | time spend in suggest                                                                                            
suggest.total                    | suto,suggestTotal                  | number of suggest ops    


[참고문서]

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html

health                               | h                                  | current health status                                                                                            
status                               | s                                  | open/close status                                                                                                
index                                | i,idx                              | index name                                                                                                       
pri                                  | p,shards.primary,shardsPrimary     | number of primary shards                                                                                         
rep                                  | r,shards.replica,shardsReplica     | number of replica shards                                                                                         
docs.count                           | dc,docsCount                       | available docs                                                                                                   
docs.deleted                         | dd,docsDeleted                     | deleted docs                                                                                                     
creation.date                        | cd                                 | index creation date (millisecond value)                                                                          
creation.date.string                 | cds                                | index creation date (as string)                                                                                  
store.size                           | ss,storeSize                       | store size of primaries & replicas                                                                               
pri.store.size                       |                                    | store size of primaries                                                                                          
completion.size                      | cs,completionSize                  | size of completion                                                                                               
pri.completion.size                  |                                    | size of completion                                                                                               
fielddata.memory_size                | fm,fielddataMemory                 | used fielddata cache                                                                                             
pri.fielddata.memory_size            |                                    | used fielddata cache                                                                                             
fielddata.evictions                  | fe,fielddataEvictions              | fielddata evictions                                                                                              
pri.fielddata.evictions              |                                    | fielddata evictions                                                                                              
query_cache.memory_size              | qcm,queryCacheMemory               | used query cache                                                                                                 
pri.query_cache.memory_size          |                                    | used query cache                                                                                                 
query_cache.evictions                | qce,queryCacheEvictions            | query cache evictions                                                                                            
pri.query_cache.evictions            |                                    | query cache evictions                                                                                            
request_cache.memory_size            | rcm,requestCacheMemory             | used request cache                                                                                               
pri.request_cache.memory_size        |                                    | used request cache                                                                                               
request_cache.evictions              | rce,requestCacheEvictions          | request cache evictions                                                                                          
pri.request_cache.evictions          |                                    | request cache evictions                                                                                          
request_cache.hit_count              | rchc,requestCacheHitCount          | request cache hit count                                                                                          
pri.request_cache.hit_count          |                                    | request cache hit count                                                                                          
request_cache.miss_count             | rcmc,requestCacheMissCount         | request cache miss count                                                                                         
pri.request_cache.miss_count         |                                    | request cache miss count                                                                                         
flush.total                          | ft,flushTotal                      | number of flushes                                                                                                
pri.flush.total                      |                                    | number of flushes                                                                                                
flush.total_time                     | ftt,flushTotalTime                 | time spent in flush                                                                                              
pri.flush.total_time                 |                                    | time spent in flush                                                                                              
get.current                          | gc,getCurrent                      | number of current get ops                                                                                        
pri.get.current                      |                                    | number of current get ops                                                                                        
get.time                             | gti,getTime                        | time spent in get                                                                                                
pri.get.time                         |                                    | time spent in get                                                                                                
get.total                            | gto,getTotal                       | number of get ops                                                                                                
pri.get.total                        |                                    | number of get ops                                                                                                
get.exists_time                      | geti,getExistsTime                 | time spent in successful gets                                                                                    
pri.get.exists_time                  |                                    | time spent in successful gets                                                                                    
get.exists_total                     | geto,getExistsTotal                | number of successful gets                                                                                        
pri.get.exists_total                 |                                    | number of successful gets                                                                                        
get.missing_time                     | gmti,getMissingTime                | time spent in failed gets                                                                                        
pri.get.missing_time                 |                                    | time spent in failed gets                                                                                        
get.missing_total                    | gmto,getMissingTotal               | number of failed gets                                                                                            
pri.get.missing_total                |                                    | number of failed gets                                                                                            
indexing.delete_current              | idc,indexingDeleteCurrent          | number of current deletions                                                                                      
pri.indexing.delete_current          |                                    | number of current deletions                                                                                      
indexing.delete_time                 | idti,indexingDeleteTime            | time spent in deletions                                                                                          
pri.indexing.delete_time             |                                    | time spent in deletions                                                                                          
indexing.delete_total                | idto,indexingDeleteTotal           | number of delete ops                                                                                             
pri.indexing.delete_total            |                                    | number of delete ops                                                                                             
indexing.index_current               | iic,indexingIndexCurrent           | number of current indexing ops                                                                                   
pri.indexing.index_current           |                                    | number of current indexing ops                                                                                   
indexing.index_time                  | iiti,indexingIndexTime             | time spent in indexing                                                                                           
pri.indexing.index_time              |                                    | time spent in indexing                                                                                           
indexing.index_total                 | iito,indexingIndexTotal            | number of indexing ops                                                                                           
pri.indexing.index_total             |                                    | number of indexing ops                                                                                           
indexing.index_failed                | iif,indexingIndexFailed            | number of failed indexing ops                                                                                    
pri.indexing.index_failed            |                                    | number of failed indexing ops                                                                                    
merges.current                       | mc,mergesCurrent                   | number of current merges                                                                                         
pri.merges.current                   |                                    | number of current merges                                                                                         
merges.current_docs                  | mcd,mergesCurrentDocs              | number of current merging docs                                                                                   
pri.merges.current_docs              |                                    | number of current merging docs                                                                                   
merges.current_size                  | mcs,mergesCurrentSize              | size of current merges                                                                                           
pri.merges.current_size              |                                    | size of current merges                                                                                           
merges.total                         | mt,mergesTotal                     | number of completed merge ops                                                                                    
pri.merges.total                     |                                    | number of completed merge ops                                                                                    
merges.total_docs                    | mtd,mergesTotalDocs                | docs merged                                                                                                      
pri.merges.total_docs                |                                    | docs merged                                                                                                      
merges.total_size                    | mts,mergesTotalSize                | size merged                                                                                                      
pri.merges.total_size                |                                    | size merged                                                                                                      
merges.total_time                    | mtt,mergesTotalTime                | time spent in merges                                                                                             
pri.merges.total_time                |                                    | time spent in merges                                                                                             
percolate.current                    | pc,percolateCurrent                | number of current percolations                                                                                   
pri.percolate.current                |                                    | number of current percolations                                                                                   
percolate.memory_size                | pm,percolateMemory                 | memory used by percolations                                                                                      
pri.percolate.memory_size            |                                    | memory used by percolations                                                                                      
percolate.queries                    | pq,percolateQueries                | number of registered percolation queries                                                                         
pri.percolate.queries                |                                    | number of registered percolation queries                                                                         
percolate.time                       | pti,percolateTime                  | time spent percolating                                                                                           
pri.percolate.time                   |                                    | time spent percolating                                                                                           
percolate.total                      | pto,percolateTotal                 | total percolations                                                                                               
pri.percolate.total                  |                                    | total percolations                                                                                               
refresh.total                        | rto,refreshTotal                   | total refreshes                                                                                                  
pri.refresh.total                    |                                    | total refreshes                                                                                                  
refresh.time                         | rti,refreshTime                    | time spent in refreshes                                                                                          
pri.refresh.time                     |                                    | time spent in refreshes                                                                                          
search.fetch_current                 | sfc,searchFetchCurrent             | current fetch phase ops                                                                                          
pri.search.fetch_current             |                                    | current fetch phase ops                                                                                          
search.fetch_time                    | sfti,searchFetchTime               | time spent in fetch phase                                                                                        
pri.search.fetch_time                |                                    | time spent in fetch phase                                                                                        
search.fetch_total                   | sfto,searchFetchTotal              | total fetch ops                                                                                                  
pri.search.fetch_total               |                                    | total fetch ops                                                                                                  
search.open_contexts                 | so,searchOpenContexts              | open search contexts                                                                                             
pri.search.open_contexts             |                                    | open search contexts                                                                                             
search.query_current                 | sqc,searchQueryCurrent             | current query phase ops                                                                                          
pri.search.query_current             |                                    | current query phase ops                                                                                          
search.query_time                    | sqti,searchQueryTime               | time spent in query phase                                                                                        
pri.search.query_time                |                                    | time spent in query phase                                                                                        
search.query_total                   | sqto,searchQueryTotal              | total query phase ops                                                                                            
pri.search.query_total               |                                    | total query phase ops                                                                                            
search.scroll_current                | scc,searchScrollCurrent            | open scroll contexts                                                                                             
pri.search.scroll_current            |                                    | open scroll contexts                                                                                             
search.scroll_time                   | scti,searchScrollTime              | time scroll contexts held open                                                                                   
pri.search.scroll_time               |                                    | time scroll contexts held open                                                                                   
search.scroll_total                  | scto,searchScrollTotal             | completed scroll contexts                                                                                        
pri.search.scroll_total              |                                    | completed scroll contexts                                                                                        
segments.count                       | sc,segmentsCount                   | number of segments                                                                                               
pri.segments.count                   |                                    | number of segments                                                                                               
segments.memory                      | sm,segmentsMemory                  | memory used by segments                                                                                          
pri.segments.memory                  |                                    | memory used by segments                                                                                          
segments.index_writer_memory         | siwm,segmentsIndexWriterMemory     | memory used by index writer                                                                                      
pri.segments.index_writer_memory     |                                    | memory used by index writer                                                                                      
segments.index_writer_max_memory     | siwmx,segmentsIndexWriterMaxMemory | maximum memory index writer may use before it must write buffered documents to a new segment                     
pri.segments.index_writer_max_memory |                                    | maximum memory index writer may use before it must write buffered documents to a new segment                     
segments.version_map_memory          | svmm,segmentsVersionMapMemory      | memory used by version map                                                                                       
pri.segments.version_map_memory      |                                    | memory used by version map                                                                                       
segments.fixed_bitset_memory         | sfbm,fixedBitsetMemory             | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
pri.segments.fixed_bitset_memory     |                                    | memory used by fixed bit sets for nested object field types and type filters for types referred in _parent fields
warmer.current                       | wc,warmerCurrent                   | current warmer ops                                                                                               
pri.warmer.current                   |                                    | current warmer ops                                                                                               
warmer.total                         | wto,warmerTotal                    | total warmer ops                                                                                                 
pri.warmer.total                     |                                    | total warmer ops                                                                                                 
warmer.total_time                    | wtt,warmerTotalTime                | time spent in warmers                                                                                            
pri.warmer.total_time                |                                    | time spent in warmers                                                                                            
suggest.current                      | suc,suggestCurrent                 | number of current suggest ops                                                                                    
pri.suggest.current                  |                                    | number of current suggest ops                                                                                    
suggest.time                         | suti,suggestTime                   | time spend in suggest                                                                                            
pri.suggest.time                     |                                    | time spend in suggest                                                                                            
suggest.total                        | suto,suggestTotal                  | number of suggest ops                                                                                            
pri.suggest.total                    |                                    | number of suggest ops                                                                                            
memory.total                         | tm,memoryTotal                     | total used memory                                                                                                
pri.memory.total                     |                                    | total user memory  


:

[Elasticsearch] elasticsearch-analysis-arirang 5.0.1 플러그인 개발기

Elastic/Elasticsearch 2016. 11. 25. 12:31

Elasticsearch cluster 업그레이드를 위해 먼저 한글형태소 분석기 업그레이드가 필요합니다.

기본적으로 한글형태소 분석기 플러그인을 만들기 위해서는 아래의 내용을 어느 정도는 잘 알고 다룰수 있어야 합니다.


- Elasticsearch

- Lucene

- Arirang


Arirang 은 아래 링크를 통해서 소스와 jar 파일을 구하실 수 있습니다.


최근에 수명님 이외 mgkaki 님이 컨트리뷰션을 해주시고 계신듯 합니다. :)


Lucene & Arirang 변경 사항)

- lucene 6.1 과 6.2 의 패키지 구조가 변경이 되고 클래스도 바뀌었습니다.

- arirang 에서 제공하던 pairmap 관련 버그가 수정되었습니다. (그전에 수정이 되었을수도 있습니다. ^^;)

- lucene 에서 제공 되던 CharacterUtils 가 refactoring 되었습니다.

- arirang 에서 KoreanTokenizer 에 선언된 CharacterUtils 를 변경된 내용에 맞게 고쳐주어야 합니다.


Remove CharacterUtils.getInstance()

CharacterUtils.codePointAt(...) to Character.codePointAt(...)


- arirang 6.2 source를 내려 받으시면 위 변경 내용이 반영 되어 있습니다.

- arirang.morph 1.1.0 을 내려 받으셔야 합니다.


Elasticsearch Plugin 변경 사항)

플러그인 개발 변경 사항은 기본 구조 변경이 많이 되었기 때문에 수정 사항이 많습니다.

보기에 따라서 적을 수도 있지만 판단은 각자의 몫으로 ^^


- arirang.lucene-analyzer 와 arirang-morph 업데이트가 되어야 합니다.

- 기존에 binding 하던 AnalysisBinderProcessor를 사용하지 않습니다.

- 이제는 Plugin, AnalysisPlugin 에서 등록을 진행 합니다.


public class AnalysisArirangPlugin extends Plugin implements AnalysisPlugin {

  @Override

  public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {

    return singletonMap("arirang_filter", ArirangTokenFilterFactory::new);

  }


  @Override

  public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {

    Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();

    extra.put("arirang_tokenizer", ArirangTokenizerFactory::new);


    return extra;

  }


  @Override

  public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {

    return singletonMap("arirang_analyzer", ArirangAnalyzerProvider::new);

  }

}


- AnalyzerProvider, TokenFilterFactory, TokenizerFactory 내 생성자 argument 가 바뀌었습니다.

IndexSettings indexSettings, Environment env, String name, Settings settings


- assemble 하기 위한 plugin.xml 내 outputDirectory 가 elasticsearch 로 변경이 되었습니다. 

- outputDirectory 가 elasticsearch 로 작성되어 있지 않을 경우 에러가 발생 합니다.


이 정도 변경 하고 나면 이제 빌드 및 설치를 하셔도 됩니다.

이전 글 참고) [Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1


※ 플러그인을 만들면서 우선 lucene 6.1 과 6.2 가 바뀌어서 살짝 당황 했었습니다.

당연히 6.x 간에는 패키지 구조에 대한 변경은 없을거라는 기대를 했었는데 이게 잘못이였던 것 같습니다.

역시 lucene 5.x 에서 6.x 로 넘어 가기 때문에 elasticsearch 5.x 는 많이 바뀌었을 거라는 생각은 했었구요.

그래도 생각했던 것 보다 오래 걸리지는 않았지만 역시 참고할 만한 문서나 자료는 어디에도 없더라구요.

소스 보는게 진리라는건 변하지 않는 듯 싶내요. 작성하고 보니 이게 개발기인지 애매하내요. ^^;


소스코드)

https://github.com/HowookJeong/elasticsearch-analysis-arirang

:

[Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

Elastic/Elasticsearch 2016. 11. 24. 19:02

우선 빌드한 플러그인 zip 파일 먼저 공유 합니다.

나중에 작업한 내용에 대해서는 github 에 올리도록 하겠습니다.

요즘 프로젝트며 운영 업무가 너무 많아서 이것도 겨우 겨우 시간 내서 작업 했내요.


elasticsearch-analysis-arirang-5.0.1.zip


설치 방법)

$ bin/elasticsearch-plugin install --verbose file:///elasticsearch-analysis-arirang/target/elasticsearch-analysis-arirang-5.0.1.zip


설치 로그)

-> Downloading file:///elasticsearch-analysis-arirang-5.0.1.zip

Retrieving zip from file:///elasticsearch-analysis-arirang-5.0.1.zip

[=================================================] 100%

- Plugin information:

Name: analysis-arirang

Description: Arirang plugin

Version: 5.0.1

 * Classname: org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin

-> Installed analysis-arirang


Elasticsearch 실행 로그)

$ bin/elasticsearch

[2016-11-24T18:49:09,922][INFO ][o.e.n.Node               ] [] initializing ...

[2016-11-24T18:49:10,083][INFO ][o.e.e.NodeEnvironment    ] [aDGu2B9] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [733.1gb], net total_space [930.3gb], spins? [unknown], types [hfs]

[2016-11-24T18:49:10,084][INFO ][o.e.e.NodeEnvironment    ] [aDGu2B9] heap size [1.9gb], compressed ordinary object pointers [true]

[2016-11-24T18:49:10,085][INFO ][o.e.n.Node               ] [aDGu2B9] node name [aDGu2B9] derived from node ID; set [node.name] to override

[2016-11-24T18:49:10,087][INFO ][o.e.n.Node               ] [aDGu2B9] version[5.0.1], pid[56878], build[080bb47/2016-11-11T22:08:49.812Z], OS[Mac OS X/10.12.1/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [aggs-matrix-stats]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [ingest-common]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [lang-expression]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [lang-groovy]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [lang-mustache]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [lang-painless]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [percolator]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [reindex]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [transport-netty3]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded module [transport-netty4]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService     ] [aDGu2B9] loaded plugin [analysis-arirang]

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node               ] [aDGu2B9] initialized

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node               ] [aDGu2B9] starting ...

[2016-11-24T18:49:14,377][INFO ][o.e.t.TransportService   ] [aDGu2B9] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}

[2016-11-24T18:49:17,511][INFO ][o.e.c.s.ClusterService   ] [aDGu2B9] new_master {aDGu2B9}{aDGu2B9mQ8KkWCe3fnqeMw}{_y9RzyKGSvqYAFcv99HBXg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)

[2016-11-24T18:49:17,584][INFO ][o.e.g.GatewayService     ] [aDGu2B9] recovered [0] indices into cluster_state

[2016-11-24T18:49:17,588][INFO ][o.e.h.HttpServer         ] [aDGu2B9] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}

[2016-11-24T18:49:17,588][INFO ][o.e.n.Node               ] [aDGu2B9] started


한글형태소분석 실행)

$ curl -X POST -H "Cache-Control: no-cache" -H "Postman-Token: 6d392d83-5816-71ad-556b-5cd6f92af634" -d '{

  "analyzer" : "arirang_analyzer",

  "text" : "[한국] 엘라스틱서치 사용자 그룹의 HENRY 입니다."

}' "http://localhost:9200/_analyze"


형태소분석 결과)

{

  "tokens": [

    {

      "token": "[",

      "start_offset": 0,

      "end_offset": 1,

      "type": "symbol",

      "position": 0

    },

    {

      "token": "한국",

      "start_offset": 1,

      "end_offset": 3,

      "type": "korean",

      "position": 1

    },

    {

      "token": "]",

      "start_offset": 3,

      "end_offset": 4,

      "type": "symbol",

      "position": 2

    },

    {

      "token": "엘라스틱서치",

      "start_offset": 5,

      "end_offset": 11,

      "type": "korean",

      "position": 3

    },

    {

      "token": "엘라",

      "start_offset": 5,

      "end_offset": 7,

      "type": "korean",

      "position": 3

    },

    {

      "token": "스틱",

      "start_offset": 7,

      "end_offset": 9,

      "type": "korean",

      "position": 4

    },

    {

      "token": "서치",

      "start_offset": 9,

      "end_offset": 11,

      "type": "korean",

      "position": 5

    },

    {

      "token": "사용자",

      "start_offset": 12,

      "end_offset": 15,

      "type": "korean",

      "position": 6

    },

    {

      "token": "그룹",

      "start_offset": 16,

      "end_offset": 18,

      "type": "korean",

      "position": 7

    },

    {

      "token": "henry",

      "start_offset": 20,

      "end_offset": 25,

      "type": "word",

      "position": 8

    },

    {

      "token": "입니다",

      "start_offset": 26,

      "end_offset": 29,

      "type": "korean",

      "position": 9

    }

  ]

}


: