|
Elastic/Elasticsearch 2013. 1. 22. 11:14
형태소분석기 소스 분석을 위해서 필요한 정보를 하나씩 모아 봐야 할 것 같내요. 우선은 stopwords 부터.. 모아 봅니다. [English Stopword1] a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours
| ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves |
[English Stopword2] a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except f far few fifth first five followed following follows for former formerly forth four from further furthermore g get gets getting given gives go goes going gone got gotten greetings h had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself j just k keep keeps kept know knows known l last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere o obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own p particular particularly per perhaps placed please plus possible presumably probably provides q que quite qv r rather rd re really reasonably regarding regardless regards relatively respectively right s said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two u un under unfortunately unless unlikely until unto up upon us use used useful uses using usually uucp v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't x y yes yet you you'd you'll you're you've your yours yourself yourselves z zero
Elastic/Elasticsearch 2013. 1. 18. 12:30
그냥 쉽게 가려고 고민하지 않고 L4 신청했는데 오판 이였습니다. elasticsearch 에서 L4 에서 보내는 syn signal 에 rst 을 던지고 있습니다. 그렇다 보니 binding 이 정상적으로 되지 않고 VIP 로 접속 시 접속이 되지 않는 문제가 있는데요. 요건 검색해 보시면 명쾌한 답이 나옵니다. ES 를 직접 개발한 사람이 그러더군요. "There is no need load balancer in elasticsearch." 이유 인 즉, Java API 를 사용하면 Perfect!! 라고 합니다. 저도 살짝 놓친 부분인데요. (알고 있어도 고민하지 않으면.. 이렇다니까요 ㅡ.ㅡ;;) client = new TransportClient(settings).addTransportAddress(new InetSocketTransportAddress(host, port)); 여기서 search 용 node 를 추가해 주면 됩니다. 어떻게??? 이렇게요.. client = new TransportClient(settings) .addTransportAddress(new InetSocketTransportAddress(host1, port)) .addTransportAddress(new InetSocketTransportAddress(host2, port)); 혹시라도 L4 랑 붙혀서 사용하려고 고민하시는 분들을 위해서 그냥 올려 봅니다. ;;
Elastic/Elasticsearch 2013. 1. 16. 14:54
그냥 예제 입니다. ㅎㅎ 생성 시 적용된 내용은 - replica 설정 - shards 설정 - refresh interval 설정 - term index interval 설정 - field store 시 compression 설정 - analyzer 설정 - synonym 설정 - routing 설정 - _all disable(?) 설정 # index 삭제 curl -XDELETE 'http://localhost:9200/index0/'
# index 생성 curl -XPUT 'http://localhost:9200/index0' -d '{ "settings" : { "number_of_shards" : 50, "number_of_replicas" : 1, "index" : { "refresh_interval" : "60s", "term_index_interval" : "1", "store" : { "compress" : { "stored" : true, "tv" : true } }, "analysis" : { "analyzer" : { "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "kr_filter", "kr_synonym"] }, "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "kr_filter", "kr_synonym"] } }, "filter" : { "kr_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } }, "routing" : { "required" : true, "path" : "indexType.user_uniq_id" } }, "mappings" : { "indexType" : { "properties" : { "docid" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "rm_seq" : { "type" : "long", "store" : "yes", "index" : "no", "include_in_all" : false }, "rm_join_seq" : { "type" : "long", "store" : "yes", "index" : "no", "include_in_all" : false }, "rm_title" : { "type" : "string", "store" : "yes", "index" : "analyzed", "term_vector" : "yes", "analyzer" : "kr_analyzer", "include_in_all" : false }, "user_uniq_id" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "mb_nm" : { "type" : "string", "store" : "yes", "index" : "analyzed", "term_vector" : "yes", "analyzer" : "kr_analyzer", "include_in_all" : false }, "mb_count" : { "type" : "integer", "store" : "yes", "index" : "no", "include_in_all" : false }, "rm_ymdt" : { "type" : "date", "format" : "yyyyMMddHHmmss", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "data_size" : { "type" : "long", "store" : "yes", "index" : "no", "include_in_all" : false }, "msgs" : { "properties" : { "msg_seq" : { "type" : "long", "store" : "no", "index" : "no", "include_in_all" : false }, "msg" : { "type" : "string", "store" : "yes", "index" : "analyzed", "term_vector" : "yes", "analyzer" : "kr_analyzer", "include_in_all" : false }, "send_user_uniq_id" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "send_user_nick_nm" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "term_vector" : "yes", "analyzer" : "kr_analyzer", "include_in_all" : false }, "recv_ymdt" : { "type" : "date", "format" : "yyyyMMddHHmmss", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "cfn_yn" : { "type" : "string", "store" : "no", "index" : "no", "include_in_all" : false }, "send_yn" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "msg_type" : { "type" : "integer", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false } } } } } } }'
Elastic/Elasticsearch 2013. 1. 16. 13:20
원문 : http://www.elasticsearch.org/tutorials/2012/05/19/elasticsearch-for-logging.html 번역 : http://socurites.com/122 http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html http://www.elasticsearch.org/guide/reference/mapping/source-field.html http://www.elasticsearch.org/guide/reference/mapping/all-field.html http://www.elasticsearch.org/guide/reference/query-dsl/ http://www.elasticsearch.org/guide/reference/api/bulk.html
아무래도 이상해서 더 찾아 봤습니다. ㅋㅋ http://www.elasticsearch.org/guide/reference/index-modules/store.html 이 문서를 보면 일단 _all 과 _source 는 reserved keyword 같구요. (소스 보기 귀찮아서 상상만.. ) 문서 보고 store 옵션을 줘서 처리 했습니다. 결과는 ㅎㅎ 성공 ^^ "settings" : { "number_of_shards" : 50, "number_of_replicas" : 1, "index" : { "refresh_interval" : "1s", "term_index_interval" : "1", "store" : { "compress" : { "stored" : true, "tv" : true } } }, "mappings" : { "type명" : { "properties" : { "docid" : { "type" : "string", "store" : "yes", "index" : "not_analyzed", "include_in_all" : false }, "seq" : { "type" : "long", "store" : "yes", "index" : "no", "include_in_all" : false } } } } } 그리고 _all 에 대해서는 보시는 것 처럼 include_in_all : false 를 해서 _all 로는 어떤것도 매칭이 되지 않습니다. 이건 젤 위 문서에서 all-field.html 보시면 되겠습니다. 참고하시라고 압축율은 무려 80% 나 되내요.. ㅎㅎ
위 링크들 보고 열심히 튜닝 중이긴 한데.. 이게 효과가 있는건지 없는건지 알수가 없내요.. ㅡ.ㅡ;; (_all, all, _source, source 이건 setting 할때 둘다 적용 되더라구요.) "all" : { "enabled" : false }, "source" : { "enabled" : true, } 용도에 따라 disk 용량을 효율적으로 사용하기 위해서 위와 같은 구성을 했는데.. 흠.. 일단 용량은 조금 줄었는데.. 좀 큰 사이즈로 한번 돌려봐야 겠내요. 데이터 건수가 적은 걸로 돌렸을 때 옵션 안주고 돌리면 - 447MB 옵션주고 돌리면 - 438MB 9MB 절약 되었습니다. ^^;
Elastic/Elasticsearch 2013. 1. 15. 23:48
역쉬 쉬운게 없군요.. 드뎌 elasticsearch 랑 kr analyzer 랑 문제를 해결했습니다.
제가 es 의 clustering 구성을 master node 를 두개로 구성했습니다. 그리고 색인 시 20개의 thread 를 생성해서 색인 데이터를 request 했구요. 물론 master node 한대를 target 으로 하고 request 했지요.
근데 es 내부에서 자동으로 master 끼리 분산 처리를 하더군요.
첨에는 소스 보기 귀찮아서 환경이랑 설정만 가지고 삽질을 했는데.. 도저희 해결이 안되서 소스를 직접 수정해서 디버깅을 하기 시작 했습니다.
짜잔.. ^^ 해결책은 비교적 쉬운 곳에 있었습니다.
master node 를 하나만 사용하거나 thread safe 하도록 kr analyzer 소스를 조금 손봐주는 것입니다.
결론만 보면 정말 쉬운데요.. ㅋ 그 과정이 참 오래 걸렸내요.. 그래도 뭐 빨리 찾았다고 생각 합니다.. ㅎㅎ
이상 es 운영 경험 공유를 맞칩니다. ^^
[해결방법1] <- 이건 근본해결책도 아니고 그냥 쓰레기 입니다. 그리고 data node 설정도 틀렸내요. - 서버 1 : node.master: true, node.data: true - 서버 2 : node.master: false, node.data: true 아마도 - 서버 1 : node.master: true, node.data: false - 서버 2 : node.master: false, node.data: true -> 다른 분이 테스트 해보시고 안된다고 그러시내요 ^^;
[해결방법2] - SyllableUtil.java 에서 getSyllableFeature() 이 함수 내 FileUtil.readlines 를 threadsafe 하게 수정 - 저는 synchronized(lock) 으로 처리 했습니다.
Elastic/Elasticsearch 2013. 1. 14. 11:02
아래 내용이 잘못 되어서 정정 합니다. ㅡ.ㅡ;; kr_filter 부분에서 오류가 발생을 하는 것은 맞습니다. 다만, 해결책이 현재 클리어 하지 않은 상황이라 일단 적용한 스키마 코드랑 문제에 대한 차선책을 공유 합니다. kr_filter 를 ngram 으로 변경했으나 이것도 최선은 아닙니다. [문제가 된 스키마] "index" : { "analysis" : { "analyzer" : { "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "kr_filter", "kr_synonym"] }, "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "kr_filter", "kr_synonym"] } }, "filter" : { "kr_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } [차선책] "index" : { "analysis" : { "analyzer" : { "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "ngram", "kr_synonym"] }, "kr_analyzer" : { "type" : "custom", "tokenizer" : "kr_tokenizer", "filter" : ["trim", "ngram", "kr_synonym"] } }, "filter" : { "kr_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" }, "ngram" : { "type" : "ngram", "min_gram" : 2, "max_gram" : 8 } } } }
운영을 하면서 발생한 이슈를 공유 합니다. elasticsearch 0.19.12 + elasticsearch-analysis-korean-1.1.0 + jdk1.6.0 32 ▷ 이 구성에서 JDK 버전으로 인해 분석기에서 오류가 발생 합니다. ▷ 오류 내용은 analysis 세팅 시 filter 부분에 trim 과 kr_filter 적용을 해야 합니다. ▷ 근데 이 두 filter 에서 색인 시 오류가 납니다. ▷ 해결책은 일단 개발 환경과 동일하게 JDK 버전을 낮춰서 해결을 했습니다.
사용 시 참고하세요. 그리고 elasticsearch-analysis-korean 이넘은 elasticsearch 0.19.9 로 빌드 되어 있어서 0.2x 버전이랑 함께 사용할 경우 오류가 납니다. 참고하세요.
Elastic/Elasticsearch 2013. 1. 9. 12:02
기본적으로 /_plugin/head 에서 structured query 를 만들수 있지만 좀 더 다양한 옵션을 주고 싶을 경우 추가 구성을 해야 합니다. elasticsearch.org 문서만 가지고 만들다 보면 처음 접하시는 분들은 어떻게 시작할지 막막할 수 있죠. 그래서 제가 테스트 했던 것들 올려 봅니다.
[기본쿼리] http://localhost:9200/jjeong0/_search?q=msg:채팅&pretty=true - msg 라는 field 에 대해서 검색 수행 http://localhost:9200/jjeong0/_search?q=msg:안녕 OR title:안녕하세요&sort=cre_ymdt:desc&from=0&size=10&pretty=true - msg 와 title field 에 대해서 OR 검색을 수행
[JSON String Type] [Paging+term 기반 쿼리] http://localhost:9200/jjeong0/_search?source={"from":0,"size":10,"query":{"term":{"msg":"안녕하세요"}}}&pretty=true - msg 라는 field 에 대해서 term 기반으로 검색 수행 - from 은 시작 offset 값이며, size 는 한 번에 fetch 해올 문서 크기 [Sorting + term 기반 쿼리] http://localhost:9200/jjeong0/_search?source={"query":{"bool":{"must":[{"term":{"msg":"안녕"}}],"must_not":[],"should":[]}},"from":0,"size":50,"sort":[{"cre_ymdt":"asc"}],"facets":{}}&pretty=true - msg 에 반드시 "안녕" 이라는 단어가 포함이 된 것만 검색 - cre_ymdt 값에 대한 ascending sorting - http://www.elasticsearch.org/guide/reference/api/search/sort.html [Range Search] http://localhost:9200/jjeong0/_search?source={"query":{"range":{"recv_ymdt":{"from":"20120820163946", "to":"20120911160444"}}}}&pretty=true - recv_ymdt 라는 값에 대해서 범위를 지정하고 검색 - http://www.elasticsearch.org/guide/reference/query-dsl/range-query.html [Query String Search] http://localhost:9200/jjeong0/_search?source={"query":{"bool":{"must":[{"query_string":{"default_field":"msg","query":"%EC%95%88%EB%85%95%ED%95%98%EC%84%B8%EC%9A%94"}}],"must_not":[],"should":[]}},"from":0,"size":50,"sort":[{"cre_ymdt":"desc"}],"facets":{}}&pretty=true - http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html [Highlight Search] http://localhost:9200/jjeong0/_search?source={"query":{"bool":{"must":[{"query_string":{"default_field":"msg","query":"안녕 하세요"}}],"must_not":[],"should":[]}},"from":0,"size":50,"sort":[{"cre_ymdt":"desc"}],"facets":{},"highlight":{"pre_tags":["<b>"],"post_tags":["</b>"],"fields":{"msg":{}}}}&pretty=true - http://www.elasticsearch.org/guide/reference/api/search/highlighting.html [Term+QueryString+Range+Highlight, Sort, Paging, Routing] http://localhost:9200/jjeong0/_search?source={"from":0,"size":20,"query":{"bool":{"must":[{"term":{"user_uniq_id":"jjeong.tistory.com"}},{"query_string" : {"default_operator" : "OR","fields" : ["msg", "title"],"query" : "안먹어"}},{"range":{"rm_ymdt":{"from":"20121209000000","to":"20130110000000","include_lower":true,"include_upper":true}}}]}},"highlight":{"pre_tags":["<b>"],"post_tags":["</b>"],"fields":{"msg":{},"title":{}}},"sort":[{"cre_ymdt":{"order":"desc"}}]}&routing=jjeong.tistory.com&pretty=true - http://www.elasticsearch.org/guide/reference/mapping/routing-field.html
Elastic/Elasticsearch 2013. 1. 7. 14:30
아래 내용은 lucene in action 에서 뽑아온 내용입니다.
▷
The options for indexing (Field.Index.*) control how the text in the field will be
made searchable via the inverted index. Here are the choices:
- Index.ANALYZED—Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract , etc.).
- Index.NOT_ANALYZED—Do index the field, but don’t analyze the String value. Instead, treat the Field’s entire value as a single token and make that token searchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names, Social Security numbers, and telephone numbers. This option is especially useful for enabling “exact match” searching. We indexed the id field in listings 2.1 and 2.3 using this option.
- Index.ANALYZED_NO_NORMS—A variant of Index.ANALYZED that doesn’t store norms information in the index. Norms record index-time boost information in the index but can be memory consuming when you’re searching. Section 2.5 . 3 describes norms in detail.
- Index.NOT_ANALYZED_NO_NORMS—Just like Index.NOT_ANALYZED, but also doesn’t store norms. This option is frequently used to save index space and memory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.
- Index.NO—Don’t make this field’s value available for searching.
|
▷
The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:
- Store.YES—Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader. This option is useful for fields that you’d like to use when displaying the search results (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.
- Store.NO—Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.
|
▷
- TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
- TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets
- TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions
- TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets
- TermVector.NO—Doesn’t store any term vector information
Note that you can’t index term vectors unless you’ve also turned on indexing for the field. Stated more directly: if Index.NO is specified for a field, you must also specify TermVector.NO.
|
Elastic/Elasticsearch 2013. 1. 3. 17:55
config/elasticsearch.yml 파일을 열어 보면 있는 내용입니다. 처음에 막 설치하기 문서에서는 테스트를 많이 못해 보고 잘 모르겠는 구성이 있었는데.. ㅎㅎ 이제 이해해서 다시 기술 합니다.
1. 검색 + 색인 node.master: true node.data: true 2. 검색 node.master: false node.data: false 3. 색인 node.master: false node.data: true 4. 마스터 + 색인 deilivery node.master: true node.data: false
1, 2, 3번은 직관적으로 이해가 되지요. 1번은 해당 장비에서 검색과 색인을 직접 하는 구성이며, 2번은 검색 질의 request 를 받아서 data node 서버군으로 search 한 후 response 하는 구성이며, 3번은 data node 로 해당 서버에 직접 색인 파일을 생성하는 구성이 되겠습니다. 그럼 4번은요????? 그렇습니다. 클러스터링 설정을 했기 때문에 master 서버로만 사용이 되며 한마디로 색인 요청과 관리를 담당한다고 보시면 되겠죠. 즉 색인 request 를 받아서 data node 서버군으로 indexing 전달을 하는 구성입니다.
Elastic/Elasticsearch 2013. 1. 3. 14:49
Reference URL▷
- 모든 plugin 은 설치 후 plugins 폴더를 복사해서 다른 서버로 옴겨서 설치 가능 함.
- 사내에서는 사설 IP 사용 시 외부 네트워크 통신에 문제가 있는 경우 복사해서 구성 하면 됨.
|
▷
- bin/plugin -install lukas-vlcek/bigdesk
- http:
|
▷
- bin/plugin -install Aconex/elasticsearch-head
- bin/plugin -install mobz/elasticsearch-head
- http:
|
▷
- bin/plugin -install karmi/elasticsearch-paramedic
- http:
|
▷
- bin/plugin -install chanil1218/elasticsearch-analysis-korean/ 1.1 . 0
- 형태소 분석기의 경우 elasticsearch 0.19 .x 을 사용하기 때문에 0.20 .x 에서는 오류가 발생 함
|
|