'Elastic' 카테고리의 글 목록 (20 Page)

[Elasticsearch] Transport Bulk to Rest Bulk data format 변환

Elastic/Elasticsearch 2016. 7. 22. 10:02

java 로 bulk indexing 코드를 구현할 경우 색인 데이터 format을 그대로 rest bulk indexing 에서 사용을 할 수가 없습니다.

그래서 변환 하는 스크립트를 간단하게 작성해 봤습니다.

Reference)

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-bulk.html

Java Bulk Indexing Format)

{ "field1" : "value1" }

Rest Bulk Indexing Format)

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }

{ "field1" : "value1" }

보시면 아시겠지만 index/type/id 에 대한 meta 정보가 있느냐 없느냐의 차이 입니다.

당연하겠지만 java api 에서는 meta 정보를 set 하도록 되어 있습니다. 하지만 rest api 에서는 set 하는 과정이 없기 때문에 당연히 정보를 위와 같이 넣어 줘야 합니다.

변환 스크립트)

#!/bin/bash

while read line

do

header="{ \"index\" : { \"_index\" : \"INDEX_NAME\", \"_type\" : \"TYPE_NAME\" } }"

echo -e $header >> query_result.txt

echo -e $line >> query_result.txt

done < $1

실행)

$ ./convertJavaToRestFormat.sh query_result.json

Rest Bulk Action)

$ curl -XPOST 'http://localhost:9200/INDEX_NAME/TYPE_NAME/_bulk' --data-binary "@query_result.txt"

저작자표시 비영리 변경금지

:

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elastic/Elasticsearch 2016. 6. 28. 09:53

몇 가지 눈에 들어 오는게 있어서 scrap 합니다.

[원문]

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-06-27

[요점]

- low-level Java REST client has landed.

별도의 http client 를 이용해서 만들지 않고 es 에서 제공하는거 사용하면 될 것 같습니다.

- index.store.preload

warmmer 기능이 이걸로 대체 되는 것 같습니다.

- no longer turns red when creating an index

순간 red 나올 때가 있었는데 false alarm 이 줄어 들겠내요.

- default similarity is now BM25

TF/IDF 에서 BM25로 넘어 가는 군요.

- wait for status yellow

yellow 도 간혹 발생을 하는데 앞으로 status 에 대해서 다시 점검을 해야 겠내요.

Elasticsearch Core

Changes in 2.x:

The .scripts index now obeys the number_of_shards setting.
Deprecation logging for `_timestamp` and `_ttl`.
Failed synced flushes were reporting an incorrect number of failures.
The index-exists request shouldn't fail if the index is being recovered.
A valid translog file can be deleted incorrectly after a disk full exception and multiple attempts to recover.

Changes in master:

The low-level Java REST client has landed. It is functionally equivalent to the REST clients available in other languages.
The `index.store.preload` setting can preload the specified Lucene files (eg doc values, norms) into MMAP before a segment comes online. This completes the replacement of warmers.
The cluster health no longer turns red when creating an index, unless there is a problem assigning shards.
The default similarity is now BM25.
The `_timestamp` and `_ttl` fields will not be supported on indices created in 5.x.
The `fields` parameter has been removed in favour of `stored_fields`, `docvalue_fields` and (for `text` fields only)`fielddata_fields`.
Some percolator queries don't need in-memory validation to ensure that they match.
Painless now has capturing lambdas, supports adding static methods like `each` to whitelisted classes, has syntax for initialising arrays, lists and maps,
Nested inner hits no longer return _index, _type, and _id, and parent/child inner hits doesn't return _index.
`string` fields weren't upgraded to `text`/`keyword` if `include_in_all` was specified.
Getting a task with wait_for_completion will return the task result.
Nodes info returns the calculated size of the total indexing buffer.
Analysis factories are now MultiTermAware, which will help to remove the lowercase_expanded_terms from the query string query, and to support keyword analyzers on the `keyword` field.
JNA is now a required dependency.
Guice has been removed from the script service,

Ongoing changes:

Sequence number checkpoints are persisted to disk when a segment is flushed.
Reindex-from-remote now uses the Java REST client.
Ensure that primary handover while indexing does not cause a dead lock.
The index file which lists the snapshots in a repository should be written atomically.
The `discovery-azure` plugin doesn't work with the security manager.
It shouldn't be necessary to wait for status yellow before working with a newly created index.
Add helpers to make JSON easier to render in Mustache.
The SynonymQuery should be used for alternative terms, instead of the Bool query.
More time zone edge case bug fixes.
Changes to shard store fetching are required in order to allow for inline rerouting during node join.
Analysis components should implement AnalysisPlugin instead of calling registerTokenizer, allowing Guice to be removed from Hunspell.

Apache Lucene

5.5.2 RC2 release vote is underway
A tricky randomized explain test failure turns out to be a test bug in a recently added test case
Math.toRadians and Math.toDegrees are now banned, since their implementation changes slightly across java versions, impacting our geo tests
RandomAccessFilterStrategy comes back to life for faster filter intersection in some cases
Multi term queries that match no terms rewrite to MatchNoDocsQuery instead of an empty BooleanQuery , making it much simpler to add a helpful reason to MatchNoDocsQuery
The new Ukrainian lemmatizer uses MorfologikFilter with a custom dictionary for efficient dictionary-based Ukrainian analysis
Lucene's confusing and bushy IndexReader hierarchy strikes again
RAMDirectory now also enforces write-once files, and MockDirectoryWrapper now tries harder to corrupt unsync'd index files on close
GeoPoint gets some code cleanups
Eclipse now also fails on unused imports
Auto-prefix terms have been removed since dimensional points is better
CompressionTools has been removed
ForbiddenAPIs is upgraded to version 2.2
It's important to fsync files after copying them via Lucene's Directory!
A tricky test failure was holding up the 5.5.2 release process
Some minor code improvements to SearchGroup
Can we improve the default behavior of query parsers and multi-term queries?
A test bug in MoreLikeThisTest still remains tricky to fix
MoreLikeThis should not invoke toString on a Field object
ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory are safe for multi-term queries
In the possibly not-rare case where many document share the same point value, we can better compress the docIDs
The ancient query norm and coord blocks progress and should be removed
Should we add a light weight Ukrainian stemmer?
Updating doc values and then using delete-by-query with a doc values query doesn't always work, but fixing it is likely not feasible

저작자표시 비영리 변경금지

:

[Elasticsearch] Aggregation name ?

Elastic/Elasticsearch 2016. 6. 20. 18:09

aggregation 을 많이 사용하시는 분들은 잘 아실것 같구요.

그냥 기본만 사용하시는 분들에게는 생소할 수 있어서 그냥 정리해 봤습니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-aggregations.html#_structuring_aggregations

"aggregations" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"meta" : { [<meta_data_body>] } ]? [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]*

}

여기서 "<aggregation_name>" 에 대한 내용입니다.

이 값은 기본적으로 aggs 수행 후 return 될 때 사용되는 변수명을 지정하게 됩니다.

간혹 aggregation_name 에 field 명을 주시는 경우가 있을 수 있는데요. 안되는 것은 아니지만 해당 변수에 대한 정확한 용도를 알고 사용하시면 더 좋겠다 싶어서 글 남겨 봤습니다.

/**

* Constructs a new aggregation builder.

*

* @param name The aggregation name

* @param type The aggregation type

*/

public AggregationBuilder(String name, Type type) {

if (name == null) {

throw new IllegalArgumentException("[name] must not be null: [" + name + "]");

}

if (type == null) {

throw new IllegalArgumentException("[type] must not be null: [" + name + "]");

}

this.name = name;

this.type = type;

}

...중략...

@Override

public final XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {

builder.startObject(name);

if (this.metaData != null) {

builder.field("meta", this.metaData);

}

builder.field(type.name());

internalXContent(builder, params);

if (factoriesBuilder != null && (factoriesBuilder.count()) > 0) {

builder.field("aggregations");

factoriesBuilder.toXContent(builder, params);

}

return builder.endObject();

}

소스 코드를 보셔도 아시겠죠?

XContentBuilder 에서 최상위 object name 에 aggregation_name 값을 지정하고 있습니다.

그냥 제가 기억하기 위해 글 남기고 마무리 하겠습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] BoolQueryBuilder + TermsQueryBuilder 사용 시 minimum_should_match와 min_should_match

Elastic/Elasticsearch 2016. 5. 24. 18:11

이게 또 언제 변수명이 바뀌었을까요?

버전 릴리즈 될때마다 소스코드를 다 따라 갈수가 없다 보니 이런 오류를 경험하게 되내요.

주의) min_should_match 는 java api 를 이용해서는 사용 할 수 없습니다.

[Terms Query]

이 쿼리는 field 에 여러개의 term 을 넣어서 질의 할 수 있도록 해줍니다.

그래서 기본 or 검색을 지원하고 있구요. 여기서 and 연산을 하고 싶으면 terms query 에 아래 변수 값을 지정 하셔야 합니다.

min_should_match:TERM_SIZE

OR)

curl -XGET "http://localhost:9200/_search?pretty" -d'

{

"size": 10,

"query" : {

"terms": {

"title": [

"포니",

"이펙트"

],

"min_should_match": 1

}

}'

AND)

curl -XGET "http://localhost:9200/_search?pretty" -d'

{

"size": 10,

"query" : {

"terms": {

"title": [

"포니",

"이펙트"

],

"min_should_match": 2

}

}'

[Bool Query + Terms Query]

이 쿼리는 compound query 작성을 위해 많이 사용하는 것입니다.

bool query 안에 terms query 를 섞어 사용하는 것이구요. 좀 더 and, or 연산을 다양하게 할 수 있게 해줍니다.

여기서는 miminum_should_match 를 통해서 and, or 연산을 해야 하는데 terms query 에서의 min_should_match 를 사용하지 않게 되면 정상적인 결과를 얻을 수 없게 됩니다. (아무래도 2.3.3 에서 5.0 으로 넘어가는 과도기라 그런게 아닌가 싶습니다.)

should 를 여러개 사용할 경우 miminum_should_match 설정을 하셔야 합니다.

OR)

curl -XGET "http://localhost:9200/_search?pretty" -d'

{

"size": 10,

"query" : {

"bool" : {

"should": [

{

"terms": {

"title": [

"포니",

"이펙트"

],

"min_should_match": "1"

}

]

}

}'

AND)

curl -XGET "http://localhost:9200/_search?pretty" -d'

{

"size": 10,

"query" : {

"bool" : {

"should": [

{

"terms": {

"title": [

"포니",

"이펙트"

],

"min_should_match": "2"

}

]

}

}'

저작자표시 비영리 변경금지

:

[Logstash] input-http, filter-mutate,grok 샘플 config

Elastic/Logstash 2016. 5. 18. 17:48

그냥 올려봅니다.

[logstash config]

input {

http {

codec => json {

}

filter {

mutate {

add_filed => { "request_uri" => "%{headers[request_uri]}" }

replace => { "message" => "input http 사용 시 headers 내부 변수 접근(nested variables)" }

}

grok {

match => { "request_uri" => "%{URIPARAM:request}" }

}

output {

stdout { codec => rubydebug }

}

뭐 정말 별거 아닌고 모니터링 시스템 설계 하다가 prototype 구현을 해봐야 겠다 싶어서 대충 돌려보다 grok 에러가 발생해서 기록해 본겁니다.

[logstash http input 사용 시 출력 결과]

{

"message" => "",

"@version" => "1",

"@timestamp" => "2016-05-18T07:19:36.140Z",

"host" => "127.0.0.1",

"headers" => {

"request_method" => "GET",

"request_path" => "/",

"request_uri" => "/?message=test",

1 input {

"http_version" => "HTTP/1.1",

"http_host" => "127.0.0.1:8080",

"http_user_agent" => "curl/7.43.0",

"http_accept" => "*/*"

},

"tags" => [

[0] "_grokparsefailure"

]

}

저작자표시 비영리 변경금지

:

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elastic/Elasticsearch 2016. 4. 26. 15:48

이번 weekly 에서 눈에 확 들어 오는건 개인적으로 아래 두 가지 입니다.

Thread local leaks when running in web containers have finally been fixed.
CamelCase support has been removed.

원본 글)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-04-25

Elasticsearch Core

Changes in 2.x:

The index name was missing from the search slowlog.
CamelCase is deprecated (and has deprecation logging).
MoreLikeThis now handles aliases correctly.

Changes in master:

The .percolator type has been replaced with the percolator field datatype.
Added a fingerprint token filter and fingerprint analyzer for duplicate detection.
TransportReplicationAction has been signficantly refactored in order to make it unit testable.
RPM and Deb packages now set permissions explicitly, instead of relying on umasks.
Indexed scripts and templates are now stored in the cluster state, and are called "stored" scripts/templates.
Parameter names in ingest processors are now more consistent.
IP fields support range queries again.
readNamedWriteable and writeNamedWriteable are now public, and writable.readFrom is gone.
UUID generators moved out of Strings, to avoid spooky action at a distance.
The `action.realtime_get` setting has been removed.
Support for unquoted JSON keys can be allowed via a system property, for bwc purposes.
Cross-type mapping updates were not working for boolean fields.
Empty task IDs are now serialised in 1 byte, so that every task can have a parent ID.
Reindex child tasks weren't being marked as such.
Validation failures have been removed from the cluster health response.
Object fields now inherit their dynamic setting from their parent object or type.
Thread local leaks when running in web containers have finally been fixed.
Added a safeguard to protect against too-large rescore windows.
The elasticsearch-plugin script now prints the download URL of the plugin when in verbose mode, and has friendlier error messages.
The startup script now fails with an error code if the elasticsearch binary is not found or is not executable.
CamelCase support has been removed.
The ICU analyzer now accepts custom rule files.

Ongoing changes:

Dots in fields names are now supported, but so far only if the parent fields already exist. Tests are being added to make sure supporting dots fully doesn't break anything.
Persistence of results of long running tasks.
A `minhash` token filter for estimating the Jacard similarity coefficient between two docs.
Pipeline aggs are only needed on the coordinating node.
Adding searchable/aggregatable tags to fields in the field stats API.
Inner hits will no longer support the top-level syntax as the inline syntax has been improved.
It should be possible to pass include/exclude values to the terms aggs using the same format that was used to render bucket keys.
Deleted index tombstones close to being merged.

저작자표시 비영리 변경금지

:

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

Elastic/Elasticsearch 2016. 4. 26. 15:18

준비 작업을 조금 해야 할 것 같아서 일단 짧게 기록 합니다.

Elastic Stack 5.0이 정식 릴리즈 되게 되면 Lucene 6.x 기반으로 버전이 올라가게 됩니다.

이에 따라 아리랑 형태소 분석기도 버전을 올려야 하는데요.

일단 올려 보니 에러는 한 군데 보입니다.

abstract 로 선언된 method 하나만 구현해 주면 될 것으로 보입니다.

MophemeAttributeImpl.java 파일에 reflectWith(....) 메서드만 구현해 주세요.

@Override
public void reflectWith(AttributeReflector reflector) {
    reflector.reflect(MorphemeAttribute.class, "token", koreanToken);
}

해당 코드에 대한 검증 작업은 하지 않았으니 사용이나 판단은 각자 알아서 하는 것으로 하겠습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Elasticsearch에서 synonyms 사용 시 고려사항.

Elastic/Elasticsearch 2016. 4. 22. 17:59

뭐 이런게 고려 사항 일까 싶지만 그냥 머리 식히기 위해서 작성해 봅니다.

synonyms 는 기본적으로 search 시와 index 시에 다 사용이 가능 합니다.

이 둘 사이에 장단점은 아래 링크를 참고해 주시면 좋겠습니다.

참고링크)

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/synonyms-expand-or-contract.html

search 시 synonyms 를 적용하기 위해서는 match query 종류를 사용하셔야 합니다.

간혹 term query 종류를 사용하시면서 왜 안되지 하시는 분들도 있는데 주의 하셔야 합니다.

index 시 synonyms 를 적용하기 위해서는 synonyms filter 우선순위를 잘 확인 하셔야 합니다.

제일 앞에 있는 filter 때문에 적용이 안될 수도 있으니 주의 하셔야 합니다.

이 경우 search 시 term query 류를 사용하면 안되던 것이 지원이 되기 때문에 요건에 따라 선택해서 사용하시면 좋을 것 같습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Analyzer filter 구성 시 순서.

Elastic/Elasticsearch 2016. 4. 22. 11:42

아주 기본적인 내용인데 간혹 놓치고 가는 경우가 있어서 기록해 봅니다.

저 같은 경우는 synonyms 적용하면서 당연히 적용된 줄 알고 테스트 하다 삽질한 경우 입니다.

analyzer 구성은 잘 아시겠지만 settings 에서 수행하게 됩니다.

그리고 설정한 analyzer 를 mappings 에서 사용을 하게 되구요.

설정 방법에 대해서는 아래 문서 참고 하시기 바랍니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analysis.html

참고문서 내 설정 예시)

index : analysis : analyzer : standard : type : standard stopwords : [stop1, stop2] myAnalyzer1 : type : standard stopwords : [stop1, stop2, stop3] max_token_length : 500 # configure a custom analyzer which is # exactly like the default standard analyzer myAnalyzer2 : tokenizer : standard filter : [standard, lowercase, stop] tokenizer : myTokenizer1 : type : standard max_token_length : 900 myTokenizer2 : type : keyword buffer_size : 512 filter : myTokenFilter1 : type : stop stopwords : [stop1, stop2, stop3, stop4] myTokenFilter2 : type : length min : 0 max : 2000

위 예시를 가지고 설명을 드리면, myAnalyzer2 설정에 filter : [standard, lowercase, stop] 으로 정의가 되어 있습니다.

즉, filter 적용 순서가

1. standard

2. lowercase

3. stop

으로 적용이 된다고 보시면 됩니다.

아주 간단하죠.

제가 설정 순서를 잘못해 놓고 왜 안되지 하고 있었습니다. ㅡ.ㅡ;

저작자표시 비영리 변경금지

:

[Logstash] logstash slack chat output plugin 만들기

Elastic/Logstash 2016. 4. 20. 14:11

필요해서 prototype 수준으로 만들어 봤습니다.

추후 input 와 filter 부분에서 필요한 로직을 각자 구현 하시면 될 것 같습니다.

참고문서)

https://api.slack.com/docs/oauth-test-tokens

https://api.slack.com/methods

https://github.com/logstash-plugins/logstash-output-example.git

http://www.rubydoc.info/github/cheald/manticore/Manticore/Client

구현소스)

https://github.com/HowookJeong/logstash-output-slack_chat

실행방법)

    $ bin/logstash -e '
        input {
            stdin{}
        }

        output {
            slack_chat {
                url => "http://slack.com/api/chat.postMessage"
                token => "YOUR_TOKEN_STRING"
                channel => "SLACK_CHANNEL_ID"
            }

            stdout { codec => rubydebug }
        }
    '

아주 간단하죠.

뭐 꼭 logstash plugin 이 아니더라도 일반 httpclient 라이브러리를 이용해서 다양한 방법으로 구현 가능하니 목적에 맞게 구현해서 사용하시면 될 것 같습니다.

Other logstash slack)

https://github.com/cyli/logstash-output-slack

저작자표시 비영리 변경금지

:

jjeong

'Elastic'에 해당되는 글 498건

[Elasticsearch] Transport Bulk to Rest Bulk data format 변환

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elasticsearch Core

Apache Lucene

[Elasticsearch] Aggregation name ?

[Elasticsearch] BoolQueryBuilder + TermsQueryBuilder 사용 시 minimum_should_match와 min_should_match

[Logstash] input-http, filter-mutate,grok 샘플 config

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elasticsearch Core

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

[Elasticsearch] Elasticsearch에서 synonyms 사용 시 고려사항.

[Elasticsearch] Analyzer filter 구성 시 순서.

[Logstash] logstash slack chat output plugin 만들기

티스토리툴바