'elasticsearch' 태그의 글 목록 (24 Page)

'elasticsearch'에 해당되는 글 420건

[Lucene] CustomScoreQuery vs. DisjunctionMaxQuery

ITWeb/검색일반 2015. 11. 24. 15:05

검색 서비스를 운영하다 보면 문서에 대한 부스팅 작업이 필요 할 때가 있습니다.

부스팅의 목적은 다양할 수 있는데요.

한 문장으로 정리를 하면,

"특정 문서를 검색 결과 상위에 노출시키기 위해 사용"

한다고 보시면 됩니다.

문서에 대한 부스팅 작업은 아래와 같은 방법으로 구현이 가능 합니다.

1. 질의 부스팅

"Query time boosting" 이라는 것은 질의 시 특정 필드 또는 질의어에 대한 가중치를 부여하여 질의 시점에 적용하는 방식을 말합니다.

2. 색인 부스팅

"Index time boosting" 이라는 것은 색인 시 특정 필드 또는 문서에 대한 가중치를 부여하여 색인 시점에 적용하는 방식을 말합니다.

3. 필드 부스팅

"Field boosting" 이라는 것은 특정 필드에 대한 가중치가 또는 중요도가 높다는 것을 반영하여 적용하는 방식을 말합니다.

4. 도큐멘트 부스팅

"Document boosting" 이라는 것은 특정 문서 자체에 대한 가중치가 또는 중요도가 높다는 것을 반영하여 적용하는 방식을 말합니다.

5. 커스텀 부스팅

"Custom boosting" 이라는 것은 임의 문서에 대한 가중치와 스코어에 대한 조작을 통해 적용하는 방식을 말합니다.

부스팅에 대한 구현 방법을 살펴 보았는데요.

이것들은 아래의 API를 통해서 구현 하게 됩니다.

바로 DisjunctionMaxQuery 와 CustomScoreQuery 입니다.

물론 lucene에서 제공하는 다른 API 또는 Elasticsearch 나 Solr 에서 제공하는 다양한 방식으로 구현이 가능 합니다.

DisjunctionMaxQuery 를 이용해서 구현 가능한 것은

1. 질의 부스팅

3. 필드 부스팅

정도로 보입니다.

반대로 CustomScoreQuery 는 다양하게 구현이 가능합니다.

Elasticsearch 기준으로 보면 FunctionScoreQuery + DisjunctionMaxQuery 를 섞어서 사용이 가능 하기 때문에 2. 색인 부스팅을 제외 하고는 다 구현이 가능 하다고 보면 될 것 같습니다.

자 이제 급 마무리를 해보겠습니다.

뭐 말은 만드는 사람 맘이고 해석도 하는 사람 맘이니 저랑 다르게 생각하시는 분들이 계실 겁니다.

다만 내용이 틀렸거나 잘못되었다면 좀 알려주세요. ^^;

[Dis Max Query]

- 질의 시점에 사용을 합니다.

- Field 에 대한 가중치를 주어 부스팅을 합니다.

- 구현하기 쉽습니다.

- 순수하게 검색엔진에 맡겨처 처리 합니다.

[Custom(Function) Score Query]

- 질의 시점에 사용을 합니다.

- Field, Document 에 대한 가중치를 주어 부스팅을 합니다.

- 제공하는 API 가 많기 때문에 어렵지 않습니다.

- 다양한 방법으로 구현이 가능 합니다.

- 다양한 알고리즘 또는 요건을 만족 시키기 위해 사용을 합니다.

- 문서 내 특정 값을 이용하여 부스팅에 적용 할 수 있습니다.

- 복잡할 수록 성능이 느려 질 수 있습니다.

과거에 fulltext 검색으로 사용하는 게시판류 서비스 뭐 이런데에서는 dis max query + query string 을 즐겨 사용하곤 했는데요.

지금은 좀 더 정교하게 부스팅을 하기 위해서 custom(function) score query 를 많이 사용하는것 같습니다.

특히 쇼핑몰 같은 경우는 dis max 보다는 custom score 가 적합한 query라고 생각 합니다.

저작자표시 비영리 변경금지 (새창열림)

[arirang] 사전 데이터 등록 예제

ITWeb/검색일반 2015. 11. 20. 15:15

arirang analyzer 를 사용하면서 사전 활용을 위해서는 사전 파일이 어떻게 구성이 되어 있고 관리가 되어야 하는지 알아야 합니다.

아래 공식 카페에 들어가시면 많은 정보들이 있으니 참고 하시면 되겠습니다.

[공식카페]

http://cafe.naver.com/korlucene

[형태소분석 사전 구성 및 사용법]

http://cafe.naver.com/korlucene/6

[사전 등록 예시]

# 위 구성 및 사용법 에서와 같이 인덱스 순서가 이렇게 되어 있습니다.

명사 동사 기타품사 하여(다)동사 되어(다)동사 '내'가붙을수있는명사 na na na 불규칙변형

# 엘사는 명사이고 동사, 기타품사, 불규칙이 아니다, 라고 가정하면 아래와 같이 표현이 됩니다.

엘사,100000000X

# 노래는 명사이고 하여 동사가 됩니다.

노래,100100000X

# 소리는 명사이고 소리내다와 같이 내가 붙을 수 있는 명사 입니다.

소리,100001000X

[사전작업 후 리로딩]

arirang-morph 패키지에서 DictionaryUtil.java 내 loadDictionary() 호출을 통해 다시 올려 줍니다.

▶ 별도 구현이 필요합니다.

[불규칙변형 태그]

원문) http://cafe.naver.com/korlucene/135

# 위에서 제일 마지막에 'X' 라는 문자가 있습니다. 이 부분에 대한 설명 입니다.

B : ㅂ 불규칙

H : ㅎ 불규칙

L : 르 불규칙

U : ㄹ 불규칙

S : ㅅ 불규칙

D : ㄷ 불규칙

R : 러 불규칙

X : 규칙

저작자표시 비영리 변경금지 (새창열림)

[Elasticsearch] 한글 자모 형태소 분석기 플러그인.

Elastic/Elasticsearch 2015. 11. 20. 00:07

짜집기 코드를 활용해서 플러그인을 만들어 봤습니다.

소스 코드는 아래에서 받아 보실 수 있습니다.

[repository]

https://github.com/HowookJeong/elasticsearch-analysis-hangueljamo

[빌드방법]

$ mvn clean package

Elasticsearch Analyze Test URL

http://localhost:9200/test/_analyze?analyzer=hangueljamo_analyzer&text=Henry 노트북&pretty=1

Analyzed Result

{
  "tokens" : [ {
    "token" : "henry",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "ㄴㅌㅂ",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 1
  } ]
}

저작자표시 비영리 변경금지 (새창열림)

[Breaking changes in 2.0] translated by 서철규.

Elastic/Elasticsearch 2015. 11. 16. 17:47

내용이 많아서 그냥 저도원본 읽어 보는게 다였는데 Elasticsearch 유저그룹의 #서철규 님이 번역한 문서를 공유해 주셨습니다.

그래서 자료 저장 겸 ㅎㅎ 허락 받고 올려 봅니다.

원문)

https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-2.0.html

페북유저그룹)

https://www.facebook.com/groups/elasticsearch.kr/909640302455144

첨부문서)

BreakingChangesInElasticsearch2.0.pdf

자료 감사합니다. :)

저작자표시 비영리 변경금지 (새창열림)

[Elasticsearch] arirang analyzer offset 추출 오류.

Elastic/Elasticsearch 2015. 11. 10. 15:38

[아래 문제 회피]

기본적으로 pairmap 관련 기능을 사용하지 않으면 문제를 회피 할 수 있습니다.

더 근본적으로는 관련 기능에 대한 개선이 필요 하겠지만 일단 빠르게 해결 하기 위해서 코드를 제거해 보겠습니다.

대상파일)

KoreanTokenizer.java

삭제 대상코드)

if(pairstack.size()>0 && pairstack.get(0)==c) {

pairstack.remove(0);

continue;

}

int closechar = getPairChar(c);

if(closechar!=0) {

if((pairstack.size()==0 || pairstack.get(0)!=closechar) && length>0) {

pairstack.add(0,closechar);

break;

} else {

pairstack.add(0,closechar);

continue;

}

▶ 위 코드를 주석 처리 후 빌드해서 배포 하시면 됩니다.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

최근에 arirang analyzer plugin을 만들어서 elasticsearch에서 사용하고 있습니다.

사용하다 발견된 버그 공유 합니다.

[색인원본문서]

(‘14.8월)의 일환으로 ’15.3.3 상반기중(행복서울대학교 등 활용)

[Arirang Analyzed]

{

"tokens": [

{

"token": "14",

"start_offset": 2,

"end_offset": 4,

"type": "word",

"position": 0

{

"token": "8월",

"start_offset": 5,

"end_offset": 7,

"type": "korean",

"position": 1

{

"token": "의",

"start_offset": 8,

"end_offset": 9,

"type": "korean",

"position": 2

{

"token": "일환",

"start_offset": 10,

"end_offset": 12,

"type": "korean",

"position": 3

{

"token": "15",

"start_offset": 16,

"end_offset": 18,

"type": "word",

"position": 4

{

"token": "3",

"start_offset": 19,

"end_offset": 20,

"type": "word",

"position": 5

{

"token": "3",

"start_offset": 21,

"end_offset": 22,

"type": "word",

"position": 6

{

"token": "상반기중행복서울대학교",

"start_offset": 23,

"end_offset": 34,

"type": "korean",

"position": 7

{

"token": "상반",

"start_offset": 23,

"end_offset": 25,

"type": "korean",

"position": 7

{

"token": "기중",

"start_offset": 25,

"end_offset": 27,

"type": "korean",

"position": 8

{

"token": "행복",

"start_offset": 27,

"end_offset": 29,

"type": "korean",

"position": 9

{

"token": "서울",

"start_offset": 29,

"end_offset": 31,

"type": "korean",

"position": 10

{

"token": "대학교",

"start_offset": 31,

"end_offset": 34,

"type": "korean",

"position": 11

{

"token": "등",

"start_offset": 36,

"end_offset": 37,

"type": "korean",

"position": 12

{

"token": "활용",

"start_offset": 38,

"end_offset": 40,

"type": "korean",

"position": 13

}

]

}

여기서 보시면 "행복"에서 부터 offset 정보가 하나씩 줄어든것을 확인 할 수 있습니다.

[원본 변경을 통한 문제해결 - 하나]

(‘14.8월)의 일환으로 15.3.3 상반기중(행복서울대학교 등 활용)

- ’15 (apostrophe) 제거

[원본 변경을 통한 문제해결 - 둘]

(‘14.8월’)의 일환으로 ’15.3.3 상반기중(행복서울대학교 등 활용)

- 8월’ (apostrophe) 추가

[원본 변경을 통한 문제해결 - 셋]

(14.8월)의 일환으로 ’15.3.3 상반기중(행복서울대학교 등 활용)

- ’14 (apostrophe) 제거

[원본 변경을 통한 문제해결 - 넷]

‘14.8월)의 일환으로 ’15.3.3 상반기중(행복서울대학교 등 활용)

- ( 제거

[해결 방법?]

- 복합 pairmap 구성에 대한 arirang analyzer 오류 수정 (tokenizer 와 filter 쪽 수정이 필요해 보입니다.)

- 원본에 대한 character normalization 작업을 통해 filter 를 합니다.

저작자표시 비영리 변경금지 (새창열림)

[Elasticsearch] lucene arirang analyzer 플러그인 적용 on elasticsearch 2.0

Elastic/Elasticsearch 2015. 11. 4. 15:37

elasticsearch 2.0 GA 기념으로 수명님의 lucene arirang 한글분석기 적용방법을 알아 보도록 하겠습니다.

이전에 작성된 elasticsearch analyzer arirang 은 아래 글 참고 부탁 드립니다.

http://jjeong.tistory.com/958

[Requirement]

elasticsearch 2.0

jdk 1.7 이상 (elastic 에서 추천 하는 버전은 1.8 이상입니다.)

maven 3.1 이상

arirang.lucene-analyzer-5.0-1.0.0.jar (http://cafe.naver.com/korlucene/1274)

arirang-morph-1.0.0.jar (http://cafe.naver.com/korlucene/1274)

[Analysis Plugins]

https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/analysis.html

[Plugin 작성 시 변경 내용 - 하나]

- es-plugin.properties 파일이 없어 지고 plugin-descriptor.properties 가 생겼습니다.

- plugin-descriptor.properties 내용은 아래와 같습니다.

classname=org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin

name=arirang

jvm=true

java.version=1.7

site=false

isolated=true

description=Arirang plugin

version=${project.version}

elasticsearch.version=${elasticsearch.version}

hash=${buildNumber}

timestamp=${timestamp}

▶ 자세한 설명을 원하시는 분들은 아래 링크 참고 하시면 됩니다.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/plugin-authors.html#_plugin_descriptor_file

[Plugin 작성 시 변경 내용 - 둘]

- 기존에 상속 받았던 AbstractPlugin이 없어지고 Plugin을 상속 받아 구현하도록 변경 되었습니다.

From.

public class AnalysisArirangPlugin extends AbstractPlugin {...}

To.

public class AnalysisArirangPlugin extends Plugin {...}

그 밖에는 변경된 내용은 아래 arirang 에서 바뀐 부분이 적용된 내용이 전부 입니다.

[Arirang 변경 내용]

- KoreanAnalyzer 에서 lucene version 정보를 받았으나 이제는 정보를 받지 않습니다.

From.

analyzer = new KoreanAnalyzer(Lucene.VERSION.LUCENE_47);

To.

analyzer = new KoreanAnalyzer();

- KoreanTokenizer 에서는 기존에 reader 정보를 받았으나 이제는 정보를 받지 않습니다.

From.

return new KoreanTokenizer(reader);

To.

return new KoreanTokenizer();

[2.0 적용 시 바뀐 내용]

- assemblies/plugin.xml을 수정 하였습니다.

plugins 폴더에 zip 파일 올려 두고 압축 풀면 바로 동작 할 수 있도록 구성을 변경 하였습니다.

<?xml version="1.0"?>

<id>plugin</id>

</formats>

<includeBaseDirectory>false</includeBaseDirectory>

<files>

<file>

<source>lib/arirang.lucene-analyzer-5.0-1.0.0.jar</source>

<outputDirectory>analysis-arirang</outputDirectory>

</file>

<file>

<source>lib/arirang-morph-1.0.0.jar</source>

<outputDirectory>analysis-arirang</outputDirectory>

</file>

<file>

<source>target/elasticsearch-analysis-arirang-1.0.0.jar</source>

<outputDirectory>analysis-arirang</outputDirectory>

</file>

<file>

<source>${basedir}/src/main/resources/plugin-descriptor.properties</source>

<outputDirectory>analysis-arirang</outputDirectory>

</file>

</files>

</assembly>

코드는 직관적이라서 쉽게 이해 하실 수 있을 거라 생각 합니다.

필요한 jar 파일들과 properties 파일을 analysis-arirang 이라는 폴더로 묶는 것입니다.

<filtered>true</filtered> 옵션은 아래 링크 참고 하세요. (해당 파일이 filtering 되었는지 확인 하는 것입니다.)

https://maven.apache.org/plugins/maven-assembly-plugin/assembly.html#class_file

여기서 plugin-descriptor.properties 파일이 포함이 안되어 있게 되면 elasticsearch 실행 시 에러가 발생하고 실행이 안됩니다.

주의 하셔야 하는 부분(?) 입니다.

- plugin-descriptor.properties 파일 없을 때 에러 메시지

[2015-11-04 12:34:14,522][INFO ][node ] [Lady Jacqueline Falsworth Crichton] initializing ...

Exception in thread "main" java.lang.IllegalStateException: Unable to initialize plugins

Likely root cause: java.nio.file.NoSuchFileException: /Users/hwjeong/server/app/elasticsearch/elasticsearch-2.0.0/plugins/analysis-arirang/plugin-descriptor.properties

at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)

at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)

at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)

at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)

at java.nio.file.Files.newByteChannel(Files.java:315)

at java.nio.file.Files.newByteChannel(Files.java:361)

at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:380)

at java.nio.file.Files.newInputStream(Files.java:106)

at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:86)

at org.elasticsearch.plugins.PluginsService.getPluginBundles(PluginsService.java:306)

at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:112)

at org.elasticsearch.node.Node.<init>(Node.java:144)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:145)

at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:170)

at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:270)

at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)

Refer to the log for complete error details.

- Test Code 추가

뭐가 coverage 를 올리기 위한 그런 테스트 코드는 아닙니다. ;;

▶ ArirangAnalysisTest.java

이 테스트는 elasticsearch에서 실제 작성된 플러그인이 제대로 module 로 등록 되고 등록된 module에 대한 service 를 가져 오는지 보는 것입니다.

elasticsearch 소스코드를 내려 받으시면 plugins 에 들어 있는 코드 그대로 copy & paste 한 것입니다.

▶ ArirangAnalyzerTest.java

이 테스트는 _analyze 에 대한 REST API 와 실제 index.analysis 세팅 사이에 구성이 어떻게 코드로 반영 되는지 상호 맵핑 하기 위해 작성 되었습니다.

analyzer, tokenizer, tokenfilter 에 대해서 어떻게 동작 하는지 그나마 쉽게 이해 하시는데 도움이 될까 싶어 작성된 코드 입니다.

※ Elasticsearch Test Suite 이슈 - 자답(?)

현재 master branch 는 문제 없이 잘 됩니다.

다만 2.0 branch 에서는 아래와 같은 또는 다른 문제가 발생을 합니다.

그냥 master 받아서 테스트 하시길 권장 합니다.

※ Elasticsearch Test Suite 이슈.

이건 제가 잘못해서 발생 한 것일 수도 있기 때문에 혹시 해결 하신 분이 계시면 공유 좀 부탁 드립니다.

▶ 발생 에러

/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/bin/java -ea -Didea.launcher.port=7533 "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA 14 CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath "/Applications/IntelliJ IDEA 14 CE.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA 14 CE.app/Contents/plugins/junit/lib/junit-rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/deploy.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/htmlconverter.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/javaws.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/jfxrt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/plugin.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/jre/lib/rt.jar:/Users/hwjeong/git/elasticsearch-analysis-arirang/target/test-classes:/Users/hwjeong/git/elasticsearch-analysis-arirang/target/classes:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-core/5.2.1/lucene-core-5.2.1.jar:/Users/hwjeong/.m2/repository/org/elasticsearch/elasticsearch/2.0.0/elasticsearch-2.0.0.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-backward-codecs/5.2.1/lucene-backward-codecs-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-analyzers-common/5.2.1/lucene-analyzers-common-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-queries/5.2.1/lucene-queries-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-memory/5.2.1/lucene-memory-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-highlighter/5.2.1/lucene-highlighter-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-queryparser/5.2.1/lucene-queryparser-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-sandbox/5.2.1/lucene-sandbox-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-suggest/5.2.1/lucene-suggest-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-misc/5.2.1/lucene-misc-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-join/5.2.1/lucene-join-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-grouping/5.2.1/lucene-grouping-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-spatial/5.2.1/lucene-spatial-5.2.1.jar:/Users/hwjeong/.m2/repository/com/spatial4j/spatial4j/0.4.1/spatial4j-0.4.1.jar:/Users/hwjeong/.m2/repository/com/google/guava/guava/18.0/guava-18.0.jar:/Users/hwjeong/.m2/repository/com/carrotsearch/hppc/0.7.1/hppc-0.7.1.jar:/Users/hwjeong/.m2/repository/joda-time/joda-time/2.8.2/joda-time-2.8.2.jar:/Users/hwjeong/.m2/repository/org/joda/joda-convert/1.2/joda-convert-1.2.jar:/Users/hwjeong/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.5.3/jackson-core-2.5.3.jar:/Users/hwjeong/.m2/repository/com/fasterxml/jackson/dataformat/jackson-dataformat-smile/2.5.3/jackson-dataformat-smile-2.5.3.jar:/Users/hwjeong/.m2/repository/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.5.3/jackson-dataformat-yaml-2.5.3.jar:/Users/hwjeong/.m2/repository/org/yaml/snakeyaml/1.12/snakeyaml-1.12.jar:/Users/hwjeong/.m2/repository/com/fasterxml/jackson/dataformat/jackson-dataformat-cbor/2.5.3/jackson-dataformat-cbor-2.5.3.jar:/Users/hwjeong/.m2/repository/io/netty/netty/3.10.5.Final/netty-3.10.5.Final.jar:/Users/hwjeong/.m2/repository/com/ning/compress-lzf/1.0.2/compress-lzf-1.0.2.jar:/Users/hwjeong/.m2/repository/com/tdunning/t-digest/3.0/t-digest-3.0.jar:/Users/hwjeong/.m2/repository/org/hdrhistogram/HdrHistogram/2.1.6/HdrHistogram-2.1.6.jar:/Users/hwjeong/.m2/repository/commons-cli/commons-cli/1.3.1/commons-cli-1.3.1.jar:/Users/hwjeong/.m2/repository/com/twitter/jsr166e/1.1.0/jsr166e-1.1.0.jar:/Users/hwjeong/.m2/repository/log4j/log4j/1.2.16/log4j-1.2.16.jar:/Users/hwjeong/.m2/repository/org/slf4j/slf4j-api/1.6.2/slf4j-api-1.6.2.jar:/Users/hwjeong/.m2/repository/org/slf4j/slf4j-log4j12/1.6.2/slf4j-log4j12-1.6.2.jar:/Users/hwjeong/git/elasticsearch-analysis-arirang/lib/arirang-morph-1.0.0.jar:/Users/hwjeong/git/elasticsearch-analysis-arirang/lib/arirang.lucene-analyzer-5.0-1.0.0.jar:/Users/hwjeong/.m2/repository/junit/junit/4.11/junit-4.11.jar:/Users/hwjeong/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar:/Users/hwjeong/.m2/repository/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.1.16/randomizedtesting-runner-2.1.16.jar:/Users/hwjeong/.m2/repository/org/hamcrest/hamcrest-all/1.3/hamcrest-all-1.3.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-test-framework/5.2.1/lucene-test-framework-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/lucene/lucene-codecs/5.2.1/lucene-codecs-5.2.1.jar:/Users/hwjeong/.m2/repository/org/apache/ant/ant/1.8.2/ant-1.8.2.jar:/Users/hwjeong/.m2/repository/org/elasticsearch/elasticsearch/2.0.0/elasticsearch-2.0.0-tests.jar:/Users/hwjeong/.m2/repository/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar" com.intellij.rt.execution.application.AppMain com.intellij.rt.execution.junit.JUnitStarter -ideVersion5 org.elasticsearch.index.analysis.ArirangAnalysisTest,testArirangAnalysis

log4j:WARN No appenders could be found for logger (org.elasticsearch.bootstrap).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

java.lang.RuntimeException: found jar hell in test classpath

at org.elasticsearch.bootstrap.BootstrapForTesting.<clinit>(BootstrapForTesting.java:63)

at org.elasticsearch.test.ESTestCase.<clinit>(ESTestCase.java:106)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:270)

at com.carrotsearch.randomizedtesting.RandomizedRunner$1.run(RandomizedRunner.java:573)

Caused by: java.lang.IllegalStateException: jar hell!

class: org.hamcrest.BaseDescription

jar1: /Users/hwjeong/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar

jar2: /Users/hwjeong/.m2/repository/org/hamcrest/hamcrest-all/1.3/hamcrest-all-1.3.jar

at org.elasticsearch.bootstrap.JarHell.checkClass(JarHell.java:267)

at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:185)

at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:86)

at org.elasticsearch.bootstrap.BootstrapForTesting.<clinit>(BootstrapForTesting.java:61)

... 4 more

저작자표시 비영리 변경금지 (새창열림)

[Review] Modeling data for fast aggregations - on Elastic's Blog

Elastic/Elasticsearch 2015. 10. 30. 11:43

elastic blog에 올라온 글에 대한 개인 리뷰 입니다.

그냥 정리 차원에서 작성해 보겠습니다.

[원문]

https://www.elastic.co/blog/modeling-data-for-fast-aggregations

글 제목이나 링크만 봐도 어떤 내용인지 감이 오실 겁니다.

"aggregation 성능을 올리기 위한 모델링"

딱 봐도 끌리는 제목이죠.

이 글에서 제시해 주는 건 단순 명료 합니다.

query와 aggregation 조건에 대한 attribute 를 각 문서에 미리 정의를 해 두는 것으로 aggregation operation 수를 줄여 주는 것입니다.

문서에서는 6번의 aggregation operations를 attribute 설정을 통해 2번의 aggregation operations 수행으로 줄어든 것을 확인 시켜 주고 있습니다.

당연히 색인시점에 query, aggregation 조건을 분리해서 문서에 대한 attribute 설정을 하고 색인을 해야 합니다.

즉, 여기서 단점이 바로 보이시죠.

문서에도 나와 있습니다.

조건이 변경 되면 reindexing 을 해야 한다는 것입니다.

잘 아시겠지만 모든 요건을 만족하는 그런 아키텍쳐도, 모델링도 저는 보지 못한 것 같습니다.

항상 그렇지만 연구하고 요건을 충족 시킬수 있는 구성과 모델이 뭔지 실험하고 적용해 보지 않고서는 답을 찾을 수 없지 않을까 생각 합니다.

정리하면,

pre-compute 를 통한 문서의 attribute 정보 추가로 aggregation 수를 줄여 수행 성능을 빠르게 할 수 있다는 것입니다.

저작자표시 비영리 변경금지 (새창열림)

[Filebeat] 가볍게 사용해 볼까요?

Elastic/Beats 2015. 10. 27. 15:13

filebeat 가 릴리즈 되었습니다.

▶ elastic blog : https://www.elastic.co/blog/weekly-beats-first-filebeat-release

GA 버전은 아니고 beta4 이지만 그래도 의미 있는 릴리즈이기 때문에 소식을 전하지 않았나 싶습니다.

여기서는 가볍게 FEL (Filebeat + Elasticsearch + Logstash) 구성으로 /var/log 아래 파일로그에 대한 수집과 색인까지 살펴 보도록 하겠습니다.

Kibana를 이용한 dashboard 구성은 제가 직접 만들면 되는데 귀찮아서 그냥 이건 skip 하도록 하겠습니다.

기본적으로 elastic에서 제공하고 있는 dashboard sample 데이터가 있으니 참고 하시면 좋을 것 같습니다.

(2015.10.27일 기준으로 filebeat 는 등록되어 있지 않습니다.)

▶ elastic reference : https://www.elastic.co/guide/en/beats/libbeat/current/getting-started.html#load-kibana-dashboards

curl -L -O http://download.elastic.co/beats/dashboards/beats-dashboards-1.0.0-beta4.tar.gz

tar xzvf beats-dashboards-1.0.0-beta4.tar.gz

cd beats-dashboards-1.0.0-beta4/

./load.sh

[FEL Architecture]

기본적인 아키텍쳐링은 elastic 문서에 잘 나와 있습니다.

[Filebeat 란?]

filebeat는 기본적으로 logstash forwarder를 기반으로 만들어 졌습니다.

개별 노드에 agent 형태로 설치가 되어 동작 하게 되며, log directories or specific log files, tails the files 에 대해서 elasticsearch로 색인하게 됩니다.

참고 문서)

* "logstash-forwarder" : https://github.com/elastic/logstash-forwarder

* "libbeat platform" : https://www.elastic.co/guide/en/beats/libbeat/current/index.html

[Filebeat 설치]

※ 개발 장비로 macbook 을 사용중이기 때문에 mac 기준으로 작성 합니다.

Step 1) 다운로드를 받고 압축을 해제 합니다.

▶ 다운로드 링크 : https://www.elastic.co/downloads/beats/filebeat

$ tar -xvzf filebeat-1.0.0-beta4-darwin.tgz

$ cd filebeat-1.0.0-beta4-darwin

$ vi filebeat.yml

Step 2) filebeat.yml 설정

▶ filebeat configure : https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration-details.html

filebeat도 elasticsearch와 마찬가지로 잘 모르면 기본 설정으로 사용 하시면 됩니다.

기본적으로 설정 하셔야 하는 값들은 "paths", "log", "elasticsearch", "logstash" 설정입니다.

...중략...

paths:

- /var/log/*.log

type: log

...중략...

output:

### Elasticsearch as output

elasticsearch:

# Set to true to enable elasticsearch output

enabled: false

...중략...

logstash:

# Uncomment out this option if you want to output to Logstash. The default is false.

enabled: true

# The Logstash hosts

hosts: ["localhost:5044"]

...중략...

※ 여기서 elasticsearch.enabled: false 로 하는 것은 F -> L -> E 구조로 사용하기 위해서 입니다.

▶ filebeat logstash output configure : https://www.elastic.co/guide/en/beats/libbeat/master/configuration.html#logstash-output

Step 3) dynamic template 설정

이 설정은 logstash를 사용해 보신 분이라면 어떤 용도인지 잘 아실거라고 생각 합니다.

짧게 설명 드리면 dynamic mapping에 의한 특정 index 패턴에 사전 mapping 구성을 통해 생성되는 field의 특성을 pre-define 하는 설정을 하는 것입니다.

$ curl -XPUT 'http://localhost:9200/_template/filebeat?pretty' -d@filebeat.template.json

※ filebeat.template.json 파일은 압축 해제 하신 경로에 포함되어 있습니다.

Step 4) filebeat 실행

※ elasticsearch와 logstash를 먼저 실행 시켜 둔 후 아래 명령어로 실행 합니다.

$ sudo ./filebeat -e -c filebeat.yml -d "publish"

[Logstash 구성]

filebeat 데이터를 받아 줄 logstash를 구성 합니다.
logstash 1.5.4 이상
beats plugin 설치
$ bin/plugin install logstash-input-beats

[Filebeat용 logstash config 생성]

아래 설정은 libbeat reference 문서에 자세히 나와 있습니다.

▶ libbeat reference : https://www.elastic.co/guide/en/beats/libbeat/current/getting-started.html

input {

beats {

port => 5044

}

output {

elasticsearch {

host => "localhost"

port => "9200"

protocol => "http"

index => "%{[@metadata][index]}"

document_type => "%{[@metadata][type]}"

}

아래 그림은 제 맥북에서 실행 시킨 명령어 스크린샷 입니다.

logstash) bin/logstash -f conf/filebeat.config

filebeat) sudo ./filebeat -e -c filebeat.yml -d "publish"

kibana) bin/kibana

elasticsearch) bin/elasticsearch

간단하게 요약을 하면 이렇습니다.)

1. 수집 할 대상 서버에 filebeat 를 설치하고 실행 합니다.

2. logstash input beat 를 실행하고 output 으로 elasticsearch로 색인 되도록 합니다.

3. elasticsearch에 적재된 로그를 기반으로 kibana에서 dashboard를 구성 합니다.

저작자표시 비영리 변경금지 (새창열림)

[Elasticsearch] Fielddata+Webinar+IRC Q&A ...

Elastic/Elasticsearch 2015. 10. 22. 11:41

Elastic webinar chat 내용을 갈무리한 문서 인데요.

운영 관련해서 좋은 Q&A chat 들이 있어서 올려봅니다.

아래 내용으로 보시면 보기 힘드실것 같아서 elastic에서 제공해준 문서도 함께 올립니다.

Fielddata Webinar IRC .docx

kestenb

if we are using ELK for logging but only need slow 1-5 s loads of data, how can we minimize costs? Right now it is 2k /month per project in servers which is too much. Mostly due to the large ram requirements of ES.

elasticguest2489

do you allow memory swap?

jbferland

As in if you reduce allowed memory consumption in the JVM, queries fail?

izo

@kestenb : what's the size of your data ? ie: daily index size

peterkimnyc

@kestenb are you using doc values?

mta59066

How to setup a cluster on WAN? What would you suggest for somebody who is used to something like MySQL Master/Master replication, where there is a queue, eventually servers will get consistent, don’t worry about short network failures, use both ends for reads and writes.

mayzak

@mta59066 We will cover that in Q&A, good question

to start though ,we don't support a cluster across the WAN due to latency but there are options today to achieve something like that and more coming in the future

mayzak

@elasticguest2489 That's not up to Elasticsearch, its up to the JVM process and the OS. It's always bad to swap memory with Java. What are you trying to do that would make you wonder about that?

MealnieZamora

We are a multi-tenant application with multiple customer account sharing a single ES index. Each account has their own set of fields from the documents that are indexed (which are not known beforehand); therefore we use dynamic mapping. This could result in a mapping explosion. How many fields can an index mapping support? 10,000? 30,000?

mta59066

@mayzak thanks for the info, obviously a setup where latency on the arrival of the data is not vital

jpsandiego42

When setting up logstash (and other apps) to talk to the ES cluster, is it helpful to have those apps configured to use a load balancer and/or client-only nodes instead of talking directly to data nodes?

rastro

MealnieZamora: it will also result in the same field having different mappings, which is bad. ES doesn't like a lot of fields.

bharsh

load balancer - DNS round robin sufficient or dedicated appliance?

spuder-450

How can you have multiple logstashes when using kafka? It is a pull based model, so you can't have a load balancer

elasticguest1440

what is the suggested log shipper when shipping web server logs to elk cluster: install logstash on every web server versus logstash in elk cluster and lumberjack on web servers?

mayzak

@mta59066 I hear you. Have you considered duplicating the documents on their way in or using Snapshot restore between clusters?

granted the later is more a Master/Slave type setup

rastro

elasticguest1440: logstash-forwarder is a nice, lightweight shipper.

mayzak

FileBeat is also an option now

MealnieZamora

@rastro what is the magic number for a lot of fields?

Is there a rule of thumb for max # of fields?

rastro

MealnieZamora: i think we're over 70,000 and elastic.co nearly fainted. I think ES is fairly OK with it, but K4 just can't cope.

elasticguest9518

Bharsh: that depends on how sticky the connections are, for replacing secrets etc

elasticguest1759

On Logstash high-availability: how about putting two logstashes side by side and configuring the log source to send it to both logstash instances?

pickypg

@rastro K4's Discover screen goes through a deduplication process of all fields. With many, many fields, this can be expensive on the first request

EugeneG

Does the Master Zone contain all eligible master nodes, even if they aren't currently acting as master nodes?

Jakau

At what point do you decide to create those dedicated-role Elasticsearch nodes?

ⓘ ChanServ set mode +v djschny

peterkimnyc

@eugeneG Yes

EugeneG

ok, he just answered my question

pickypg

@Jakau a good rule of thumb is around 7 nodes, then you should start to separate master and data node functionality

rastro

pickypg: we had to role back to k3 because k4 doesn't work for that.

mta59066

@mayzak I'll look into those options

pickypg

@rastro :( It will get better. They are working on the problem

kestenb

@izo small daily log size: 200 MB,

jpsandiego42

We found master's really helped when we were only at 5 nodes

elasticguest8328

master-slave isn't a very reliable architecture.

peterkimnyc

@Jakau It really depends on the utilization of the data nodes. I’d argue that even with 3 nodes, if they’re really being hit hard all the time, it would benefit you to have dedicated masters

rastro

pickypg: yeah, of course.

elasticguest8328

its also pretty expensive.

pickypg

@jpsandiego42 Removing the master node from data nodes will remove some overhead, so it will benefit smaller clusters too.

kestenb

@peterkimnyc mostly defaults yes

jpsandiego42

yeah, it made a big difference in keeping the cluster available

pickypg

@kestenb you'll probably benefit from the second part of the webinar about fielddata

christian__

@MealnieZamora It will depend on your hardware. Large mappings will increase the size of the cluster state, which is distributed across the cluster whenever the mappings change, which could be often in your case. The size will also increase with the number of indices used.

centran

are 3 master only nodes really needed? if they are only master then there can be only one and since they don't have data you shouldn't have to worry about split brain

elasticguest3231

what OS's is shield tested on with Kibana? (i've failed on OSX and Arch)

izo

@kestenb: what's your setup like ? Cluster ? Single box ? Running in AWS? or on Found ?

pickypg

@centran If you don't use 3, then you lose high availability. Using three allows any one of them to drop without impacting your cluster's availability

elasticmarx77

@centran: with one dedicated master you have single point of failure.

rastro

mayzak: how can filebeat be a replacement when the project says, "Documentation: coming..." ?

elasticguest6519

So one would have 3 master on the side that talk to each other in their config file to bring the cluster up. Both the client and data node would have those 3 master in their config to join the cluster. Logstash would be sending the log as an output to the data node or the client node ?

pickypg

@leasticguest3231 I have had Kibana working on my Mac pretty consistently

christian__

@centran 3 is needed in order for two of them to be able to determine that they are in majority in case the master node dies

pickypg

with shield that is

Jakau

How is that warm data node configured? Can you move old (7+ days) over to them easily?

centran

I realize that... we use VMs and only 2 SANs so if a bigger datacenter issue occurs it doesn't matter cause it would knock out 2 anyway

elasticmarx77

@Jakau: yes, you can. also have a look at curator which helps automating index management.

pickypg

@Jakau Yes. You can use shard allocation awareness to move shards to where they need to be with little effort

+djschny

@Jakau - yes you can use the shard filtering functionality to accomplish that

michaltaborsky

I hear often (even here) "elastic does not like many fields:. But are there any tip to improve performance in case you just need many fields? In our case it's tens of thousands fields, sparsely populated, fairly small dataset (few gigabytes), complex queries and faceting.

christian__

@Jakau You use rack awareness and tag nodes in the different zones. You can then have ES move indices by changing index settings

jmferrerm

@leasticguest3231 docker container works with Debian. I tested it with Ubuntu and CentOs.

pickypg

@centran If you're fine with the single point of failure, then a single master node is fine

mattnrel

Anyone running multiple ES nodes as separate processes on the same hardware?

rastro

michaltaborsky: maybe run one node and use K3? :(

pickypg

@mattnrel People do that, but it's not common

elasticguest8116

this may have been asked , but how does the master node count requirement option work, if you have an aws multiaz setup , and you loose the zone with the current master ?

elasticguest2489

@michaltaborsky

You should use object mapping with flexible keys and values

centran

well there are two masters

ⓘ JD is now known as Guest6267

kestenb

@izo running a 3 node cluster as containers with 4 GB ram on m4.2x ssd in AWS

mattnrel

For instance i have spinning and ssd drives - could use 1 ES process for hot zone, 1 ES process for warm zone?

centran

but never had the current master fail or shut it down so don't know if the second master will take over

mattnrel

@pickypg any downside to multiple processes on same hardware?

+djschny

@mattnrel - there is nothing stopping you from doing that, however it comes at the cost of maintenance and the two processes having contention with one another

jpsandiego42

We're running multiple nodes on hardware needed to deal with JVM 32g limits, but haven't tried for difference zones.

Jakau

Will common steps of performing performance tests to identify bottlenecks on your own setup be covered at all?

michaltaborsky

@elasticguest2489 What are flexible keys and values?

+djschny

@jpsandiego42 - are you leveraging doc values?

pickypg

@mattnrel If you misconfigure something, then replicas will end up on the same node. You need to set the "processors" setting as well to properly split up the number of cores. And if the box goes down, so do all of those nodes

mattnrel

another usecase for multiple processes - one for master node, one for data?

christian__

@centran If you have 2 masters, the second should not be able to take over if the master dies. If it can, you run the risk of having a split brain in scenario in ase you suffer a network partition. This is why 3 master eligible nodes are recommended

jpsandiego42

yeah, had to put in extra config to ensure host awareness and halfing the # of processors, etc

mattnrel

@pickypg yeah i've spotted the config setting for assuring data is replicated properly when running multiple instances on same server

elasticguest6519

In the setup shown, logstash would send his data as an output to the client or to the data node ?

jpsandiego42

not using doc values today

Crickes

does shifting data from hot to warm nodes require re-indexing?

elasticmarx77

@Crickes: no

christian__

@Crickes No.

German23

@Crickes no just adjusting the routing tag

+djschny

@jpsandiego42 - doc values should reduce your heap enough that you shouldn't need to run more than one node on a single host

elasticguest2489

@michaltaborsky Object type mapping with 2 fields called key and value. Depending on the nature of your data this might avoid the sparseness and enhance performance

+djschny

@mattnrel - generally speaking you are always better off following the gold rule of each box only runs one process (whether that be a web app, mysql, etc.)

peterkimnyc

@Crickes No but there’s a great new feature in ES2.0 that would make you want to run an _optimize after migration to warm nodes to compress the older data at a higher compression level.

izo

@kestenb: and those 3 containers cost you 2k a month ?

elasticguest4713

Is there a general rule to improve performance on heavy load of aggregation and faced queries? Adding more nodes and more RAM?

jpsandiego42

@djschny - most of our issues come from not doing enough to improve mappings/analyzed and our fielddata getting too big.

elasticguest2489

Good question...

michaltaborsky

@elasticguest2489 I don't think this would work for us, like I wrote, we use quite complex queries and facets

peterkimnyc

@Crickes [warning: blatant self-promotion] I wrote a blog post about that feature recently. https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0

Crickes

i thought you can't change the index sonfig once its created, show how do you modify a tag on an idex that might have several thousand records in it already?

peterkimnyc

@Crickes There are many dynamic index config settings

+djschny

@Crickles indexes have static and dynamic settings. the tagging is a dynamic one (similiar to number of replica shards)

Crickes

@peterkimnyc Thanks, I'll have a look at that

peterkimnyc

You’re probably thinking of the number_of_shards config, which is not dynamic

alanhrdy

@Crickes time series index are normally created each day. Each day you can change the settings :)

elasticguest2489

@michaltaborsky

If you have too many fields this often reflects a bad mapping... but it's hard to tell without knowing the use case...

elasticmarx77

@Crickes: have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

+inqueue

clickable for the first bullet: https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale

michaltaborsky

The use case is product database. Different products have different parameters (fields). T-shirts have size and color, fabric... mobile phones have color, operating system, memory size, ... There are thousands of different product categories, hundreds or thousands products in each.

mattnrel

With indexes having same mapping - better to have more/smaller indexes (say per day), or have fewer/larger indexes (say per week) - esp in terms of fielddata

mattnrel

Very relevant talk to my current situation (OOM on fielddata)! Thanks for this.

centran

should be called field data life saver

jpsandiego42

MealnieZamora

is there a rule of thumb for how may indices you should have per cluster?

centran

fielddata bite me in the butt too but it was coupled with setting heap size to 32g which is too close... going down to 30g made my cluster much happier

mattnrel

Would REALLY be nice to have a shortcut method to enable doc_values after the fact - even just a method to rebuild enire index on the fly

MrRobi

Are "doc values" the same as Lucene TermVectors?

rastro

MealnieZamora: the more indexes/shards, the more overhead in ES. For us, it's been a heap management issue.

michaltaborsky

+1 on a simple way to reindex an index

mattnrel

@MrRobi doc values are the same as Lucene DocValues

+djschny

@centran - correct, if you heap is above 30GB then the JVM can no longer use compressed pointers, this results in larger GC times and less usable heap memory

rastro

daily indexes and templates FTW.

jpsandiego42

ⓘ elasticguest9087 is now known as setaou

spuder-450

@MelnieZamora I've heard anecdotally to keep your indexes between 200 - 300

rastro

doc_values saved us like 80%+ of heap.

MealnieZamora

are doc values applicable to system fields like _all

mattnrel

@rastro wow. doing much aggregation/sorting?

elasticguest3231

+1 on re-indexing

christian__

@MwalnieZamora No, it only works for fields that are not_analyzed

centran

@djschny - yep... at the time I think the elastic doc was mentioning the 32g problem but didn't say that the problem can pop up between 30-32. took researching java memory managment on other sites to discover heap size of 32 is bad idea and playing with fire

c4urself

so we should set circuit breaker to 5-10% AFTER enabling doc values?

rastro

mattnrel: most of our queries are aggregation, as we're building dashboards and generating alerts (by host, etc).

+djschny

@MealnieZamora - there is no magic number here. it depends upon, number of nodes, machine sizes, size of docs, mappings, requirements around indexing rate, search rate, etc.

mattnrel

@rastro same here so good to know your success w/ docvalues

elasticguest3231

not_analyzed should be configurable as default option for strings

+djschny

@MealnieZamora - best best is to run tests

centran

@c4urself he said he recommends that after you think you got them all so it will trip and you can find anything you missed

mattnrel

@rastro same performance under doc values? (obviously is better that you aren't filling your heap and possibly crashing nodes...)

rastro

elasticguest3231: i use templates for that (all field types, actually).

c4urself

centran: ok, thanks for the clarification

rastro

mattnrel: the doc says there's a performance penalty, but I can say that a running cluster is more performant than a crashed cluster.

+djschny

@centran - do you happen to have the link to the elastic doc mentioning 32GB? If so would like to correct it.

centran

I think it was fixed but not sure... I can look

rastro

centran: all the doc i found says "less than 32GB", but doesn't explain the boundary condition.

centran

I know when I was reading up it was on the old site

mattnrel

" I can say that a running cluster is more performant than a crashed cluster. " so true!

elasticguest3231

@rastro - yeah, we wrote datatype conversion scripts to handle still seems like your should be able to set at index level rather than field

mattnrel

with same mappings - generally better to run more/smaller indexes (daily) or fewer/larger indexes (weekly)?

rastro

djschny: "when you have 32GB or more heap space..." https://www.elastic.co/blog/found-elasticsearch-in-production

yxxxxxxy

We need to have case-insensitive sort. So we analyze strings to lowercase them. Does that mean we can't use doc_values?

centran

@djschny https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html

Shawn

@yxxxxxxy - https://github.com/elastic/elasticsearch/issues/11901

avielastic

Can I get the recording of this webnar ? I joined late

christian__

@mattnrel You do not want too many small shards as each shard carries a bit of overhead, so the decision between daily and weekly indices often depend on data volumes

pickypg

Recording will be posted later

elasticguest5827

Is there any rule to find an optimal size of shard e.g. shard to heap ratio?

elasticguest7305

If I'm using just a lowercase string analyzer (not tokenizing it). Does that work with Doc_Values? Or, do we need to duplicate before we bulk insert the record?

elasticguest2745

Is the circuit breaker for the total cluster or just for that node?

rastro

elasticguest3231: the template says "any string in this index...", which feels like index-level, right?

centran

@djschny they talk about the limit but should probably be explicit that it needs to be set lower to be in the safe zone

c4urself

what are some scaling problems that happen regularly AFTER enabling doc values (that is, not field data-related problems)?

+djschny

@centran - I will patch the documents and add that for sure.

setaou

In ES 1.x, we have a parameter for the Range Filter allowing to use fielddata. In our use case it gives more performance than the other setting (index), and more perfs than the Range Query. In ES 2.0, filters are no more, so what about the performance of the Range Query, which works without field data ?

+djschny

@centran - Thanks for the link

mattnrel

@elasticguest2745 per node

elasticguest2745

thanks

avielastic

what are the best possible ways to change the datatype of a field of an existing Index without re-indexing ? Will multi-field or dynamic mapping help

rbastian

Would doc values improve nested aggregation performance or only help with stability due to less heap?

Crickes

its the mechanism for ageing the index without using curator I'm interested in finding out. How do you manually move an index from a hot node, to a warm node?

elasticguest2745

We are seeing that the field data cache isnt getting evicted when it hits the set limit. how can we make sure it gets cleared?

jmferrerm

elmanytas

Crickes

I think the anser in buried in https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

dt_ken

I know the website says you do not recommend the G1GC for elastic but we've found it is much faster and seems completely stable. Is there still fear in using G1GC?

jbferland

If you're on the latest java 8 releases, I think G1GC is ok now despite the warnings.

doctorcal

Huh?

michaltaborsky

@dt_ken We use G1GC for a while, for us it is also more stable.

doctorcal

What your data model is

jbferland

There were historical cases of corruption but there have been bug fixes. Risk / reward and dart boards at this point.

+djschny

you can either run 3 master nodes (one in each AZ)

elasticguest2399

When indexes/shards are moved from hot to warm nodes, are the segments in the shards coalesced together? Or is index optimization still needed?

+djschny

or you can put the master node in a Cloud Formation template, so that if it goes down, the CF will spin up another one in another zone

Jakau

So I'm looking ~35GB a day, 4 log types, and then indexing the events into ~4 indexes a piece that all have the same alias for querying across them. The seperate indexes are due to different retentions. Any issues with this? We'd be looking at keeping 90 days worth of logs live

elasticguest8116

ok so use a 3rd az just for a master node

avielastic

whats the advantage of having dedicated master vs Master-data nodes?

mattnrel

How much heap is recommended for master-only node? (Vs 1/2 of ram < 32G general recommendation)

+djschny

@elasticguest2399 - shard relocation copies segments exactly, byte for byte. After that is finished, segment merging then happens independent of the node where things were copied from

christian__

@Jakau You may want to reduce the shard count from the default of 5 in order to reduce the number of shard generated per day

elasticguest6947

Do you have a lightweight master-quorem arbiter daemon, similar to Percona's arbiter, to deal with a 2-master scenario?

elasticguest8116

thank you

elasticguest2399

@+djschny: Thank you

pickypg

@elasticguest6947 not at this time

MIggy282

yes

elasticguest6947

@pickypg thanks

MIggy282

your correct

+djschny

Generally speaking when using log data, you don't need a distributed queue like Kafka

Jakau

@christian__ What should it be reduced to? My thoughts right now were 1 shard per node. We're looking at starting with 3 nodes

yxxxxxxy

how many replicas can ES reasonably handle?

elasticguest3231

@rastro - oh, index templates - hadn't understood their use case... are you using to configure better geo_point handling?

spuder-450

I thought elasticsearch clusters shouldn't span geo locations

jpsandiego42

cool. I like that.

Jakau

What's the recommended procedure for performance testing an ELK stack? I've largely seen JMeter for testing query performance

ⓘ elasticguest9203 is now known as Prabin

rastro

elasticguest3231: i think we have a template that takes any field that ends in ".foo" and makes it a geo_point.

Prabin

is there a way to merge two indices?

elasticguest7305

If I'm using just a lowercase string analyzer (not tokenizing it). Does that work with Doc_Values? Or, do we need to duplicate (and lowercase) before we bulk insert the record?

yxxxxxxy

@Prabin you can create an alias over the two indices and search against the alias

Crickes

could you use a tribal node to join 2 geographical seperate clusters?

jwieringa

Thanks!

jpsandiego42

Thanks!

elasticguest2489

Thx

elasticguest9430

Upgrading webinar https://www.elastic.co/webinars/upgrading-elasticsearch

elasticguest2433

Thanks

elasticguest3231

many thanks - might solve a lot of headaches for us

elasticguest8687

this has been one of the most useful webinars on elsticsearch I have seen. Thanks!!

Prabin

@yxxxxxxy alias is definitely an option but with time the number of indices is going to increase, so want to merge them so that search happens on fewer index

pickypg

@elasticguest7305 Unfortunately not yet.

rastro

Crickes: i hope so, because we're moving in that direction with some new clusters.

Jakau

Yes, this was an excellent webinar, thank you

pickypg

@Crickles Yes

bharsh

excellent presentation guys... gives me lots to look at

pickypg

@elasticguest7305 https://github.com/elastic/elasticsearch/issues/12394 <- this will be the solution to that

elasticguest8687

I see some questions about the number of indices, and my question might be the same (I didn't see the stat of this thread). Is it ok to have hundreds of indices with the total data size is around 100GB?

centran

agreed. good presentation. great knowledge for those how having been getting ELK going and are now realizing the mess they got themselves into

pickypg

@elasticguest8687 So the sum of all the indices is 100 GB? You probably want to reduce the number of indices because that's less than 1 GB per index

rastro

centran: lol

pickypg

There's nothing wrong with that per se, but it _sounds_ wasteful

The impact would be: a lot of shards to search through (a lot of threads) and a bloated cluster state (from extra indices)

Crickes

thanks everyone

chadwiki

@crickles Make sure you have unique Index name, example - region1_index1 and region2_index1

elasticguest8687

it has more to do with the requirements for the over all application. I'll rethink the strategy, but I guess what I really want to know is if the searches will be slow or not if you have that many indices.

pickypg

@elasticguest8687 It kind of depends on how you're searching. Are you searching a single index or all of them with a single request?

centran

I thought I was overkilling it with indexes especially because we have rolling ones but then I discovered the awesomeness of setting up proper index patterns in kibana... holy crap does the speed differences. having lots of fields is what sucks in my opinion

elasticguest8687

it many cases it would be searching across many (or most) of the indices

so would document types be a better approach than using many indices?

pickypg

@centran Yeah. That is being worked on (for real), but it's not a simple problem (quickly deduping)

@elasticguest8687 Do the indexes have the same mappings?

and, if so, why/how are they separated?

elasticguest8687

not necessarily (one of the reason using multiple indices came up as a solution). The idea was to have different fields between indices and search across a common field if you need to.

pickypg

If the mappings are different, then definitely do not use different types. Types are literally just a special filter added for you at the expense of bloating your mapping. If you _can_ and _want_ to use types, then simply create an extra field and name it "type" (or whatever you want), then filter on that manually. It will limit the bloat better.

pickypg

As for the rest: if your index is not greater than 1 GB, then it had better only have 1 shard (there are exceptions, but in general...)

primary shard that is

elasticguest8687

ok. thanks for the info. very helpful.

pickypg

The downside to having a ton of indices for search is that each shard needs to be searched and the results need to be federated/combined by the originating requester node (an advantage of a client node). As such, each index needs to route all requests to all of their shards. This means that if you search 100 shards, then you have 100 threads workin

g _across your cluster_.

Individually they're probably going to be very quick, but the request is only as good as the weakest/slowest shard, which is _probably_ going to be impacted by the slowest node

elasticguest8687

actually I guess I don't have a good idea of how big the index will be. but my guess is it will be more than 1 GB.

pickypg

Also, less obvious, if you have too many shards in the request (e.g., using 5 primary shards unnecessarily), then you will run into blocked requests because of too many threads

How much more?

elasticguest8687

well, the data itself (files to be indexed) total to about 100GB. Most of the files are pdfs, so I plan to extract the text from those.

pickypg

Text is tiny by comparison, so it's really quite hard to say what will come out of them

elasticguest8687

right

pickypg

https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0

Good, relevant blog post

elasticguest8687

thanks

pickypg

@elasticguest8687 You can also bring this up on the discuss.elastic.co forums, but my strong recommendation would be to combine indices that share the same mapping (using a separate field to represent type as described above) and deal with the quantity of shards as it happens. In my experience, it's quite good at it -- I was dealing with an issue w

here a user was running an aggregation across 450 shards without issues stemming from that (there were different issues), but eventually the added parallelism does itself incur a cost

pickypg

and that cost is two fold: 1. the federated search must combine results to find the actual relevant results (top 10 from 5 shards requires up to 50 comparisons at the federated level) 2. the number of threads is a bottleneck

elasticguest8687

ok. Yeah, i think i need to go back to the drawing board and think about this some more.

pickypg

Also take a look at our book chapter on "Life Inside a Cluster" https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html

The book's free and great. The next three chapters are also highly relevant, as is sorting and relevance

oh and this is #2 from my above comment: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html

elasticguest8687

awesome! thanks, again. this has been very helpful.

pickypg

Good luck

mattnrel

thanks again to Elastic for the great preso

저작자표시 비영리 변경금지 (새창열림)

[Kibana] kibana 를 이용한 모니터링 대쉬보드 사용시 주의 할 점 - Search Thread.

Elastic/Kibana 2015. 10. 1. 13:56

ELK를 이용해서 매트릭 수집이나 시스템 모니터링 등을 하고 계신 분들이 많은 걸로 압니다.

이미 경험해 보신 분들이 많을 것 같기는 하지만 그래도 공유 차원에서 글 작성해 보겠습니다.

보통 ELK 를 이용해서 데이터를 수집 후 kibana 로 dashboard 구성을 해서 지표를 보게 됩니다.

잘 아시겠지만 kibana의 기본 설정은 logstash-* 로 모든 index를 대상으로 질의를 하게 됩니다.

이와 같은 이유로 시간이 지날 수록 성능이 떨어지고 에러가 발생하게 되는데요.

잘 아시겠지만 elasticsearch 에서의 모든 action 은 thread 단위로 동작 하게 됩니다.

그렇기 때문에 kibana를 이용한 dashboard 를 계속 띄워 놓고 auto refresh 를 사용하게 되면 해당 주기 동안 계속 search request 가 실행 됩니다.

하나의 예를 들어 보면 index 당 shard 크기를 5개로 했다고 가정 하겠습니다. (replica 0)

현재까지 생성된 index 는 30일치 30개 입니다.

그렇다면 총 shard 수는 5 x 30 = 150개가 됩니다.

kibana 를 이용해서 한 화면에 8개의 지표를 볼 수 있는 dashboard를 구성했다고 가정 하겠습니다.

이와 같이 했을 경우 dashboard 에서 실행 되는 질의는 총 8개 입니다.

이 8개의 질의에 대한 elasticsearch에서의 실행 되는 search thread 수는 얼마나 될까요?

8 x 5 x 30 = 1200 개의 search thread 가 실행 되게 됩니다.

이게 무슨 문제를 일으킬 수 있을까요?

elasticsearch에서는 threadpool 설정을 통해서 search thread 크기를 조정 할 수 있습니다.

아래는 ThreadPool.java 의 code snippet 입니다.

defaultExecutorTypeSettings = ImmutableMap.<String, Settings>builder()

....

.put(Names.SEARCH, settingsBuilder().put("type", "fixed").put("size", ((availableProcessors * 3) / 2) + 1).put("queue_size", 1000).build())

....

위에서 보시는 것 처럼 기본 runnable size 는 (availableProcessors * 3) / 2) + 1 입니다.

그리고 queue_size 는 1000 개로 정해져 있습니다.

CPU가 4 core 라고 가정하면,

runnable thread size = ( ( 4 x 3 ) / 2 ) + 1 = 7

queue thread size = 1000

이와 같습니다.

8개의 지표로 구성된 dashboard를 한번 호출 할 때마다 1200 개의 search request 가 발생을 하게 되고 이 시점에 일부 시스템 리소스가 부족하게 된다면 해당 elasticsearch 로 다른 applicaiton 에서 aggregation과 같은 질의를 실행 했을 때 잘 못된 정보가 return 될 수도 있습니다.

실제 이런 경우 elasticsearch의 error log 를 보면 아래와 같은 메시지를 보실 수 있습니다.

[2015-09-29 00:08:40,896][DEBUG][action.search.type ] [Madame Masque] [....][7], node[zXJSZ4IYS2KwPhj190hhEQ], [P], s[STARTED]: Failed

to execute [org.elasticsearch.action.search.SearchRequest@542e2e00] lastShard [true]

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransp

ortAction$23@76a71853

at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)

at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)

......중략.....

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

에러 메시지는 queue capacity 1000 을 넘었기 때문에 reject 했다는 내용입니다.

해결 방법은 간단 합니다.

- kibana의 dashboard 화면을 단순화 하고 필요한 시점에만 띄워 놓고 봅니다.

- kibana의 auto-refresh 를 off 하거나 주기를 길게 줍니다.

- threadpool.search.queue_size 를 늘려 줍니다.

경험 할 수도 있고 안할 수도 있지만 운영하면서 알고 있으면 그래도 도움은 되지 않을까 싶어서 공유 합니다.

저작자표시 비영리 변경금지 (새창열림)

◀ PREV : [1] : [···] : [21] : [22] : [23] : [24] : [25] : [26] : [27] : [···] : [42] : NEXT ▶

'elasticsearch'에 해당되는 글 420건

티스토리툴바