'Elastic/Elasticsearch' 카테고리의 글 목록 (21 Page)

[Elasticsearch] JDBC River 추가 REST API 팁!!

Elastic/Elasticsearch 2014. 11. 13. 12:57

오픈소스의 좋은 점이기도 하지만 조금 불편한 점도 있죠.

사용방법이나 어디 레퍼런스가 없어서 소스코드를 확인해야만 하는 노가다!!

ㅎㅎ 뭐 그래도 재밌으면 그걸로 만족 ^^

JDBC River 테스트 중에 2014년 10월에 추가된 API 가 있어서 테스트 하던 중 아래 에러가 나와서 소스코드 확인 후 어떻게 사용해야 하는지 알게 되었내요.

[에러코드]

{"error":"NullPointerException[null]","status":500}

[추가된 REST API]

_state

_suspend

_abort

_resume

_run

[테스트 버전]

elasticsearch 1.3.4

elasticsearch jdbc river 1.3.4.4

[Request 방법]

$ curl -d "{\"rivername\":\"my_jdbc_river\"} -XPOST http://localhost:9200/_river/jdbc/{REST_API}

- 일단 _state 는 GET 입니다. POST 로 던지고 안된다고 하지 마세요.

- 문서에도 없는 내용입니다. rivername 이라는 parameter 를 구성해서 넘겨줘야 합니다.

- JSON 형태로 넘기셔야 됩니다.

기타 자세한 설명은 아래 링크 참고하세요.

Ref. https://github.com/jprante/elasticsearch-river-jdbc

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] scale up limitation.

Elastic/Elasticsearch 2014. 11. 13. 11:28

elasticsearch에 대한 scale up 고려 시 참고 하세요.

장비 스펙이 너무 좋아도 리소스를 제대로 사용하지 못하면 의미가 없겠죠.

- Less than equal 32GB ram

- Less than equal 32 cores

저작자표시 비영리 변경금지 (새창열림)

:

JDBC River의 strategy 란?

Elastic/Elasticsearch 2014. 11. 12. 11:41

https://github.com/jprante/elasticsearch-river-jdbc

현재까지 제공되는 strategy 옵션은 두개 입니다.

1) simple

2) column

simple은 단순 fetch 와 indexing or delete 작업을 수행 하는 것 이라고,

column은 마지막에 수행한 정보를 기록해 두었다가 값을 비교하여 indexing or delete 작업을 수행 하는 것 입니다.

저작자표시 비영리 변경금지 (새창열림)

:

[Analyzer] 형태소 분석기.

Elastic/Elasticsearch 2014. 11. 11. 16:43

[형태소 분석기]

1) 루씬 기본 standard/cjk (source code 제공)

https://lucene.apache.org/core/4_10_0/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html

2) 루씬 arirang (source code 제공)

https://lucenekorean.svn.sourceforge.net/svnroot/lucenekorean/

3) mecab (jar 제공)

https://bitbucket.org/eunjeon/mecab-ko

4) twitter korean text (source code 제공)

https://github.com/twitter/twitter-korean-text

5) komoran (jar 제공)

http://shineware.tistory.com/entry/KOMORAN-ver-23

저작자표시 비영리 변경금지 (새창열림)

:

[Lucene] 4.9.0 analyzer & tokenizer....

Elastic/Elasticsearch 2014. 11. 5. 13:05

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html

https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/TokenStream.html

    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    
    try {
      ts.reset(); // Resets this stream to the beginning. (Required)
      while (ts.incrementToken()) {
        // Use AttributeSource.reflectAsString(boolean)
        // for token stream debugging.
        System.out.println("token: " + ts.reflectAsString(true));

        System.out.println("token start offset: " + offsetAtt.startOffset());
        System.out.println("  token end offset: " + offsetAtt.endOffset());
      }
      ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
      ts.close(); // Release resources associated with this stream.
    }

The workflow of the new TokenStream API is as follows:

Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
The consumer calls reset().
The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
The consumer calls incrementToken() until it returns false consuming the attributes after each call.
The consumer calls end() so that any end-of-stream operations can be performed.
The consumer calls close() to release any resource when finished using the TokenStream.

이전 버전이랑 바뀐 내용이 있으니 확인하셔야 합니다. :)

저작자표시 비영리 변경금지 (새창열림)

:

[ElasticSearch] _score 계산 시 IDF 연산은 어떻게 이루어 지나요?

Elastic/Elasticsearch 2014. 10. 30. 11:53

어제 저희 회사 행사에서 "오픈소스 검색엔진 구축 사례"로 발표를 했었는데요.

저에게 질문 주셨던 것중에 나름 재밌는 질문을 주셨던 내용이 있어서 공유 드립니다.

(아마도 질문 주신 분은 elasticsearch 를 사용해 보지 않으셨거나 경험 하신지 얼마 안되신 것 같다는 느낌 이였구요. lucene 은 많이 사용해보신 분 같다는 느낌 이였습니다. ㅎㅎ 제 느낌이니 틀릴수도 있구요.)

질문은 이랬던것 같습니다.

- elasticsearch 에서 색인 시에, IDF 값을 Global 하게 쓰기 어려울 텐데 어떻게 사용되는 지에 대한 질문이었습니다

정답은 아래 링크에 나와 있죠. ^^

(shard 별로 이루어 집니다.)

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html

우선 TF의 경우 뭐 그냥 term frequency 니까 이건 별 문제 없을 거구요.

(기본 per field similarity 입니다.)

IDF의 경우는 그럼 어떻게 할까요 인데요????

IDF는 쉽게말해 index 내 전체 문서에서의 term이 포함된 document frequency 가 되는 건데요.

루씬에서는 뭐 당연히 문제가 안되겠지만 es 에서는 shard 라는 개념이 있죠. 즉 하나의 index를 여러개의 shard 로 나눠서 서로 다른 노드에 가지고 있으니 IDF 를 어떻게 계산 할 수 있을지....

제가 보는 관점은 단순 합니다.

- shard 별 document 의 idf 값만 보면 document 의 relevance가 문제일수 있지만, document 의 score 의 경우 tf + idf + field length norm 등 다양한 요소와 함께 계산되기 떄문에 normalized 되었다고 봅니다. 즉 score 값을 신뢰 할 수 있다 입니다.

ㅎㅎ 간만에 검색 질문을 주셔서 재밌었습니다.

저작자표시 비영리 변경금지 (새창열림)

:

[ElasticSearch] bitset...

Elastic/Elasticsearch 2014. 10. 17. 18:25

내 블로그에 안남기고 페북에만 남겼내..ㅋㅋ

그냥 지나가다가...

질의 성능 최적화에서 queries vs filters 에 대해 이해하고 사용하고 계실줄로 압니다만, ^^;

요기에 추가로 bitset 에 대해서도 내용이 나오기 때문에 그냥 참고하시라고 링크 던져 봅니다. ^^

filter 가 빠르다라고 이야기 하는 부분에서 bitset에 대한 이해가 없으면 안될것 같아서요.

https://lucene.apache.org/core/4_10_1/core/org/apache/lucene/util/OpenBitSet.html

쉽게는 term filter 했을 떄 match 된 term 에 대해서 bitset 을 1로 marking 해 두어 다음에 이 bitset 만 보고 문서를 리턴하게 된다 뭐 그런 이야기 입니다.

저작자표시 비영리 변경금지 (새창열림)

:

[ElasticSearch] GC 튜닝 참고.

Elastic/Elasticsearch 2014. 10. 2. 11:32

gc 관련 문의주신 분이 계셔서 공유해 드리기로 약속했습니다.

그래서 공유 합니다.

우선, gc 옵션 값은 정답이 정해져 있는게 아닙니다.

지속적으로 관리를 해주셔야 합니다.

elasticsearch 에서 gc 관련 영향을 미치는 것들은 주로 아래 내용들에 대해서 살펴 보셔야 합니다.

- segment merge policy

- 검색 질의에 대한 유형 및 데이터 크기

- field, filter 관련 cache 관리

- facet, aggregation에 대한 유형 및 데이터 크기

- jdk 7 이상 사용

- 등등등..

제가 주로 사용하는 gc 옵션 값은 아래와 같습니다.

(그러고 보니 이것도 예전에 공유드렸던 것 같습니다.)

-server

-XX:+AggressiveOpts

-XX:UseCompressedOops

-XX:MaxDirectMemorySize

-XX:+UseParNewGC

-XX:+UseConcMarkSweepGC

-XX:+CMSParallelRemarkEnabled

-XX:CMSInitiatingOccupancyFraction=75

-XX:+UseCMSInitiatingOccupancyOnly

-XX:+UseG1GC

JDK 7 부터는 default G1 gc를 사용하는 것으로 알고 있습니다.

※ http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

아시겠지만, 1.2.0 이상 부터는 무조건 JDK 7 버전을 사용해야 합니다.

gc 튜닝은 단순 JVM 설정만으로 다 해결 되는 문제는 아닙니다.

client application에서도 문제가 발생 되지 않도록 구현을 해야 하는 것도 중요합니다.

단순 예를 들면, elasticsearch node 에 설정한 memory 보다 큰 사이즈의 데이터를 요청해서 분석하는 경우 이건 그냥 OOM 갑니다.

더 심각하게는 node 가 죽을 수도 있구요.

아래 링크들은 gc 관련 링크 입니다. (예전에 찾아 봤던 링크들 입니다.)

https://gist.github.com/mrflip/5366376

http://jprante.github.io/2012/11/28/Elasticsearch-Java-Virtual-Machine-settings-explained.html

https://www.found.no/foundation/elasticsearch-in-production/

http://helloworld.naver.com/helloworld/1329

http://eediom.com/logpresso-cli-araqne-core/

http://dimdim.tistory.com/entry/Java-GC-%ED%83%80%EC%9E%85-%EB%B0%8F-%EC%84%A4%EC%A0%95-%EC%A0%95%EB%B3%B4-%EC%A0%95%EB%A6%AC

http://wiki.ex-em.com/index.php/JVM_Options

저작자표시 비영리 변경금지 (새창열림)

:

[ElasticSearch] master node election...

Elastic/Elasticsearch 2014. 9. 2. 11:00

그냥 또 까먹을까봐.. 대충 적어 봅니다.

기본적으로 zen discovery 를 통해서 cluster 구성 및 node 간 통신을 위해서 사용이 됩니다.

그럼 master node 가 죽었을 때 어떻게 선출이 될까요?

es 가 실행 되면서 zen discovery module 도 등록이 됩니다.

ZenDiscovery 가 binding 되면서 MasterFaultDetection 과 ElectMasterService 도 등록이 되지요.

MasterFaultDetection 에서 master node 를 감시하다가 에러가 나면 ElectMasterService 에서 master node 를 선출하게 되는 구조가 되는 것입니다.

참 쉽죠 ^^;

ZenDiscovery 에서 하는 역할은 더 있지만 여기서는 그냥 단순 master election 에 대해서만 살펴 봤습니다.

ZenDiscovery

↓

MasterFaultDetection

↓

ElectMasterService

저작자표시 비영리 변경금지 (새창열림)

:

[ElasticSearch] Hash Partition 테스트

Elastic/Elasticsearch 2014. 8. 27. 14:15

간혹 특정 shard 로 색인 문서가 몰리는 경우가 있습니다.

이럴경우 _id 값에 대한 key 조합을 확인해야 할 필요가 있는데요.

es 내부에서 사용하는 hash 함수를 이용해서 간단하게 테스트 해볼수 있습니다.

[테스트 코드]

public class EsHashPartitionTest {

private static final Logger log = LoggerFactory.getLogger(EsHashPartitionTest.class);

private HashFunction hashFunction = new DjbHashFunction();

@Test

public void testHashPartition() {

int shardSize = 120;

List<Long> shards = new ArrayList<Long>();

long[] partSize = new long[shardSize];

for ( int i=0; i<shardSize; i++ ) {

shards.add((long) 0);

partSize[i] = 0;

}

for ( int i=0; i<1000000; i++ ) {

int shardId = MathUtils.mod(hash(String.valueOf(i)), shardSize);

shards.add(shardId, (long) ++partSize[shardId]);

}

for ( int i=0; i<shardSize; i++ ) {

log.debug("["+i+"] {}", partSize[i]);

}

public int hash(String routing) {

return hashFunction.hash(routing);

}

[Hash 함수 원본 코드]

/**

* This class implements the efficient hash function

* developed by <i>Daniel J. Bernstein</i>.

*/

public class DjbHashFunction implements HashFunction {

public static int DJB_HASH(String value) {

long hash = 5381;

for (int i = 0; i < value.length(); i++) {

hash = ((hash << 5) + hash) + value.charAt(i);

}

return (int) hash;

}

public static int DJB_HASH(byte[] value, int offset, int length) {

long hash = 5381;

final int end = offset + length;

for (int i = offset; i < end; i++) {

hash = ((hash << 5) + hash) + value[i];

}

return (int) hash;

}

@Override

public int hash(String routing) {

return DJB_HASH(routing);

}

@Override

public int hash(String type, String id) {

long hash = 5381;

for (int i = 0; i < type.length(); i++) {

hash = ((hash << 5) + hash) + type.charAt(i);

}

for (int i = 0; i < id.length(); i++) {

hash = ((hash << 5) + hash) + id.charAt(i);

}

return (int) hash;

}

- https://github.com/elasticsearch/elasticsearch/tree/master/src/main/java/org/elasticsearch/cluster/routing/operation/hash

이와 관련된 자세한 내용은 아래 링크 참고하세요.

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/routing-value.html

저작자표시 비영리 변경금지 (새창열림)

:

jjeong

'Elastic/Elasticsearch'에 해당되는 글 385건

[Elasticsearch] JDBC River 추가 REST API 팁!!

[Elasticsearch] scale up limitation.

JDBC River의 strategy 란?

[Analyzer] 형태소 분석기.

[Lucene] 4.9.0 analyzer & tokenizer....

[ElasticSearch] _score 계산 시 IDF 연산은 어떻게 이루어 지나요?

[ElasticSearch] bitset...

[ElasticSearch] GC 튜닝 참고.

[ElasticSearch] master node election...

[ElasticSearch] Hash Partition 테스트

티스토리툴바