[ElasticSearch] 대용량 데이터 색인 시 optimize 튜닝 관점
Elastic/Elasticsearch 2014. 8. 4. 18:04elasticsearch.org 에 the definitive guide 에 좋은 내용이 있어 공유 합니다.
최근에 제가 대용량 데이터를 색인 하면서 사용한 방법 이기도 합니다.
아래 링크가 원문 입니다.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/inside-a-shard.html
제일 중요한 부분은 장비스펙과 데이터 크기에 따른 shard sizing 입니다.
이건 추후 공유 드리기로 하구요. ^^;
원문 대비 제가 활용한 방법에 대해서 간단하게 정리 하도록 하겠습니다.
[한 줄 정리]
- optimize 대신 refresh 와 flush 를 이용한다.
※ optimize 실행은 가급적 대용량 데이터 색인 및 실시간 색인에서는 사용을 피하는 것이 좋습니다.
- 글에도 있지만 대용량 데이터 색인 시 I/O 자원과 CPU 자원을 많이 사용합니다.
The merging of big segments can use a lot of I/O and CPU, which can hurt search performance if left unchecked. By default, Elasticsearch throttles the merge process so that search still has enough resources available to perform well.
Be aware that merges triggered by the optimize API are not throttled at all. They can consume all of the I/O on your nodes, leaving nothing for search and potentially making your cluster unresponsive. If you plan on optimizing an index, you should use shard allocation (see Appendix A, TODO) to first move the index to a node where it is safe to run.
[Bulk Request Size]
원문) http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/bulk.html
how big is too big?edit
The entire bulk request needs to be loaded into memory by the node which receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk
request. Above that size, performance no longer improves and may even drop off.
The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load. Fortunately, it is easy to find this sweetspot:
Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of between 1,000 and 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1kB documents is very different than one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.