'engineering'에 해당되는 글 1건

  1. 2016.03.11 [Elasticsearch] ElasticON'16 - Stories from Support

[Elasticsearch] ElasticON'16 - Stories from Support

Elastic/Elasticsearch 2016. 3. 11. 15:36

ElasticON '16에서 support engineer 가 발표한 자료로 보입니다.

여기서 Elasticsearch를 이용하면서 기억 하고 있어야 할 만한 내용들이 있어서 기록해 봅니다.

이미 아시는 분들도 있겠지만 아마도 잘 모르시는 분들이 더 많지 않을까 싶습니다.

개인적으로 예전 Deview 때 제가 발표했던 내용의 근거가 포함되어 있어 기분 좋내요. :)


발표제목)

Stories from Support: Top Problems and Solutions


문서첨부)


thursday-mark-w-chris-e-stories-from-support0problems-solutions-stage-c.pdf


내용 Snippet)

Doc Value Caveats

• Analyzed strings do not currently support doc_values,which means that you must avoid using such fields for sorting, aggregating, and scripting

• Analyzed strings are generally tokenized into multiple terms, which means that there is an array of values

• With few exceptions (e.g.,significant terms),aggregating against analyzed strings is not doing what you want

• Unless you want the individual tokens,scripting is largely not useful

• Big improvement coming in ES2.3 (“keyword” field)


Do You Know Where Your Shards Are At Night

• Elasticsearch 1.X defaults to 5 primary, 1 replica

• Elasticsearch 2.0 defaults to 2 primary, 1 replica

• Increase primaries for higher write throughput and to spread load

• 50GB is the rule of thumb max size for a primary shard. More for recovery than performance

• Replicas are not backups. Rarely see a benefit with more than 1


Queries

• Deeppagination

  • ES 2.0 has a soft limiton 10K hits per request. Linearly more expensive per shard

  • Use scan and/or scrollAPI

• Leading wildcards

  • Equivalent to a full table scan(bad)

• Scripting

  • Without parameters • Dynamically(inline)

  • Unnecessary filter caching (e.g.,exact date ranges down to the millisecond)


Aggregations

• Cardinality

  • Setting the threshold to 40K (or higher) is memory intensive and generally unnecessary

  • Using in place of search

  • Searching will be faster

  • Enormous sizes

• Requesting large shard sizes (relative to actual size)

  • Linearly more expensive per shard touched

  • Generally unnecessary

• Returning hits when you don’t want them


Indexing

• Too many shards

  • If your shards are small (define:small as<5GB) and they outnumber your nodes, then you have too many

• Refreshing too fast

  • This controls “near real time” search

• Merge throttling

  • Disable it on SSDs

  • Make single threaded on HDDs(see node sizing link)

• Not using bulk processing


: