ElasticON '16에서 support engineer 가 발표한 자료로 보입니다.
여기서 Elasticsearch를 이용하면서 기억 하고 있어야 할 만한 내용들이 있어서 기록해 봅니다.
이미 아시는 분들도 있겠지만 아마도 잘 모르시는 분들이 더 많지 않을까 싶습니다.
개인적으로 예전 Deview 때 제가 발표했던 내용의 근거가 포함되어 있어 기분 좋내요. :)
발표제목)
Stories from Support: Top Problems and Solutions
문서첨부)
thursday-mark-w-chris-e-stories-from-support0problems-solutions-stage-c.pdf
내용 Snippet)
Doc Value Caveats
• Analyzed strings do not currently support doc_values,which means that you must avoid using such fields for sorting, aggregating, and scripting
• Analyzed strings are generally tokenized into multiple terms, which means that there is an array of values
• With few exceptions (e.g.,significant terms),aggregating against analyzed strings is not doing what you want
• Unless you want the individual tokens,scripting is largely not useful
• Big improvement coming in ES2.3 (“keyword” field)
Do You Know Where Your Shards Are At Night
• Elasticsearch 1.X defaults to 5 primary, 1 replica
• Elasticsearch 2.0 defaults to 2 primary, 1 replica
• Increase primaries for higher write throughput and to spread load
• 50GB is the rule of thumb max size for a primary shard. More for recovery than performance
• Replicas are not backups. Rarely see a benefit with more than 1
Queries
• Deeppagination
• ES 2.0 has a soft limiton 10K hits per request. Linearly more expensive per shard
• Use scan and/or scrollAPI
• Leading wildcards
• Equivalent to a full table scan(bad)
• Scripting
• Without parameters • Dynamically(inline)
• Unnecessary filter caching (e.g.,exact date ranges down to the millisecond)
Aggregations
• Cardinality
• Setting the threshold to 40K (or higher) is memory intensive and generally unnecessary
• Using in place of search
• Searching will be faster
• Enormous sizes
• Requesting large shard sizes (relative to actual size)
• Linearly more expensive per shard touched
• Generally unnecessary
• Returning hits when you don’t want them
Indexing
• Too many shards
• If your shards are small (define:small as<5GB) and they outnumber your nodes, then you have too many
• Refreshing too fast
• This controls “near real time” search
• Merge throttling
• Disable it on SSDs
• Make single threaded on HDDs(see node sizing link)
• Not using bulk processing