kestenb
if we are using ELK for logging but only need slow 1-5 s loads of data, how can we minimize costs? Right now it is 2k /month per project in servers which is too much. Mostly due to the large ram requirements of ES.
elasticguest2489
do you allow memory swap?
jbferland
As in if you reduce allowed memory consumption in the JVM, queries fail?
izo
@kestenb : what's the size of your data ? ie: daily index size
peterkimnyc
@kestenb are you using doc values?
mta59066
How to setup a cluster on WAN? What would you suggest for somebody who is used to something like MySQL Master/Master replication, where there is a queue, eventually servers will get consistent, don’t worry about short network failures, use both ends for reads and writes.
mayzak
@mta59066 We will cover that in Q&A, good question
to start though ,we don't support a cluster across the WAN due to latency but there are options today to achieve something like that and more coming in the future
mayzak
@elasticguest2489 That's not up to Elasticsearch, its up to the JVM process and the OS. It's always bad to swap memory with Java. What are you trying to do that would make you wonder about that?
MealnieZamora
We are a multi-tenant application with multiple customer account sharing a single ES index. Each account has their own set of fields from the documents that are indexed (which are not known beforehand); therefore we use dynamic mapping. This could result in a mapping explosion. How many fields can an index mapping support? 10,000? 30,000?
mta59066
@mayzak thanks for the info, obviously a setup where latency on the arrival of the data is not vital
jpsandiego42
When setting up logstash (and other apps) to talk to the ES cluster, is it helpful to have those apps configured to use a load balancer and/or client-only nodes instead of talking directly to data nodes?
rastro
MealnieZamora: it will also result in the same field having different mappings, which is bad. ES doesn't like a lot of fields.
bharsh
load balancer - DNS round robin sufficient or dedicated appliance?
spuder-450
How can you have multiple logstashes when using kafka? It is a pull based model, so you can't have a load balancer
elasticguest1440
what is the suggested log shipper when shipping web server logs to elk cluster: install logstash on every web server versus logstash in elk cluster and lumberjack on web servers?
mayzak
@mta59066 I hear you. Have you considered duplicating the documents on their way in or using Snapshot restore between clusters?
granted the later is more a Master/Slave type setup
rastro
elasticguest1440: logstash-forwarder is a nice, lightweight shipper.
mayzak
FileBeat is also an option now
MealnieZamora
@rastro what is the magic number for a lot of fields?
Is there a rule of thumb for max # of fields?
rastro
MealnieZamora: i think we're over 70,000 and elastic.co nearly fainted. I think ES is fairly OK with it, but K4 just can't cope.
elasticguest9518
Bharsh: that depends on how sticky the connections are, for replacing secrets etc
elasticguest1759
On Logstash high-availability: how about putting two logstashes side by side and configuring the log source to send it to both logstash instances?
pickypg
@rastro K4's Discover screen goes through a deduplication process of all fields. With many, many fields, this can be expensive on the first request
EugeneG
Does the Master Zone contain all eligible master nodes, even if they aren't currently acting as master nodes?
Jakau
At what point do you decide to create those dedicated-role Elasticsearch nodes?
ⓘ ChanServ set mode +v djschny
peterkimnyc
@eugeneG Yes
EugeneG
ok, he just answered my question
pickypg
@Jakau a good rule of thumb is around 7 nodes, then you should start to separate master and data node functionality
rastro
pickypg: we had to role back to k3 because k4 doesn't work for that.
mta59066
@mayzak I'll look into those options
pickypg
@rastro :( It will get better. They are working on the problem
kestenb
@izo small daily log size: 200 MB,
jpsandiego42
We found master's really helped when we were only at 5 nodes
elasticguest8328
master-slave isn't a very reliable architecture.
peterkimnyc
@Jakau It really depends on the utilization of the data nodes. I’d argue that even with 3 nodes, if they’re really being hit hard all the time, it would benefit you to have dedicated masters
rastro
pickypg: yeah, of course.
elasticguest8328
its also pretty expensive.
pickypg
@jpsandiego42 Removing the master node from data nodes will remove some overhead, so it will benefit smaller clusters too.
kestenb
@peterkimnyc mostly defaults yes
jpsandiego42
yeah, it made a big difference in keeping the cluster available
pickypg
@kestenb you'll probably benefit from the second part of the webinar about fielddata
christian__
@MealnieZamora It will depend on your hardware. Large mappings will increase the size of the cluster state, which is distributed across the cluster whenever the mappings change, which could be often in your case. The size will also increase with the number of indices used.
centran
are 3 master only nodes really needed? if they are only master then there can be only one and since they don't have data you shouldn't have to worry about split brain
elasticguest3231
what OS's is shield tested on with Kibana? (i've failed on OSX and Arch)
izo
@kestenb: what's your setup like ? Cluster ? Single box ? Running in AWS? or on Found ?
pickypg
@centran If you don't use 3, then you lose high availability. Using three allows any one of them to drop without impacting your cluster's availability
elasticmarx77
@centran: with one dedicated master you have single point of failure.
rastro
mayzak: how can filebeat be a replacement when the project says, "Documentation: coming..." ?
elasticguest6519
So one would have 3 master on the side that talk to each other in their config file to bring the cluster up. Both the client and data node would have those 3 master in their config to join the cluster. Logstash would be sending the log as an output to the data node or the client node ?
pickypg
@leasticguest3231 I have had Kibana working on my Mac pretty consistently
christian__
@centran 3 is needed in order for two of them to be able to determine that they are in majority in case the master node dies
pickypg
with shield that is
Jakau
How is that warm data node configured? Can you move old (7+ days) over to them easily?
centran
I realize that... we use VMs and only 2 SANs so if a bigger datacenter issue occurs it doesn't matter cause it would knock out 2 anyway
elasticmarx77
@Jakau: yes, you can. also have a look at curator which helps automating index management.
pickypg
@Jakau Yes. You can use shard allocation awareness to move shards to where they need to be with little effort
+djschny
@Jakau - yes you can use the shard filtering functionality to accomplish that
michaltaborsky
I hear often (even here) "elastic does not like many fields:. But are there any tip to improve performance in case you just need many fields? In our case it's tens of thousands fields, sparsely populated, fairly small dataset (few gigabytes), complex queries and faceting.
christian__
@Jakau You use rack awareness and tag nodes in the different zones. You can then have ES move indices by changing index settings
jmferrerm
@leasticguest3231 docker container works with Debian. I tested it with Ubuntu and CentOs.
pickypg
@centran If you're fine with the single point of failure, then a single master node is fine
mattnrel
Anyone running multiple ES nodes as separate processes on the same hardware?
rastro
michaltaborsky: maybe run one node and use K3? :(
pickypg
@mattnrel People do that, but it's not common
elasticguest8116
this may have been asked , but how does the master node count requirement option work, if you have an aws multiaz setup , and you loose the zone with the current master ?
elasticguest2489
@michaltaborsky
You should use object mapping with flexible keys and values
centran
well there are two masters
ⓘ JD is now known as Guest6267
kestenb
@izo running a 3 node cluster as containers with 4 GB ram on m4.2x ssd in AWS
mattnrel
For instance i have spinning and ssd drives - could use 1 ES process for hot zone, 1 ES process for warm zone?
centran
but never had the current master fail or shut it down so don't know if the second master will take over
mattnrel
@pickypg any downside to multiple processes on same hardware?
+djschny
@mattnrel - there is nothing stopping you from doing that, however it comes at the cost of maintenance and the two processes having contention with one another
jpsandiego42
We're running multiple nodes on hardware needed to deal with JVM 32g limits, but haven't tried for difference zones.
Jakau
Will common steps of performing performance tests to identify bottlenecks on your own setup be covered at all?
michaltaborsky
@elasticguest2489 What are flexible keys and values?
+djschny
@jpsandiego42 - are you leveraging doc values?
pickypg
@mattnrel If you misconfigure something, then replicas will end up on the same node. You need to set the "processors" setting as well to properly split up the number of cores. And if the box goes down, so do all of those nodes
mattnrel
another usecase for multiple processes - one for master node, one for data?
christian__
@centran If you have 2 masters, the second should not be able to take over if the master dies. If it can, you run the risk of having a split brain in scenario in ase you suffer a network partition. This is why 3 master eligible nodes are recommended
jpsandiego42
yeah, had to put in extra config to ensure host awareness and halfing the # of processors, etc
mattnrel
@pickypg yeah i've spotted the config setting for assuring data is replicated properly when running multiple instances on same server
elasticguest6519
In the setup shown, logstash would send his data as an output to the client or to the data node ?
jpsandiego42
not using doc values today
Crickes
does shifting data from hot to warm nodes require re-indexing?
elasticmarx77
@Crickes: no
christian__
@Crickes No.
German23
@Crickes no just adjusting the routing tag
+djschny
@jpsandiego42 - doc values should reduce your heap enough that you shouldn't need to run more than one node on a single host
elasticguest2489
@michaltaborsky Object type mapping with 2 fields called key and value. Depending on the nature of your data this might avoid the sparseness and enhance performance
+djschny
@mattnrel - generally speaking you are always better off following the gold rule of each box only runs one process (whether that be a web app, mysql, etc.)
peterkimnyc
@Crickes No but there’s a great new feature in ES2.0 that would make you want to run an _optimize after migration to warm nodes to compress the older data at a higher compression level.
izo
@kestenb: and those 3 containers cost you 2k a month ?
elasticguest4713
Is there a general rule to improve performance on heavy load of aggregation and faced queries? Adding more nodes and more RAM?
jpsandiego42
@djschny - most of our issues come from not doing enough to improve mappings/analyzed and our fielddata getting too big.
elasticguest2489
Good question...
michaltaborsky
@elasticguest2489 I don't think this would work for us, like I wrote, we use quite complex queries and facets
peterkimnyc
@Crickes [warning: blatant self-promotion] I wrote a blog post about that feature recently. https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0
Crickes
i thought you can't change the index sonfig once its created, show how do you modify a tag on an idex that might have several thousand records in it already?
peterkimnyc
@Crickes There are many dynamic index config settings
+djschny
@Crickles indexes have static and dynamic settings. the tagging is a dynamic one (similiar to number of replica shards)
Crickes
@peterkimnyc Thanks, I'll have a look at that
peterkimnyc
You’re probably thinking of the number_of_shards config, which is not dynamic
alanhrdy
@Crickes time series index are normally created each day. Each day you can change the settings :)
elasticguest2489
@michaltaborsky
If you have too many fields this often reflects a bad mapping... but it's hard to tell without knowing the use case...
elasticmarx77
@Crickes: have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html
+inqueue
clickable for the first bullet: https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale
michaltaborsky
The use case is product database. Different products have different parameters (fields). T-shirts have size and color, fabric... mobile phones have color, operating system, memory size, ... There are thousands of different product categories, hundreds or thousands products in each.
mattnrel
With indexes having same mapping - better to have more/smaller indexes (say per day), or have fewer/larger indexes (say per week) - esp in terms of fielddata
mattnrel
Very relevant talk to my current situation (OOM on fielddata)! Thanks for this.
centran
should be called field data life saver
jpsandiego42
=)
MealnieZamora
is there a rule of thumb for how may indices you should have per cluster?
centran
fielddata bite me in the butt too but it was coupled with setting heap size to 32g which is too close... going down to 30g made my cluster much happier
mattnrel
Would REALLY be nice to have a shortcut method to enable doc_values after the fact - even just a method to rebuild enire index on the fly
MrRobi
Are "doc values" the same as Lucene TermVectors?
rastro
MealnieZamora: the more indexes/shards, the more overhead in ES. For us, it's been a heap management issue.
michaltaborsky
+1 on a simple way to reindex an index
mattnrel
@MrRobi doc values are the same as Lucene DocValues
+djschny
@centran - correct, if you heap is above 30GB then the JVM can no longer use compressed pointers, this results in larger GC times and less usable heap memory
rastro
daily indexes and templates FTW.
jpsandiego42
=)
ⓘ elasticguest9087 is now known as setaou
spuder-450
@MelnieZamora I've heard anecdotally to keep your indexes between 200 - 300
rastro
doc_values saved us like 80%+ of heap.
MealnieZamora
are doc values applicable to system fields like _all
mattnrel
@rastro wow. doing much aggregation/sorting?
elasticguest3231
+1 on re-indexing
christian__
@MwalnieZamora No, it only works for fields that are not_analyzed
centran
@djschny - yep... at the time I think the elastic doc was mentioning the 32g problem but didn't say that the problem can pop up between 30-32. took researching java memory managment on other sites to discover heap size of 32 is bad idea and playing with fire
c4urself
so we should set circuit breaker to 5-10% AFTER enabling doc values?
rastro
mattnrel: most of our queries are aggregation, as we're building dashboards and generating alerts (by host, etc).
+djschny
@MealnieZamora - there is no magic number here. it depends upon, number of nodes, machine sizes, size of docs, mappings, requirements around indexing rate, search rate, etc.
mattnrel
@rastro same here so good to know your success w/ docvalues
elasticguest3231
not_analyzed should be configurable as default option for strings
+djschny
@MealnieZamora - best best is to run tests
centran
@c4urself he said he recommends that after you think you got them all so it will trip and you can find anything you missed
mattnrel
@rastro same performance under doc values? (obviously is better that you aren't filling your heap and possibly crashing nodes...)
rastro
elasticguest3231: i use templates for that (all field types, actually).
c4urself
centran: ok, thanks for the clarification
rastro
mattnrel: the doc says there's a performance penalty, but I can say that a running cluster is more performant than a crashed cluster.
+djschny
@centran - do you happen to have the link to the elastic doc mentioning 32GB? If so would like to correct it.
centran
I think it was fixed but not sure... I can look
rastro
centran: all the doc i found says "less than 32GB", but doesn't explain the boundary condition.
centran
I know when I was reading up it was on the old site
mattnrel
" I can say that a running cluster is more performant than a crashed cluster. " so true!
elasticguest3231
@rastro - yeah, we wrote datatype conversion scripts to handle still seems like your should be able to set at index level rather than field
mattnrel
with same mappings - generally better to run more/smaller indexes (daily) or fewer/larger indexes (weekly)?
rastro
djschny: "when you have 32GB or more heap space..." https://www.elastic.co/blog/found-elasticsearch-in-production
yxxxxxxy
We need to have case-insensitive sort. So we analyze strings to lowercase them. Does that mean we can't use doc_values?
centran
@djschny https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
Shawn
@yxxxxxxy - https://github.com/elastic/elasticsearch/issues/11901
avielastic
Can I get the recording of this webnar ? I joined late
christian__
@mattnrel You do not want too many small shards as each shard carries a bit of overhead, so the decision between daily and weekly indices often depend on data volumes
pickypg
Recording will be posted later
elasticguest5827
Is there any rule to find an optimal size of shard e.g. shard to heap ratio?
elasticguest7305
If I'm using just a lowercase string analyzer (not tokenizing it). Does that work with Doc_Values? Or, do we need to duplicate before we bulk insert the record?
elasticguest2745
Is the circuit breaker for the total cluster or just for that node?
rastro
elasticguest3231: the template says "any string in this index...", which feels like index-level, right?
centran
@djschny they talk about the limit but should probably be explicit that it needs to be set lower to be in the safe zone
c4urself
what are some scaling problems that happen regularly AFTER enabling doc values (that is, not field data-related problems)?
+djschny
@centran - I will patch the documents and add that for sure.
setaou
In ES 1.x, we have a parameter for the Range Filter allowing to use fielddata. In our use case it gives more performance than the other setting (index), and more perfs than the Range Query. In ES 2.0, filters are no more, so what about the performance of the Range Query, which works without field data ?
+djschny
@centran - Thanks for the link
mattnrel
@elasticguest2745 per node
elasticguest2745
thanks
avielastic
what are the best possible ways to change the datatype of a field of an existing Index without re-indexing ? Will multi-field or dynamic mapping help
rbastian
Would doc values improve nested aggregation performance or only help with stability due to less heap?
Crickes
its the mechanism for ageing the index without using curator I'm interested in finding out. How do you manually move an index from a hot node, to a warm node?
elasticguest2745
We are seeing that the field data cache isnt getting evicted when it hits the set limit. how can we make sure it gets cleared?
jmferrerm
elmanytas
Crickes
I think the anser in buried in https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html
dt_ken
I know the website says you do not recommend the G1GC for elastic but we've found it is much faster and seems completely stable. Is there still fear in using G1GC?
jbferland
If you're on the latest java 8 releases, I think G1GC is ok now despite the warnings.
doctorcal
Huh?
michaltaborsky
@dt_ken We use G1GC for a while, for us it is also more stable.
doctorcal
What your data model is
jbferland
There were historical cases of corruption but there have been bug fixes. Risk / reward and dart boards at this point.
+djschny
you can either run 3 master nodes (one in each AZ)
elasticguest2399
When indexes/shards are moved from hot to warm nodes, are the segments in the shards coalesced together? Or is index optimization still needed?
+djschny
or you can put the master node in a Cloud Formation template, so that if it goes down, the CF will spin up another one in another zone
Jakau
So I'm looking ~35GB a day, 4 log types, and then indexing the events into ~4 indexes a piece that all have the same alias for querying across them. The seperate indexes are due to different retentions. Any issues with this? We'd be looking at keeping 90 days worth of logs live
elasticguest8116
ok so use a 3rd az just for a master node
avielastic
whats the advantage of having dedicated master vs Master-data nodes?
mattnrel
How much heap is recommended for master-only node? (Vs 1/2 of ram < 32G general recommendation)
+djschny
@elasticguest2399 - shard relocation copies segments exactly, byte for byte. After that is finished, segment merging then happens independent of the node where things were copied from
christian__
@Jakau You may want to reduce the shard count from the default of 5 in order to reduce the number of shard generated per day
elasticguest6947
Do you have a lightweight master-quorem arbiter daemon, similar to Percona's arbiter, to deal with a 2-master scenario?
elasticguest8116
thank you
elasticguest2399
@+djschny: Thank you
pickypg
@elasticguest6947 not at this time
MIggy282
yes
elasticguest6947
@pickypg thanks
MIggy282
your correct
+djschny
Generally speaking when using log data, you don't need a distributed queue like Kafka
Jakau
@christian__ What should it be reduced to? My thoughts right now were 1 shard per node. We're looking at starting with 3 nodes
yxxxxxxy
how many replicas can ES reasonably handle?
elasticguest3231
@rastro - oh, index templates - hadn't understood their use case... are you using to configure better geo_point handling?
spuder-450
I thought elasticsearch clusters shouldn't span geo locations
jpsandiego42
cool. I like that.
Jakau
What's the recommended procedure for performance testing an ELK stack? I've largely seen JMeter for testing query performance
ⓘ elasticguest9203 is now known as Prabin
rastro
elasticguest3231: i think we have a template that takes any field that ends in ".foo" and makes it a geo_point.
Prabin
is there a way to merge two indices?
elasticguest7305
If I'm using just a lowercase string analyzer (not tokenizing it). Does that work with Doc_Values? Or, do we need to duplicate (and lowercase) before we bulk insert the record?
yxxxxxxy
@Prabin you can create an alias over the two indices and search against the alias
Crickes
could you use a tribal node to join 2 geographical seperate clusters?
jwieringa
Thanks!
jpsandiego42
Thanks!
elasticguest2489
Thx
elasticguest9430
Upgrading webinar https://www.elastic.co/webinars/upgrading-elasticsearch
elasticguest2433
Thanks
elasticguest3231
many thanks - might solve a lot of headaches for us
elasticguest8687
this has been one of the most useful webinars on elsticsearch I have seen. Thanks!!
Prabin
@yxxxxxxy alias is definitely an option but with time the number of indices is going to increase, so want to merge them so that search happens on fewer index
pickypg
@elasticguest7305 Unfortunately not yet.
rastro
Crickes: i hope so, because we're moving in that direction with some new clusters.
Jakau
Yes, this was an excellent webinar, thank you
pickypg
@Crickles Yes
bharsh
excellent presentation guys... gives me lots to look at
pickypg
@elasticguest7305 https://github.com/elastic/elasticsearch/issues/12394 <- this will be the solution to that
elasticguest8687
I see some questions about the number of indices, and my question might be the same (I didn't see the stat of this thread). Is it ok to have hundreds of indices with the total data size is around 100GB?
centran
agreed. good presentation. great knowledge for those how having been getting ELK going and are now realizing the mess they got themselves into
pickypg
@elasticguest8687 So the sum of all the indices is 100 GB? You probably want to reduce the number of indices because that's less than 1 GB per index
rastro
centran: lol
pickypg
There's nothing wrong with that per se, but it _sounds_ wasteful
The impact would be: a lot of shards to search through (a lot of threads) and a bloated cluster state (from extra indices)
Crickes
thanks everyone
chadwiki
@crickles Make sure you have unique Index name, example - region1_index1 and region2_index1
elasticguest8687
it has more to do with the requirements for the over all application. I'll rethink the strategy, but I guess what I really want to know is if the searches will be slow or not if you have that many indices.
pickypg
@elasticguest8687 It kind of depends on how you're searching. Are you searching a single index or all of them with a single request?
centran
I thought I was overkilling it with indexes especially because we have rolling ones but then I discovered the awesomeness of setting up proper index patterns in kibana... holy crap does the speed differences. having lots of fields is what sucks in my opinion
elasticguest8687
it many cases it would be searching across many (or most) of the indices
so would document types be a better approach than using many indices?
pickypg
@centran Yeah. That is being worked on (for real), but it's not a simple problem (quickly deduping)
@elasticguest8687 Do the indexes have the same mappings?
and, if so, why/how are they separated?
elasticguest8687
not necessarily (one of the reason using multiple indices came up as a solution). The idea was to have different fields between indices and search across a common field if you need to.
pickypg
If the mappings are different, then definitely do not use different types. Types are literally just a special filter added for you at the expense of bloating your mapping. If you _can_ and _want_ to use types, then simply create an extra field and name it "type" (or whatever you want), then filter on that manually. It will limit the bloat better.
pickypg
As for the rest: if your index is not greater than 1 GB, then it had better only have 1 shard (there are exceptions, but in general...)
primary shard that is
elasticguest8687
ok. thanks for the info. very helpful.
pickypg
The downside to having a ton of indices for search is that each shard needs to be searched and the results need to be federated/combined by the originating requester node (an advantage of a client node). As such, each index needs to route all requests to all of their shards. This means that if you search 100 shards, then you have 100 threads workin
g _across your cluster_.
Individually they're probably going to be very quick, but the request is only as good as the weakest/slowest shard, which is _probably_ going to be impacted by the slowest node
elasticguest8687
actually I guess I don't have a good idea of how big the index will be. but my guess is it will be more than 1 GB.
pickypg
Also, less obvious, if you have too many shards in the request (e.g., using 5 primary shards unnecessarily), then you will run into blocked requests because of too many threads
How much more?
elasticguest8687
well, the data itself (files to be indexed) total to about 100GB. Most of the files are pdfs, so I plan to extract the text from those.
pickypg
Text is tiny by comparison, so it's really quite hard to say what will come out of them
elasticguest8687
right
pickypg
https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0
Good, relevant blog post
elasticguest8687
thanks
pickypg
@elasticguest8687 You can also bring this up on the discuss.elastic.co forums, but my strong recommendation would be to combine indices that share the same mapping (using a separate field to represent type as described above) and deal with the quantity of shards as it happens. In my experience, it's quite good at it -- I was dealing with an issue w
here a user was running an aggregation across 450 shards without issues stemming from that (there were different issues), but eventually the added parallelism does itself incur a cost
pickypg
and that cost is two fold: 1. the federated search must combine results to find the actual relevant results (top 10 from 5 shards requires up to 50 comparisons at the federated level) 2. the number of threads is a bottleneck
elasticguest8687
ok. Yeah, i think i need to go back to the drawing board and think about this some more.
pickypg
Also take a look at our book chapter on "Life Inside a Cluster" https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html
The book's free and great. The next three chapters are also highly relevant, as is sorting and relevance
oh and this is #2 from my above comment: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
elasticguest8687
awesome! thanks, again. this has been very helpful.
pickypg
Good luck
mattnrel
thanks again to Elastic for the great preso