Could anyone suggest the best pattern of gathering metrics from a cluster of nodes (each node is Tomcat Docker Container with Java app)?
We're planning to use ELK stack (ElasticSearch, Logstash, Kibana) as a visualization tool but the question for us is how metrics should be delivered to Kibana?
We're using DropWizard metrics library and it provides per instance metrics (gauges, timers, histograms).
Some metrics, obviously, should be gathered per instance (e.g. cpu, memory, etc..) - it doesn't make any sense to aggregate them per cluster.
But for such metrics as an average API response times, database calls durations we want a clear global picture - i.e. not per concrete instance.
And here is where we hesitating. Should we:
Just send plain gauge values to ElasticSearch and allow Kibana to calculate averages, percentiles, etc.. In this approach all aggregation happens in Kibana.
Use timers and histograms per instance and send them instead - but since this data is already aggregated per instance (i.e. timer already provides percentiles and 1minute, 5minute and 15minute rates) - how should Kibana handle this to show a global picture? Does it make a lot of sense to aggregate already aggregated data?
Thanks in advance,
You will want to use Metricbeat. It supports modules for the system level, Docker API, and Dropwizard. This will collect the events for you (without any pre-aggregation).
For the aggregation and visualization I'd use the time-series visual builder, where you can aggregate per container, node, service, everything,... It should be very flexible to get the right data granularity for you.
Related
Let's say that I wanna keep track of the amount of times a service does "something".
I could store every single entry of that event in my database along some metadata about it, but if said event happens tens of times per second it would increase network latency / impact the application performance.
What would be a good way to track such data to be used in metrics?
You could log certain information that can later be used for analysis (using log4j or logback). I don't know what you are using for the back-end, but if you are using Spring or Play, this can be easily done.
In Spring you can implement a HandlerInterceptor or a filter that gets called for each endpoint (or only for specific ones, depending on your use case). You can use this listener to log the start timestamp, end timestamp, and other information like the endpoint that was called. Furthermore, you can store information in the ThreadContext to be used (for example, for calculating the time it took to process the request).
Similarly, for Play you can write a class that implements play.http.ActionCreator and add it to the processing flow (more on this here: https://petrepopescu.tech/2021/09/intercepting-and-manipulating-requests-in-play-framework/ )
Now, you have the logs that have all the information needed and you can perform static analysis on it. You can have a job that parses the logs daily and offer the statistics. If you mark you logs properly (ex: by adding a keyword in the beginning of the message), parsing should be really easy.
If you DON'T want to do this because you need the stat near real-time, you will need to dedicate a thread just for this, if you want things to be as smooth for the user as possible. Maybe even have an Actor (see Akka Actors) that processes the data async. Again, using an Interceptor or a Filter, send a message to the actor (using .tell(), NOT .ask()) and it does the statistics and saves it in the database (or some other location of your choice).
You can instrument your code to collect time-series metrics. Basically, maintain counters, and log/reset them at a time interval, like every 5 minutes. Then you can use other tools to collect and visualize those logs.
Take a look at Spectator.
In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance
Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.
When using CodaHale Metrics in Java or Scala code a clustered environment what are the gotchas when reporting to Graphite?
If I have multiple instances of my application running and creating different metrics can Graphite cope - i.e. is reporting cumulative?
For example if I have AppInstance A and B. If B has a gauge reporting 1.2 and another reporting 1.3 - what would be the result in Graphite? Will it be an average? or will one override the other.
Are counters cumulative?
Are timers cumulative?
Or should I somehow give each instance some tag to differentiate different JVM instances?
You can find your default behavior for the case where graphite receive several points during the aggregation period in aggreagtion-rules.conf.
I think graphite default is to take last received point in the aggregation period.
If you might be interested in metric detail by process instance (and you'll probably be at some point), you should tag instances in some way and use that tag in metric path. Graphite is extremely useful for aggregation at request time and finding a way to aggregate individuals metrics (sum, avg, max, or more complex) you be difficult.
One thing that can make you reluctant to have different metrics by sender process would be if you have a very versatile environment where instances change all the time (thus creating many transient metrics). Otherwise, just use ip+pid and you'll be fine.
I added a 'count' field to every set of metrics that I knew went in at the same time. Then I aggregated all of the values including the counts as 'sum'. This let me find both the average, sum and count for all metrics in a set. (Yes, graphite's default is to take the most recent sample for a time period. You need to use the carbon aggregator front end.)
Adding the IP address to the metric name lets you calculate relative speeds for different servers. If they're all the same type and some are 4x as fast as others you have a problem. (I've seen this). As noted above adding a transitory value like IP creates a dead metric problem. If you care about history you could create a special IP for 'old' and collect dead metrics there, then remove the dead entries. In fact, the number of machines at any timer period would be a very useful metric.
We've found that the easiest way to handle this is by using per instance metrics. This way you can see how each instance is behaving independently. If you want an overall view of the cluster it's also easy to look at the sumSeries of a set of metrics by using wildcards in the metric name.
The caveat to this approach is that you are keeping track of more metrics in your graphite instance, so if you're using a hosted solution this does cost more.
What is the best procedure for reading real time data from a socket and plotting it in a graph? Graphing isn't the issue for me; my question is more related to storing the stream of data, which gets updated via a port. Java needs to be able to read and parse it in a queue, array, or hash map and thus plot the real time graph.
The complication here is that the app can get a large amount of data every second. Java needs to do some high-performance job to parse the data, clean the old records, and add new records continuously. Is there any real=time library I should use, or should I program it from scratch using scratch, e.g. create a queue or array to store the data and delete data after the queue/array reaches a certain size?
I would humbly suggest that perhaps you can consider LinkedBlockingQueue (use thread/executor to read/remove the data from the LinkedBlockingQueue).
If you do not wish to use these data structures but prefer a messaging server to take on that responsibility for some reason, ActiveMQ/ZMQ/RabbitMQ etc would help out (with either queues or topics depending on your use case - for instance in AMQ, use queue for one consumer, topic for multiple consumers unless you wish to use browsers with durable queues).
If it suits your use case, you could also look into actor libraries such as kilim or akka.
My need is to aggregate real time statistics of a web application server.
For example:
How many requests of content type X have been done
How long it takes to process request of type Y
And so on.
This data has to be completely in memory, not in a file, for best performance. It doesn't log each and every request but instead only stores counters of various aspects.
The most easy way I know is to store the values in a SQL-like table and do SQL-like queries. The benefit is that the indexing is coming off-the-shelf without development effort. I guess some embedded Java databases like Apache Derby would do the work.
The other way to go is to implement collection (say a list) and hash table for each "index column". This way it's all done with Java/Scala collections API, but I actually have to implement indexing mechanism myself, test it, maintain it, etc.
So my question is what way do you think is preferred, and if there are other ways to easily and quickly implement this feature?
Thanks.
I would choose H2 database, I have very positive experiences with it, performance is great as well.
Are you sure that SQL database is well suited for your needs, and have you looked at javamelody, to see if it suits your needs, or if it does not suit you take a look at JRobin for a rolling database implementation.
I would imagine you only need one collection per type of information you need to collection. To improve performance, simplify code I would use TObjectIntHashMap. e.g.
How many requests of content type X have been done
TObjectIntHashMap<ContentType> contentTypeCount
= new TObjectIntHashMap<ContentType>();
contentTypeCount.increment(contentType);
How long it takes to process request of type Y
TObjectLongHashMap<ProcessType> contentTypeTime
= new TObjectLongHashMap<ProcessType>();
contentTypeTime.adjustValue(processType, processTime);
I don't see how you can make it any shorter/simpler/faster by using the other approaches you mentioned.
The average time to perform increment(key) on my machines takes 15 ns (billionths of a second)
I also been noticed about Twitter Ostrich that is statistics library for Scala.
It contains counters, gauges and timing meters.
Data is accessible from HTTP REST API.