When using CodaHale Metrics in Java or Scala code a clustered environment what are the gotchas when reporting to Graphite?
If I have multiple instances of my application running and creating different metrics can Graphite cope - i.e. is reporting cumulative?
For example if I have AppInstance A and B. If B has a gauge reporting 1.2 and another reporting 1.3 - what would be the result in Graphite? Will it be an average? or will one override the other.
Are counters cumulative?
Are timers cumulative?
Or should I somehow give each instance some tag to differentiate different JVM instances?
You can find your default behavior for the case where graphite receive several points during the aggregation period in aggreagtion-rules.conf.
I think graphite default is to take last received point in the aggregation period.
If you might be interested in metric detail by process instance (and you'll probably be at some point), you should tag instances in some way and use that tag in metric path. Graphite is extremely useful for aggregation at request time and finding a way to aggregate individuals metrics (sum, avg, max, or more complex) you be difficult.
One thing that can make you reluctant to have different metrics by sender process would be if you have a very versatile environment where instances change all the time (thus creating many transient metrics). Otherwise, just use ip+pid and you'll be fine.
I added a 'count' field to every set of metrics that I knew went in at the same time. Then I aggregated all of the values including the counts as 'sum'. This let me find both the average, sum and count for all metrics in a set. (Yes, graphite's default is to take the most recent sample for a time period. You need to use the carbon aggregator front end.)
Adding the IP address to the metric name lets you calculate relative speeds for different servers. If they're all the same type and some are 4x as fast as others you have a problem. (I've seen this). As noted above adding a transitory value like IP creates a dead metric problem. If you care about history you could create a special IP for 'old' and collect dead metrics there, then remove the dead entries. In fact, the number of machines at any timer period would be a very useful metric.
We've found that the easiest way to handle this is by using per instance metrics. This way you can see how each instance is behaving independently. If you want an overall view of the cluster it's also easy to look at the sumSeries of a set of metrics by using wildcards in the metric name.
The caveat to this approach is that you are keeping track of more metrics in your graphite instance, so if you're using a hosted solution this does cost more.
Related
Let's say that I wanna keep track of the amount of times a service does "something".
I could store every single entry of that event in my database along some metadata about it, but if said event happens tens of times per second it would increase network latency / impact the application performance.
What would be a good way to track such data to be used in metrics?
You could log certain information that can later be used for analysis (using log4j or logback). I don't know what you are using for the back-end, but if you are using Spring or Play, this can be easily done.
In Spring you can implement a HandlerInterceptor or a filter that gets called for each endpoint (or only for specific ones, depending on your use case). You can use this listener to log the start timestamp, end timestamp, and other information like the endpoint that was called. Furthermore, you can store information in the ThreadContext to be used (for example, for calculating the time it took to process the request).
Similarly, for Play you can write a class that implements play.http.ActionCreator and add it to the processing flow (more on this here: https://petrepopescu.tech/2021/09/intercepting-and-manipulating-requests-in-play-framework/ )
Now, you have the logs that have all the information needed and you can perform static analysis on it. You can have a job that parses the logs daily and offer the statistics. If you mark you logs properly (ex: by adding a keyword in the beginning of the message), parsing should be really easy.
If you DON'T want to do this because you need the stat near real-time, you will need to dedicate a thread just for this, if you want things to be as smooth for the user as possible. Maybe even have an Actor (see Akka Actors) that processes the data async. Again, using an Interceptor or a Filter, send a message to the actor (using .tell(), NOT .ask()) and it does the statistics and saves it in the database (or some other location of your choice).
You can instrument your code to collect time-series metrics. Basically, maintain counters, and log/reset them at a time interval, like every 5 minutes. Then you can use other tools to collect and visualize those logs.
Take a look at Spectator.
My Spring based web app is deployed to production in a Tomcat cluster (4+ nodes) with sticky sessions. The max number of nodes will not exceed 8-10 in a few years time.
I need to cache some data(mostly configuration), to avoid hitting Oracle. Since the nature of this data is mostly configuration, I would say the ratio of reads to writes is 999999 / 1.
I don't want to use a full-blown caching solution such as Infinispan/Hazelcast/Redis as it adds operation complexity to the product and the requirement is to cache some small, mostly read-only data(let's say a few hundred kilobytes the most)
At first, I wanted to implement a simple replicating map myself, then I saw [JGroups][1] ships with a [ReplicatedHashMap][1]. I think it suits my needs but I'm not sure whether I'm missing something.
What else should I consider?
Has anyone used it in production?
ReplicatedHashMap is one class of 700 lines, so it isn't particularly complex, and uses JGroups, which has been used for decade(s) in production.
If you need something simple, without transactions/overflow-store etc, then it might be right for your job. Note that you could modify it and/or write your own, with RHM as template.
RHM replicates all data to all nodes, so if you have many nodes (you don't), or your data is large, then ReplCache may be the better choice.
Could anyone suggest the best pattern of gathering metrics from a cluster of nodes (each node is Tomcat Docker Container with Java app)?
We're planning to use ELK stack (ElasticSearch, Logstash, Kibana) as a visualization tool but the question for us is how metrics should be delivered to Kibana?
We're using DropWizard metrics library and it provides per instance metrics (gauges, timers, histograms).
Some metrics, obviously, should be gathered per instance (e.g. cpu, memory, etc..) - it doesn't make any sense to aggregate them per cluster.
But for such metrics as an average API response times, database calls durations we want a clear global picture - i.e. not per concrete instance.
And here is where we hesitating. Should we:
Just send plain gauge values to ElasticSearch and allow Kibana to calculate averages, percentiles, etc.. In this approach all aggregation happens in Kibana.
Use timers and histograms per instance and send them instead - but since this data is already aggregated per instance (i.e. timer already provides percentiles and 1minute, 5minute and 15minute rates) - how should Kibana handle this to show a global picture? Does it make a lot of sense to aggregate already aggregated data?
Thanks in advance,
You will want to use Metricbeat. It supports modules for the system level, Docker API, and Dropwizard. This will collect the events for you (without any pre-aggregation).
For the aggregation and visualization I'd use the time-series visual builder, where you can aggregate per container, node, service, everything,... It should be very flexible to get the right data granularity for you.
When I troubleshoot some site issues, I need to check many metrics like CPU, memory, application metrics and so on. generally, I want to know the following items automatically (without checking all the metrics one by one by human) :
How many metrics have spikes during that time.
if metric X has the same pattern with metric Y
if metric X has some periodicity characters.
for item 1 and 2, I think I can get it by calculating some change rate. for item 3, I have no idea so far.
my questions here are:
do we have some library already which can be used here, language (Go, Java, Python is ok).
do you have any suggestion for requirement 3.
=====
More background here:
I have a Prometheus(a monitor system) setup already, but my issue is I want to analyze these metrics automatically. For example:
User input:
Here are 1000 time-serial data and I have an issue on time 1 to time 2, I see metrics X has spiked during that time.
Program output:
item 1/2/3 above.
I just have some issue during implement the program.
I think you need some monitoring & analytic services like:
DataDog: https://www.datadoghq.com/
Librato: https://www.librato.com/
etc...
Or a self hosted infrastructure to run Graphite
(https://github.com/hopsoft/docker-graphite-statsd) or similar tools.
In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance
Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.