Metric pattern analysis for troubleshooting

Metric pattern analysis for troubleshooting - java

When I troubleshoot some site issues, I need to check many metrics like CPU, memory, application metrics and so on. generally, I want to know the following items automatically (without checking all the metrics one by one by human) :
How many metrics have spikes during that time.
if metric X has the same pattern with metric Y
if metric X has some periodicity characters.
for item 1 and 2, I think I can get it by calculating some change rate. for item 3, I have no idea so far.
my questions here are:
do we have some library already which can be used here, language (Go, Java, Python is ok).
do you have any suggestion for requirement 3.
=====
More background here:
I have a Prometheus(a monitor system) setup already, but my issue is I want to analyze these metrics automatically. For example:
User input:
Here are 1000 time-serial data and I have an issue on time 1 to time 2, I see metrics X has spiked during that time.
Program output:
item 1/2/3 above.
I just have some issue during implement the program.

I think you need some monitoring & analytic services like:
DataDog: https://www.datadoghq.com/
Librato: https://www.librato.com/
etc...
Or a self hosted infrastructure to run Graphite
(https://github.com/hopsoft/docker-graphite-statsd) or similar tools.

Related

Java Spring Actuator Grafana - time taken for a request different between http_server_requests_seconds_max and http_server_requests_seconds::sum/count

Quick question regarding two queries for the time taken for an API please.
My SpringBoot app has out of the box http_server_requests_seconds max, count and sum metrics.
Therefore, I wanted to build a visual in order to see how much time do I take to answer a request.
With that in mind, I started to search on the web existing queries, and found two, which are both supposed to provide the request duration:
http_server_requests_seconds_max{_ws_="my_workspace"}
irate(http_server_requests_seconds::sum{_ws_="my_workspace"}[5m]) / irate(http_server_requests_seconds::count{_ws_="my_workspace"}[5m])
Unfortunately, the dashboards show completely different graphs.
May I ask if I use one of the two query wrong please?
Which of the two is the most appropriate query in order to get the time taken for a request?
Thank you

One obvious difference is that the second example calculates the first derivative of your values (the rate), whereas the first does not.
The second query calculates the average response time over 5 minute intervals, the first query simply shows the maximum value of the timer metric.
An explanation of the different metric types in Prometheus can be found at https://prometheus.io/docs/concepts/metric_types/#summary. irate is explained at https://prometheus.io/docs/prometheus/latest/querying/functions/#irate

Agent based modeling in anylogic

I need a help :(
I am new in anylogic, the problem is i have 4 identical machines each machine has 5 different critical parts.
I want these critical parts to represent one machine. What i tried to do is i create a machine agent type with population 4, and Inside the diagram of the machine agent I created 5 critical part agent type( i.e cp1 ,cp2..cp5) each with initial no. Of agent = 1, and i extended those cps to the machine agent type..is this correct? I am confused because i have 4 machines, does the initial no of CP should be 4 to be distributed to the 4 machines?
I know it is very stupid question :)
Thank you

If this behavior will only occur in the case of a failure you can model this in a different way. Incorporate fails in the resourcePool and select the flowchart option (instead of modeling it with a delay). In that flowchart you have a pickup (or a similar action) from a queue that should contain the spare parts. Tweaking this behavior will probably be a better approach than modeling the 5 critical parts and use them all.
I would suggest the following approach.
Create a resource pool for each part and require its use in the service(see image):
Then, for each of the resource Pools, you will model failures, as in the picture, and the repair task is a flowchart.
You will need to have a queue to represent the spare parts storage. From there you can remove the specific part you want (this will require you to model that information into the agent type and then search the queue but I expect you to know how to do that.
The repair task is very simple in my example but you can and should improve it to your needs.
Hope this is enough for you to solve your problem.
Best regards,
Luís

Java/Scala Metrics - Codahale - Cluster/Mulitnode & Graphite Reporter

When using CodaHale Metrics in Java or Scala code a clustered environment what are the gotchas when reporting to Graphite?
If I have multiple instances of my application running and creating different metrics can Graphite cope - i.e. is reporting cumulative?
For example if I have AppInstance A and B. If B has a gauge reporting 1.2 and another reporting 1.3 - what would be the result in Graphite? Will it be an average? or will one override the other.
Are counters cumulative?
Are timers cumulative?
Or should I somehow give each instance some tag to differentiate different JVM instances?

You can find your default behavior for the case where graphite receive several points during the aggregation period in aggreagtion-rules.conf.
I think graphite default is to take last received point in the aggregation period.
If you might be interested in metric detail by process instance (and you'll probably be at some point), you should tag instances in some way and use that tag in metric path. Graphite is extremely useful for aggregation at request time and finding a way to aggregate individuals metrics (sum, avg, max, or more complex) you be difficult.
One thing that can make you reluctant to have different metrics by sender process would be if you have a very versatile environment where instances change all the time (thus creating many transient metrics). Otherwise, just use ip+pid and you'll be fine.

I added a 'count' field to every set of metrics that I knew went in at the same time. Then I aggregated all of the values including the counts as 'sum'. This let me find both the average, sum and count for all metrics in a set. (Yes, graphite's default is to take the most recent sample for a time period. You need to use the carbon aggregator front end.)
Adding the IP address to the metric name lets you calculate relative speeds for different servers. If they're all the same type and some are 4x as fast as others you have a problem. (I've seen this). As noted above adding a transitory value like IP creates a dead metric problem. If you care about history you could create a special IP for 'old' and collect dead metrics there, then remove the dead entries. In fact, the number of machines at any timer period would be a very useful metric.

We've found that the easiest way to handle this is by using per instance metrics. This way you can see how each instance is behaving independently. If you want an overall view of the cluster it's also easy to look at the sumSeries of a set of metrics by using wildcards in the metric name.
The caveat to this approach is that you are keeping track of more metrics in your graphite instance, so if you're using a hosted solution this does cost more.

How to speed up the model creation process of OpenNLP

I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!

Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.

I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two
OutcomePriorFeatureGenerator
PreviousMapFeatureGenerator
BigramNameFeatureGenerator
SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.
Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin
ic is Initial Capital, lc is lower case
Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.
Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.
Thank you.

Relationship discovery using Neural networks

Suppose I am sampling a number of signals at a fixed rate (say once per second) and extracting some metrics from the signals such as, ratio of one to the other, the rate of change, relative rate of change, etc.
I've heard that Neural Networks can be of use in discovering relationships. Is this true?
If so, what books/internet resources can I use to learn more about how do do this.
The processing is being done is Java, so a Java slant on all your answers would be most appreciated.
Thanks

Most likely you would need to determine some sort of a "window", like maybe the last 10 samples. You would normalize your signal into an array of 10 "doubles" normalized between -1 and 1. This would form the "input" into your neural network. So you would have 10 input neurons. Then you have to decide what you want the output to be. Maybe you have 100 different classifications that you may want to classify the signals into. If this is the case you would have 100 different output neurons that would each be trained to produce a higher output than the other output neurons when they recognize a specific signal.
Between the input and output layers neural networks usually have one or more hidden layers. These just provide additional capability to the neural network.
For Java neural network programming, you might try the Encog project. There is also a DotNet version of Encog as well.

That is true. You can discover relationships with NN. The problems is that it's very hard to interpret the weightings after you make the calibration.. so they are a bit of a black box (more so than other data mining algorithms).
I would actually recommend exploring the Neural Net algorithm that comes with MS Analysis Services. It's a good way to learn about NNets before you start programming anything (and since it's a server service you can call it from java).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.