JMH forks, threads and debug

JMH forks, threads and debug - java

I started to work with JMH lately and wondered if there is a possible way to debug it.
First thing I did was try to debug it like any other program, but it threw "No transports initialized", so I couldn't debug it in the old fashioned way.
Next thing I did is to try to search on the internet and found someone who said that you need to put the forks to 0, tried it and it worked.
Unfortunately, I couldn't really understand why the forks are impacting the debug, or how forks impact the way I see things in the JMH.
All I know so far is that when you put .forks(number) on the OptionBuilder it says on how many process the benchmark will run. But if I put .forks(3) it's running each #Benchmark method on 3 process async?
An explanation about the .forks(number), .threads(number) how they are changing the way benchmarks run and how they impact debug would really clear things.

So normal JMH run has two processes running at any given time. The first ("host") process handles the run. The second ("forked") process is the process that runs one trial of a given benchmark -- achieving isolation. Requesting N forks either via annotation or command line (which takes precedence over annotations) makes N consecutive invocations for forked process. Requesting zero forks runs the workload straight in the hosted process itself.
So, there are two things to do with debugging:
a) Normally, you can attach to the forked process, and debug it there. It helps to configure workload to run longer to have plenty of time to attach and look around. The forked process usually has ForkedMain as its entry point, visible in jps.
b) If the thing above does not work, ask -f 0, and attach to the host process itself.

This was quite a pain in the soft side to get to work (now that I wrote this, it sounds trivial)
First of all I was trying to debug DTraceAsmProfiler (perfasm, but for Mac), using gradle.
First of all, I have a gradle task called jmh that looks like this:
task jmh(type: JavaExec, dependsOn: jmhClasses) {
// the actual class to run
main = 'com.eugene.AdditionPollution'
classpath = sourceSets.jmh.compileClasspath + sourceSets.jmh.runtimeClasspath
// I build JMH from sources, thus need this
repositories {
mavenLocal()
maven {
url '/Users/eugene/.m2/repository'
}
}
}
Then I needed to create a Debug Configuration in Intellij (nothing un-usual):
And then simply run:
sudo gradle -Dorg.gradle.debug=true jmh --debug-jvm

Related

Slowness after migrating to Grails 5.1.1 - time spent on GrailsControllerUrlMappingInfo

Recently, we migrated our backend APIs from Grails 3 to Grails 5.1.1
Together with it we also upgraded java version to 11. Everything is running on Docker. Otherwise, nothing else has changed.
After the migration, we are now facing performance issues. But it's a weird one.
First, we got some results from NewRelic:
NewRelic is showing that org.grails.web.mapping.mvc.GrailsControllerUrlMappingInfo is to blame. There is nothing else underneath it that is slow.
Digging a bit deeper, we found an article (from a while ago) which claims that NewRelic is not instrumenting Grails very well.
At this point, we were trying to reproduce the issues locally and we did. Created a simple function that wraps whatever we need with a timer to measure how long things taking to execute:
def withExecutionTimeLog(Closure closure) {
Long start = System.currentTimeMillis()
def result = closure()
log.warn "Execution time -> ${System.currentTimeMillis() - start} ms"
result
}
And as an example, used it for one of the slowest endpoints:
This is a controller (the highest(ish) point under our control)
def create(Long workspaceId, Long projectId) {
withExecutionTimeLog {
canUpdateProject(projectId) {
withExceptionsHandling {
CecUtilization utilization =
cecUtilizationService.create(workspaceId, projectId, getCurrentUser())
[utilization: utilization]
}
}
}
}
So, in this case, the timer function wraps everything including the DB calls, all the way to view rendering.
And here is the result:
The full request cycle takes 524ms (out of which 432ms is server-side:
From that little execution time logger I've got 161ms
Given all of that, it looks like NewRelic is kinda right. Something else is wasting cycles. The biggest question is WHAT?
How can I profile that? Maybe someone came across a similar situation.
How can I resolve it?
On another note, since we have all the possible monitoring tools in place, I can tell for sure that the servers are not doing much (including the DB). The allocated heap size is up to 3G.
Usually, the first request to the same endpoint takes much longer, and the consecutive ones are much better but still slow.
UPDATE
Decided to dig deeper. Hooked up the Async Profiler and I think my original assumptions were proved true.
The class loader seems to be the one causing the performance issues. This also explains the slowness of the first request (of any type), while subsequent ones start working faster. But, if I do the same call in another 4-5 minutes, it loads the classes again and is slow.
So, unless I am mistaken, the issue here is the class loader. I am blaming TomcatEmbeddedWebappClassLoader for this.
Now, the biggest question is how do I fix that in Grails? I am running a war file in prod -> java -Dgrails.env=prod -Xms1500m -Xmx3000m -jar app.war
I found this post that is somewhat in the same direction, but I am not sure how to wire it in, in Grails.
UPDATE #2
Looking for the class loader solution brought me to this issue. The proposed solution is for Spring. I am wondering what is the way to solve that for grails.

In the link https://github.com/spring-projects/spring-boot/issues/16471 has the following report:
"It's quite low priority as the problem can be avoided by using .jar packaging rather than .war packaging or, I believe, by switching to another embedded container. As things stand, we have no plans to tackle this in the 2.x timeframe."
So I ran the command './gradlew bootJar'.
And I did the test with siege and the first request was from 1.62 seconds to 0.47 seconds

Run the siege command to check the total time of several users making the request at the same point.
I made a repository trying to do what you said
https://github.com/fernando88to/slowness5

Surefire forkCount not resulting in this number of processes

I can set the value of the parameter forkCount to any desired number, say 12, and I'd expect to have 12 new Java processes of type surefirebooter when running tests like these. But ps shows that I only sometimes get the 12 expected Java processes (to be precise: I get them extremely rarely). Instead I typically get less, sometimes even only three or four. Execution of my hundreds of unit tests also appears to be slow then.
The running processes also often disappear from the ps output (terminate, I assume) before the unit tests are done. In some cases all of them, then the execution hangs indefinitely.
Documentation wasn't too clear about this, but I'd expect to have the given number of processes all the time until all unit tests are done.
Maybe the surefirebooter processes run into some problem and terminate prematurely. I see no error message, though. Where should I see them? Can I switch on some debug logging? Switching on the debug mode of Surefire changed the test results, so I didn't follow that path very far.
I'm running ~1600 unit tests in ~400 classes which takes ~7 minutes in the best case. Execution time varies greatly, sometimes the whole thing terminates after more than an hour.
In some cases, on the other hand, the surefirebooter processes continue to run after execution finished (successfully) and puts massive load on the system (so it seems to be busy waiting for something).
Questions:
Can anybody explain these observed phenomena?
Can anybody give advice what to change in order to have a more proper execution? (I. e. with the desired number of surefirebooter processes at all times.)
Can anybody give advice on how to debug the situation? See messages about what happens with the surefirebooter processes? (I tried using strace but that also changed the behavior so dramatically that the call didn't terminate anymore.)

My hypothesis #1 would be that oom_killer can be the culprit. #2 would be that forked processes go into swap and/or spend crazy amount of time garbage collecting stuff
To debug:
Which platform you run this on?
If that's something of *nix kind, could you please check dmesg or /var/log/messages for the messages telling about killed processes after the run?
In cases where you have processes busy waiting, could you please try a) collect stacktrace with jstack (both for forked processes and the main one) b) quantify massive load on the system in terms of cpu / memory usage / amount of stuff paged in / paged out
If none of those proves useful, I'd try to fork surefire ForkStarter adding more logging events and comparing the logs of successful runs and failed ones for more clues. (--debug or -X argument to maven to output debug messages).

Sporadic java.lang.NoClassDefFoundError in Scala

We have a weird problem. We are using an automatic test tool. The DSL was implemented in Scala. The system which we test with this tool was written in Java, and the interface between the two components is RMI. Indeed, the interface part of the automatic test tool is also Java (the rest is Scala). We have the full control of the source code of these components.
We already have at the magnitude of thousand test cases. We execute these test cases automatically once every night, using Jenkins on a Linux server. The problem is that we sporadically receive a java.lang.NoClassDefFoundError exception. This typically happens when trying to access a Java artifacts from a Scala code.
If we execute the same test case manually, or check the result of the next nightly run, then typically the problem solves automatically, but sometimes it happens again in a completely different place. In case of some runs no such problem appears at all. The biggest problem is that the error is not reproducible; furthermore, as it happens in case of an automatic run, we have hardly any information about the exact circumstances, just the test case and the log.
Has somebody already encountered with such a problem? Do you have any idea how to proceed? Any hint or piece of information would be helpful, not only the solution of the problem. Thank you!

I found the reason of the error (99% sure). We had the following 2 Jenkins jobs:
Job1: Performs a full clean build of the tested system, written in Java, then performs a full clean build of the DSL, and finally executes the test cases. This is a long running job (~5 hours).
Job2: Performs a full clean build of the tested system, and then executes something else on it. The DSL is not involved. This is a shorter job (~1 hour).
We have one single Maven repository for all the jobs. Furthermore, some parts of the tested component is part of the interface between the two components.
Considering the time stamps the following happened:
Job1 performed the full build of both components, and started a test suite containing several test cases, which execution lasts about half an hour.
The garbage collector might swept out the components not used yet.
Job2 started the build, and it also rebuilt the interface parts, including the one swept out by garbage collector of Job1.
Job1 reached a test case which uses an interface component already swept out.
The solution was the following: we moved Job2 to an earlier time; now it finishes the job before Job1 starts the tests.

Debugging multiple hadoop jvms with Eclipse

I have a problem that appears in pseudo-distributed mode, but not in standalone mode, and I'm hoping to scratch up some ideas on how to debug this.
Some of my mapper tasks are returning code 143. I'd love to drop a breakpoint on System.exit() and see who's calling this why, but I have to get the debugger running on that mapper.
I can get the task tracker up in the debugger by modifying my bin/hadoop script and remotely connecting to localhost:5000:
...
elif [ "$COMMAND" = "tasktracker" ] ; then
CLASS=org.apache.hadoop.mapred.TaskTracker
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"
# TBMark!
HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,address=5000,server=y,suspend=n"
...and I can get the first mapper (or by a minor tweak, reducer) into Eclipse by adding this into my conf/mapred-site.xml and remotely connecting to localhost:5001:
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xdebug -Xrunjdwp:transport=dt_socket,address=5001,server=y,suspend=y</value>
</property>
My problem is that the failure happens at random and not on the first mapper.
Unsatisfactory ideas that come to mind include:
Somehow replace System.exit() with my own method that does a stack trace. (How does one hook a system call?)
Just keep trying to debug the mappers one by one and run each one to completion before debugging the next. (It might work...)
Track down every last place in hadoop that calls System.exit() and write a distinct signature to a log. (Yuck)
Make the debugger port number variable such that, if I can guess which one is going to fail and the delay doesn't make the bug go away, I can attach to that jvm and debug it. (Many if's, and I don't know any way to make this variable in the .xml file.)
If failure can be predicted to happen on a certain attempt, break the task tracker just before the jvm launch and hand edit the script file. (Desperate times call for desperate measures)
Any suggestions or ideas for how to make my bad ideas above work?

You could try to re-run a failed map task with the IsolationRunner
In case it fails again you should be able to add the debug options!

How to find an infinite loop in a java web application?

One day our java web application goes up to 100% CPU usage.
A restart solve the incident but not the problem because a few hours after the problem came back.
We suspected a infinite loop introduced by a new version but we didn't make any change on the code or on the server.
We managed to find the problem by making several thread dumps with kill -QUIT and by looking and comparing every thread details.
We found that one thread call stack appear in all the thread dumps.
After analysis, there was a while loop condition that never go false for some data that was regularly updated in the database.
The analysis of several thread dumps of web application is really tedious.
So do you know any better way or tools to find such issue in a production environment ?

After some queries, I found an answer in Monitoring and Managing Java SE 6 Platform Applications :
You can diagnose looping thread by using JDK’s provided tool called JTop that will show the CPU time each thread is using:
With the thread name, you can find the stack trace of this thread in the “Threads” tab of by making a thread dump with a kill -QUIT.
You can now focus on the code that cause the infinite loop.
PS.: It seems OK to answer my own question according to https://blog.stackoverflow.com/2008/07/stack-overflow-private-beta-begins/ :
[…]
“yes, it is OK and even encouraged to answer your own questions, if you find a good answer before anyone else.”
[…]
PS.: In case sun.com domain will no longer exists:
You can run JTop as a stand-alone GUI:
$ <JDK>/bin/java -jar <JDK>/demo/management/JTop/JTop.jar
Alternately, you can run it as a JConsole plug-in:
$ <JDK>/bin/jconsole -pluginpath <JDK>/demo/management/JTop/JTop.jar

Fix the problem before it occurs! Use a static analysis tool like FindBugs or PMD as part of your build system. It won't find everything, but it is a good first step.

Think of using coverage tools like Cobertura.
It would have shown you, that you didn't test these code-paths.
Testing sth. like this can become really cumbersome, so try to avoid this by introducing quality measurements.
Anyways tools like VisualVM will give you a nice overview of all threads, so it becomes relatively easy to identify threads which are working for an unexpectedly long time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.