We're struggling to find the cause for XSL transformations that perform really bad occasionally for quite some time now.
So far there's nothing we can make out as a real cause, since it can occur under heavy load but also when the server is basically idle. The attached example happened when there were 158 requests in 15 minutes. So, no mentionable load at all.
We were suspecting some external XML documents that are used within the transformations, but that doesn't seem to be the problem either, since they usually load within milliseconds, sometimes maybe seconds, but nothing that would explain the 200+ seconds the requests took.
The same transformations run quite well when we try them later to check if there's a problem.
We are running Fusion Reactor to monitor our server but there's nothing unusual to see as well. In yesterday's cases, there was neither high CPU load nor anything else out of the ordinary.
I attached a screenshot from Fusion Reactor's profiler, where you can see the times taken and it always seems to be the "scanDocument" part that takes up 99.x% of the time, if we interpret the result correctly.
Is there any way to find out what's causing the delay here?
The versions we are currently running are:
Ubuntu: 14.04.5 LTS
Java: 1.8.0_45
Lucee: 4.5.4.017 final
Well, there is 99.8% in SocketInputStream.sockerRead0 so I'd blame a slow network connection.
The rest of the program is just waiting for bytes to arrive over the slow network connection, so you don't see high CPU
Related
I am working on a java process which is conceptually rather simple. It is a single thread constantly fetching data from various sources and making decisions based on these. I have recently noticed a suspicious delay between 2 log lines, where I would not expect much processing to happen between the 2. (Tens of milis delay vs expectation of a 1 to a few milis)
Since this suspicious delay is not always there, my first thought was that I did a poor job at minimizing the need for garbage collection, causing the JVM to pause execution at unwanted times.
While I still believe I did a poor job with that, it doesn't seem to be the cause. I have added the following JVM parameters: -Xlog:gc*=info,safepoint::timemillis,level,tags
and I see no pause between my suspicious log lines. Could there be other JVM pauses that these JVM params would not reveal?
Anyways, would java pros have any recommendations to try and efficiently track down the source of this latency? Any (preferably free) tools I could use to monitor and understand what's happening?
Environment info: Linux 3.10, java 11. The process in question is running on an isolated core, other than that I have not done much tuning.
depending on where the information is coming from, it may be latency in the grabbing of the information. like if it has to reach out to a server, then the connecting and querying of the server may be introducing that latency. or maybe you are accessing your disk too quickly, and you are limited by the speed of your drive. besides that, I am not sure. maybe add in some of these.
long time = System.currentTimeMillis();
foo();
System.err.println(System.currentTimeMillis() - time + "ms");
I have been fighting with this issue for ages and I cannot for the life of me figure out what the problem is. Let me set the stage for the stack we are using:
Web-based Java 8 application
GWT
Hibernate 4.3.11
MySQL
MongoDB
Spring
Tomcat 8 (incl Tomcat connection pooling instead of C3PO, for example)
Hibernate Search / Lucene
Terracotta and EhCache
The problem is that every couple of days (sometimes every second day, sometimes once every 10 days, it varies) in the early hours of the morning, our application "locks up". To clarify, it does not crash, you just cannot log in or do anything for that matter. All background tasks - everything - just halt. If we attempt to login when it is in this state, we can see in our log file that it is authenticating us as a valid user, but no response is ever sent so the application just "spins".
The only pattern we have found to date related to when these "lock ups" occur is that it happens when our morning scheduled tasks or SAP imports are running. It is not always the same process that is running though, sometimes the lock up happens during one of our SAP imports and sometimes during internal scheduled task execution. All that these things have in common are that they run outside of business hours (between 1am and 6am) and that they are quite intensive processes.
We are using JavaMelody for monitoring and what we see every time is that starting at different times in this 1 - 6am window, the number of used jdbc connections just start to spike (as per the attached image). Once that starts, it is just a matter of time before the lock up occurs and the only way to solve it is to bounce Tomcat thereby restarting the application.
As for as I can tell, memory, CPU, etc, are all fine when the lock up occurs the only thing that looks like it has an issue is the constantly increasing number of used jdbc connections.
I have checked the code for our transaction management so many times to ensure that transactions are being closed off correctly (the transaction management code is quite old fashioned: explicit begin and commit in try block, rollback in catch blocks and entity manager close in a finally block). It all seems correct to me so I am really, really stumped. In addition to this, I have also recently explicitly configured the Hibernate connection release mode properly to after_transaction, but the issue still occurs.
The other weird thing is that we run several instances of the same application for different clients and this issue only happens regularly for one client. They are our client with by far the most data to be processed though and although all clients run these scheduled tasks, this big client is the only one with SAP imports. That is why I originally thought that the SAP imports were the issue, but it locked up just after 1am this morning and that was a couple hours before the imports even start running. In this case it locked up during an internal scheduled task executing.
Does anyone have any idea what could be causing this strange behavior? I have looked into everything I can think of but to no avail.
After some time and a lot of trial and error, my team and I managed to sort out this issue. Turns out that the spike in JDBC connections was not the cause of the lock-ups but was instead a consequence of the lock-ups. Apache Terracotta was the culprit. It was just becoming unresponsive it seems. It might have been a resource allocation issue, but I don't think so since this was happening on servers that were low usage as well and they had more than enough resources available.
Fortunately we actually no longer needed Terracotta so I removed it. As I said in the question, we were getting these lock-ups every couples of days - at least once per week, every week. Since removing it we have had no such lock-ups for 4 months and counting. So if anyone else experiences the same issue and you are using Terracotta, try dropping it and things might come right, as they did in my case.
As said by coladict, you need to look at "Opened jdbc connections" page in the javamelody monitoring report and before the server "locks up".
Sorry if you need to do that at 2h or 3h in the morning, but perhaps you can run a wget command automatically in the night.
I'm loading about 1 million records into Oracle using a custom Java utility. The Java utility is multi-threaded and has worked numerous times in the past with no problem. My issue is that when I start the load for the very first time, it is lightning fast, around 150K object per hour. After about an hour or 2 the performance greatly decreases to around 6000 objects per hour. I'm almost certain that my performance hit has something to do with Oracle, but I can't figure out what it is. The Oracle machine has 16GB of RAM and 8 CPUs. I set the following system parameters, that have worked for me in the past:
optimizer_mode=ALL_ROWS
optimizer_index_cost_adj=10
query_rewrite_integrity=ENFORCED
pga_aggregate_target=300M
sga_target=5000M
sga_max_size=5000M
Does anyone have any Oracle knowledge to maybe know why my performance is great initially but drops off drastically? One additional note, if I stop the load, restart the machine, then start the load again, I continue to see the 6000 object per hour performance. So it's always the very first load after cloning our Production database that has the best performance. Hopefully someone has an idea, thanks in advance!!
I assume that the load is only inserts and that the distribution of the data changes over time.
Or are it continuous inserts into the same table, like loading continuously Call Detail Records of a phone system?
In principe Oracle does not easily get slower with increasing and lasting use. But there are some ways to make it run slower:
Locks / latches
I would recommend checking that concurrent use by other Oracle sessions is not causing the problems due to short locks or latches. Given that it are inserts, it could maybe be the other threads trying to insert in the same data blocks given the distribution of the data which might become different after some time.
Restricted inserts per block
Please check that max_trans on the tables is not restricted to 1 or 2. I've seen that once and it was really funny to see how Oracle got down to a crawl when only one session can do something in a block.
SGA and kernel problems
With older Oracle releases (Oracle 7 and 8) I've seen numerous occassions on large systems where Oracle started to kill itself. This especially holds for multiprocessor systems, because locking/latching on a MP-system is implemented differently: the other processor might get it's work done, so an Oracle threads first just spins a little and then tries again. Also, problems with SGA fragmentation or even bad locking of the SGA can cause problems.
Please check that the insert statements use bind variables, batches or bypass SQL completely. You might also want to try running it in one thread. Is one thread processing stable over time (although slower)? If so, you have a locking issue somewhere. Google for locks/latches/spins and follow scenarios listed.
I have been having occasional problems with a server I wrote. It's in Clojure, but I don't think that matters, and we can pretend it's in Java. Anyway, it works fine for hours at a time, but goes into fits where it behaves very badly: all activity stops, for around fifteen seconds, and then it works normally for a few seconds, then stops for fifteen seconds...and so on for (usually) about ten minutes or so, after which it goes back to behaving normally.
I've done a lot of profiling of it with YourKit, and I've ruled out a number of plausible suspects:
It's not a garbage collection issue: I'm running it with -XX:+UseConcMarkSweepGC, and I've verified that the server continues to run just fine during both minor and major collections, due to the concurrent nature of this garbage collector. And we're not thrashing as we run out of total memory or something: the current heap size is well below its max.
I don't think it's a locking/synchronization issue, but I'm not 100% sure on that. The YourKit profiler shows threads waiting sometimes, eg competing over the lock for System.out to produce log messages, but the only long waits are for worker threads in threadpools when there's nothing to do. And of course YourKit says it's never detected any deadlocks.
It's not something caused by having the profiler attached, because it still happens even if I boot the server up and then leave it alone without ever attaching the profiler.
It's not some other process on the system taking up all the CPU time: top shows CPU usage at 100% for my java process, and basically 0% for everything else.
My biggest problem is that I can't see what the server is doing during these strange funks, because the profiler stops receiving samples. Here's a graph of the CPU usage chart:
The left side of the graph is normal operation, during which we get profiler samples every second or so. The right side is "broken", and is very spiky because the profiler is only getting samples every ten seconds or so. In the samples it does get, the server seems to be doing its usual business: responding to requests and so on; and the logs confirm that it is doing normal stuff, but only at the times the profiler has samples for: during the upward-sloping "straight lines" on the graph, for which the profiler has no samples, the server is doing nothing at all.
So, does this graph look familiar to anyone? Have you had this problem before and fixed it? Or can you point me in the direction of a tool that can figure out what my server is doing during the time when YourKit can't? In case it matters, the server machine is running Ubuntu 10.04, and
$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.10) (rhel-1.28.1.10.10.el5_8-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
Okay, from the comments it seems clear to me we are not going to be able to figure this out with the information you've given so far. The best we can do is to give suggestions on how to debug it...
I would try to use jstack during one of the spikes and see if you can use that to figure out where it hangs.
If you have no chance to measure or debug in code try to look form the outside.
I would at first to try to reproduce the problem. In other words is there a external event that produce the behavior. Try to change the load on server. Switch every thing you can to reproduce the problem.
Maybe it's also a good idea to sniff the network traffic (tcpdump) to find something interesting around the time when you server hangs.
You can also run it on another operating system to check if it depends from your installation environment.
If you can't reproduce a situation where the problem occurs, try to find situations where you don't get the problem. For instance remove the server from net. Shutdown all other services.
If you can't find with that any change of behavior of your program try to reduce the complexity of your working code and see if you can find a internal module that seems to be related with the problem.
Have you had this problem before and fixed it? Or can you point me in
the direction of a tool that can figure out what my server is doing
during the time when YourKit can't?
If you have shell access on the server and can see stdout, try taking a thread dump when the server becomes unresponsive. Not sure if this will give you anything different than what jstack (mentioned in the other answer) would give you or not.
On Ubuntu: kill -QUIT <java-pid> (will not actually kill the Java process).
http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx
We have a 64 bit linux machine and we make multiple HTTP connections to other services and Drools Guvnor website(Rule engine if you don't know) is one of them. In drools, we create knowledge base per rule being fired and creation of knowledge base makes a HTTP connection to Guvnor website.
All other threads are blocked and CPU utilization goes up to ~100% resulting into OOM. We can make changes to compile the rules after 15-20 mins. but I want to be sure of the problem if someone has already faced it.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000 threads, Can it be a reason?
I have a couple of question:
When do we know that we are running over capacity?
How many threads can be spawned internally (any rough estimate or formula relating diff parameters will work)?
Has anyone else seen similar issues with Drools? Concurrent access to Guvnor website is basically causing the issue.
Thanks,
I am basing my answer on the assumption that you are creating a knowledge base for each request, and this knowledge base creation incudes the download of latest rule sources from Guvnor please correct if I am mistaken.
I suspect that the build /compilation of packages is taking time and hog your system.
Instead of compiling packages on each and every request, you can download pre build packages from guvnor, and also you can cache this packages locally if your rules does not change much. Only restriction is that you need to use the same version of drools both on guvnor and in your application.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000
threads, Can it be a reason?
That number does look large but we dont know if a majority of those threads belong to you java app. Create a java thread dump to confirm this. Your thread dump will also show the CPU time taken by each thread.
When do we know that we are running over capacity?
You have 100% CPU and an OOM error. You are over capacity :) Jokes aside, you should monitor your HTTP connection queue to determine what you are doing wrong. Your post says nothing about how you are handling the HTTP connections (presumably through some sort of pooling mechanism backed by a queue ?). I've seen containers and programs queue requests infinitely causing them to crash with a big bang. Plot the following graphs to isolate your problem
The number of blocking threads over time
Time taken for each thread
Number of threads per thread pool and how they increase / decrease with time (pool size)
How many threads can be spawned internally (any rough estimate or
formula relating diff parameters will work)?
Only a load test can answer this question. Load your server and determine the number of concurrent users it can support at 60-70% capacity. Note the number of threads spawned internally at this point. That is your peak capacity (allowing room for unexpected traffic)
Has anyone else seen similar issues with Drools? Concurrent access to
Guvnor website is basically causing the issue
I cant help there since I've not accessed drools this way. Sorry.