Recently, we migrated our backend APIs from Grails 3 to Grails 5.1.1
Together with it we also upgraded java version to 11. Everything is running on Docker. Otherwise, nothing else has changed.
After the migration, we are now facing performance issues. But it's a weird one.
First, we got some results from NewRelic:
NewRelic is showing that org.grails.web.mapping.mvc.GrailsControllerUrlMappingInfo is to blame. There is nothing else underneath it that is slow.
Digging a bit deeper, we found an article (from a while ago) which claims that NewRelic is not instrumenting Grails very well.
At this point, we were trying to reproduce the issues locally and we did. Created a simple function that wraps whatever we need with a timer to measure how long things taking to execute:
def withExecutionTimeLog(Closure closure) {
Long start = System.currentTimeMillis()
def result = closure()
log.warn "Execution time -> ${System.currentTimeMillis() - start} ms"
result
}
And as an example, used it for one of the slowest endpoints:
This is a controller (the highest(ish) point under our control)
def create(Long workspaceId, Long projectId) {
withExecutionTimeLog {
canUpdateProject(projectId) {
withExceptionsHandling {
CecUtilization utilization =
cecUtilizationService.create(workspaceId, projectId, getCurrentUser())
[utilization: utilization]
}
}
}
}
So, in this case, the timer function wraps everything including the DB calls, all the way to view rendering.
And here is the result:
The full request cycle takes 524ms (out of which 432ms is server-side:
From that little execution time logger I've got 161ms
Given all of that, it looks like NewRelic is kinda right. Something else is wasting cycles. The biggest question is WHAT?
How can I profile that? Maybe someone came across a similar situation.
How can I resolve it?
On another note, since we have all the possible monitoring tools in place, I can tell for sure that the servers are not doing much (including the DB). The allocated heap size is up to 3G.
Usually, the first request to the same endpoint takes much longer, and the consecutive ones are much better but still slow.
UPDATE
Decided to dig deeper. Hooked up the Async Profiler and I think my original assumptions were proved true.
The class loader seems to be the one causing the performance issues. This also explains the slowness of the first request (of any type), while subsequent ones start working faster. But, if I do the same call in another 4-5 minutes, it loads the classes again and is slow.
So, unless I am mistaken, the issue here is the class loader. I am blaming TomcatEmbeddedWebappClassLoader for this.
Now, the biggest question is how do I fix that in Grails? I am running a war file in prod -> java -Dgrails.env=prod -Xms1500m -Xmx3000m -jar app.war
I found this post that is somewhat in the same direction, but I am not sure how to wire it in, in Grails.
UPDATE #2
Looking for the class loader solution brought me to this issue. The proposed solution is for Spring. I am wondering what is the way to solve that for grails.
In the link https://github.com/spring-projects/spring-boot/issues/16471 has the following report:
"It's quite low priority as the problem can be avoided by using .jar packaging rather than .war packaging or, I believe, by switching to another embedded container. As things stand, we have no plans to tackle this in the 2.x timeframe."
So I ran the command './gradlew bootJar'.
And I did the test with siege and the first request was from 1.62 seconds to 0.47 seconds
Run the siege command to check the total time of several users making the request at the same point.
I made a repository trying to do what you said
https://github.com/fernando88to/slowness5
Related
I'm having trouble with a Jetty 9 server application that seems to go into some kind of resting state after a longer period of idleness. Normally the memory usage of the Java process is ~500 MB, but after being idle for some time it seems to drop down to less than 50MB. The first request that comes takes up to several seconds to respond whereas requests are normally on the scale of tens of milliseconds. But after one or two requests it seems like the application is back to it's normal responsive state.
I'm running on the 32-bit Oracle Java 8 JVM. My JVM configuration is very basic:
java -server -jar start.jar
I was hoping that this issue might be solvable through JVM configuration. Does anyone know if there's any particular parameter to disable this type of behavior?
edit: Based on the comment from Ivan, I was able to identify the source of the issue. Turns out Windows was swapping parts of the Java process out to disk. See my own answer below for a description of my solution.
Based on the comment from Ivan, I was able to identify the source of the issue. Turns out Windows was swapping parts of the Java process out to disk. This was clearly visible when comparing the private working set to the commit size in the task manager.
My solution to this was two-fold. First, I made a simple scheduled job inside my server app that runs every minute and does a simple test run to make sure that the important services never go inactive for long periods. I'm hoping this should ensure that Windows doesn't regard the related pages as inactive.
Afterwards, I also noticed that the process was executing with "Below normal" priority. So I changed the script that starts the server to ensure that it's running with "High" priority going forward. This seems likely to affect swapping behavior and may very well also have been enough to resolve the issue on it's own, but I only found it after already deploying my first solution so that remains unclear. In any case, everything seems to be working as it should now.
The auto complete stalls so frequently and for so long, I quit using it altogether.
I've had success with the following using Eclipse (Classic) 3.6.1 on Windows 7 x64.
"A workaround, until the fix is released in 3.6.2 is summarized here: http://groups.google.com/group/android-developers/msg/0f9d2a852e661cba"
(copied for convenience)
"You can replace your /plugins/
org.eclipse.jdt.core_3.6.1.v_A68_R36x.jar plugin with one from
http://www.google.com/url?q=http://adt-addons.googlecode.com/svn/patches/org.eclipse.jdt.core_3.6.1.v_A68_R36x.zip&ei=vg5aTf2RIMrUgAeI-qTvDA&sa=X&oi=unauthorizedredirect&ct=targetlink&ust=1297749446528273&usg=AFQjCNFv7FGlTrnoVhRGE35JPjHxOwI_Bw
and restart Eclipse. Content Assists will be much better. Just try it.
Don't forget backup your original plugins. "
This solved part of my problem.
In preferences, I defaulted all the 'Java->Editor->Content assist' screens and the performance is much improved. Any lag I have now is due to system speed and is negligible. I've gone from minutes to seconds building the suggestion list.
UPDATE: This didn't completely solve my problem, but it got me close. The search continues...
UPDATE: I'm developing in Java for Android using the default packages that are included and any that might have come down during a update(in retrospect, maybe choosing update all in the SDk update might not have been wise). The timing is fairly consistent online and offline. I did a few tests and found the following:
Startup Eclipse and enter a line of code that can use a .toString(). Typing the '.' populates the auto complete within 2-3 seconds. Type a 't' and it takes 70-75 seconds. After that, 10 seconds. Diff objects do the same thing(75 the first time, 10 after that). It's the filtering process that appears to stall. My CPU does not max, Memory is OK, but the program will go not responding till it's done. Any typeahead gets cached and eventually filters the list when Eclipse starts responding.
For me the problem went away when I increased the memory for the vm.
Put this in your eclipse.ini:
-Xms512m
-Xmx1024m
on my 4GB Windows Vista system this would happen A LOT !! (as well as debug issues when looking up variables).
This all went away after I built my new PC with 8GB RAM. I can now run 4 emulators simultaneously and it doesn't have any debug problems any more either. Auto complete with huge lists also works just fine.
it would seem to be just an issue with how much RAM you've got.
Here is the situation: I had an app with a cold start time of about 4 seconds. I was trying to improve the cold start time by removing a bunch of libraries and code I didn't really need. After doing that the cold start time was about 3 seconds latency, and 3 seconds CPU time used.
I changed the version number in appengine-web.xml, and nothing else. And now I have two versions of my app that have the exact same code, up and running.
For cold starts, the newer version uses 1800ms to 1900ms in CPU time.
For cold starts, the older version uses 2400ms to 3000ms in CPU time.
The exact same jsp page from each version is requested to test the cold start time. So far I have sampled 7 cold starts for each version.
hmmm, I think it is possible that there is some kind of caching of the look of your application, since gae upload is basically differential update (you send only changed files).
If you posted many changes on one version id, it is possible that GAE has many snapshots of your code.
Thus, if you do big changes (this is my rule of thumb) you should always change the version of your application, just to be sure. Additional commits I use only for bug fixes, never for big refactorings/adding or removing JARs. I think you also at that point you have new logs and simply "refreshing installation" of your application so GAE can do some optimizations...
Agree?
Sounds like a fluke, I don't see how changing the version number of your program could generate a change in speed. Unless there was a coincidental library update or some such.
Could the version number be changing an execution path somewhere? Perhaps in the XML parser or data binding that happens before your app is running?
I have a developed two small Java applications - a vanilla Java app and a Java Web application (i.e. Spring MVC, Servlets, JSP, etc.).
The vanilla application consists of several threads which read data continuously at varying rates (from once a second to twice a minute) from several websites, process the data and write it to a database.
The Web Application reads the data from the database and presents it using JSPs, etc.
I'd now like to deploy the applications to a Linux machine and have them run 24 x 7.
If the applications crash I would like them to be restarted.
What's the best way of doing this?
Your web container will run 24x7 by default. If your deployed application throws an exception, it's captured by the container. I wouldn't normally expect this process to not run. Perhaps if threads run away, then it may become unresponsive, so it's worth monitoring (perhaps by a separate process querying it via HTTP?).
Does your vanilla application need to run at regular intervals ? If so, then cron is a good bet. It'll invoke a new instance every 'n' minutes (or however you configure it). If your instance suffers a problem, then it'll simply bail out and a new instance will be launched at the next configured interval. Again, you should probably monitor this (capture log files?) in case some problem determines that it'll never succeed completely.
with Ubuntus upstart you can respawn processes automatically. A little bit more low-level is to put the respawn directly in /etc/inittab. Both work well, but upstart is more manageable (more tools), but requires a newer system (ubuntu, fedora, and debian is switching soon).
For inittab you need to add a line like this to /etc/inittab (from the inittab manpage):
12:2345:respawn:/path/to/myapp flags
For upstart you do something similar (this is a standard script in ubuntu 9.10):
user#host:/etc/init$ cat tty1.conf
# tty1 - getty
#
# This service maintains a getty on tty1 from the point the system is
# started until it is shut down again.
start on stopped rc RUNLEVEL=[2345]
stop on runlevel [!2345]
respawn
exec /sbin/getty -8 38400 tty1
Check out the ServletContextListener, this allows you to embed your java application inside your web application (by creating a background thread). Then you can have it all running inside the web container.
Consider investigating and using a web container supported by the operating system vendor so all the scripts to bring it up and down (including in case of problems) is written and maintained by somebody else but you.
E.g. Ubuntu has a Tomcat as a package
I have a crontab job running every 15 minutes to see if the script is still running. If not, it restarts the service. The script itself is a piece of Perl code:
#!/usr/bin/perl
use diagnostics;
use strict;
my %is_set;
for (#ARGV) {
$is_set{$_} = 1;
}
my $verbose = -1;
if ($is_set{"--verbose"}) {
$verbose = 1;
}
my #components = ("cdk", "qsar", "rdf");
foreach my $comp (#components) {
print "Checking component $comp\n" if ($verbose == 1);
my $bla = `ps aux | grep component | grep $comp-xws | grep -v "ps aux" | wc -l`;
$bla =~ s/\n|\r//g;
if ($bla eq "1") {
print " ... running\n" if ($verbose == 1);
} else {
print " ... restarting component $comp\n" if ($verbose == 1);
system "cd /home/egonw/runtime/$comp; sh runCDKcomponent.sh &";
}
}
First, when a problem occur, it is in general a good idea to have a human look at it to find the root cause as restarting a service without any action will in many cases not magically solve the issue. The common way to handle this situation is to use a monitoring solution offering some kind of alerting (by email, sms, etc) to let a human know that something is wrong and needs a human action. For example, have a look at HypericHQ, OpenNMS, Zenoss, Nagios, etc.
Second, if you want to offer some kind of highly available service, running multiple instances of the service (this is often referred to as clustering) would be a good idea. When doing so, if one instance goes down, the service won't be totally interrupted, obviously. Note that when using a cluster, if one node goes down because of too heavy load, it's very unlikely that the remaining part of the cluster will be able to handle the load so clustering isn't an absolute guarantee in all situations. Implementing this (at least for the web application) depends on the application server or servlet engine you are using.
But actually, if you are looking for something simple and pretty straight forward, I'd warmly suggest to check monit which is really a better alternative to a custom cron job (don't reinvent the wheel, monit is precisely doing what you want in a smart way). See this article for an introduction:
monit is a utility for managing and monitoring processes, files, directories and devices on a Unix system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations. For example, monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses to much resources. You may use monit to monitor files, directories and devices for changes, such as timestamps changes, checksum changes or size changes.
Java Service Wrapper may help with keeping the Java program up 24x7 (or very close).
Several years ago I worked on a project using Java 1.2 and our goal was to run 24x7. We never made it. The longest we managed to keep Java running was about 2-3 weeks. Twice it crashed after about 15 days. The first time we just restarted it, the second time a colleague did some research and found that the crash was due to an int variable overflowing in the Calendar class: the JdbcDriver had called new Date(year, month, day, hour minute, second) more than about 300 million times and each call had incremented the int 6 times. I think this particular bug may be fixed but you may find there are others that you encounter as you try to keep the JVM running for a long time.
So you may need to design your application to be restarted occasionally to avoid this kind of thing.
[UPDATE: I forgot to add that this 30 sec. freezing problem only happens the first time I try to load a file from the server. Subsequent loads are very quick. Maybe some strange reverse DNS lookup? I am hosting on Google's appengine.]
I started a little project recently called http://www.chartle.net which is build around an applet.
Startup time is an important factor in the user's experience of an applet. I collect statistics and am shocked that I find often very long startup times (factor 50 to 100 higher then necessary)
The applet starts in 1-3 seconds depending on the speed of your computer and connection. Still for some users it takes up to 100 sec.
I have mixed results from my own tests. Mostly it is very fast but sometimes freezes the browser for a long time and the Java console doesn't tell me why. Best guess is, that it stalls when loading a saved chart.
Please help me figuring this out - best test by opening an already saved chart (click on one of the 'create' links at http://www.chartle.net/gallery)
Cheers,
Dieter
This is generic help rather than specific for your demo (which loaded fine for me in a few attempts).
Freezing applets
In the JDK bin directory there is a very handy programme called jstack. Refresh your browser window until it crashes and then run:
jstack *process_id*
This will give you the stack trace of any frozen Java process. If Java is not a separate process then you can use the browser's process (eg for Opera).
The following few problems were/are common for me:
I reccommend you use invokeLater rather than invokeAndWait on the init method (although you can't do this if you use start/stop methods)
Opera's custom java plugin acts very poorly...
Deadlocks caused by synch blocks and invokeAndWait's
Slow applets
Possibly the browser is fetching resources from the server, unable to use the jar file?
It may be that only the old plugin causes these problems. That means basically all people running on OSX and other users with Java prior to 1.6_update_10.
So, I would really appreciate people with such setups to watch their Java console and describe the first startup behaviour.
Cheers,
Dieter