I'm building a large import script that uses functionality from a separate code base that I suspect of having a memory leak. It calls the code base as many as 10000 times for the same operations and while the first is relatively quick (2 sec) the script is requiring a long time to run (over 100 hours and counting) and by the end the same task is up to 60 sec or more (and still climbing). What is the best way to work around this while the leaks are found and fixed?
Some solutions that have been brainstormed would be:
Create a process that runs a part of the script then end it, reclaiming the resources it used.
Use a shell script to launch the program multiple times completing a sub-set of the tasks each time and have the updated data output to file to be used by the next iteration
edit: Changed the way the question was phrased to make it clear that the import and the code base are separate programs
You know, none of the evidence you have presented clearly points to a storage leak. The real problem could be something completely different, like a poorly designed algorithm, or a poorly tuned database table or query.
Assuming that this is a storage leak and applying "band-aid" solutions could be a waste of time, or actually make the problem worse.
You will be better off spending the time up front to determine what the real problem is and fix it, rather than trying a series of workarounds ... which may turn out to be futile.
I solved this issue by minimizing the scope that contains references to the other codebase. Basically every time I initialize an object or call a function from the other codebase I went through hoops to make sure it existed for the minimal time possible. Often setting references again to NULL in order to make sure all references were removed.
This ended up working excellently, reduced the time from over 150 hours and counting to under 30.
Related
This is something I think I've seen before with other profiling tools in other environments, but it's particularly dramatic in this case.
I'm taking a CPU profile of a task that runs for about 12 minutes, and it's showing almost half the time spent in a method that literally does nothing: it's got an empty body. What can cause this? I don't believe that the method is being called a ridiculous number of times, certainly not to account for half the execution time.
For what it's worth, the method in question is called startContent() and it's used to notify a parsing event. The event is passed down a chain of filters (perhaps a dozen of them), and the startContent() method on each filter does almost nothing except to call startContent() on the next filter in the chain.
This is pure Java code, and I'm running it on a Mac.
Attached is a screen shot of the CPU sampler output:
and here is a sample showing the call stack:
(After a delay due to vacation) Here are a couple of pictures showing the output from the profiler. These figures are much more what I would expect the profile to look like. The profiler output seems entirely meaningful, while the sampler output is spurious.
As some of you will have guessed, the job in question is a run of the Saxon XML schema validator (on a 9Gb input file). The profile shows about half the time being spent validating element content against simple types (which happens during endElement processing) and about half being spent testing key constraints for uniqueness; the two profiler views show highlight the activity involved in these two aspects of the task.
I'm not able to supply the data as it comes from a client.
I have not used VisualVM, but I suspect the problem is likely because of the instrumentation overhead on such an empty method. Here's the relevant passage in JProfiler's documentation (which I have used extensively):
If the method call recording type is set to Dynamic instrumentation, all methods of profiled classes are instrumented. This creates some overhead which is significant for methods that have very short execution times. If such methods are called very frequently, the measured time of those method will be far to high. Also, due to the instrumentation, the hot spot compiler might be prevented from optimizing them. In extreme cases, such methods become the dominant hot spots although this is not true for an uninstrumented run. An example is the method of an XML parser that reads the next character. This method returns very quickly, but may be invoked millions of times in a short time span.
Basically, a profiler adds it's own "time length detection code", essentially, but in an empty method the profiler will spend all it's time doing that rather than actually allowing the method to run.
I recommend, if it's possible, to tell VisualVM to stop instrumenting that thread, if it supports such a filtering.
It is generally assumed that using a profiler is much better (for finding performance problems, as opposed to measuring things) than - anything else, really - certainly than the bone-simple way of random pausing.
This assumption is only common wisdom - it has no basis in theory or practice.
There are numerous scholarly peer-reviewed papers about profiling, but none that I've read even address the point, let alone substantiate it.
It's a blind spot in academia, not a big one, but it's there.
Now to your question -
In the screenshot showing the call stack, that is what's known as the "hot path", accounting for roughly 60% of in-thread CPU time. Assuming the code with "saxon" in the name is what you're interested in, it is this:
net.sf.saxon.event.ReceivingContentHandler.startElement
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.StartTagBuffer.startContent
net.sf.saxon.event.ProxyReceiver.startContent
com.saxonica.ee.validate.ValidationStack.startContent
com.saxonica.ee.validate.AttributeValidator.startContent
net.sf.saxon.event.TeeOutputter.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.Sink.startContent
First, this looks to me like it has to be doing I/O, or at least waiting for some other process to give it content. If so, you should be looking at wall-clock time, not CPU time.
Second, the problem(s) could be at any of those call sites where a function calls the one below. If any such call is not truly necessary and could be skipped or done less often, it will reduce time by a significant fraction.
My suspicion is drawn to StartTagBuffer and to validate, but you know best.
There are other points I could make, but these are the major ones.
ADDED after your edit to the question.
I tend to assume you are looking for ways to optimize the code, not just ways to get numbers for their own sake.
It still looks like only CPU time, not wall-clock time, because there is no I/O in the hot paths. Maybe that's OK in your case, but what it means is, of your 12-minute wall clock time, 11 minutes could be spent in I/O wait, with 1 minute in CPU. If so, you could maybe cut out 30 seconds of fat in the CPU part, and only shorten the time by 30 seconds.
That's why I prefer sampling on wall-clock time, so I have overall perspective.
By looking at hot paths alone, you're not getting a true picture.
For example, if the hot path says function F is on the hot path for, say 40% of the time, that only means F costs no less than 40%. It could be much more, because it could be on other paths that aren't so hot. So you could have a juicy opportunity to speed things up by a lot, but it doesn't get much exposure in the specific path that the profiler chose to show you, so you don't give it much attention.
In fact, a big time-taker might not show up at all because on any specific hot path there's always something else a little bigger, like new, or because it goes by multiple names, such as templated collection class constructors.
It's not showing you any line-resolution information.
If you want to inspect a supposedly high-cost routine for the reason for the cost, you have to look at the lines within it. There's a tendency when looking at a routine to say "It's just doing what it's supposed to do.", but if you are looking at a specific costly line of code, which most often is a method call, you can ask "Is it really necessary to do this call? Maybe I already have the information." It's far more specific in suggesting what you could fix.
Can it actually show you some raw stack samples?
In my experience these are far more informative than any summary, like a hot path, that the profiler can present.
The thing to do is examine the sample and come to a full understanding of what the program was doing, and the reason why, at that point in time.
Then repeat for several more samples.
You will see things that don't need to be done, that you can fix to get substantial speedup.
(Unless the code is already optimal, in which case it will be nice to know.)
The point is, you're looking for problems, not measurements.
Statistically, it's very rough, but good enough, and no problem will escape.
My guess is that the method Sink.startContent actually is called a ridiculous number of times.
Your screenshot shows the Sampling tab, which usually results in realistic timings if user over a long enoung interval. If you use Profiler tab instead, you will also get the invocation count. (You'll also get less realistic timings and your program will get very very slow, but you only need to do this for a few seconds to get a good idea about the invocation counts).
It's hard to predict what optimizations and especially inlining HotSpot performs, and the sampling profiler can only attribute the time of inlined methods to the call sites. I suspect that some of the invocation code in saxon might for some reason be attributed to your empty callback function. In that case, you're just suffering the cost of XML, and switching to a different parser might be the only option.
I've had a lot of useful information and guidance from this thread, for which many thanks. However, I don't think the core question has been answered: why is the CPU sampling in VisualVM giving an absurdly high number of hits in a method that does nothing, and that isn't called any more often than many other methods?
For future investigations I will rely on the profiler rather than the sampler, now I have gained a bit of insight into how they differ.
From the profiler I haven't really gained a lot of new information about this specific task, in so far as it has largely confirmed what I would have guessed; but that itself is useful. It has told me that there's no magic bullet to speeding up this particular process, but has put bounds on what might be achieved by some serious redesign, e.g a possible future enhancement that appears to have some promise is generating a bytecode validator for each user-defined simple type in the schema.
I want to filter what classes are being cpu-profiled in Java VisualVm (Version 1.7.0 b110325). For this, I tried under Profiler -> Settings -> CPU-Settings to set "Profile only classes" to my package under test, which had no effect. Then I tried to get rid of all java.* and sun.* classes by setting them in "Do not profile classes", which had no effect either.
Is this simply a bug? Or am I missing something? Is there a workaround? I mean other than:
paying for a better profiler
doing sampling by hand (see One could use a profiler, but why not just halt the program?)
switch to the Call Tree view, which is no good since only the Profiler view gives me the percentages of consumed CPU per method.
I want to do this mainly to get halfway correct percentages of consumed CPU per method. For this, I need to get rid of the annoying measurements, e.g. for sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() (around 70%). Many users seem to have this problem, see e.g.
Java VisualVM giving bizarre results for CPU profiling - Has anyone else run into this?
rmi.transport.tcp.tcptransport Connectionhandler consumes much CPU
Can't see my own application methods in Java VisualVM.
The reason you see sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() in the profile is that you left the option Profile new Runnables selected.
Also, if you took a snapshot of your profiling session you would be able to see the whole callstack for any hotspot method - this way you could navigate from the run() method down to your own application logic methods, filtering out the noise generated by the Profile new Runnables option.
OK, since your goal is to make the code run as fast as possible, let me suggest how to do it.
I'm no expert on VisualVM, but I can tell you what works. (Only a few profilers actually tell you what you need to know, which is - which lines of your code are on the stack a healthy fraction of wall-clock time.)
The only measuring I ever bother with is some stopwatch on the overall time, or alternatively, if the code has something like a framerate, the number of frames per second. I don't need any sort of further precision breakdown, because it's at best a remote clue to what's wasting time (and more often totally irrelevant), when there's a very direct way to locate it.
If you don't want to do random-pausing, that's up to you, but it's proven to work, and here's an example of a 43x speedup.
Basically, the idea is you get a (small, like 10) number of stack samples, taken at random wall-clock times.
Each sample consists (obviously) of a list of call sites, and possibly a non-call site at the end.
(If the sample is during I/O or sleep, it will end in the system call, which is just fine. That's what you want to know.)
If there is a way to speed up your code (and there almost certainly is), you will see it as a line of code that appears on at least one of the stack samples.
The probability it will appear on any one sample is exactly the same as the fraction of time it uses.
So if there's a call site or other line of code using a healthy fraction of time, and you can avoid executing it, the overall time will decrease by that fraction.
I don't know every profiler, but one I know that can tell you that is Zoom.
Others may be able to do it.
They may be more spiffy, but they don't work any quicker or better than the manual method when your purpose is to maximize performance.
I'm writing a MOS 6502 processor emulator as part of a larger project I've undertaken in my spare time. The emulator is written in Java, and before you say it, I know its not going to be as efficient and optimized as if it was written in c or assembly, but the goal is to make it run on various platforms and its pulling 2.5MHZ on a 1GHZ processor which is pretty good for an interpreted emulator. My problem is quite to the contrary, I need to limit the number of cycles to 1MHZ. Ive looked around but not seen many strategies for doing this. Ive tried a few things including checking the time after a number of cycles and sleeping for the difference between the expected time and the actual time elapsed, but checking the time slows down the emulation by a factor of 8 so does anyone have any better suggestions or perhaps ways to optimize time polling in java to reduce the slowdown?
The problem with using sleep() is that you generally only get a granularity of 1ms, and the actual sleep that you will get isn't necessarily even accurate to the nearest 1ms as it depends on what the rest of the system is doing. A couple of suggestions to try (off the top of my head-- I've not actually written a CPU emulator in Java):
stick to your idea, but check the time between a large-ish number of emulated instructions (execution is going to be a bit "lumpy" anyway especially on a uniprocessor machine, because the OS can potentially take away the CPU from your thread for several milliseconds at a time);
as you want to execute in the order of 1000 emulated instructions per millisecond, you could also try just hanging on to the CPU between "instructions": have your program periodically work out by trial and error how many runs through a loop it needs to go between instructions to "waste" enough CPU to make the timing work out at 1 million emulated instructions / sec on average (you may want to see if setting your thread to low priority helps system performance in this case).
I would use System.nanoTime() in a busy wait as #pst suggested earlier.
You can speed up the emulation by generating byte code. Most instructions should translate quite well and you can add a busy wait call so each instruction takes the amount of time the original instruction would have done. You have an option to increase the delay so you can watch each instruction being executed.
To make it really cool you could generate 6502 assembly code as text with matching line numbers in the byte code. This would allow you to use the debugger to step through the code, breakpoint it and see what the application is doing. ;)
A simple way to emulate the memory is to use direct ByteBuffer or native memory with the Unsafe class to access it. This will give you a block of memory you can access as any data type in any order.
You might be interested in examining the Java Apple Computer Emulator (JACE), which incorporates 6502 emulation. It uses Thread.sleep() in its TimedDevice class.
Have you looked into creating a Timer object that goes off at the cycle length you need it? You could have the timer itself initiate the next loop.
Here is the documentation for the Java 6 version:
http://download.oracle.com/javase/6/docs/api/java/util/Timer.html
Debugging performance problems using a standard debugger is almost hopeless since the level of detail is too high. Other ways are using a profiler, but they seldom give me good information, especially when there is GUI and background threads involved, as I never know whether the user was actually waiting for the computer, or not. A different way is simply using Control + C and see where in the code it stops.
What I really would like is to have Fast Forward, Play, Pause and Rewind functionality combined with some visual repressentation of the code. This means that I could set the code to run on Fast Forward until I navigate the GUI to the critical spot. Then I set the code to be run in slow mode, while I get some visual repressentation of, which lines of are being executed (possibly some kind of zoomed out view of the code). I could for example set the execution speed to something like 0.0001x. I believe that I would get a very good visualization this way of whether the problem is inside a specific module, or maybe in the communication between modules.
Does this exist? My specific need is in Python, but I would be interested in seeing such functionality in any language.
The "Fast Forward to critical spot" function already exists in any debugger, it's called a "breakpoint". There are indeed debuggers that can slow down execution, but that will not help you debug performance problems, because it doesn't slow down the computer. The processor and disk and memory is still exactly as slow as before, all that happens is that the debugger inserts delays between each line of code. That means that every line of code suddenly take more or less the same time, which means that it hides any trace of where the performance problem is.
The only way to find the performance problems is to record every call done in the application and how long it took. This is what a profiler does. Indeed, using a profiler is tricky, but there probably isn't a better option. In theory you could record every call and the timing of every call, and then play that back and forwards with a rewind, but that would use an astonishing amount of memory, and it wouldn't actually tell you anything more than a profiler does (indeed, it would tell you less, as it would miss certain types of performance problems).
You should be able to, with the profiler, figure out what is taking a long time. Note that this can be both by certain function calls taking a long time because they do a lot of processing, or it can be system calls that take a long time becomes something (network/disk) is slow. Or it can be that a very fast call is called loads and loads of times. A profiler will help you figure this out. But it helps if you can turn the profiler on just at the critical section (reduces noise) and if you can run that critical section many times (improves accuracy).
The methods you're describing, and many of the comments, seem to me to be relatively weak probabilistic attempts to understand the performance impact. Profilers do work perfectly well for GUIs and other idle-thread programs, though it takes a little practice to read them. I think your best bet is there, though -- learn to use the profiler better, that's what it's for.
The specific use you describe would simply be to attach the profiler but don't record yet. Navigate the GUI to the point in question. Hit the profiler record button, do the action, and stop the recording. View the results. Fix. Do it again.
I assume there is a phase in the app's execution that takes too long - i.e. it makes you wait.
I assume what you really want is to see what you could change to make it faster.
A technique that works is random-pausing.
You run the app under the debugger, and in the part of its execution that makes you wait, pause it, and examine the call stack. Do this a few times.
Here are some ways your program could be spending more time than necessary.
I/O that you didn't know about and didn't really need.
Allocating and releasing objects very frequently.
Runaway notifications on data structures.
others too numerous to mention...
No matter what it is, when it is happening, an examination of the call stack will show it.
Once you know what it is, you can find a better way to do it, or maybe not do it at all.
If the program is taking 5 seconds when it could take 1 second, then the probability you will see the problem on each pause is 4/5. In fact, any function call you see on more than one stack sample, if you could avoid doing it, will give you a significant speedup.
AND, nearly every possible bottleneck can be found this way.
Don't think about function timings or how many times they are called. Look for lines of code that show up often on the stack, that you don't need.
Example Added: If you take 5 samples of the stack, and there's a line of code appearing on 2 of them, then it is responsible for about 2/5 = 40% of the time, give or take. You don't know the precise percent, and you don't need to know.
(Technically, on average it is (2+1)/(5+2) = 3/7 = 43%. Not bad, and you know exactly where it is.)
I'm working on a system at the moment. It's a complex system but it boils down to a Solver class with a method like this:
public int solve(int problem); // returns the solution, or 0 if no solution found
Now, when the system is up and running, a run time of about 5 seconds for this method is expected and is perfectly fast enough. However, I plan to run some tests that look a bit like this:
List<Integer> problems = getProblems();
List<Integer> solutions = new ArrayList<Integer>(problems.size);
Solver solver = getSolver();
for (int problem: problems) {
solutions.add(solver.solve(problem));
}
// see what percentage of solutions are zero
// get arithmetic mean of non-zero solutions
// etc etc
The problem is I want to run this on a large number of problems, and don't want to wait forever for the results. So say I have a million test problems and I want the tests to complete in the time it takes me to make a cup of tea, I have two questions:
Say I have a million core processor and that instances of Solver are threadsafe but with no locking (they're immutable or something), and that all the computation they do is in memory (i.e. there's no disk or network or other stuff going on). Can I just replace the solutions list with a threadsafe list and kick off threads to solve each problem and expect it to be faster? How much faster? Can it run in 5 seconds?
Is there a decent cloud computing service out there for Java where I can buy 5 million seconds of time and get this code to run in five seconds? What do I need to do to prepare my code for running on such a cloud? How much does 5 million seconds cost anyway?
Thanks.
You have expressed your problem with two major points of serialisation: Problem production and solution consumption (currently expressed as Lists of integers). You want to get the first problems as soon as you can (currently you won't get them until all problems are produced).
I am assuming as well that there is a correlation between the problem list order and the solution list order – that is solutions.get(3) is the solution for problems.get(3) – this would be a huge problem for parallelising it. You'd be better off having a Pair<P, S> of problem/solution so you don't need to maintain the correlation.
Parallelising the solver method will not be difficult, although exactly how you do it will depend a lot on the compute costs of each solve method (generally the more expensive the method the lower the overhead costs of parallelising, so if these are very cheap you need to batch them). If you end up with a distributed solution you'll have much higher costs of course. The Executor framework and the fork/join extensions would be a great starting point.
You're asking extremely big questions. There is overhead for threads, and a key thing to note is that they run in the parent process. If you wanted to run a million of these solvers at the same time, you'd have to fork them into their own processes.
You can use one program per input, and then use a simple batch scheduler like Condor (for Linux) or HPC (for Windows). You can run those on Amazon too, but there's a bit of a learning curve, it's not just "upload Java code & go".
Sure, you could use a standard worker-thread paradigm to run things in parallel. But there will be some synchronization overhead (e.g., updates to the solutions list will cause lock contention when everything tries to finish at the same time), so it won't run in exactly 5 seconds. But it would be faster than 5 million seconds :-)
Amazon EC2 runs between US$0.085 and US$0.68 per hour depending on how much CPU you need (see pricing). So, maybe about $120. Of course, you'll need to set up something separate to distribute your jobs across various CPUs. One option might be just to use Hadoop (see this question about whether Hadoop is right for running simulations.
You could read things like Guy Steele's talk on parallelism for more info on how to think parallel.
Use an appropriate Executor. Have a look at http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()
Check out these article on concurrency:
http://www.vogella.de/articles/JavaConcurrency/article.html
http://www.baptiste-wicht.com/2010/09/java-concurrency-part-7-executors-and-thread-pools/
Basically, Java 7's new Fork/Join model will work really well for this approach. Essentially you can set up your million+ tasks and it will spread them as best it can accross all available processors. You would have to provide your custom "Cloud" task executor, but it can be done.
This assumes, of course, that your "solving" algorithm is rediculously parallel. In short, as long as the Solver is fully self-contained, they should be able to be split among an arbitrary number of processors.