How to write a profiler?

How to write a profiler? - java

i would to know how to write a profiler? What books and / or articles recommended? Can anyone help me please?
Someone has already done something like this?

Encouraging lot, aren't we :)
Profilers aren't too hard if you're just trying to get a reasonable idea of where the program's spending most of its time. If you're bothered about high accuracy and minimum disruption, things get difficult.
So if you just want the answers a profiler would give you, go for one someone else has written. If you're looking for the intellectual challenge, why not have a go at writing one?
I've written a couple, for run time environments that the years have rendered irrelevant.
There are two approaches
adding something to each function or other significant point that logs the time and where it is.
having a timer going off regularly and taking a peek where the program currently is.
The JVMPI version seems to be the first kind - the link provided by uzhin shows that it can report on quite a number of things (see section 1.3). What gets executed changes to do this, so the profiling can affect the performance (and if you're profiling what was otherwise a very lightweight but often called function, it can mislead).
If you can get a timer/interrupt telling you where the program counter was at the time of the interrupt, you can use the symbol table/debugging information to work out which function it was in at the time. This provides less information but can be less disruptive. A bit more information can be obtained from walking the call stack to identify callers etc. I've no idea if these is even possible in Java...
Paul.

I wrote one once, mainly as an attempt to make "deep sampling" more user-friendly. When you do the method manually, it is explained here. It is based on sampling, but rather than take a large number of small samples, you take a small number of large samples.
It can tell you, for example, that instruction I (usually a function call) is costing you some percent X of total execution time, more or less, since it appears on the stack on X% of samples.
Think about it, because this is a key point. The call stack exists as long as the program is running. If a particular call instruction I is on the stack X% of the time, then if that instruction could disappear, that X% of time would disappear. This does not depend on how many times I is executed, or how long the function call takes. So timers and counters are missing the point. And in a sense all instructions are call instructions, even if they only call microcode.
The sampler is based on the premise that it is better to know the address of instruction I with precision (because that is what you are looking for) than to know the number X% with precision. If you know that you could save roughly 30% of time by recoding something, do you really care that you might be off by 5%? You're still going to want to fix it. The amount of time it actually saves won't be made any less or greater by your knowing X precisely.
So it is possible to drive samples off of a timer, but frankly I found it just as useful to trigger an interrupt by the user pressing both shift keys as the same time. Since 20 samples is generally plenty, and this way you can be sure to take samples at a relevant time (i.e. not while waiting for user input) it was quite adequate. Another way would be to only do the timer-driven samples while the user holds down both shift keys (or something like that).
It did not concern me that the taking of samples might slow down the program, because the goal was not to measure speed, but to locate the most costly instructions. After fixing something, the overall speedup is easy to measure.
The main thing that the profiler provided was a UI so you could examine the results painlessly. What comes out of the sampling phase is a collection of call stack samples, where each sample is a list of addresses of instructions, where every instruction but the last is a call instruction. The UI was mainly what is called a "butterfly view".
It has a current "focus", which is a particular instruction. To the left is displayed the call instructions immediately above that instruction, as culled from the stack samples. If the focus instruction is a call instruction, then the instructions below it appear to the right, as culled from the samples. On the focus instruction is displayed a percent, which is the percent of stacks containing that instruction. Similarly for each instruction on the left or right, the percent is broken down by the frequency of each such instruction. Of course, the instruction was represented by file, line number, and the name of the function it was in. The user could easily explore the data by clicking any of the instructions to make it the new focus.
A variation on this UI treated the butterfly as bipartite, consisting of alternating layers of function call instructions and the functions containing them. That can give a little more clarity of time spent in each function.
Maybe it's not obvious, so it's worth mentioning some properties of this technique.
Recursion is not an issue, because if an instruction appears more than once on any given stack sample, that still counts as only one sample containing it. It still remains true that the estimated time that would be saved by its removal is the percent of stacks it is on.
Notice this is not the same as a call tree. It gives you the cost of an instruction no matter how many different branches of a call tree it is in.
Performance of the UI is not an issue, because the number of samples need not be very large. If a particular instruction I is the focus, it is quite simple to find how may samples contain it, and for each adjacent instruction, how many of the samples containing I also contain the adjacent instruction next to it.
As mentioned before, speed of sampling is not an issue, because we're not measuring performance, we're diagnosing. The sampling does not bias the results, because the sampling does not affect what the overall program does. An algorithm that takes N instructions to complete still takes N instructions even if it is halted any number of times.
I'm often asked how to sample a program that completes in milliseconds. The simple answer is wrap it in an outer loop to make it take long enough to sample. You can find out what takes X% of time, remove it, get the X% speedup, and then remove the outer loop.
This little profiler, that I called YAPA (yet another performance analyzer) was DOS-based and made a nice little demo, but when I had serious work to do, I would fall back on the manual method. The main reason for this is that the call stack alone is often not enough state information to tell you why a particular cycle is being spent. You may also need to know other state information so you have a more complete idea of what the program was doing at that time. Since I found the manual method pretty satisfactory, I shelved the tool.
A point that's often missed when talking about profiling is that you can do it repeatedly to find multiple problems. For example, suppose instruction I1 is on the stack 5% of the time, and I2 is on the stack 50% of the time. Twenty samples will easily find I2, but maybe not I1. So you fix I2. Then you do it all again, but now I1 takes 10% of the time, so 20 samples will probably see it. This magnification effect allows repeated applications of profiling to achieve large compounded speedup factors.

I would look at those open-source projects first:
Eclipse TPTP (http://www.eclipse.org/tptp/)
VisualVM (https://visualvm.dev.java.net/)
Then I would look at JVMTI (not JVMPI)
http://java.sun.com/developer/technicalArticles/Programming/jvmti/

JVMPI spec: http://java.sun.com/j2se/1.5.0/docs/guide/jvmpi/jvmpi.html
I salute your courage and bravery
EDIT: And as noted by user Boune, JVMTI:
http://java.sun.com/developer/technicalArticles/Programming/jvmti/

As another answer, I just looked at LukeStackwalker on sourceforge. It is a nice, small, example of a stack-sampler, and a nice place to start if you want to write a profiler.
Here, in my opinion, is what it does right:
It samples the entire call stack.
Sigh ... so near yet so far. Here, IMO, is what it (and other stack samplers like xPerf) should do:
It should retain the raw stack samples. As it is, it summarizes at the function level as it samples. This loses the key line-number information locating the problematic call sites.
It need not take so many samples, if storage to hold them is an issue. Since typical performance problems cost from 10% to 90%, 20-40 samples will show them quite reliably. Hundreds of samples give more measurement precision, but they do not increase the probability of locating the problems.
The UI should summarize in terms of statements, not functions. This is easy to do if the raw samples are kept. The key measure to attach to a statement is the fraction of samples containing it. For example:
5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)
This says that line 326 in MyFile.cpp showed up on 5 out of 20 samples, in the process of calling strlen. This is very significant, because you can instantly see the problem, and you know how much speedup you can expect from fixing it. If you replace strlen(s) by s[i], it will no longer be spending time in that call, so these samples will not occur, and the speedup will be approximately 1/(1-5/20) = 20/(20-5) = 4/3 = 33% speedup. (Thanks to David Thornley for this sample code.)
The UI should have a "butterfly" view showing statements. (If it shows functions too, that's OK, but the statements are what really matter.) For example:
3/20 MyFile.cpp:502 MyFunction(myArgs)
2/20 HisFile.cpp:113 MyFunction(hisArgs)
5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)
5/20 strlen.asm:23 ... some assembly code ...
In this example, the line containing the for statement is the "focus of attention". It occurred on 5 samples. The two lines above it say that on 3 of those samples, it was called from MyFile.cpp:502, and on 2 of those samples, it was called from HisFile.cpp:113. The line below it says that on all 5 of those samples, it was in strlen (no surprise there). In general, the focus line will have a tree of "parents" and a tree of "children". If for some reason, the focus line is not something you can fix, you can go up or down. The goal is to find lines that you can fix that are on as many samples as possible.
IMPORTANT: Profiling should not be looked at as something you do once. For example, in the sample above, we got a 4/3 speedup by fixing one line of code. When the process is repeated, other problematic lines of code should show up at 4/3 the frequency they did before, and thus be easier to find. I never hear of people talking about iterating the profiling process, but it is crucial to getting overall large compounded speedups.
P.S. If a statement occurs more than once in a single sample, that means there is recursion taking place. It is not a problem. It still only counts as one sample containing the statement. It is still the case that the cost of the statement is approximated by the fraction of samples containing it.

Related

Measure time to complete individual streaming steps?

Is there a library that allows measuring Java Stream steps?
As in "when consuming this stream, X amount of time is spent in that filter, Y amount of time is spent in this map, etc..."?
Or do I have to tinker around a bit?

By definition, there is no 'between' time for steps. The method chain that you define only defines the steps to take (the methods build an internal structure containing the steps).
Once you call the terminal operator, all steps are executed in indeterminate order. If you never call a terminal operator, no steps are executed either.
What I mean by indeterminate order is this:
intstream.peek(operation1).peek(operation2).sum();
You don't know if operation1 is called, then operation2 called, then operation1 again, or if operation1 is called a bunch of times, then operation2 a bunch of times, or something else yet. (I hope I'm correct in saying this. [edit] correct) It's probably up to the implementation of the VM to trade off storage with time complexity.
The best you can to is measure the time each operation takes as you have full control over every operation. The time the stream engine takes is hard to gauge. However, know that when you do this, you're breaking the stream API requirement that operations should be side-effect-free. So don't use it in production code.

The only tool I can think of is JMH, which is a general benchmarking tool for Java. However note that it might be hard to understand the effect that different Stream operators have on the benchmark due to JIT. The fact that your code contains a Stream with a bunch of different operators doesn't mean that's exactly the way JIT compiles it.

Why does Java CPU profile (using visualvm) show so many hits on a method that does nothing?

This is something I think I've seen before with other profiling tools in other environments, but it's particularly dramatic in this case.
I'm taking a CPU profile of a task that runs for about 12 minutes, and it's showing almost half the time spent in a method that literally does nothing: it's got an empty body. What can cause this? I don't believe that the method is being called a ridiculous number of times, certainly not to account for half the execution time.
For what it's worth, the method in question is called startContent() and it's used to notify a parsing event. The event is passed down a chain of filters (perhaps a dozen of them), and the startContent() method on each filter does almost nothing except to call startContent() on the next filter in the chain.
This is pure Java code, and I'm running it on a Mac.
Attached is a screen shot of the CPU sampler output:
and here is a sample showing the call stack:
(After a delay due to vacation) Here are a couple of pictures showing the output from the profiler. These figures are much more what I would expect the profile to look like. The profiler output seems entirely meaningful, while the sampler output is spurious.
As some of you will have guessed, the job in question is a run of the Saxon XML schema validator (on a 9Gb input file). The profile shows about half the time being spent validating element content against simple types (which happens during endElement processing) and about half being spent testing key constraints for uniqueness; the two profiler views show highlight the activity involved in these two aspects of the task.
I'm not able to supply the data as it comes from a client.

I have not used VisualVM, but I suspect the problem is likely because of the instrumentation overhead on such an empty method. Here's the relevant passage in JProfiler's documentation (which I have used extensively):
If the method call recording type is set to Dynamic instrumentation, all methods of profiled classes are instrumented. This creates some overhead which is significant for methods that have very short execution times. If such methods are called very frequently, the measured time of those method will be far to high. Also, due to the instrumentation, the hot spot compiler might be prevented from optimizing them. In extreme cases, such methods become the dominant hot spots although this is not true for an uninstrumented run. An example is the method of an XML parser that reads the next character. This method returns very quickly, but may be invoked millions of times in a short time span.
Basically, a profiler adds it's own "time length detection code", essentially, but in an empty method the profiler will spend all it's time doing that rather than actually allowing the method to run.
I recommend, if it's possible, to tell VisualVM to stop instrumenting that thread, if it supports such a filtering.

It is generally assumed that using a profiler is much better (for finding performance problems, as opposed to measuring things) than - anything else, really - certainly than the bone-simple way of random pausing.
This assumption is only common wisdom - it has no basis in theory or practice.
There are numerous scholarly peer-reviewed papers about profiling, but none that I've read even address the point, let alone substantiate it.
It's a blind spot in academia, not a big one, but it's there.
Now to your question -
In the screenshot showing the call stack, that is what's known as the "hot path", accounting for roughly 60% of in-thread CPU time. Assuming the code with "saxon" in the name is what you're interested in, it is this:
net.sf.saxon.event.ReceivingContentHandler.startElement
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.StartTagBuffer.startContent
net.sf.saxon.event.ProxyReceiver.startContent
com.saxonica.ee.validate.ValidationStack.startContent
com.saxonica.ee.validate.AttributeValidator.startContent
net.sf.saxon.event.TeeOutputter.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.Sink.startContent
First, this looks to me like it has to be doing I/O, or at least waiting for some other process to give it content. If so, you should be looking at wall-clock time, not CPU time.
Second, the problem(s) could be at any of those call sites where a function calls the one below. If any such call is not truly necessary and could be skipped or done less often, it will reduce time by a significant fraction.
My suspicion is drawn to StartTagBuffer and to validate, but you know best.
There are other points I could make, but these are the major ones.
ADDED after your edit to the question.
I tend to assume you are looking for ways to optimize the code, not just ways to get numbers for their own sake.
It still looks like only CPU time, not wall-clock time, because there is no I/O in the hot paths. Maybe that's OK in your case, but what it means is, of your 12-minute wall clock time, 11 minutes could be spent in I/O wait, with 1 minute in CPU. If so, you could maybe cut out 30 seconds of fat in the CPU part, and only shorten the time by 30 seconds.
That's why I prefer sampling on wall-clock time, so I have overall perspective.
By looking at hot paths alone, you're not getting a true picture.
For example, if the hot path says function F is on the hot path for, say 40% of the time, that only means F costs no less than 40%. It could be much more, because it could be on other paths that aren't so hot. So you could have a juicy opportunity to speed things up by a lot, but it doesn't get much exposure in the specific path that the profiler chose to show you, so you don't give it much attention.
In fact, a big time-taker might not show up at all because on any specific hot path there's always something else a little bigger, like new, or because it goes by multiple names, such as templated collection class constructors.
It's not showing you any line-resolution information.
If you want to inspect a supposedly high-cost routine for the reason for the cost, you have to look at the lines within it. There's a tendency when looking at a routine to say "It's just doing what it's supposed to do.", but if you are looking at a specific costly line of code, which most often is a method call, you can ask "Is it really necessary to do this call? Maybe I already have the information." It's far more specific in suggesting what you could fix.
Can it actually show you some raw stack samples?
In my experience these are far more informative than any summary, like a hot path, that the profiler can present.
The thing to do is examine the sample and come to a full understanding of what the program was doing, and the reason why, at that point in time.
Then repeat for several more samples.
You will see things that don't need to be done, that you can fix to get substantial speedup.
(Unless the code is already optimal, in which case it will be nice to know.)
The point is, you're looking for problems, not measurements.
Statistically, it's very rough, but good enough, and no problem will escape.

My guess is that the method Sink.startContent actually is called a ridiculous number of times.
Your screenshot shows the Sampling tab, which usually results in realistic timings if user over a long enoung interval. If you use Profiler tab instead, you will also get the invocation count. (You'll also get less realistic timings and your program will get very very slow, but you only need to do this for a few seconds to get a good idea about the invocation counts).
It's hard to predict what optimizations and especially inlining HotSpot performs, and the sampling profiler can only attribute the time of inlined methods to the call sites. I suspect that some of the invocation code in saxon might for some reason be attributed to your empty callback function. In that case, you're just suffering the cost of XML, and switching to a different parser might be the only option.

I've had a lot of useful information and guidance from this thread, for which many thanks. However, I don't think the core question has been answered: why is the CPU sampling in VisualVM giving an absurdly high number of hits in a method that does nothing, and that isn't called any more often than many other methods?
For future investigations I will rely on the profiler rather than the sampler, now I have gained a bit of insight into how they differ.
From the profiler I haven't really gained a lot of new information about this specific task, in so far as it has largely confirmed what I would have guessed; but that itself is useful. It has told me that there's no magic bullet to speeding up this particular process, but has put bounds on what might be achieved by some serious redesign, e.g a possible future enhancement that appears to have some promise is generating a bytecode validator for each user-defined simple type in the schema.

When and how to properly use loop optimization and transformation techniques

First of all, i would like to know what is the fundamental difference between loop optimization and transformation , also
A simple loop in C follows:
for (i = 0; i < N; i++)
{
a[i] = b[i]*c[i];
}
but we can unroll it to:
for (i = 0; i < N/2; i++)
{
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
but further we can unroll it..but what is the limit till which we can unroll it, and how do we find that.
There are many more techniques like Loop Tilling,Loop Distribution,etc. , how to determine when to use the appropriate one.

I will assume that the OP has already profiled his/her code and has discovered that this piece of code is actually important, and actually answer the question :-) :
The compiler will try to make the loop unrolling decision based on what it knows about your code and the processor architecture.
In terms of making things faster.
As someone pointed out, unrolling does reduce the number of loop termination condition compares and jumps.
Depending on the architecture, the hardware may also support an efficient way to to index near memory locations (E.g., mov eax, [ebx + 4]), without adding additional instructions (this may expand to more micro-ops though - not sure).
Most modern processors use out of order execution, to find instruction level parallelism. This is hard to do, when the next N instructions are after multiple conditional jumps (i.e., the hardware would need to be able to discard variable levels of speculation).
There is more opportunity to reorder memory operations earlier so that the data fetch latency is hidden.
Code vectorization (e.g., converting to SSE/AVX), may also occur which allows parallel execution of the code in some cases. This is also a form of unrolling.
In terms of deciding when to stop unrolling:
Unrolling increases code size. The compiler knows that there are penalties for exceeding instruction code cache size (all modern processors), trace cache(P4), loop buffer cache(Core2/Nehalem/SandyBridge), micro-op cache(SandyBridge), etc. Ideally it uses static cost-benefit heuristics (a function of the specic code and architecture) to determine which level of unrolling will result in the best overall net performance. Depending on the compiler, the heurstics may vary (often I find that it would be nice to tweak this oneself).
Generally, if the loop contains a large amount of code it is less likely to be unrolled because the loop cost is already amortized, there is plenty of ILP available, and the code bloat cost of unrolling is excessive. For smaller pieces of code, the loop is likely to be unrolled, since the cost is likely to be low. The actual number of unrolls will depend on the specifics of the architecture, compiler heuristics and code, and will be what the compiler decides is optimal (it may not be :-) ).
In terms of when YOU should be doing these optimizations:
When you don't think the compiler did the correct thing. The compiler may not be sophisticated (or sufficiently up to date) enough to use the knowledge of the architecture you are working on optimally.
Possibly, the heuristics just failed (they are just heuristics after all). In general, if you know the piece of code is very important, try unroll it, and if it improved performance, keep it, otherwise throw it out. Also, only do this when you have roughly the whole system in place, since what may be beneficial, when your code working set is 20k, may not be beneficial when your code working set is 31k.

This may seem rather off topic to your question but I cannot but stress the importance of this.
The key is to write a correct code and get your code working as per the requirement without being bothered about micro optimization.
If later you find your program to be lacking in performance then you profile!! your application to find the problem areas and then try to optimize them.
Remember as one of the wise guys said It is only 10% of your code which runs 90% of the total run time of your application trick is to identify that code through profiling and then try to optimize it.

Well considering that your first attempt at optimizing is already wrong in 50% of all cases I really wouldn't try anything more complex (try any odd number).
Also instead of multiplying your indices, just add 2 to i and loop up to N again - avoids the unnecessary shifting (minor effect as long as we stay with powers of 2, but still)
To summarize: You created incorrect, slower code than what a compiler could do - well that's the perfect example of why you shouldn't do this stuff I assume.

Does filtering classes for cpu profiling work in Java VisualVM?

I want to filter what classes are being cpu-profiled in Java VisualVm (Version 1.7.0 b110325). For this, I tried under Profiler -> Settings -> CPU-Settings to set "Profile only classes" to my package under test, which had no effect. Then I tried to get rid of all java.* and sun.* classes by setting them in "Do not profile classes", which had no effect either.
Is this simply a bug? Or am I missing something? Is there a workaround? I mean other than:
paying for a better profiler
doing sampling by hand (see One could use a profiler, but why not just halt the program?)
switch to the Call Tree view, which is no good since only the Profiler view gives me the percentages of consumed CPU per method.
I want to do this mainly to get halfway correct percentages of consumed CPU per method. For this, I need to get rid of the annoying measurements, e.g. for sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() (around 70%). Many users seem to have this problem, see e.g.
Java VisualVM giving bizarre results for CPU profiling - Has anyone else run into this?
rmi.transport.tcp.tcptransport Connectionhandler consumes much CPU
Can't see my own application methods in Java VisualVM.

The reason you see sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() in the profile is that you left the option Profile new Runnables selected.
Also, if you took a snapshot of your profiling session you would be able to see the whole callstack for any hotspot method - this way you could navigate from the run() method down to your own application logic methods, filtering out the noise generated by the Profile new Runnables option.

OK, since your goal is to make the code run as fast as possible, let me suggest how to do it.
I'm no expert on VisualVM, but I can tell you what works. (Only a few profilers actually tell you what you need to know, which is - which lines of your code are on the stack a healthy fraction of wall-clock time.)
The only measuring I ever bother with is some stopwatch on the overall time, or alternatively, if the code has something like a framerate, the number of frames per second. I don't need any sort of further precision breakdown, because it's at best a remote clue to what's wasting time (and more often totally irrelevant), when there's a very direct way to locate it.
If you don't want to do random-pausing, that's up to you, but it's proven to work, and here's an example of a 43x speedup.
Basically, the idea is you get a (small, like 10) number of stack samples, taken at random wall-clock times.
Each sample consists (obviously) of a list of call sites, and possibly a non-call site at the end.
(If the sample is during I/O or sleep, it will end in the system call, which is just fine. That's what you want to know.)
If there is a way to speed up your code (and there almost certainly is), you will see it as a line of code that appears on at least one of the stack samples.
The probability it will appear on any one sample is exactly the same as the fraction of time it uses.
So if there's a call site or other line of code using a healthy fraction of time, and you can avoid executing it, the overall time will decrease by that fraction.
I don't know every profiler, but one I know that can tell you that is Zoom.
Others may be able to do it.
They may be more spiffy, but they don't work any quicker or better than the manual method when your purpose is to maximize performance.

Debugging visually using >>, >, >|, ||, |<, <, <<

Debugging performance problems using a standard debugger is almost hopeless since the level of detail is too high. Other ways are using a profiler, but they seldom give me good information, especially when there is GUI and background threads involved, as I never know whether the user was actually waiting for the computer, or not. A different way is simply using Control + C and see where in the code it stops.
What I really would like is to have Fast Forward, Play, Pause and Rewind functionality combined with some visual repressentation of the code. This means that I could set the code to run on Fast Forward until I navigate the GUI to the critical spot. Then I set the code to be run in slow mode, while I get some visual repressentation of, which lines of are being executed (possibly some kind of zoomed out view of the code). I could for example set the execution speed to something like 0.0001x. I believe that I would get a very good visualization this way of whether the problem is inside a specific module, or maybe in the communication between modules.
Does this exist? My specific need is in Python, but I would be interested in seeing such functionality in any language.

The "Fast Forward to critical spot" function already exists in any debugger, it's called a "breakpoint". There are indeed debuggers that can slow down execution, but that will not help you debug performance problems, because it doesn't slow down the computer. The processor and disk and memory is still exactly as slow as before, all that happens is that the debugger inserts delays between each line of code. That means that every line of code suddenly take more or less the same time, which means that it hides any trace of where the performance problem is.
The only way to find the performance problems is to record every call done in the application and how long it took. This is what a profiler does. Indeed, using a profiler is tricky, but there probably isn't a better option. In theory you could record every call and the timing of every call, and then play that back and forwards with a rewind, but that would use an astonishing amount of memory, and it wouldn't actually tell you anything more than a profiler does (indeed, it would tell you less, as it would miss certain types of performance problems).
You should be able to, with the profiler, figure out what is taking a long time. Note that this can be both by certain function calls taking a long time because they do a lot of processing, or it can be system calls that take a long time becomes something (network/disk) is slow. Or it can be that a very fast call is called loads and loads of times. A profiler will help you figure this out. But it helps if you can turn the profiler on just at the critical section (reduces noise) and if you can run that critical section many times (improves accuracy).

The methods you're describing, and many of the comments, seem to me to be relatively weak probabilistic attempts to understand the performance impact. Profilers do work perfectly well for GUIs and other idle-thread programs, though it takes a little practice to read them. I think your best bet is there, though -- learn to use the profiler better, that's what it's for.
The specific use you describe would simply be to attach the profiler but don't record yet. Navigate the GUI to the point in question. Hit the profiler record button, do the action, and stop the recording. View the results. Fix. Do it again.

I assume there is a phase in the app's execution that takes too long - i.e. it makes you wait.
I assume what you really want is to see what you could change to make it faster.
A technique that works is random-pausing.
You run the app under the debugger, and in the part of its execution that makes you wait, pause it, and examine the call stack. Do this a few times.
Here are some ways your program could be spending more time than necessary.
I/O that you didn't know about and didn't really need.
Allocating and releasing objects very frequently.
Runaway notifications on data structures.
others too numerous to mention...
No matter what it is, when it is happening, an examination of the call stack will show it.
Once you know what it is, you can find a better way to do it, or maybe not do it at all.
If the program is taking 5 seconds when it could take 1 second, then the probability you will see the problem on each pause is 4/5. In fact, any function call you see on more than one stack sample, if you could avoid doing it, will give you a significant speedup.
AND, nearly every possible bottleneck can be found this way.
Don't think about function timings or how many times they are called. Look for lines of code that show up often on the stack, that you don't need.
Example Added: If you take 5 samples of the stack, and there's a line of code appearing on 2 of them, then it is responsible for about 2/5 = 40% of the time, give or take. You don't know the precise percent, and you don't need to know.
(Technically, on average it is (2+1)/(5+2) = 3/7 = 43%. Not bad, and you know exactly where it is.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.