Best method to count byte codes executed for a Java code

Best method to count byte codes executed for a Java code - java

I was trying to get timing data for various Java programs. Then I had to perform some regression analysis based on this timing data. Here are the two methods I used to get the timing data:
System.currentTimeMillis(): I used this initially, but I wanted the timing data to be constant when the same program was run multiple
times. The variation was huge in this case. When two instances of the
same code were executed in parallel, the variation was even more. So
I dropped this and started looking for some profilers.
-XX countBytecodes Flag in Hotspot JVM: Since the variation in timing data was huge, I thought of measuring the number of byte codes executed, when this code was executed. This should have given a more static count, when the same program was executed multiple times. But This also had variations. When the programs were executed sequentially, the variations were small, but during parellel runs of the same code, the variations were huge. I also tried compiling using -Xint, but the results were similar.
So I am looking for some profiler that could give me the count of byte codes executed when a code is executed. The count should remain constant (or correlation close to 1) across runs of the same program. Or if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.

I wanted the timing data to be constant when the same program was run multiple times
That is not possible on a real machine unless it is designed for hard real time system which your machine will almost certainly be not.
I am looking for some profiler that could give me the count of byte codes executed when a code is executed.
Assuming you could do this, it wouldn't prove anything. You wouldn't be able to see for example that ++ is 90x cheaper than % depending on the hardware you run it on. You won't be able to see that a branch miss of an if is up to 100x more expensive than a speculative branch. You wouldn't be able to see that a memory access to an area of memory which triggers a TLB miss can be more expensive than copying 4 KB of data.
if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.
You can run it many times and take the average. This will hide any high results/outliers and give you a favourable idea of throughput. It can be a reproducible number for a given machine, if run long enough.

Related

Is processing a large 3d array with anywhere from 100 to 100 instructions per array row a good problem to be solved on the GPU?

I have a problem where I need to run a complex function on a large 3d array. For each array row, I will execute anywhere from 100 to 1000 instructions, and depending on the data on that row some instructions will or not be executed.
This array is large but would still fit inside a GPU shared memory (around 2GB in size). I could execute these instructions on separate parts of the array given that they don't need to be processed in order, so I'm thinking executing on the GPU could be a good option. I'm not entirely sure because the instructions executed will change depending on the data itself (lots of if/then/else in there) and I've read branching could an issue.
These instructions are an abstract syntax tree representing a short program that operates over the array row and returns a value.
Does this look like an appropriate problem to be tackled by the GPU?
What other info would be needed to determine that?
I'm thinking to write this in Java and use JCuda.
Thanks!
Eduardo

It Depends. How big is your array, i.e. how many parallel tasks does your array provide (in your case it sounds like the number of rows is the number of parallel tasks you're going to execute)? If you have few rows (ASTs) but many columns (commands), then maybe it's not worth it. The other way round would work better, because more work can be parallelized.
Branching can indeed be an issue if you're unaware. You can do some optimizations though to mitigate that cost - after you got your initial prototype running and can do some comparision measurements.
The issue with Branching is, that all streaming multiprocessors in one "Block" need to execute the same instruction. If one core does not need that instruction, it sleeps. So if you have two ASTs, each with 100 distinct commands, the multiprocessors will take 200 commands to complete the calculation, some of the SMs will be sleeping while the other do their commands.
If you have 1000 commands max and some only use a subset, the processor will take as many commands as the AST with the most commands has - in the optimal case. E.g. a set of (100, 240, 320, 1, 990) will run for at least 990 commands, even though one of the ASTs only uses one command. And if that command isn't in the set of 990 commands from the last AST, it even runs for 991 commands.
You can mitigate this (after you have the prototype working and can do actual measurements) by optimizing the array you send to the GPU, so that one set of Streaming Multiprocessors (Block) has a similar set of instructions to do. As different SMs don't interfere with each other on the execution level, they don't need to wait on each other. The size of the blocks is also configurable when you execute the code, so you can adjust it somewhat here.
For even more optimization - only 32 (NVidia "Warp")/64 (AMD "Wavefront") of the threads in a block are executed at the same time, so if you organize your array to exploit this, you can even gain a bit more.
How much of a difference those optimizations make is dependant on how sparse / dense / mixed your command array will be. Also not all optimizations actually optimize your execution time. Testing and comparing is key here. Another source of optimization is your memory layout, but with your described use case it shouldn't be a problem. You can look up Memory Coalescing for more info on that.

Profiling Java code changes execution times

I'm trying to optimize my code, but it's giving me problems.
I've got this list of objects:
List<DataDescriptor> descriptors;
public class DataDescriptor {
public int id;
public String name;
}
There is 1700 objects with unique id (0-1699) and some name, it's used to decode what type of data I get later on.
The method that I try to optimize works like that:
public void processData(ArrayList<DataDescriptor> descriptors, ArrayList<IncomingData> incomingDataList) {
for (IncomingData data : incomingDataList) {
DataDescriptor desc = descriptors.get(data.getDataDescriptorId());
if (desc.getName().equals("datatype_1")) {
doOperationOne(data);
} else if (desc.getName().equals("datatype_2")) {
doOperationTwo(data);
} else if ....
.
.
} else if (desc.getName().equals("datatype_16")) {
doOperationSixteen(data);
}
}
}
This method is called about milion times when processing data file and every time incomingDataList contains about 60 elements, so this set of if/elses is executed about 60 milion times.
This takes about 15 seconds on my desktop (i7-8700).
Changing code to test integer ids instead of strings obviously shaves off few seconds, which is nice, but I hoped for more :)
I tried profiling using VisualVM, but for this method (with string testing) it says that 66% of time is spent in "Self time" (which I believe would be all this string testing? and why doesnt it says that it is in String.equals method?) and 33% is spent on descriptors.get - which is simple get from ArrayList and I don't think I can optimize it any further, other than trying to change how data is structured in memory (still, this is Java, so I don't know if this would help a lot).
I wrote "simple benchmark" app to isolate this String vs int comparisons. As I expected, comparing integers was about 10x faster than String.equals when I simply run the application, but when I profiled it in VisualVM (I wanted to check if in benchmark ArrayList.get would also be so slow), strangely both methods took exactly the same amount of time. When using VisualVM's Sample, instead of Profile, application finished with expected results (ints being 10x faster), but VisualVM was showing that in his sample both types of comparisons took the same amount of time.
What is the reason for getting such totally different results when profiling and not? I know that there is a lot of factors, there is JIT and profiling maybe interferes with it etc. - but in the end, how do you profile and optimize Java code, when profiling tools change how the code runs? (if it's the case)

Profilers can be divided into two categories: instrumenting and sampling. VisualVM includes both, but both of them have disadvantages.
Instrumenting profilers use bytecode instrumentation to modify classes. They basically insert the special tracing code into every method entry and exit. This allows to record all executed methods and their running time. However, this approach is associated with a big overhead: first, because the tracing code itself can take much time (sometimes even more than the original code); second, because the instrumented code becomes more complicated and prevents from certain JIT optimizations that could be applied to the original code.
Sampling profilers are different. They do not modify your application; instead they periodically take a snapshot of what the application is doing, i.e. the stack traces of currently running threads. The more often some method occurs in these stack traces - the longer (statistically) is the total execution time of this method.
Sampling profilers typically have much smaller overhead; furthermore, this overhead is manageable, since it directly depends on the profiling interval, i.e. how often the profiler takes thread snapshots.
The problem with sampling profilers is that JDK's public API for getting stack traces is flawed. JVM does not get a stack trace at any arbitrary moment of time. It rather stops a thread in one of the predefined places where it knows how to reliably walk the stack. These places are called safepoints. Safepoints are located at method exits (excluding inlined methods), and inside the loops (excluding short counted loops). That's why, if you have a long linear peace of code or a short counted loop, you'll never see it in a sampling profiler that relies on JVM standard getStackTrace API.
This problem is known as Safepoint Bias. It is described well in a great post by Nitsan Wakart. VisualVM is not the only victim. Many other profilers, including commercial tools, also suffer from the same issue, because the original problem is in the JVM rather than in a particular profiling tool.
Java Flight Recorder is much better, as long as it does not rely on safepoints. However, it has its own flaws: for example, it cannot get a stack trace, when a thread is executing certain JVM intrinsic methods like System.arraycopy. This is especially disappointing, since arraycopy is a frequent bottleneck in Java applications.
Try async-profiler. The goal of the project is exactly to solve the above issues. It should provide a fair view of the application performance, while having a very small overhead. async-profiler works on Linux and macOS. If you are on Windows, JFR is still your best bet.

Execution time of junit test case varies every time. Why?

I have a set of 196 test methods. The Execution time of these testcases vary every time I run it. It has been run in a controlled environment,Say,For garbage collection, I included null in teardown().
Every time before executing the tests, I also make sure CPU usage, Memory usage, Disk space, System load are same for every start.
Also,The time variation is not in any particular order. I need to know why don't we get stable execution time while executing the same test cases again?
I made 93 cases stable by including a warm up period in the class. Other cases are related to database connections (reading a data or updating a data in database). Is it possible to have same execution time every time i run these testcases. (Execution time refers to junit testcase execution time)

Two primary things come to mind with Java performance:
You need to warmup the JVM, your tests are being interpreted as bytecode and are at the mercy of the JVM. That means executing the same test upwards of thousands of times during the same run.
JUnit tests are not measured with much accuracy. In fact, it's pretty much impossible to get an exact performance reading, even with libraries build specifically for this. This is why taking an average of multiple samples is generally suggested.
Yet, this and those others suggested by Reto are just what could be causing variance by Java. Where a variance of milliseconds is more than expected. For an example of this, create a unit test that takes a thread and puts it to sleep for 10 ms. Watch as you're given results anywhere from 7 ms to 13 ms to 17 ms or more. It just isn't a reliable way to measure things.
If you're connecting to a network, uploading data to a database, etc. I can't speak on behalf of that, but you need to take into account the variance of those systems as well.
I would suggest breaking your three tests with the greatest variance into smaller blocks. Try and isolate where your biggest bottleneck is, then concentrate on optimizing that operation or set of operations. I would think that connecting to the database takes the greatest amount of time, next to that would likely be executing the query. But you should isolate the measurement of these operations to make sure of that.

Methods of limiting emulated cpu speed

I'm writing a MOS 6502 processor emulator as part of a larger project I've undertaken in my spare time. The emulator is written in Java, and before you say it, I know its not going to be as efficient and optimized as if it was written in c or assembly, but the goal is to make it run on various platforms and its pulling 2.5MHZ on a 1GHZ processor which is pretty good for an interpreted emulator. My problem is quite to the contrary, I need to limit the number of cycles to 1MHZ. Ive looked around but not seen many strategies for doing this. Ive tried a few things including checking the time after a number of cycles and sleeping for the difference between the expected time and the actual time elapsed, but checking the time slows down the emulation by a factor of 8 so does anyone have any better suggestions or perhaps ways to optimize time polling in java to reduce the slowdown?

The problem with using sleep() is that you generally only get a granularity of 1ms, and the actual sleep that you will get isn't necessarily even accurate to the nearest 1ms as it depends on what the rest of the system is doing. A couple of suggestions to try (off the top of my head-- I've not actually written a CPU emulator in Java):
stick to your idea, but check the time between a large-ish number of emulated instructions (execution is going to be a bit "lumpy" anyway especially on a uniprocessor machine, because the OS can potentially take away the CPU from your thread for several milliseconds at a time);
as you want to execute in the order of 1000 emulated instructions per millisecond, you could also try just hanging on to the CPU between "instructions": have your program periodically work out by trial and error how many runs through a loop it needs to go between instructions to "waste" enough CPU to make the timing work out at 1 million emulated instructions / sec on average (you may want to see if setting your thread to low priority helps system performance in this case).

I would use System.nanoTime() in a busy wait as #pst suggested earlier.
You can speed up the emulation by generating byte code. Most instructions should translate quite well and you can add a busy wait call so each instruction takes the amount of time the original instruction would have done. You have an option to increase the delay so you can watch each instruction being executed.
To make it really cool you could generate 6502 assembly code as text with matching line numbers in the byte code. This would allow you to use the debugger to step through the code, breakpoint it and see what the application is doing. ;)
A simple way to emulate the memory is to use direct ByteBuffer or native memory with the Unsafe class to access it. This will give you a block of memory you can access as any data type in any order.

You might be interested in examining the Java Apple Computer Emulator (JACE), which incorporates 6502 emulation. It uses Thread.sleep() in its TimedDevice class.

Have you looked into creating a Timer object that goes off at the cycle length you need it? You could have the timer itself initiate the next loop.
Here is the documentation for the Java 6 version:
http://download.oracle.com/javase/6/docs/api/java/util/Timer.html

How do I make this Java code parallelizable? How do I make it cloudable?

I'm working on a system at the moment. It's a complex system but it boils down to a Solver class with a method like this:
public int solve(int problem); // returns the solution, or 0 if no solution found
Now, when the system is up and running, a run time of about 5 seconds for this method is expected and is perfectly fast enough. However, I plan to run some tests that look a bit like this:
List<Integer> problems = getProblems();
List<Integer> solutions = new ArrayList<Integer>(problems.size);
Solver solver = getSolver();
for (int problem: problems) {
solutions.add(solver.solve(problem));
}
// see what percentage of solutions are zero
// get arithmetic mean of non-zero solutions
// etc etc
The problem is I want to run this on a large number of problems, and don't want to wait forever for the results. So say I have a million test problems and I want the tests to complete in the time it takes me to make a cup of tea, I have two questions:
Say I have a million core processor and that instances of Solver are threadsafe but with no locking (they're immutable or something), and that all the computation they do is in memory (i.e. there's no disk or network or other stuff going on). Can I just replace the solutions list with a threadsafe list and kick off threads to solve each problem and expect it to be faster? How much faster? Can it run in 5 seconds?
Is there a decent cloud computing service out there for Java where I can buy 5 million seconds of time and get this code to run in five seconds? What do I need to do to prepare my code for running on such a cloud? How much does 5 million seconds cost anyway?
Thanks.

You have expressed your problem with two major points of serialisation: Problem production and solution consumption (currently expressed as Lists of integers). You want to get the first problems as soon as you can (currently you won't get them until all problems are produced).
I am assuming as well that there is a correlation between the problem list order and the solution list order – that is solutions.get(3) is the solution for problems.get(3) – this would be a huge problem for parallelising it. You'd be better off having a Pair<P, S> of problem/solution so you don't need to maintain the correlation.
Parallelising the solver method will not be difficult, although exactly how you do it will depend a lot on the compute costs of each solve method (generally the more expensive the method the lower the overhead costs of parallelising, so if these are very cheap you need to batch them). If you end up with a distributed solution you'll have much higher costs of course. The Executor framework and the fork/join extensions would be a great starting point.

You're asking extremely big questions. There is overhead for threads, and a key thing to note is that they run in the parent process. If you wanted to run a million of these solvers at the same time, you'd have to fork them into their own processes.

You can use one program per input, and then use a simple batch scheduler like Condor (for Linux) or HPC (for Windows). You can run those on Amazon too, but there's a bit of a learning curve, it's not just "upload Java code & go".

Sure, you could use a standard worker-thread paradigm to run things in parallel. But there will be some synchronization overhead (e.g., updates to the solutions list will cause lock contention when everything tries to finish at the same time), so it won't run in exactly 5 seconds. But it would be faster than 5 million seconds :-)
Amazon EC2 runs between US$0.085 and US$0.68 per hour depending on how much CPU you need (see pricing). So, maybe about $120. Of course, you'll need to set up something separate to distribute your jobs across various CPUs. One option might be just to use Hadoop (see this question about whether Hadoop is right for running simulations.
You could read things like Guy Steele's talk on parallelism for more info on how to think parallel.

Use an appropriate Executor. Have a look at http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()

Check out these article on concurrency:
http://www.vogella.de/articles/JavaConcurrency/article.html
http://www.baptiste-wicht.com/2010/09/java-concurrency-part-7-executors-and-thread-pools/
Basically, Java 7's new Fork/Join model will work really well for this approach. Essentially you can set up your million+ tasks and it will spread them as best it can accross all available processors. You would have to provide your custom "Cloud" task executor, but it can be done.
This assumes, of course, that your "solving" algorithm is rediculously parallel. In short, as long as the Solver is fully self-contained, they should be able to be split among an arbitrary number of processors.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.