I'm writing a program that requires massive number crunching in any case.
Often, I have the option of either calculating a value by means of {3-4 additions or multiplications and an if-else check or two, maybe a sort of about five numbers}, or reading the value up from a lookup table. Everything is int.
How fast is a memory read in comparsion to such simple operations, roughly?
Basic principle of performance tuning ; "Don't guess, measure it"
This is impossible to answer in any meaningful way. It depends on the actual code, and on the platform you are using. As a general rule, if there are simple local optimizations that might work, the JIT compiler will do them for you.
You are better doing the following:
Write the program in a simple and natural way.
Get it working.
Run it on a typical input dataset / problem. If it is fast enough, then stop.
Profile the code as it executes a typical input dataset / problem.
Use the profiling results to identify the most critical hotspot in your code.
Examine the code, and identify a possible optimization.
Code the optimization and rerun the profiling. Did it improve things?
Repeat from step 3 until either the program is running fast enough, or you have run out of possible optimizations.
The problem with lookup tables is that you are trading off time for space, and the space usage depends on the number of combinations of inputs that are your application uses. The lookup table approach only pays off in limitted cases.
Related
Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.
I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).
For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?
I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.
Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).
Thanks!
How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)
Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.
There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.
There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.
Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.
So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.
Use a low-level language designed for this task and link it to the Java for high-level logic.
EDIT: adding some more info based on comments
If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.
In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
openjdk/hotspot/src/share/vm/runtime/globals.hpp
They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.
I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions
SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).
I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.
It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I understand the JVM optimizes some things for you (not clear on which things yet), but lets say I were to do this:
while(true) {
int var = 0;
}
would doing:
int var;
while(true) {
var = 0;
}
take less space? Since you aren't declaring a new reference every time, you don't have to specify the type every time.
I understand you really would only need to put var outside of while if I wanted to use it outside of that loop (instead of only being able to use it locally like in the first example). Also, what about objects, would it be different that primitive types in that situation? I understand it's a small situation, but build-up of this kind of stuff can cause my application to take a lot of memory/cpu. I'm trying to use the least amount of operations possible, but I don't completely understand whats going on behind the scenes.
If someone could help me out, even maybe link me to somewhere I can learn about saving cpu by decreasing amount of operations, it would be highly appreciated. Please no books (unless they're free! :D), no way of getting one right now /:
Don't. Premature optimization is the root of all evil.
Instead, write your code as it makes most sense conceptually. Write it thoughtfully, yes. But don't think you can be a 'human compiler' and optimize and still write good code.
Once you have written your code (more or less naively, depending on your level of experience) you write performance tests for it. Try to think of different ways in which the code may be used (many times in a row, from front to back or reversed, many concurrent invocations etc) and try to cover these in test cases. Then benchmark your code.
If you find that some test cases are not performing well, investigate why. Measure parts of the test case to see where the time is going. Zoom into the parts where most time is spent.
Mostly, you will find weird loops where, upon reading the code again, you will think 'that was silly to write it that way. Of course this is slow' and easily fix it. In my experience most performance problems can be solved this way and 'hardcore optimization' is hardly ever needed.
In the end you will find that 99* percent of all performance problems can be solved by touching only 1 percent of the code. The other code never comes into play. This is why you should not 'prematurely' optimize. You will be spending valuable time optimizing code that had no performance issues in the first place. And making it less readable in the process.
Numbers made up of course but you know what I mean :)
Hot Licks points out the fact that this isn't much of an answer, so let me expand on this with some good ol' perfomance tips:
Keep an eye out for I/O
Most performance problems are not in pure Java. Instead they are in interfacing with other systems. In particular disk access is notoriously slow. So is the network. So minimize it's use.
Optimize SQL queries
SQL queries will add seconds, even minutes, to your program's execution time if you don't watch out. So think about those very carefully. Again, benchmark them. You can write very optimized Java code, but if it first spends ten seconds waiting for the database to run some monster SQL query than it will never be fast.
Use the right kind of collections
Most performance problems are related to doing things lots of times. Usually when working with big sets of data. Putting your data in a Map instead of in a List can make a huge difference. Also there are specialized collection types for all sorts of performance requirements. Study them and pick wisely.
Don't write code
When performance really matters, squeezing the last 'drops' out of some piece of code becomes a science all in itself. Unless you are writing some very exotic code, chances are great there will be some library or toolkit to solve your kind of problems. It will be used by many in the real world. Tried and tested. Don't try to beat that code. Use it.
We humble Java developers are end-users of code. We take the building blocks that the language and it's ecosystem provides and tie it together to form an application. For the most part, performance problems are caused by us not using the provided tools correctly, or not using any tools at all for that matter. But we really need specifics to be able to discuss those. Benchmarking gives you that specifity. And when the slow code is identified it is usually just a matter of changing a collection from list to map, or sorting it beforehand, or dropping a join from some query etc.
Attempting to optimise code which doesn't need to be optimised increases complexity and decreases readability.
However, there are cases were improving readability also comes with improved performance.
For example,
if a numeric value cannot be null, use a primitive instead of a wrapper. This makes it clearer that the value cannot be null but also uses less memory and reduces pressure on the GC.
use a Set when you have a collection which cannot have duplicates. Often a List is used when in fact a Set would be more appropriate, depending on the operations you perform, this can also be faster by reducing time complexity.
consider using an enum with one instance for a singleton (if you have to use singletons at all) This is much simpler as well as faster than double check locking. Hint: try to only have stateless singletons.
writing simpler, well structured code is also easier for the JIT to optimise. This is where trying to out smart the JIT with more complex solutions will back fire because you end up confusing the JIT and what you think should be faster is actually slower. (And it's more complicated as well)
try to reduce how much you write to the console (and IO in general) in critical sections. Writing to the console is so expensive, both for the program and the poor human having to read it that is it worth spending more time producing concise console output.
try to use a StringBuilder when you have a loop of elements to add. Note: Avoid using StringBuilder for one liners, just series of append() as this can actually be slower and harder to read.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --
Antoine de Saint-Exupery,
French writer (1900 - 1944)
Developers like to solve hard problems and there is a very strong temptation to solve problems which don't need to be solved. This is a very common behaviour for developers of up to 10 years experience (it was for me anyway ;), after about this point you have already solved most common problem before and you start selecting the best/minimum set of solutions which will solve a problem. This is the point you want to get to in your career and you will be able to develop quality software in far less time than you could before.
If you dream up an interesting problem to solve, go ahead and solve it in your own time, see what difference it makes, but don't include it in your working code unless you know (because you measured) that it really makes a difference.
However, if you find a simpler, elegant solution to a problem, this is worth including not because it might be faster (thought it might be), but because it should make the code easier to understand and maintain and this is usually far more valuable use of your time. Successfully used software usually costs three times as much to maintain as it cost to develop. Do what will make the life of the poor person who has to understand why you did something easier (which is harder if you didn't do it for any good reason in the first place) as this might be you one day ;)
A good example on when you might make an application slower to improve reasoning, is in the use of immutable values and concurrency. Immutable values are usually slower than mutable ones, sometimes much slower, however when used with concurrency, mutable state is very hard to get provably right, and you need this because testing it is good but not reliable. Using concurrency you have much more CPU to burn so a bit more cost in using immutable objects is a very sensible trade off. In some cases using immutable objects can allow you to avoid using locks and actually improve throughput. e.g. CopyOnWriteArrayList, if you have a high read to write ration.
I just realized that I have no idea on how to tell whether or not a piece of Java code is efficient from the computational point of view. Reading several source codes sometimes I feel that the code I'm reading is highly inefficient, some other times I feel the opposite.
Could you list the basic one-line rules to respect and why they are so important?
edit - My question is related to the Java implementations of the JVM, so things like Java allocation issues, String management, exception handling, thread synchronization and so on.
Thanks in advance
p.s. don't take the "one-line" literally pls
Basic one-line rule? Okay, here you go:
Avoid unnecessary computations.
How do you do it? Sorry, no one-line answer to that. :(
Well, people spends years in college learning about algorithms and data structures in computer science for a reason... might want to take a course on algorithms/data structures sometime.
I'm not sure what you mean by "from a computation point of view" (it seems to imply algorithm issues), but assuming you mean tricks more similar to things like profiling, try these:
Run the program, then suddenly pause it, and see where it paused. Do this a few times; wherever it stops the most is a bottleneck, and how often it stops indicates how bad of a bottleneck it is.
Avoid boxing/unboxing (converting between int and Integer, etc.); especially avoid Integer[], List<Integer>, and other things that internally store arrays of objects of primitive types
Factor out common code (sometimes a speed issue, sometimes readability)
Avoid looping with String operations; use StringBuilder/StringBuffer instead. (In short, avoid creating and/or copying data when that's not needed.)
I'll add to this if other things come to mind.
Use profiling. Look at JProfile or for any other profilers.
I'll second Mherdad's answer in that there definitely are no "basic one-line rules."
Regarding answers that suggest using profiling tools, profiling isn't really useful until you understand algorithmic time complexity and big-O notation. From wikipedia's article on Big O notation:
In mathematics, computer science, and
related fields, big-O notation
describes the limiting behavior of the
function when the argument tends
towards a particular value or
infinity, usually in terms of simpler
functions. Big O notation
characterizes functions according to
their growth rates: different
functions with the same growth rate
may be represented using the same O
notation.
The idea behind big-O notation is that it gives you a feel for how input size affects execution time for a given algorithm. For instance, consider the following two methods:
void linearFoo(List<String> strings){
for(String s:strings){
doSomethingWithString(s);
}
}
void quadraticFoo(List<String> strings){
for(String s:strings){
for(String s1:strings){
doSomethingWithTwoStrings(s,s1);
}
}
}
linearFoo is said to be O(n), meaning that its time increases linearly with the input size n (ie. strings.size()). quadraticFoo is said to be O(n2), meaning that the time it takes to execute quadraticFoo is a function of strings.size() squared.
Once you have a feel for the algorithmic time complexity of your program profiling tools will start to be useful. For instance, you'll be able to tell that if while profiling you find out that a method typically takes 1ms for a fixed input size, if that method is O(n), doubling the input size will result in an execution time of 2ms (1ms = n, therefore 2n = 2ms). However, if it is O(n2), doubling input size will mean that your method will take around 4ms to execute (1ms = n2 therefore (2n)2 = 4ms).
Take a look at the book Effective Java by Joshua Bloch if you really need a list of rules that you should follow in Java. The book offers guidelines not just for performance but also not he proper way of programming in Java.
You can use jconsole for monitoring your application's deadlocks, memory leaks, threads and heap. In short you can see your applications performance in graphs.
I want to ask a complex question.
I have to code a heuristic for my thesis. I need followings:
Evaluate some integral functions
Minimize functions over an interval
Do this over thousand and thousand times.
So I need a faster programming language to do these jobs. Which language do you suggest? First, I started with Java, but taking integrals become a problem. And I'm not sure about speed.
Connecting Java and other softwares like MATLAB may be a good idea. Since I'm not sure, I want to take your opinions.
Thanks!
C,Java, ... are all Turing complete languages. They can calculate the same functions with the same precision.
If you want achieve performance goals use C that is a compiled and high performances language . Can decrease your computation time avoiding method calls and high level features present in an interpreted language like Java.
Anyway remember that your implementation may impact the performances more than which language you choose, because for increasing input dimension is the computational complexity that is relevant ( http://en.wikipedia.org/wiki/Computational_complexity_theory ).
It's not the programming language, it's probably your algorithm. Determine the big0 notation of your algorithm. If you use loops in loops, where you could use a search by a hash in a Map instead, your algorithm can be made n times faster.
Note: Modern JVM's (JDK 1.5 or 1.6) compile Just-In-Time natively (as in not-interpreted) to a specific OS and a specific OS version and a specific hardware architecture. You could try the -server to JIT even more aggressively (at the cost of an even longer initialization time).
Do this over thousand and thousand times.
Are you sure it's not more, something like 10^1000 instead? Try accurately calculating how many times you need to run that loop, it might surprise you. The type of problems on which heuristics are used, tend to have a really big search space.
Before you start switching languages, I'd first try to do the following things:
Find the best available algorithms.
Find available implementations of those algorithms usable from your language.
There are e.g. scientific libraries for Java. Try to use these libraries.
If they are not fast enough investigate whether there is anything to be done about it. Is your problem more specific than what the library assumes. Are you able to improve the algorithm based on that knowledge.
What is it that takes so much/memory? Is this realy related to your language? Try to avoid observing JVM start times instead of the time it performed calculation for you.
Then, I'd consider switching languages. But don't expect it to be easy to beat optimized third party java libraries in c.
Order of the algorithm
Tipically switching between languages only reduce the time required by a constant factor. Let's say you can double the speed using C, but if your algorithm is O(n^2) it will take four times to process if you double the data no matter the language.
And the JVM can optimize a lot of things getting good results.
Some posible optimizations in Java
If you have functions that are called a lot of times make them final. And the same for entire classes. The compiler will know that it can inline the method code, avoiding creating method-call stack frames for that call.
For a project at university, we had to implement a few different algorithms to calculate the equivalenceclasses when given a set of elements and a collection of relations between said elements.
We were instructed to implement, among others, the Union-Find algorithm and its optimizations (Union by Depth, Size). By accident (doing something I thought was necessary for the correctness of the algorithm) I discovered another way to optimize the algorithm.
It isn't as fast as Union By Depth, but close. I couldn't for the life of me figure out why it was as fast as it was, so I consulted one of the teaching assistants who couldn't figure it out either.
The project was in java and the datastructures I used were based on simple arrays of Integers (the object, not the int)
Later, at the project's evaluation, I was told that it probably had something to do with 'Java caching', but I can't find anything online about how caching would effect this.
What would be the best way, without calculating the complexity of the algorithm, to prove or disprove that my optimization is this fast because of java's way of doing stuff? Implementing it in another, (lower level?) language? But who's to say that language won't do the same thing?
I hope I made myself clear,
thanks
The only way is to prove the worst-case (average case, etc) complexity of the algorithm.
Because if you don't, it might just be a consequence of a combination of
The particular data
The size of the data
Some aspect of the hardware
Some aspect of the language implementation
etc.
It is generally very difficult to perform such task given modern VM's! Like you hint they perform all sorts of stuff behind your back. Method calls gets inlined, objects are reused. Etc. A prime example is seeing how trivial loops gets compiled away if their obviously are not performing anything other than counting. Or how a funtioncall in functional programming are inlined or tail-call optimized.
Further, you have the difficulty of proving your point in general on any data set. An O(n^2) can easily be much faster than a seemingly faster, say O(n), algorithm. Two examples
Bubble sort is faster at sorting a near-sorted data collection than quick sort.
Quick sort in the general case, of course is faster.
Generally the big-O notation purposely ignores constants which in a practical situation can mean life or death to your implementation. and those constants may have been what hit you. So in practice 0.00001 * n
^2 (say the running time of your algorithm) is faster than 1000000 * n log n
So reasoning is hard given the limited information you provide.
It is likely that either the compiler or the JVM found an optimisation for your code. You could try reading the bytecode that is output by the javac compiler, and disabling runtime JIT compilation with the -Djava.compiler=NONE option.
If you have access to the source code -- and the JDK source code is available, I believe -- then you can trawl through it to find the relevant implementation details.