Low performance in non-sequential iteration over int Array in JAVA

Low performance in non-sequential iteration over int Array in JAVA - java

I have the following function:
public void scanText(char[] T){
int q=0;
for(int i=0;i<T.length;i++){
q = transFunc[preCompRow[q]+T[i]];
if(q==pattern.length){
System.out.println("match found at position: "+(i-pattern.length+2));
}
}
}
This function scans a char Array searching for matches of a given pattern, which is stored as a finite automata. The transition function of the automata is stored in the variable called transFunc.
I am testing this function in a text with 8 millions of characters and using 800000 patterns. The thing is the accession of the array preCompRow[q] (which is an int[]) is very slow. The performance is greatly improved if I delete the preCompRow[q] of the code. I think this might be because in every loop the q variable has a different non-sequential value (2, 56, 18, 9 ..).
Is there any better way to access to an array in a non-sequential manner?
Thanks in advance!

One possible explanation is that your code is seeing poor memory performance due to poor locality in its memory access patterns.
The role of the memory caches in a modern computer is to deal with the speed mismatch between processor instruction times (less than 1 ns) and main memory (5 to 10 ns or more). They work best when your code gets a cache hit most time it fetches from memory.
A modern Intel chipset caches memory in blocks of 64 bytes, and loads from main memory in burst mode. (That corresponds to 16 int values.) The L1 cache on (say) an I7 processor is 2MB.
If your application is able to access the data in a large array (roughly) sequentially, then 7 out of 8 accesses will be a cache hits. If the access pattern is non-sequential and the "working set" of is a large multiple of the cache size, then you may end up with a cache miss on each memory access.
If memory access locality is the root of yoiur problems, then your option are limited:
redesign your algorithm so that locality of memory references is better
buy hardware with larger caches
(maybe) redesign your algorithm to use GPUs or some other strategy to reduce the memory traffic
Recoding your existing in C or C++ may give a performance improvement, but the same memory locality problems will bite you there as well.
I am not aware of any tools that can be used to measure cache performance in Java applications.

Related

Is there an alternative to AtomicReferenceArray for large amounts of data?

I have a large amount of data that I'm currently storing in an AtomicReferenceArray<X>, and processing from a large number of threads concurrently.
Each element is quite small and I've just got to the point where I'm going to have more than Integer.MAX_VALUE entries. Unfortunately List and arrays in java are limited to Integer.MAX_VALUE (or just less) values. Now I have enough memory to keep a larger structure in memory - with the machine having about 250GB of memory in a 64b VM.
Is there a replacement for AtomicReferenceArray<X> that is indexed by longs? (Otherwise I'm going to have to create my own wrapper that stores several smaller AtomicReferenceArray and maps long accesses to int accesses in the smaller ones.)

Sounds like it is time to use native memory. Having 4+ billion objects is going to cause some dramatic GC pause times. However if you use native memory you can do this with almost no impact on the heap. You can also use memory mapped files to support faster restarts and sharing the data between JVMs.
Not sure what your specific needs are but there are a number of open source data structures which do this like; HugeArray, Chronicle Queue and Chronicle Map You can create an array which 1 TB but uses almost no heap and has no GC impact.
BTW For each object you create, there is a 8 byte reference and a 16 byte header. By using native memory you can save 24 bytes per object e.g. 4 bn * 24 is 96 GB of memory.

How to measure memory usage of a data structure? [duplicate]

This question already has answers here:
Calculate size of Object in Java [duplicate]
(3 answers)
Closed 1 year ago.
I am comparing a Trie with a HashMap storing English words, over 1 million. After the data is loaded, only lookup is performed. I am writing code to test both speed and memory. The speed seems easy to measure, simply recording the system time before and after the testing code.
What's the way to measure the memory usage of an object? In this case, it's either a Trie and HashMap. I watched the system performance monitor and tested in Eclipse. The OS performance monitor shows over 1G memory is used after my testing program is launched. I doubt the fact that storing the data needs so much memory.
Also, on my Windows machine, it shows that memory usage keeps rising throughout the testing time. This shouldn't happen, since the initial loading time of the data is short. And after that, during the lookup phrase, there shouldn't be any more additional memory consumption, since no new objects are created. On linux, the memory usage seems more stable, though it also increased some.
Would you please share some thoughts on this? Thanks a lot.

The short answer is: you can't.
The long answer is: you can calculate the size of objects in memory by repeating the differential memory analysis calling GC multiple times before and after the tests. But even then only a very large numbers or round can approximate the real size. You need a warmup phase first and even if it all seams to work just smoothly, you can get stuck with jit and other optimizations, you were not aware of.
In general it's a good rule of thumb to count the amount of objects you use.
If your tree implementation uses objects as structure representing the data, it is quite possible that your memory consumption is high, compared to a map.
If you have wast amount of data a map might become slow because of collisions.
A common approach is to optimize later in case optimization is needed.

Did you try the "jps" tool which is provided by Oracle in Java SDK? You can find this in JavaSDK/bin folder. Its a great tool for performance checking and even memory usage.

Are there performance issues from using large numbers of objects in Java

I am currently working on a system where performance is an important consideration. It is going to be used for processing large quantities of data (some of the object types are in millions) with non-trivial algorithms (think about Integer Programming problems etc.). At the moment I have a working solution which creates all these data points as Objects.
Is there any performance increase to be gained, by treating them as arrays for example? Are there any best practices for working with large numbers of objects in Java (should it be avoided?).

I suggest you start by using a commercial CPU and memory profiler. This will give you a good idea of what are your bottleneck.
Reducing garbage and making your memory more compact helps more when your have optimised the code to the point that your profilers cannot suggest anything.
You might like to consider what structures which fit in your CPU caches better as this can improve performance by up to 2-5x. e.g. Your L3 cache might be 8 MB, and more than 5x faster than main memory. The more you can condense your working set to fit into it the better.
BTW Your L1 cache is 32 KB and ~10x faster again.
This all assumes that the time to perform a GC doesn't bother you. If you create enough objects you can see multi-second, even multi-minute GC stop-the-world pauses.

Arrays or ArrayLists have similar performance although arrays are faster (up to 25% depending on what you do with them). Where you can find a significant performance gain is by avoiding boxed primitives for calculations, in which case the only solution is to use an array.
Apart from that, creating many short lived objects incurs little performance cost, apart from the fact that GC will run more often (but the cost of running minor GC depends on the number of reachable objects, not on unreachable ones).

Premature optimization is evil. As Richard says in comments, write your code, see if its slow, then improve it. If you have suspicions write an example to simulate high load. The time spent up front to determine this is worth it.
But as for your question...
Yes, creating objects is more expensive compared to creating primitives. It also occupies more heap space (memory.) Also if you are using objects for only a short time the garbage collector will have to run more often which will eat some CPU.
Again, only worry about this if you really need speed improvement.

Prototype key parts of your algorithms, test them in separation, find the slowest, improve, repeat. Stay single threads for as long as possible, but always make a note of what can be done in parallel.
At the end your bottleneck may be either of below:
CPU because if algorithm computational complexity => try finding better algorithm (or run on multiple CPUs in parallel if you are just slightly below the target, if you are far below then parallel processing won't help)
CPU because of excessive GC => profile memory, use low/zero-GC collections (trove4j etc.) or even arrays of primitive types, or even direct memory buffers from NIO, experiment
Memory - optimize data proximity (use chunked arrays matching cache sizes, etc).
Contentions on concurrent objects => revert to single threaded design, try lock-free synchronization primitives, etc.

Optimal data structure for a large number of (float) values

I'm developing a visualization app for Android (including older devices running Android 2.2).
The input model of my app contains an area, which typically consists of tens of thousands of vertices. Typical models have 50000-100000 vertices (each with an x,y,z float coord), i.e. they use up 600K-1200 kilobytes of total memory. The app requires all vertices are available in memory at any time. This is all I can share about this app (I'm not allowed to share high-level use cases), so I'm wondering if my below conclusions are correct and whether there is a better solution.
For example, assume there are count=50000 vertices. I see two solutions:
1.) My earlier solution was using an own VertexObj (better readability due to encapsulation, better locality when accessing individual coordinates):
public static class VertexObj {
public float x, y, z;
}
VertexObj mVertices = new VertexObj[count]; // 50,000 objects
2.) My other idea is using a large float[] instead:
float[] mVertices = new VertexObj[count * 3]; // 150,000 float values
The problem with the first solution is the big memory overhead -- we are on a mobile device where the app's heap might be limited to 16-24MB (and my app needs memory for other things too). According to the official Android pages, object allocation should be avoided when it is not truly necessary. In this case, the memory overhead can be huge even for 50,000 vertices:
First of all, the "useful" memory is 50000*3*4 = 600K (this is used up by float values). Then we have +200K overhead due to the VertexObj elements, and probably another +400K due to Java object headers (they're probably at least 8 bytes per object on Android, too). This is 600K "wasted" memory for 50,000 vertices, which is 100% overhead (!). In case of 100,000 vertices, the overhead is 1.2MB.
The second solution is much better, as it requires only the useful 600K for float values.
Apparently, the conclusion is that I should go with float[], but I would like to know the risks in this case. Note that my doubts might be related with lower-level (not strictly Android-specific) aspects of memory management as well.
As far as I know, when I write new float[300000], the app requests the VM to reserve a contiguous block of 300000*4 = 1200K bytes. (It happened to me in Android that I requested a 1MB byte[], and I got an OutOfMemoryException, even though the Dalvik heap had much more than 1MB free. I suppose this was because it could not reserve a contiguous block of 1MB.)
Since the GC of Android's VM is not a compacting GC, I'm afraid that if the memory is "fragmented", such a hugefloat[] allocation may result in an OOM. If I'm right here, then this risk should be handled. E.g. what about allocating more float[] objects (and each would store a portion such as 200KB)? Such linked-list memory management mechanisms are used by operating systems and VMs, so it sounds unusual to me that I would need to use it here (on application level). What am I missing?
If nothing, then I guess that the best solution is using a linked list of float[] objects (to avoid OOM but keep overhead small)?

The out of memory you are facing while allocating the float array is quite strange.
If the biggest countinous memory block available in the heap is smaller then the memory required by the float array, the heap increases his size in order to accomodate the required memory.
Of course, this would fail if the heap has already reach the maximum available to your application. This would mean, that your application has exausted the heap, and then release a significant number of objects that resulted in memory fragmentation, and no more heap to allocate. However, if this is the case, and assuming that the fragmented memory is enough to hold the float array (otherwise your application wouldn't run anyawy), it's just a matter of allocating order.
If you allocate the memory required for the float array during application startup, you have plenty of countinous memory for it. Then, you just let your application do the remaining stuff, as the countigous memory is already allocated.
You can easly check the memory blocks being allocated (and the free ones) using DDMS in Eclipse, selecting yout app, and pressing Update Heap button.
Just for the sake of of avoiding misleading you, I've tested it before post, allocationg several contigous memory bloocks of float[300000].
Regards.

I actually ran into a problem where I wanted to embed data for a test case. You'll have quite a fun time embedding huge arrays because Eclipse kept complaining when the function exceeded something like 65,535 bytes of data due to me declaring an array like that. However, this is actually a quite common approach.
The rest goes into optimization. The big question is this: would be worth the trouble doing all of that optimizing? If you aren't hard-up on RAM, you should be fine using 1.2 megs. There's also a chance that Java will whine if you have an array that large, but you can do things like use a fancier data structure like a LinkedList or chop up the array into smaller ones. For statically set data, I feel an array might be a good choice if you are reading it like crazy.
I know you can make .xml files for integers, so storing as an integer with a tactic like multiplying the value, reading it in, and then dividing it by a value would be another option. You can also put in things like text files into your assets folder. Just do it once in the application and you can read/write however you like.
As for double vs float, I feel that in your case, a math or science case, that doubles would be safer if you can pull it off. If you do any math, you'll have less chance of error with double, especially with an operation like multiplication. floats are usually faster. I'm not sure if Java does SIMD packing, but if it does, more floats can be packed into an SIMD register than doubles

Running time of disk I/O algorithms

In the memory based computing model, the only running time calculations that need to be done can be done abstractly, by considering the data structure.
However , there aren't alot of docs on high performance disk I/o algorithms. Thus I ask the following set of questions:
1) How can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
2) And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
3) Finally... how does the JVM optimize access of indexed portions of a file?
And... as far as resources -- in general... Are there any good idioms or libraries for on disk data structure implementations?

1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
In chapter 6 of Computer Systems: A Programmer's Perspective they give a pretty practical mathematical model for how long it takes to read some data from a typical magnetic disk.
To quote the last page in the linked pdf:
Putting it all together, the total estimated access time is
Taccess = Tavg seek + Tavg rotation + Tavg transfer
= 9 ms + 4 ms + 0.02 ms
= 13.02 ms
This example illustrates some important points:
• The time to access the 512 bytes in a disk sector is dominated by the seek time and the rotational
latency. Accessing the first byte in the sector takes a long time, but the remaining bytes are essentially
free.
• Since the seek time and rotational latency are roughly the same, twice the seek time is a simple and
reasonable rule for estimating disk access time.
*note, the linked pdf is from the authors website == no piracy
Of course, if the data being accessed was recently accessed, there's a decent chance it's cached somewhere in the memory heiarchy, in which case the access time is extremely small(practically, "near instant" when compared to disk access time).
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
Another seek + rotation amount of time may occur if the seeked location isnt stored sequentially nearby. It depends where in the file you're seeking, and where that data is physically stored on the disk. For example, fragmented files are guaranteed to cause disk seeks to read the entire file.
Something to keep in mind is that even though you may only request to read a few bytes, the physical reads tend to occur in multiples of a fixed size chunks(the sector size), which ends up in cache. So you may later do a seek to some nearby location in the file, and get lucky that its already in cache for you.
Btw- The full chapter in that book on the memory hierarchy is pure gold, if you're interested in the subject.

1) If you need to compare the speed of various IO functions, you have to just run it a thousand times and record how long it takes.
2) That depends on how you plan to get to this index. An index to the beginning of a file is exactly the same as an index to the middle of a file. It just points to a section of memory on the disk. If you get to this index by starting at the beginning and progressing there, then yes it will take longer.
3/4) No these are managed by the operating system itself. Java isn't low level enough to handle these kinds of operations.

high performance disk I/o algorithms.
The performance of your hardware is usually so important that what you do in software doesn't matter so much. You should first consider buying the right hardware for the job.
how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
Its simple to time them as they are always going to take many micro-seconds each. For example a HDD can perform 80-120 IOPs and an SSD can perform 80K to 230K IOPs. You can usually get within 1/2 what the manufacturer specifies easily and getting 100% is the where you might do tricks in software. Never the less you will never get a HDD to perform like an SSD unless you have lots of memory and only ever read the data in which case the OS will do all the work for you.
You can buy hybrid drives which give you the capacity of an HDD but performance close to that of an SSD. For commercial production use you may be willing to spend the money of a disk sub-system with multiple drives. This can increase the perform to say 500 IOPS but can cost increases significantly. You usually buy a disk subsytem because you need the capacity and redundancy it provides but you usually get a performance boost as well but having more spinals working together. Although this link on disk subsystem performance is old (2004) they haven't changed that much since then.
And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It depends on whether it is in memory or not. If it is very close to data you recently read it quite likely, if it far away it depends on what accesses you have done in the past and how much memory you have free to cache disk accesses.
The typical latency for a HDD is ~8 ms each (i.e. if you have 10 random reads queued it can be 80 ms) The typical latency of a SSD is 25 to 100 us. It is far less likely that reads will already be queued as it is much faster to start with.
how does the JVM optimize access of indexed portions of a file?
Assuming you are using sensible buffer sizes, there is little you can do about generically in software. What you can do is done by the OS.
are there any good idioms or libraries for on disk data structure implementations?
Use a sensible buffer size like 512 bytes to 64 KB.
Much more importantly, buy the right hardware for your requirements.

1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
There are no such universal constants. In fact, performance models of physical disk I/O, file systems and operating systems are too complicated to be able to make accurate predictions for specific operations.
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It is too complicated to predict. For instance, it depends on how much file buffering the OS does, physical disk parameters (e.g. seek times) and how effectively the OS can schedule disk activity ... across all applications.
3)Finally... how does the JVM optimize access of indexed portions of a file?
It doesn't. It is an operating system level thing.
4) are there any good idioms or libraries for on disk data structure implementations?
That is difficult to answer without more details of your actual requirements. But the best idea is not to try and implement this kind of thing yourself. Find an existing library that is a good fit to your requirements.

Also note that Linux systems, at least, allow different file systems. Depending on the application, one might be a better fit than the others. http://en.wikipedia.org/wiki/File_system#Linux

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.