Is there an alternative to AtomicReferenceArray for large amounts of data?

Is there an alternative to AtomicReferenceArray for large amounts of data? - java

I have a large amount of data that I'm currently storing in an AtomicReferenceArray<X>, and processing from a large number of threads concurrently.
Each element is quite small and I've just got to the point where I'm going to have more than Integer.MAX_VALUE entries. Unfortunately List and arrays in java are limited to Integer.MAX_VALUE (or just less) values. Now I have enough memory to keep a larger structure in memory - with the machine having about 250GB of memory in a 64b VM.
Is there a replacement for AtomicReferenceArray<X> that is indexed by longs? (Otherwise I'm going to have to create my own wrapper that stores several smaller AtomicReferenceArray and maps long accesses to int accesses in the smaller ones.)

Sounds like it is time to use native memory. Having 4+ billion objects is going to cause some dramatic GC pause times. However if you use native memory you can do this with almost no impact on the heap. You can also use memory mapped files to support faster restarts and sharing the data between JVMs.
Not sure what your specific needs are but there are a number of open source data structures which do this like; HugeArray, Chronicle Queue and Chronicle Map You can create an array which 1 TB but uses almost no heap and has no GC impact.
BTW For each object you create, there is a 8 byte reference and a 16 byte header. By using native memory you can save 24 bytes per object e.g. 4 bn * 24 is 96 GB of memory.

Related

Low performance in non-sequential iteration over int Array in JAVA

I have the following function:
public void scanText(char[] T){
int q=0;
for(int i=0;i<T.length;i++){
q = transFunc[preCompRow[q]+T[i]];
if(q==pattern.length){
System.out.println("match found at position: "+(i-pattern.length+2));
}
}
}
This function scans a char Array searching for matches of a given pattern, which is stored as a finite automata. The transition function of the automata is stored in the variable called transFunc.
I am testing this function in a text with 8 millions of characters and using 800000 patterns. The thing is the accession of the array preCompRow[q] (which is an int[]) is very slow. The performance is greatly improved if I delete the preCompRow[q] of the code. I think this might be because in every loop the q variable has a different non-sequential value (2, 56, 18, 9 ..).
Is there any better way to access to an array in a non-sequential manner?
Thanks in advance!

One possible explanation is that your code is seeing poor memory performance due to poor locality in its memory access patterns.
The role of the memory caches in a modern computer is to deal with the speed mismatch between processor instruction times (less than 1 ns) and main memory (5 to 10 ns or more). They work best when your code gets a cache hit most time it fetches from memory.
A modern Intel chipset caches memory in blocks of 64 bytes, and loads from main memory in burst mode. (That corresponds to 16 int values.) The L1 cache on (say) an I7 processor is 2MB.
If your application is able to access the data in a large array (roughly) sequentially, then 7 out of 8 accesses will be a cache hits. If the access pattern is non-sequential and the "working set" of is a large multiple of the cache size, then you may end up with a cache miss on each memory access.
If memory access locality is the root of yoiur problems, then your option are limited:
redesign your algorithm so that locality of memory references is better
buy hardware with larger caches
(maybe) redesign your algorithm to use GPUs or some other strategy to reduce the memory traffic
Recoding your existing in C or C++ may give a performance improvement, but the same memory locality problems will bite you there as well.
I am not aware of any tools that can be used to measure cache performance in Java applications.

OpenHFT ChronicleMap memory alocation and limits

This post would likely be a good candidate for frequently asked questions at OpenHFT.
I am playing with ChronicleMap considering it for an idea but having lots of questions. I am sure most junior programmers who are looking into this product have similar considerations.
Would you explain how memory is managed in this API?
ChronicleMap proclaims some remarkable TBs off-heap memory resources available to processing its data and I would like to get a clear vision on that.
Lets get down to a programmer with a laptop of 500GB HD and 4GB RAM. In this case pure math sais - total resource of 'swapped' memory available is 504GB. Let's give the OS and other programs half and we are left with 250GB HD and 2GB RAM. Can you elaborate on the actual available memory ChronicleMap can allocate in numbers relative to available resources?
Next related questions are relative to the implementation of ChronicleMap.
My understanding is that each ChronicleMap allocates chunk of memory it works with and optimal performance/memory usage is achieved when we can accurately predict the amount of data passed through. However, this is a dynamic world.
Lets set an (exaggerated but possible) example:
Suppose a map of K (key) 'cities' and their V (value) - 'description' (of the cities) and allowing users large limits on the description length.
First user enters: K = "Amsterdam", V = "City of bicycles" and this entry is used to declare the map
- it sets the precedent for the pair like this:
ChronicleMap<Integer, PostalCodeRange> cityPostalCodes = ChronicleMap
.of(CharSequence.class, CharSequence.class)
.averageKey("Amsterdam")
.averageValue("City of bicycles")
.entries(5_000)
.createOrRecoverPersistedTo(citiesAndDescriptions);
Now, next user gets carried away and writes an assay about Prague
He passes to: K = "Prague", V = "City of 100 towers is located in the hard of Europe ... blah, blah... million words ..."
Now the programmer had expected max 5_000 entries, but it gets out of his hands and there are many thousands of entries.
Does ChronicleMap allocate memory automatically for such cases? If yes is there some better approach of declaring ChronicleMaps for this dynamic solution? If no, would you recommend an approach (best in code example) how to best handle such scenarios?
How does this work with persistence to file?
Can ChronicleMaps deplete my RAM and/or disk space? Best practice to avoid that?
In other words, please explain how memory is managed in case of under-estimation and over-estimation of the value (and/or key) lengths and number of entries.
Which of these are applicable in ChronicleMap?
If I allocate big chunk (.entries(1_000_000), .averageValueSize(1_000_000) and actual usage is - Entries = 100, and Average Value Size = 100.
What happens?:
1.1. - all works fine, but there will be large wasted chunk - unused?
1.2. - all works fine, the unused memory is available to:
1.2.1 - ChronicleMap
1.2.2 - given thread using ChronicleMap
1.2.3 - given process
1.2.4 - given JVM
1.2.5 - the OS
1.3. - please explain if something else happens with the unused memory
1.4. - what does the over sized declaration do to my persistence file?
Opposite of case 1 - I allocate small chunk (.entries(10), .averageValueSize(10) and the actual usage is 1_000_000s of entries, and Average Value Size = 1_000s of bytes.
What happens?:

Lets get down to a programmer with a laptop of 500GB HD and 4GB RAM. In this case pure math sais - total resource of 'swapped' memory available is 504GB. Let's give the OS and other programs half and we are left with 250GB HD and 2GB RAM. Can you elaborate on the actual available memory ChronicleMap can allocate in numbers relative to available resources?
Under such conditions Chronicle Map will be very slow, with on average 2 random disk reads and writes (4 random disk operations in total) on each operation with Chronicle Map. Traditional disk-based db engines, like RocksDB or LevelDB, should work better when the database size is much bigger than memory.
Now the programmer had expected max 5_000 entries, but it gets out of his hands and there are many thousands of entries.
Does ChronicleMap allocate memory automatically for such cases? If yes is there some better approach of declaring ChronicleMaps for this dynamic solution? If no, would you recommend an approach (best in code example) how to best handle such scenarios?
Chronicle Map will allocate memory until the actual number of entries inserted divided by the number configured through ChronicleMapBuilder.entries() is not higher than the configured ChronicleMapBuilder.maxBloatFactor(). E. g. if you create a map as
ChronicleMap<Integer, PostalCodeRange> cityPostalCodes = ChronicleMap
.of(CharSequence.class, CharSequence.class)
.averageKey("Amsterdam")
.averageValue("City of bicycles")
.entries(5_000)
.maxBloatFactor(5.0)
.createOrRecoverPersistedTo(citiesAndDescriptions);
It will start throwing IllegalStateException on attempts to insert new entries, when the size will be ~ 25 000.
However, Chronicle Map works progressively slower when the actual size grows far beyond the configured size, so the maximum possible maxBloatFactor() is artificially limited to 1000.
The solution right now is to configure the future size of the Chronicle Map via entries() (and averageKey(), and averageValue()) at least approximately correctly.
The requirement to configure plausible Chronicle Map's size in advance is acknowledged to be a usability problem. There is a way to fix this and it's on the project roadmap.
In other words, please explain how memory is managed in case of under-estimation and over-estimation of the value (and/or key) lengths and number of entries.
Key/value size underestimation: space is wasted in hash lookup area, ~ 8 bytes * underestimation factor, per entry. So it could be pretty bad if the actual average entry size (key + value) is small, e. g. 50 bytes, and you have configured it as 20 bytes, you will waste ~ 8 * 50 / 20 = 20 bytes, or 40%. Bigger the average entry size, smaller the waste.
Key/value size overestimation: if you configure just key and value average size, but not actualChunkSize() directly, the actual chunk size is automatically chosen between 1/8th and 1/4th of the average entry size (key + value). The actual chunk size is the allocation unit in Chronicle Map. So if you configured average entry size as ~ 1000 bytes, the actual chunk size will be chosen between 125 and 250 bytes. If the actual average entry size is just 100 bytes, you will lose a lot of space. If the overestimation is small, the expected space losses are limited to about 20% of the data size.
So if you are afraid you may overestimate the average key/value size, configure actualChunkSize() explicitly.
Number of entries underestimation: discussed above. No particular space waste, but Chronicle Map works slower, the worse the underestimation.
Number of entries overestimation: memory is wasted in hash lookups area, ~ 8 bytes * overestimation factor, per entry. See the section key/value size underestimation above on how good or bad it could be, depending on the actual average entry data size.

Cassandra - many small or fewer bigger nodes?

I will be hosting my Cassandra database on Google cloud. Instances are priced in a linear fashion meaning 1cpu with 2gb ram is $1, 2cpu with 4gb is $2, 4cpu with 8GB is $4 and so on.
I am deciding on the size of my instances and am not sure what the standard is? I was thinking of using more fewer larger instances (8cpu, 64gb) opposed to lighter such as (2cpu, 4 gb). My thought process is with more instances each node will carry less of the overall data which would have a smaller impact if nodes fail. As well, the os of these smaller instances would have less overhead because it would accept less connections.
These are pros, but here are some cons I can think of:
1) Each instance will be less utilized
2) Cassandra + JVM overhead on so many instances can add up and be a lot of overhead.
3) I will be using local SSD opposed to persistent SSD which are much more expensive meaning each instance will need their own local SSD which raises costs.
These are some reasons I can think of, is there any other pros/cons between choosing more smaller instances vs fewer larger for a Cassandra database (maybe even nodes in general)? Are there any best practices associated to choosing Cassandra server sizes?
PS: I added the 'Java' tag because Cassandra is built using JAVA and runs on the JVM and would like to see if the JVM has any pros/cons.

I think you've hit some of the tradeoff points, but here are a few other things:
As the amount of data stored on a single node increases, the cost of bootstrapping (adding new nodes) increases. For instance, you'll get reasonable bootstrapping times storing 100 GB per node, but the process will take eons with 10 TB per node.
SSD usage makes this less important, but consider using separate physical disks for your commitlog and data.
Configurations with fewer than 4 cores or less than 8 GB of memory are usually not recommended, but your mileage may vary.

Optimal data structure for a large number of (float) values

I'm developing a visualization app for Android (including older devices running Android 2.2).
The input model of my app contains an area, which typically consists of tens of thousands of vertices. Typical models have 50000-100000 vertices (each with an x,y,z float coord), i.e. they use up 600K-1200 kilobytes of total memory. The app requires all vertices are available in memory at any time. This is all I can share about this app (I'm not allowed to share high-level use cases), so I'm wondering if my below conclusions are correct and whether there is a better solution.
For example, assume there are count=50000 vertices. I see two solutions:
1.) My earlier solution was using an own VertexObj (better readability due to encapsulation, better locality when accessing individual coordinates):
public static class VertexObj {
public float x, y, z;
}
VertexObj mVertices = new VertexObj[count]; // 50,000 objects
2.) My other idea is using a large float[] instead:
float[] mVertices = new VertexObj[count * 3]; // 150,000 float values
The problem with the first solution is the big memory overhead -- we are on a mobile device where the app's heap might be limited to 16-24MB (and my app needs memory for other things too). According to the official Android pages, object allocation should be avoided when it is not truly necessary. In this case, the memory overhead can be huge even for 50,000 vertices:
First of all, the "useful" memory is 50000*3*4 = 600K (this is used up by float values). Then we have +200K overhead due to the VertexObj elements, and probably another +400K due to Java object headers (they're probably at least 8 bytes per object on Android, too). This is 600K "wasted" memory for 50,000 vertices, which is 100% overhead (!). In case of 100,000 vertices, the overhead is 1.2MB.
The second solution is much better, as it requires only the useful 600K for float values.
Apparently, the conclusion is that I should go with float[], but I would like to know the risks in this case. Note that my doubts might be related with lower-level (not strictly Android-specific) aspects of memory management as well.
As far as I know, when I write new float[300000], the app requests the VM to reserve a contiguous block of 300000*4 = 1200K bytes. (It happened to me in Android that I requested a 1MB byte[], and I got an OutOfMemoryException, even though the Dalvik heap had much more than 1MB free. I suppose this was because it could not reserve a contiguous block of 1MB.)
Since the GC of Android's VM is not a compacting GC, I'm afraid that if the memory is "fragmented", such a hugefloat[] allocation may result in an OOM. If I'm right here, then this risk should be handled. E.g. what about allocating more float[] objects (and each would store a portion such as 200KB)? Such linked-list memory management mechanisms are used by operating systems and VMs, so it sounds unusual to me that I would need to use it here (on application level). What am I missing?
If nothing, then I guess that the best solution is using a linked list of float[] objects (to avoid OOM but keep overhead small)?

The out of memory you are facing while allocating the float array is quite strange.
If the biggest countinous memory block available in the heap is smaller then the memory required by the float array, the heap increases his size in order to accomodate the required memory.
Of course, this would fail if the heap has already reach the maximum available to your application. This would mean, that your application has exausted the heap, and then release a significant number of objects that resulted in memory fragmentation, and no more heap to allocate. However, if this is the case, and assuming that the fragmented memory is enough to hold the float array (otherwise your application wouldn't run anyawy), it's just a matter of allocating order.
If you allocate the memory required for the float array during application startup, you have plenty of countinous memory for it. Then, you just let your application do the remaining stuff, as the countigous memory is already allocated.
You can easly check the memory blocks being allocated (and the free ones) using DDMS in Eclipse, selecting yout app, and pressing Update Heap button.
Just for the sake of of avoiding misleading you, I've tested it before post, allocationg several contigous memory bloocks of float[300000].
Regards.

I actually ran into a problem where I wanted to embed data for a test case. You'll have quite a fun time embedding huge arrays because Eclipse kept complaining when the function exceeded something like 65,535 bytes of data due to me declaring an array like that. However, this is actually a quite common approach.
The rest goes into optimization. The big question is this: would be worth the trouble doing all of that optimizing? If you aren't hard-up on RAM, you should be fine using 1.2 megs. There's also a chance that Java will whine if you have an array that large, but you can do things like use a fancier data structure like a LinkedList or chop up the array into smaller ones. For statically set data, I feel an array might be a good choice if you are reading it like crazy.
I know you can make .xml files for integers, so storing as an integer with a tactic like multiplying the value, reading it in, and then dividing it by a value would be another option. You can also put in things like text files into your assets folder. Just do it once in the application and you can read/write however you like.
As for double vs float, I feel that in your case, a math or science case, that doubles would be safer if you can pull it off. If you do any math, you'll have less chance of error with double, especially with an operation like multiplication. floats are usually faster. I'm not sure if Java does SIMD packing, but if it does, more floats can be packed into an SIMD register than doubles

estimating size of Java objects inside memcached

I have a Java app that uses the spymemcached library (http://code.google.com/p/spymemcached) to read and write objects to memcached.
The app always caches the same type of object to memcached. The cached object is always an ArrayList of 5 or 6 java.util.Strings. Using the SizeOf library (http://www.codeinstructions.com/2008/12/sizeof-for-java.html), I've determined that the average deep size of the ArrayList is about 800 bytes.
Overall, I have allocated 12 GB of RAM to memcached. My question is: How many of these objects can memcached hold?
It's not clear to me if it's correct to use the "800 byte" metric from SizeOf, or if that's misleading. For example, SizeOf counts each char to be 2 bytes. I know that every char in my String is a regular ASCII character. I believe spymemcached uses Java serialization, and I'm not sure if that causes each char to take up 1 byte or 2 bytes.
Also, I don't know how much per-object overhead memcached uses. So the calculation should account for the RAM that memcached uses for its own internal data structures.
I don't need a number that's 100% exact. A rough back-of-the-envelope calculation would be great.

The simple approach would be experimentation:
restart memcache
Check bytes allocated: echo "stats" | nc localhost 11211 | fgrep "bytes "
insert 1 object, check bytes allocated
insert 10 objects, check bytes allocated
etc.
This should give you a good idea of bytes-per-key.
However, even if you figure out your serialized size, that alone probably won't tell you how many objects of that size memcache will hold. Memcache's slab system and LRU implementation make any sort of estimate of that nature difficult.
Memcache doesn't really seem to be designed around guaranteeing data availability -- when you GET a key, it might be there, or it might not: maybe it was prematurely purged; maybe one or two of the servers in your pool went down.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.