How to create efficient bit set structure for big data?

How to create efficient bit set structure for big data? - java

Java's BitSet is in memory and it has no compression in it.
Say I have 1 billion entries in bit map - 125 MB is occupied in memory.
Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable.
How to do fast operations on such bit maps without holding them uncompressed in memory?
I do not know the distribution of the bit in the bit-set.
I have also looked at JavaEWAH, which is a variant of the Java BitSet class, using run-length encoding (RLE) compression.
Is there any better solution ?

One solution is to keep the arrays off the heap.
You'll want to read this answer by #PeterLawrey to a related question.
In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap.
The operating system may limit the size of a individual memory mapped region. Its easy to work around this limitation by mapping multiple regions. If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.
Are you sure you need compression? Compression will trade time for space. Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. Can you add an SSD?
If you haven't yet tried memory-mapped files, start with that. I'd take a close look at implementing something on top of Peter's Chronicle.
If you need more speed you could try doing your binary operations in parallel.
If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays.

From the comments here what I would say as a complement to your initial question :
the bit fields distribution is unknown and so BitSet is probably the best we can use
you have to use the bit fields in different modules and want to cache them
That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS.
Pseudo code :
class BitSetHolder {
class BitSetCache extends LinkedHashMap<Integer, Bitset> {
BitSetCache() {
LinkedHashMap(size, loadfactor, true); // access order ...
}
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
}
}
BitSet get(int i) { // get from cache if not from disk
if (bitSetCache.containsKey(i) {
return bitSetCache.get(i);
}
// if not in cache, put it in cache
BitSet bitSet = readFromDisk();
bitSetCache.put(i, bitSet);
return bitSet();
}
}
That way :
you have transparent access to you 10 bit sets
you keep in memory the most recently accessed bit sets
you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others)
If this is an option for your requirements, I could develop a little more. Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap.

The best solution depends a great deal on the usage patterns and structure of the data.
If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. For example, a word list can be represented very efficiently in both space and lookup time using a DAG.
Sample Directed Graph and Topological Sort Code
BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java

Related

Minimizing overhead for large trees in java

I need to implement a large octree in Java. The tree will be very large so I will need to use a paging mechanism and split the tree into many small files for long-term storage.
My concern is that Java objects have a very high space overhead. If I were using C, I would be able to store the 8 reference pointers and any other data with only a byte or so of overhead to store the node type.
Is there any way that I can approach this level of overhead in Java?
I would be tempted to just use a single byte array per file. I could then use offsets in place of pointers (this is how I plan on storing the files). However, even when limiting the max file size, that would easily leave me with arrays too large to fit in contiguous memory, particularly if that memory becomes heavily fragmented. This would also lead to large time overheads for adding new nodes as the entire space would need to be reallocated. A ByteBuffer might resolve the first problem (I am not entirely sure of this), however it would not solve the second, as the size of a ByteBuffer is static.
For the moment I will just stick to using node objects. If anyone knows a more space efficient solution with a low time cost, please let me know.

Best practice for holding huge lists of data in Java

I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.
The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.
Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)
My question is:
What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?

This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:
The size of your entries
Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.
The hardware you're using
Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.
The type of computation(s) you want to do
You've mentioned sorting, so things like TreeMap or maybe PriorityQueue come to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.
Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMap or ConcurrentSkipListMap.

You can use a caching system, check MapDB it's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.
// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");
//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");

My intuition is that you could take inspiration from the initial MapReduce paradigm and partition your problem into several smaller but similar ones and then aggregate these partial results in order to reach the complete solution.
If you solve one smaller problem instance at a time (i.e. file chunk) this will guarantee you a space consumption penalty bounded by the space requirements for this single instance.
This approach to process the file lazily will work invariant of the data structure you choose.

B-tree implementation for variable size keys

I'm looking to implement a B-tree (in Java) for a "one use" index where a few million keys are inserted, and queries are then made a handful of times for each key. The keys are <= 40 byte ascii strings, and the associated data always takes up 6 bytes. The B-tree structure has been chosen because my memory budget does not allow me to keep the entire temporary index in memory.
My issue is about the practical details in choosing a branching factor and storing nodes on disk. It seems to me that there are two approaches:
One node always fit within one block. Achieved by choosing a branching factor k so that even for the worst case key-length the storage requirement for keys, data and control structures are <= the system block size. k is likely to be low, and nodes will in most cases have a lot of empty room.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
The questions are then:
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
Given my use case, would you recommend a different overall solution?
I should in closing mention that I'm aware of the jdbm3 project, and is considering using it. Will attempt to implement my own in any case, both as a learning exercise and to see if case specific optimization can yield better performance.
Edit: Reading about SB-Trees at the moment:
S(b)-Trees
Algorithms and Data Structures for External Memory

I'm missing option C here:
At least two tuples always fit into one block, the block size is chosen accordingly. Blocks are filled up with as many key/value pairs as possible, which means the branching factor is variable. If the blocksize is much greater than average size of a (key, value) tuple, the wasted space would be very low. Since the optimal IO size for discs is usually 4k or greater and you have a maximum tuple size of 46, this is automatically true in your case.
And for all options you have some variants: B* or B+ Trees (see Wikipedia).

JDBM BTree is already self balancing. It also have defragmentation which is very fast and solves all problems described above.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
Not necessary. JDBM3 uses mapped memory, so it never reads full block from disk to memory. It creates 'a view' on top of block and only read partial data as actually needed. So instead of reading full 4KB block, it may read just 2x128 bytes. This depends on underlying OS block size.
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
I think you missed point that increasing disk size decreases performance, as more data have to be read. And single tree can have share both approaches (newly inserted nodes first, second after defragmentation).
Anyway, flat-file with mapped memory buffer is probably best for your problem. Since you have fixed record size and just a few million records.
Also have look at leveldb. It has new java port which almost beats JDBM:
https://github.com/dain/leveldb
http://code.google.com/p/leveldb/

You could avoid this hassle if you use some embedded database. Those have solved these problems and some more for you already.
You also write: "a few million keys" ... "[max] 40 byte ascii strings" and "6 bytes [associated data]". This does not count up right. One gig of RAM would allow you more then "a few million" entries.

Alternative to Java Bitset with array like performance?

I am looking for an alternative to Java Bitset implementation. I am implementing a high performance algorithm and seems like using a Bitset object is killing its performance. Any ideas?

Someone here has compared boolean[] to BitSet and concluded with:
BitSet is more memory efficient than boolean[] except for very
small sizes. Each boolean in the array takes a byte. The numbers
from runtime.freeMemory() are a bit muddled for BitSet, but less.
boolean[] is more CPU efficient except for very large sizes, where
they are about even. E.g., for size 1 million boolean[] is about
four times faster (e.g. 6ms vs 27ms), for ten and a hundred million
they are about even.
If you Google, you can find some alternative implementations as well, like JavaEWAH, used by Apache Hive, Apache Spark and Eclipse JGit. It claims:
The goal of word-aligned compression is not to achieve the best
compression, but rather to improve query processing time. Hence, we
try to save CPU cycles, maybe at the expense of storage. However, the
EWAH scheme we implemented is always more efficient storage-wise than
an uncompressed bitmap as implemented in the BitSet class). Unlike
some alternatives, javaewah does not rely on a patented scheme.

While searching an answer for my question single byte comparison vs multiple boolean comparison, I found OpenBitSet
They claim to be faster than Java BitSet and direct access to the array of words storing the bits.
I am definitely gonna try that. See if it solve your purpose too.

Look at Javolution FastBitSet :
A high-performance bitset integrated with the collection framework as a set of indices and obeying the collection semantic for methods such as FastSet.size() (cardinality) or FastCollection.equals(java.lang.Object) (same set of indices).
See also http://code.google.com/p/guava-libraries/issues/detail?id=724#c3.

If you really must squeeze the maximum performance out of this thing, and if memory does not matter, you can try storing each one of your flags in an integer whose bit size is equal to the width of the data bus of your CPU.
You are probably on a 64-bit data bus CPU, so try long integers.

There are a number of compressed alternatives to the BitSet class. EWAH was already mentioned (https://github.com/lemire/javaewah). More recent additions include Roaring bitmaps (https://github.com/RoaringBitmap/RoaringBitmap) that are used by Apache Lucene, Apache Spark, Elastic Search, and so forth.

How do you make your Java application memory efficient?

How do you optimize the heap size usage of an application that has a lot (millions) of long-lived objects? (big cache, loading lots of records from a db)
Use the right data type
Avoid java.lang.String to represent other data types
Avoid duplicated objects
Use enums if the values are known in advance
Use object pools
String.intern() (good idea?)
Load/keep only the objects you need
I am looking for general programming or Java specific answers. No funky compiler switch.
Edit:
Optimize the memory representation of a POJO that can appear millions of times in the heap.
Use cases
Load a huge csv file in memory (converted into POJOs)
Use hibernate to retrieve million of records from a database
Resume of answers:
Use flyweight pattern
Copy on write
Instead of loading 10M objects with 3 properties, is it more efficient to have 3 arrays (or other data structure) of size 10M? (Could be a pain to manipulate data but if you are really short on memory...)

I suggest you use a memory profiler, see where the memory is being consumed and optimise that. Without quantitative information you could end up changing thing which either have no effect or actually make things worse.
You could look at changing the representation of your data, esp if your objects are small.
For example, you could represent a table of data as a series of columns with object arrays for each column, rather than one object per row. This can save a significant amount of overhead for each object if you don't need to represent an individual row. e.g. a table with 12 columns and 10,000,000 rows could use 12 objects (one per column) rather than 10 million (one per row)

You don't say what sort of objects you're looking to store, so it's a little difficult to offer detailed advice. However some (not exclusive) approaches, in no particular order, are:
Use a flyweight pattern wherever
possible.
Caching to disc. There are
numerous cache solutions for
Java.
There is some debate as to whether
String.intern is a good idea. See
here for a question re.
String.intern(), and the amount of
debate around its suitability.
Make use of soft or weak
references to store data that you can
recreate/reload on demand. See
here for how to use soft
references with caching techniques.
Knowing more about the internals and lifetime of the objects you're storing would result in a more detailed answer.

Ensure good normalization of your object model, don't duplicate values.
Ahem, and, if it's only millions of objects I think I'd just go for a decent 64 bit VM and lots of ram ;)

Normal "profilers" won't help you much, because you need an overview of all your "live" objects. You need heap dump analyzer. I recommend the Eclipse Memory analyzer.
Check for duplicated objects, starting with Strings.
Check whether you can apply patterns like flightweight, copyonwrite, lazy initialization (google will be your friend).

Take a look at this presentation linked from here. It lays out the memory use of common java object and primitives and helps you understand where all the extra memory goes.
Building Memory-efficient Java Applications: Practices and Challenges

You could just store fewer objects in memory. :) Use a cache that spills to disk or use Terracotta to cluster your heap (which is virtual) allowing unused parts to be flushed out of memory and transparently faulted back in.

I want to add something to the point Peter alredy made(can't comment on his answer :() it's always better to use a memory profiler(check java memory profiler) than to go by intution.80% of time it's routine that we ignore has some problem in it.also collection classes are more prone to memory leaks.

If you have millions of Integers and Floats etc. then see if your algorithms allow for representing the data in arrays of primitives. That means fewer references and lower CPU cost of each garbage collection.

A fancy one: keep most data compressed in ram. Only expand the current working set. If your data has good locality that can work nicely.
Use better data structures. The standard collections in java are rather memory intensive.
[what is a better data structure]
If you take a look at the source for the collections, you'll see that if you restrict yourself in how you access the collection, you can save space per element.
The way the collection handle growing is no good for large collections. Too much copying. For large collections, you need some block-based algorithm, like btree.

Spend some time getting acquainted with and tuning the VM command line options, especially those concerning garbage collection. While this won't change the memory used by your objects, it can have a big impact on performance with memory-intensive apps on machines with a lot of RAM.

Assign null value to all the variables which are no longer used. Thus make it available for Garbage collection.
De-reference the collections once usage is over, otherwise GC won't sweep those.

1) Use right dataTypes wherever possible
Class Person {
int age;
int status;
}
Here we can use below variables to save memory while sending Person object
class Person{
short age;
byte status;
}
2) Instead of returning new ArrayList<>(); from method , you can use Collection.emptyList() which will only contain only one element instead of default 10;
For e.g
public ArrayList getResults(){
.....
if(failedOperation)
return new ArrayList<>();
}
//Use this
public ArrayList getResults(){
if(failedOperation)
return Collections.emptyList();
}
3 ) Move creation of objects in methods instead of static declaration wherever possible as fields of objects will be stored on stack instead of heap
4) Using binary formats like protobuf,thrift,avro,messagepack for reducing intercommunication instead of json or XML

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.