In my Java code, I am using Guava's Multimap (com.google.common.collect.Multimap) by using this:
Multimap<Integer, Integer> Index = HashMultimap.create()
Here, Multimap key is some portion of a URL and value is another portion of the URL (converted into an integer). Now, I assign my JVM 2560 Mb (2.5 GB) heap space (by using Xmx and Xms). However, it can only store 9 millions of such (key,value) pairs of integers (approx 10 million). Now, issue is, I can provide JVM only limited amount of memory (say 2 GB).
So, can anybody help me,
1) Is there another way or home-baked solution to solve this memory issue? Means, Is Disk/DB Based Multi-Map can be a nice solution ? I read from some web articles that there is some DB/Disk based solution to solve this issue ex. Berkley DB or Ehcache. Can anybody inform me whether (or which one) is faster ?
2) Is those Disk/DB Based Multi-Map has performance issue (I am asking for both storing and searching) ?
3) Any idea or information how to use those in brief.
4) Any other idea will be nice for me.
NB: I want Multimap (key can have multiple values)solutions for the above issue. And I have to consider performance of storing and searching also.
You certainly won't store 100 million pairs of Integer objects in 2.5 GB of memory. If I'm not mistaken, an Integer will use at least 16 bytes of memory in Oracle/Sun JVM (and the alignment is also 16 bytes), which means 3.2 GB of memory for the Integers alone, without any structure.
With this data size you should definitely go with something which is backed by the disk, or use a server with lots of memory and/or optimized data structures (in particular try to avoid primitive type wrappers). I have used H2 for similar tasks and found it quite good (it can use mapped files to access the disk instead of reads), but I don't have any comparison with other similar libraries.
JDBM3 is a very fast on-disk HashMap/TreeMap (B+Tree) library and is claimed to be 4x faster than berkeley db. Billions of records can be stored in the map. It does caching internally so map operations won't be slowing down because of disk access.
DB db = DBMaker.openFile(fileName).make();
Map<Integer,Integer> map = db.createHashMap("mapName");
map.put(5, 10);
db.close()
It does not have a Multimap but the value can be a Set/List.
Related
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.
The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.
Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)
My question is:
What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?
This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:
The size of your entries
Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.
The hardware you're using
Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.
The type of computation(s) you want to do
You've mentioned sorting, so things like TreeMap or maybe PriorityQueue come to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.
Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMap or ConcurrentSkipListMap.
You can use a caching system, check MapDB it's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.
// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");
//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");
My intuition is that you could take inspiration from the initial MapReduce paradigm and partition your problem into several smaller but similar ones and then aggregate these partial results in order to reach the complete solution.
If you solve one smaller problem instance at a time (i.e. file chunk) this will guarantee you a space consumption penalty bounded by the space requirements for this single instance.
This approach to process the file lazily will work invariant of the data structure you choose.
Java's BitSet is in memory and it has no compression in it.
Say I have 1 billion entries in bit map - 125 MB is occupied in memory.
Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable.
How to do fast operations on such bit maps without holding them uncompressed in memory?
I do not know the distribution of the bit in the bit-set.
I have also looked at JavaEWAH, which is a variant of the Java BitSet class, using run-length encoding (RLE) compression.
Is there any better solution ?
One solution is to keep the arrays off the heap.
You'll want to read this answer by #PeterLawrey to a related question.
In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap.
The operating system may limit the size of a individual memory mapped region. Its easy to work around this limitation by mapping multiple regions. If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.
Are you sure you need compression? Compression will trade time for space. Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. Can you add an SSD?
If you haven't yet tried memory-mapped files, start with that. I'd take a close look at implementing something on top of Peter's Chronicle.
If you need more speed you could try doing your binary operations in parallel.
If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays.
From the comments here what I would say as a complement to your initial question :
the bit fields distribution is unknown and so BitSet is probably the best we can use
you have to use the bit fields in different modules and want to cache them
That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS.
Pseudo code :
class BitSetHolder {
class BitSetCache extends LinkedHashMap<Integer, Bitset> {
BitSetCache() {
LinkedHashMap(size, loadfactor, true); // access order ...
}
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
}
}
BitSet get(int i) { // get from cache if not from disk
if (bitSetCache.containsKey(i) {
return bitSetCache.get(i);
}
// if not in cache, put it in cache
BitSet bitSet = readFromDisk();
bitSetCache.put(i, bitSet);
return bitSet();
}
}
That way :
you have transparent access to you 10 bit sets
you keep in memory the most recently accessed bit sets
you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others)
If this is an option for your requirements, I could develop a little more. Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap.
The best solution depends a great deal on the usage patterns and structure of the data.
If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. For example, a word list can be represented very efficiently in both space and lookup time using a DAG.
Sample Directed Graph and Topological Sort Code
BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java
I have a huge dataset of utf8 strings to process, I need to eliminate duplicate in order to have uniq set of string.
I'm using a hashet to check if the string is already know, but now I reached 100 000 000 strings, I do not have enough RAM and the process crash. Moreover, I only processed 1% of the dataset so in memory solution is impossible.
What I would like is a hybrid solution like a "in-memory index" and "disk-based storage" so I could use the 10Go of RAM I have to speed up the process.
=> Do you known a java library already doing this ? If not which algorithm should i look after ?
Using a bloom filter in memory to check if the string is not present could be a solution, but I still have to check the disk sometime (false positive) and I would like to know different solution.
=> How to store the strings on the disk to have a fast read and write access ?
_ I don't want to use an external service like a nosql db or mysql, it must be embedded.
_ I already try file based light SQL db like h2sql or hsql but they are very bad at handling massive dataset.
_ I don't consider using Trove/Guava Collections as a solution (unless they offer disk based solution I'm not aware of), I'm already using an extremly memory efficient custom hashset and I don't even store String but byte[] in memory. I already tweaked -Xmx stuff for the jvm.
EDIT: The dataset I'm processing is huge, the raw unsorted dataset doesn't fit on my hard disk. I'm streaming it byte per byte and processing it.
What you could do would be to use an External Sorting Technique such as the External Merge Sort in which you would sort your data first.
Once that this is done, you could iterate through the sorted set and keep the last element you have encountered. Once that you have that, you would check the current item with the next. If they are the same, you move on to the next item. If not, you would update the item you currently have.
To avoid huge memory consumption, you could dump your list of unique items to hard drive whenever a particular threshold is reached and keep on going.
Long story short:
Let data be the data set you need to work with
Let sorted_data = External_Merge_Sort(data)
Data_Element last_data = {}
Let unique_items be the set of unique items you want to yield
foreach element e in sorted_data
if(e != last_data)
{
last_data = e
add e in unique_items
if (size(unique_items) == threshold)
{
dump_to_drive(unique_items)
}
}
What is the total data size you have ? If that is not in tera bytes and suppose you can use say 10 machines, I would suggest some external cache like memcached (spymemcached is a good java client form memcached).
Install memcached on the 10 nodes. Spymemcached client should be initialized with the list of memcached servers, so that they become a virtual cluster for our program.
For each string you read:
check if it is already in memcache
if it is in memcache:
will check the next string
continue
else:
add it to memcache
add it to list of string buffers to be flushed to disk
if size of the list of strings to be flushed > certain threshold:
flush them to disk
flush any remaining string to disk
Another approach is to use some kind of map-reduce :), without Hadoop:)
Deduplicate first 2 GB of Strings and writeout the de-duplicated stuff to an intermediate file
Follow the above step with next 2GB of Strings and so on.
Now apply the same method on the intermediate de-duplicated files.
When the total size of intermediate de-duplicated data is smaller, use Memcache or internal HashMap to produce the final output.
This approach doesn't involve sorting and hence may be efficient.
In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.