Make uniq and indexing a huge set of String in java

Make uniq and indexing a huge set of String in java - java

I have a huge dataset of utf8 strings to process, I need to eliminate duplicate in order to have uniq set of string.
I'm using a hashet to check if the string is already know, but now I reached 100 000 000 strings, I do not have enough RAM and the process crash. Moreover, I only processed 1% of the dataset so in memory solution is impossible.
What I would like is a hybrid solution like a "in-memory index" and "disk-based storage" so I could use the 10Go of RAM I have to speed up the process.
=> Do you known a java library already doing this ? If not which algorithm should i look after ?
Using a bloom filter in memory to check if the string is not present could be a solution, but I still have to check the disk sometime (false positive) and I would like to know different solution.
=> How to store the strings on the disk to have a fast read and write access ?
_ I don't want to use an external service like a nosql db or mysql, it must be embedded.
_ I already try file based light SQL db like h2sql or hsql but they are very bad at handling massive dataset.
_ I don't consider using Trove/Guava Collections as a solution (unless they offer disk based solution I'm not aware of), I'm already using an extremly memory efficient custom hashset and I don't even store String but byte[] in memory. I already tweaked -Xmx stuff for the jvm.
EDIT: The dataset I'm processing is huge, the raw unsorted dataset doesn't fit on my hard disk. I'm streaming it byte per byte and processing it.

What you could do would be to use an External Sorting Technique such as the External Merge Sort in which you would sort your data first.
Once that this is done, you could iterate through the sorted set and keep the last element you have encountered. Once that you have that, you would check the current item with the next. If they are the same, you move on to the next item. If not, you would update the item you currently have.
To avoid huge memory consumption, you could dump your list of unique items to hard drive whenever a particular threshold is reached and keep on going.
Long story short:
Let data be the data set you need to work with
Let sorted_data = External_Merge_Sort(data)
Data_Element last_data = {}
Let unique_items be the set of unique items you want to yield
foreach element e in sorted_data
if(e != last_data)
{
last_data = e
add e in unique_items
if (size(unique_items) == threshold)
{
dump_to_drive(unique_items)
}
}

What is the total data size you have ? If that is not in tera bytes and suppose you can use say 10 machines, I would suggest some external cache like memcached (spymemcached is a good java client form memcached).
Install memcached on the 10 nodes. Spymemcached client should be initialized with the list of memcached servers, so that they become a virtual cluster for our program.
For each string you read:
check if it is already in memcache
if it is in memcache:
will check the next string
continue
else:
add it to memcache
add it to list of string buffers to be flushed to disk
if size of the list of strings to be flushed > certain threshold:
flush them to disk
flush any remaining string to disk
Another approach is to use some kind of map-reduce :), without Hadoop:)
Deduplicate first 2 GB of Strings and writeout the de-duplicated stuff to an intermediate file
Follow the above step with next 2GB of Strings and so on.
Now apply the same method on the intermediate de-duplicated files.
When the total size of intermediate de-duplicated data is smaller, use Memcache or internal HashMap to produce the final output.
This approach doesn't involve sorting and hence may be efficient.

Related

How to optimize the memory usage for large file processing

I have a file and from file I am populating the HashMap<String, ArrayList<Objects>>. HashMap size will be 25 for sure, means 25 keys, but the List will be huge say million records for each key.
So what I use to do now is for each key retrieve the list of records and process them parallel using threads. Things went on good until I faced the larger file and so I am facing the "java.lang.OutOfMemoryError: Java heap space".
I would like to ask you what is the best way instead populating the HashMap with the list of objects? What I am thinking is to get the 25 offsets of the file and instead of putting the lines I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought. But before I execute, I would like to know any better ways to optimize memory usage.

I will populate the HashMap<String, ArrayList<Objects>>
After populating the HashMap what do you need to do with it? I believe that just populating the Map is not your task. Whatever the scenario, you don't need to read the whole file in memory.
Increasing the heap size may not be a good solution as someday you may get a file even bigger than your heap size.
Read the file in chunks using a BufferedReader or BufferedInputStream depending on your needs and do your task as you read. The two APIs only read a part of the file in memory at a time.
I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought.
Using multiple threads will not prevent java.lang.OutOfMemoryError because all the threads will be in same JVM. Furthermore, no matter you read the file in one list or multiple lists, all the data from the file will be read into the same heap memory.
If you mention what you actually want to do with the data from file, this answer can be more specific.

Ditto what ares said. Need more information. What do you plan on doing with the map. Is it an operation that requires the whole file to be loaded onto memory ? Or can it be done in parts ?
Also, have you considered splitting the file into parts once its size surpasses a threshold size ?
Like Pshemo's answer here : How to break a file into pieces using Java?
Also, If you want to process in parallel, you could consider processing a map which covers a part of the file. Process that map in parallel and store the results in a queue of some sort. Provided the queue will contain a subset of the data you are processing(to avoid OutOfMemory exceptions).

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Large-scale processing of seralized Integer objects

I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.

You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.

So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.

Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this

Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.

How can I load a massive data file in Java? Any structure better than hashtables?

I am trying to load a big file of 14 million lines in hashtables in the memory. Each line contains three numbers (n,m,v), where:
n: is the id of a user (an object)
m: is the id of an item (an object)
v: is the rate that user n gives to the item m.
Each user n has a hashtable<item, rate> to store the items that the user rates,
and each item has a hashtable<user, rate> to store the users that rated this item.
On my machine I cannot load this structure into the memory so I have a heap memory error in every time.
I tried to replace hashtables with hashBasedTable, which allows tow keys for each value, but no solution. In addition hashBasedTable made my program much slower.
Is there any solution to be able to load this mass of data?

14 millions lines of three numbers each doesn't sound like a massive data array.
It is approximately 14M * (3 + 1) * 8 ~ 450M or memory.
Just make sure you set -Xmx setting to a big enough value (e.g. -Xmx1024m - which will allow JVM allocate up to 1G of RAM).
P.S. I would suggest HashMap instead of HashTable though.

I suggest that you represent each rated item's users and each user's rated items using ArrayList<User> and ArrayList<Item> respectively. That will save a lot of space.
Admittedly, some operations will now be O(N) but that is only a problem if N gets large. (And if it does, consider a hybrid where you use ArrayList for small relations and HashMap for large ones.)
Suggestion #2 - use plain arrays ... and keep them sorted so that you can implement lookup using binary search. This is more code intensive (i.e. more complicated), but it will give you better memory usage than using Collection types.
Suggestion #3 - Use a database. It will scale better.

I do not think it depends on the data structure you use. You simply cannot load so much data into RAM, you would have to process the file line by line and execute the logic you have.

I'm a little unclear as to your access pattern, but it sounds like you probably want to use a single big table instead of one per user and per item. Especially if your data is very sparse (only a few items per user or vice versa), you will be wasting a lot of space due to the initial capacity of the hashtables (you could try lowering the initial capacity and/or raising the load factor if you wanted to keep your current organization).
Build a pair object (user id, item id) to use as the key for a single big hashtable. If you need enumeration (i.e. list all items for a user or vice versa), keep ArrayLists of that data and use trimToSize, much more compact than a hashtable.

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.

Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)

Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.

Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator

Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.