I have several large double and long arrays of 100k values each that needs to be accessed for computation at a given time, even with largeHeap requested the Android OS doesnt give me enough memory and i keep getting outofmemory exceptions in most of tested devices. So i went researching for ways to overcome this, and according to an answer i got from Waldheinz in my previous question i implemented an array that is based on a file, using RandomAccessMemory to get a channel to it, then map it using MappedByteBuffer as suggested, and use the MappedByteBuffer asLongBuffer or asDoubleBuffer. this works perfect, i 100% eliminated the outofmemory exceptions. but the performance is very poor. i get lot of calls to get(some index) that takes about 5-15 miliseconds each and therefore user exprience is ruined
some usefull information :
i am using binary search on the arrays to find a start and end indices and then i have a linear loop from start to end
I added a print command for any get() calls that takes more then 5 mili seconds to finish (printing out time it took,requested index and the last requested index), seems like all of the binary search get requests were printed, and few of the linear requests were too.
Any suggestions on how to make it go faster?
Approach 1
Index your data - add pointers for quick searching
Split your sorted data into 1000 buckets 100 values each
Maintain an index referencing each bucket's start and end
The algorithm is to first find your bucket in this memory index (even a loop is fine for this) and then to jump to this bucket in a memory mapped file
This will result into a single jump over a file (a single bucket to find) and an iteration on 100 elements max.
Approach 2
Utilize a lightweight embedded database. I.e. MapDB supports Android.
Related
I have a problem. I work in Java, Eclipse. My program calculates some mathematical physics, and I need to draw animation (Java SWT package) of the process (some hydrodynamics). The problem is 2D, so each iteration returns two dimensional array of numbers. One iteration takes rather long time and time needed for iteration changes from one iteration to another, so showing pictures dynamically as program works seems like a bad idea. In this case my idea was to store a three dimensional array, where third index represents time, and building an animation when calculations are over. But in this case, as I want accuracuy from my program, I need a lot of iterations, so program easily reaches maximal array size. So the question is: how do I avoid creating such an enormous array or how to avoid limitations on array size? I thought about creating a special file to store data and then reading from it, but I'm not sure about this. Do you have any ideas?
When I was working on a procedural architecture generation system at university for my dissertation I created small, extremely easily read and parsed binary files for calculated data. This meant that the data could be read in within an acceptable amount of time, despite being quite a large amount of data...
I would suggest doing the same for your animations... It might be of value storing maybe five seconds of animation per file and then caching each of these as they are about to be required...
Also, how large are your arrays, you could increase the amount of memory your JVM is able to allocate if it's not maximum array size, but maximum memory limitations you're hitting.
I hope this helps and isn't just my ramblings...
I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.
Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.
As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.
I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.
Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.
I have a huge dataset of utf8 strings to process, I need to eliminate duplicate in order to have uniq set of string.
I'm using a hashet to check if the string is already know, but now I reached 100 000 000 strings, I do not have enough RAM and the process crash. Moreover, I only processed 1% of the dataset so in memory solution is impossible.
What I would like is a hybrid solution like a "in-memory index" and "disk-based storage" so I could use the 10Go of RAM I have to speed up the process.
=> Do you known a java library already doing this ? If not which algorithm should i look after ?
Using a bloom filter in memory to check if the string is not present could be a solution, but I still have to check the disk sometime (false positive) and I would like to know different solution.
=> How to store the strings on the disk to have a fast read and write access ?
_ I don't want to use an external service like a nosql db or mysql, it must be embedded.
_ I already try file based light SQL db like h2sql or hsql but they are very bad at handling massive dataset.
_ I don't consider using Trove/Guava Collections as a solution (unless they offer disk based solution I'm not aware of), I'm already using an extremly memory efficient custom hashset and I don't even store String but byte[] in memory. I already tweaked -Xmx stuff for the jvm.
EDIT: The dataset I'm processing is huge, the raw unsorted dataset doesn't fit on my hard disk. I'm streaming it byte per byte and processing it.
What you could do would be to use an External Sorting Technique such as the External Merge Sort in which you would sort your data first.
Once that this is done, you could iterate through the sorted set and keep the last element you have encountered. Once that you have that, you would check the current item with the next. If they are the same, you move on to the next item. If not, you would update the item you currently have.
To avoid huge memory consumption, you could dump your list of unique items to hard drive whenever a particular threshold is reached and keep on going.
Long story short:
Let data be the data set you need to work with
Let sorted_data = External_Merge_Sort(data)
Data_Element last_data = {}
Let unique_items be the set of unique items you want to yield
foreach element e in sorted_data
if(e != last_data)
{
last_data = e
add e in unique_items
if (size(unique_items) == threshold)
{
dump_to_drive(unique_items)
}
}
What is the total data size you have ? If that is not in tera bytes and suppose you can use say 10 machines, I would suggest some external cache like memcached (spymemcached is a good java client form memcached).
Install memcached on the 10 nodes. Spymemcached client should be initialized with the list of memcached servers, so that they become a virtual cluster for our program.
For each string you read:
check if it is already in memcache
if it is in memcache:
will check the next string
continue
else:
add it to memcache
add it to list of string buffers to be flushed to disk
if size of the list of strings to be flushed > certain threshold:
flush them to disk
flush any remaining string to disk
Another approach is to use some kind of map-reduce :), without Hadoop:)
Deduplicate first 2 GB of Strings and writeout the de-duplicated stuff to an intermediate file
Follow the above step with next 2GB of Strings and so on.
Now apply the same method on the intermediate de-duplicated files.
When the total size of intermediate de-duplicated data is smaller, use Memcache or internal HashMap to produce the final output.
This approach doesn't involve sorting and hence may be efficient.
Ok so i reformated the post to make it a little easier to understand (sorry about all the pastebins but stack overflow was being dumb with code formatting)
Please note that I do not intend to have the ridiculous amount of data stored as I state below.
The main reason I use the amount I said is to squeeze out as much efficiency as possible.
Lets say I have the following code(s)
the method that will be adding to the DropItemQueue (starts at floodFill(with depth 0) the other paramaters do not matter)
http://pastebin.com/3wqEb5cM
this is the same class and it will then call the dropItem method in Utils
http://pastebin.com/yEUW9Wad
My Utils.dropItem method is as follows
http://pastebin.com/c0eaWeMA
This is the ServerTickHandler.addDropItemQueue method and its variable storage
http://pastebin.com/Q4p5a4ja
Here is the DropItemQueue class
http://pastebin.com/wxCj9imN
If I was to say add 100000000 elements to this hashset I have noticed that it takes around 2 seconds to
iterate over everything in the hashset using
This iterator is called every 1/20th of a seccond
http://pastebin.com/zSWg1kKu
2 seconds per iteration doesn't seem like much but with that amount of elements to get rid of every single element stored in the hash set it would take around 50 days
every time an element is parsed to the hashset the maxTicks is 1 more than the previous added element so basically every 1 second an item is dropped but due to the 2 seconds to iterate over everything its actually taking 3 seconds to drop an item which would make it take around 150 days to complete the iteration and flush every element out and complete
my question is would it be quicker to have multiple hash sets with less maximum elements lets say 1000 elements.
yes this would give me 100000 hashsets but due to them each being smaller would the iterations times be slower (all be it a very small increase in efficiency) or is there something better I can use other than a hashsets or are hashsets the best thing to use?
do note that if I did use multiple iterations of smaller iterations I could not use threads due to cross thread data redundancy
The HashSet data structure is designed for really only one purpose, which is to answer the question "Does this set contain this item". For any other use it not necessarily the most efficient.
For your use, it seems like a Queue would be a better choice.
In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.