Sorry if this has been asked before (though I can't really find a solution).
I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.
Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).
What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?
EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)
The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:
java -Xmx6144m ...
That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...
Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.
You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...
Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.
Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.
Don't know much about SVD but maybe this page can be helpful:
http://www.netlib.org/lapack/lug/node32.html
-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.
so
java -Xms512 -Xmx1024
would give you 512 to start with
As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?
For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.
Just for right now, pretend there are a billion entries with 3 words associated with each.
If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.
Given 16 GB cost a few hundred dollars, I would suggest buying more memory.
Related
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
I'm using SOLR-3.4, spatial filtering with the schema having LatLonType (subType=tdouble). I have an index of about 20M places. My basic problem is that if I do bbox filter with cache=true, the performance is reasonably good (~40-50 QPS, about 100-150ms latency), but a big downside is crazy fast old gen heap growth ultimately leading to major collections every 30-40 minutes (on a very large heap, 25GB). And at that point performance is beyond unacceptable. On the other hand I can turn off caching for bbox filters, but then my latency and QPS drops (the latency goes down from 100ms => 500ms). The NumericRangeQuery javadoc talks about the great performance you can get (sub 100 ms) but now I wonder if that was with filterCache enabled, and nobody bothered to look at the heap growth that results. I feel like this is sort of a catch-22 since neither configuration is really acceptable.
I'm open to any ideas. My last idea (untried) is to use geo hash (and pray that it either performs better with cache=false, or has more manageable heap growth if cache=true).
EDIT:
Precision step: default (8 for double I think)
System memory: 32GB (EC2 M2 2XL)
JVM: 24GB
Index size: 11 GB
EDIT2:
A tdouble with precisionStep of 8 means that your doubles will be splitted in sequences of 8 bits. If all your latitudes and longitudes only differ by the last sequence of 8 bits, then tdouble would have the same performance has a normal double on a range query. This is why I suggested to test a precisionStep of 4.
Question: what does this actually mean for a double value?
Having a profile of Solr while responding to your spatial queries would be of great help to understand what is slow, see hprof for example.
Still, here are a few ideas on how you could (perhaps) improve latency.
First you could try to test what happens when decreasing the precisionStep (try 4 for example). If the latitudes and longitudes are too close of each other and the precisionStep is too high, Lucene cannot take advantage of having several indexed values.
You could also try to give a little bit less memory to the JVM in order to give the OS cache more chances to cache frequently accessed index files.
Then, if it is still not fast enough, you could try to extend replace TrieDoubleField as a sub field by a field type that would use a frange query for the getRangeQuery method. This would reduce the number of disk access while computing the range at the cost of a higher memory usage. (I have never tested it, it might provide horrible performance as well.)
We are working in a Tomcat/J2EE application.
In this application we store a lot of data in the session, and I'm wondering how many data can we store with no problems.
What is the minimum restriction? The memory of Tomcat? The JVM?
How can I calculate if I can store it 200k strings?
For #1 - You can store as much data as heap size allocated to JVM. Of course, tomcat runs inside the JVM so it will also use some part of the memory allocated.
For #2 - It really depends on the size of the string - 2 bytes are required per unicode character. Take the average size of your string, multiply it by 200k and then make sure you have enough memory allocated.
Tomcat runs in a JVM. So if you have a 32 bits jre, you can have a maximum heap size of about 1,7Gb. If you want more, you should switch to a 64 bits jre.
About string allocation, the internal java character encoding is Unicode, so I think it is UTF-8. In order to save space, you may compress those strings using zip and saving them as these were files.
What is the size that a reference in Android's Java VM consumes?
More info:
By that I mean, if we have
String str = "Watever";
I need what str takes, not "Watever". -- "Watever" is what's saved in the location to which the pointer (or the reference) that str is holding, is pointing to.
Also, if we have
String str = null;
how much memory does it consume? Is it the same as the other str?
Now, if we have:
Object obj[] = new object[2];
how much does obj consume and how much does obj[1] and obj[2] consume?
The reason for the question is the following: (in case someone can recommend something).
I'm working on an app that manages many pictures downloaded from internet.
I started storing those pictures on a "bank" (that consists of a list of pictures).
When displaying those pictures on a gallery, I used to search for the picture in the list (SLOW) and then, if then picture wasn't there, I used to show a temporal downloading image until the picture was downloaded.
Since that happened on the UI Thread, the app became very slow, so I thought about implementing a hash table on the bank instead of the list I had.
As I explained before, this search occurs in the UI Thread (and I can't change that). Because of that, collisions can become a problem if they start slowing the thread.
I have read that "To balance time and space efficiency, the hash table should be around half full", but that makes collisions occur half of the time (Not practical for the UI Thread). That makes me think about having a very long hash table (compared to the amount of pictures saved) and use more RAM (having less free VMHeap).
Before determining the size of the hash table, I wanted to know how much memory would it consume in order not to exagerate.
I know that the size of the hash table might be very small compared to the memory that the pictures might consume, but I wanted to make sure I wasn't consuming more memory than necessary.
Before asking this question i searched, between other places, in
How big is an object reference in Java and precisely what information does it contain?
reference type size in java
Hashing Tutorial
(Yes, I know two of the places contradict each other, that's part of the reason for the question).
A object or array reference occupies one 32 bit word (4 bytes) on a 32 bit JVM or Davlik VM. A null takes the same space as a reference. (It has to, because a null has to fit in a reference-typed slot; i.e. instance field, local variable, etc.)
On the other hand, an object occupies a minimum of 2 32 bit words (8 bytes), and an array occupies a minimum of 3 32 bit words (12 bytes). The actual size depends on the number and kinds of fields for an object, and on the number and kind of elements for an array.
For a 64 bit JVM, the size of a reference is 64 bits, unless you have configured the JVM to use compressed pointers:
-XX:+UseCompressedOops Enables the use of compressed pointers (object references represented as 32 bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32gb.
This is the nub of your question, I think.
Before determining the size of the hash table, I wanted to know how much memory would it consume in order not to exagerate.
If you allocate a HashMap or Hashtable with a large initial size, the majority of the space will be occupied by the hash array. This is an array of references, so the size will be 3 + initialSize 32 bit words. It is unlikely that this will be significant ... unless you get your size estimate drastically wrong.
However, I think you are probably worrying unnecessarily about performance. If you are storing objects in a default allocated HashMap or Hashtable, the class will automatically resize the hash table as it gets larger. So, provided that your objects have a decent hash function (not too slow, not hashing everything to a small number of values) the hash table should not be a direct CPU performance concern.
References are nearly free. Even more so when compared to images.
Having a few collisions in a Map isn't a real problem. Collisions can be resolved far quicker than a linear search through a list of items. That said, a Binary Search through a sorted list of items would be a good way to keep memory usage down (compared to a Map).
I can vouch for the effectiveness of a having smaller initial sizes for Maps - I recently wrote a program that makes a Trie structure of 170000 English words. When I set the initial size to 26, I would run out of memory by the time I got to words starting with R. Cutting it down to 5, I was able to create the maps without memory issues and can search the tree (with many collisions) in effectively no time.
[Edit] If a reference is 32 bit (4 bytes) and your average image is around 2 megabytes, you could fit 500000 references into the same space that a single image would take. You don't have to worry about the references.
I'm using this statement
//some code
int a[][]=new int[5000000][5000000];
//some code
and running it with command
java -mx512m Test
It is giving OutOFMemoryError: Java Heap space indicating the line number of the mentioned statement in the stacktrace
How do i solve this problem
Edit:
I'm trying to solve a practice problem on codechef
You may need to consider an approach to your problem which requires less memory.
From Google Calculator (assuming a 64bit integer size):
(5 000 000^2) * 64 bits = 186 264.515 gigabytes
I've faced same problem with Eclips concern Java heap
and the solution was to revise the mx512m into mx4096m or mx2048m (expand the maximum allowed limit of memory)
so in your case try the command
java -mx4096m Test which will allow java to use 4 GB of your ram
I think the data structure you are looking for is a sparse matrix. Store your elements along with their coordinates in a map data structure (eg. Map<Integer,Map<Integer,Integer>> for a 2d sparse array) and just assume anything not in the map is zero.
Well, that's 25 trillion ints, each of which takes 4 bytes, so 100 trillion bytes overall. Easiest solution is to buy ~90 terabytes of RAM and a 64 bit OS.
Seriously though, the correct solution is probably to allocate a more reasonable data structure that can store the data more efficiently, assuming that you don't actually need to load 90 terabytes of data into RAM at once. Perhaps if you post more about the problem, we can give a better answer?
Amongst the other answers, you have a typo in your command.
It should be: java -Xmx512M Test
You are using too much memory. Use less or have more. Each int element is 32-bits. And your memory limit of 512MB is much less than you think it is.
for int[5000000][5000000] you need 100000000000000B or 100000GB. So you only can wait about 100 years when something like that is going to be possible.
You need a more enlightened data structure. Unless you actually NEED 25 trillion integers, which is doubtful.
That's a matrix with 5 million columns and 5 million rows. Is that really what you wanted?