I'm using SOLR-3.4, spatial filtering with the schema having LatLonType (subType=tdouble). I have an index of about 20M places. My basic problem is that if I do bbox filter with cache=true, the performance is reasonably good (~40-50 QPS, about 100-150ms latency), but a big downside is crazy fast old gen heap growth ultimately leading to major collections every 30-40 minutes (on a very large heap, 25GB). And at that point performance is beyond unacceptable. On the other hand I can turn off caching for bbox filters, but then my latency and QPS drops (the latency goes down from 100ms => 500ms). The NumericRangeQuery javadoc talks about the great performance you can get (sub 100 ms) but now I wonder if that was with filterCache enabled, and nobody bothered to look at the heap growth that results. I feel like this is sort of a catch-22 since neither configuration is really acceptable.
I'm open to any ideas. My last idea (untried) is to use geo hash (and pray that it either performs better with cache=false, or has more manageable heap growth if cache=true).
EDIT:
Precision step: default (8 for double I think)
System memory: 32GB (EC2 M2 2XL)
JVM: 24GB
Index size: 11 GB
EDIT2:
A tdouble with precisionStep of 8 means that your doubles will be splitted in sequences of 8 bits. If all your latitudes and longitudes only differ by the last sequence of 8 bits, then tdouble would have the same performance has a normal double on a range query. This is why I suggested to test a precisionStep of 4.
Question: what does this actually mean for a double value?
Having a profile of Solr while responding to your spatial queries would be of great help to understand what is slow, see hprof for example.
Still, here are a few ideas on how you could (perhaps) improve latency.
First you could try to test what happens when decreasing the precisionStep (try 4 for example). If the latitudes and longitudes are too close of each other and the precisionStep is too high, Lucene cannot take advantage of having several indexed values.
You could also try to give a little bit less memory to the JVM in order to give the OS cache more chances to cache frequently accessed index files.
Then, if it is still not fast enough, you could try to extend replace TrieDoubleField as a sub field by a field type that would use a frange query for the getRangeQuery method. This would reduce the number of disk access while computing the range at the cost of a higher memory usage. (I have never tested it, it might provide horrible performance as well.)
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
The app I'm working on app where users can simulate tests and answer them offline. I have a software that takes the data from my database (the questions, alternatives, type of question and etc) and turn them into a array.
I don't know which is the most efficient (memory-wise): create an object with a big array with all the questions or creating separated objects (for each subject for example) with an array each or creating multiple arrays in the same object. Is it ok to create an array with about 1000 arrays inside or is it better to split that array, in... 10 arrays with 100 arrays inside each?
P.S: During the test I will only use 30 items from the array, so I'll take the entries from the big array (or from the multiple arrays) and add them to the small 30 entries array that'll be created accordingly to the user's inputs.
What I would like to use
I would like a big array, because for me it would be easier to sort and create random tests, some people are saying 1000 entries aren't too much, so I think I'll stick to a big array. What would be too big? 10k, 100k?
There are three kinds of efficiency you need to consider"
Memory efficiency; i.e. minimizing RAM utilization
CPU efficiency
Programmer efficiency; i.e minimizing the amount of your valuable time spent on writing, writing testcases, debugging, and maintaining the code.
Note that the above criteria work against each other.
Memory Efficiency
The memory size in bytes of an array of references N in Java given by
N * reference_size + array_header_size + padding
where:
reference_size is the size of a reference in bytes (typically 4 or 8)
array_header_size is typically 12 bytes
padding is greater or equal to zero, and less than the heap node size granularity.
The array itself also has a unique reference which must be held in memory somewhere.
So, if you split a large array into M smaller arrays, you will be using at least (M - 1) * 16 extra bytes of RAM, and possibly more. On the other hand, we are talking about bytes here, not kilobytes or megabytes. So this is hardly significant.
CPU Efficiency
This is harder to predict. The CPU utilization effects will depends largely on what you do with the arrays, and how you do it.
If you are simply subscripting (indexing) an array, that operation doesn't depend on the array size. But if you have multiple arrays (e.g. an array of arrays) then there will be additional overheads in determining which array to in subscript.
If you are searching for something in an array, then the larger the array you have to search the longer it will take (on average). But if you split a large array into smaller arrays, that doesn't necessarily help ... unless you know before hand which of the smaller arrays to search.
Programmer Efficiency
It will probably make your code more complicated if you use multiple arrays rather than one. More complicated code means more programmer effort in all phases of the application's development and maintenance lifecycle. It is hard to quantify how much extra effort is involved. However programmer effort means cost (paying for salaries) and time (deadlines, time to market, etc), and this is likely to outweigh any small savings in memory and CPU.
Scalability
You said:
Some people are saying 1000 entries aren't too much, so I think I'll stick to a big array. What would be too big? 10k, 100k?
Once again, it depends on the context. In reality, the memory used for an array of 100K instances of X depends largely on the average size of X. You will most likely run out of memory to represent the X instances instead of the array.
So, if you want your application to scale up indefinitely, you should probably change the architecture so that it fetches the questions / answers from the database on demand rather than loading them all into memory on start up.
Premature Optimization
Donald Knuth is often (mis-)quoted1 as saying:
"Premature optimization is the root of all evil."
What he is pointing out is that programmers are inclined to optimize things that don't really need optimizing, or spend their effort optimizing the wrong areas of their code based on incorrect intuitions.
My advice on this is the following:
Don't do fine-grained optimization too early. (This doesn't mean that you should ignore efficiency concerns in the design and coding stages, but my advice would be to only consider on the major issues; e.g. complexity of algorithms, granularity of APIs and database queries, and so on. Especially things that would be a lot of effort to fix later.)
If and when you do your optimization, do it scientifically:
Use a benchmark to measure performance.
Use a profiler to find performance hotspots and focus your efforts on those.
Use the benchmark to see if the optimization has improved things, and abandon optimizations that don't help.
Set some realistic goals (or time limits) for your optimization and stop when you reach them.
1 - The full quotation is more nuanced. Look it up. And in fact, Knuth is himself quoting Tony Hoare. For a deeper exploration of this, see https://ubiquity.acm.org/article.cfm?id=1513451
To be honest 1000 is not a big size but it matters the size of elements.
However, in future your entry size will increase, so definitely you don’t want to change logic every time. So you have to design in efficient matter which should be fruitful for growing data as well.
Memory concern - If you store all data in one array or 10 array, both it will take mostly same amount of memory very minimum difference),
But if you have 10 Array then management could be difficult, with growing demands you can face more complexity.
I can suggest you to use single Array which will be great for management. You can consider LinkedList as well which will be great for faster search results.
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
I am using OPTICSXi with rstartree on ELKI to cluster geo-dataset (latitude&longitude), Gowalla, which includes about 6 million records, but MiniGUI always shows 'java heap space' and 'error:out of memory'.
I used to see an answer of Anony Mousse, in which 1.2 million location data were dealed in 11 minutes, using OPTICSXi on ELKI. i'm so confused. Why ELKI reporting these errors?
Any parameters I need to modify on java platform or ELKI?
This is a standard out of memory error.
You will have to add more memory, or decrease memory consumption somehow.
You could also try the cover tree (it should need much less memory than the current R*-tree implementation). Make sure to use an appropriate, and small, value of epsilon to benefit from indexing.
Sorry if this has been asked before (though I can't really find a solution).
I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.
Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).
What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?
EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)
The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:
java -Xmx6144m ...
That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...
Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.
You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...
Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.
Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.
Don't know much about SVD but maybe this page can be helpful:
http://www.netlib.org/lapack/lug/node32.html
-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.
so
java -Xms512 -Xmx1024
would give you 512 to start with
As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?
For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.
Just for right now, pretend there are a billion entries with 3 words associated with each.
If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.
Given 16 GB cost a few hundred dollars, I would suggest buying more memory.
I'm trying to apply a Kalman Filter to sensor readings using Java, but the matrix manipulation library I'm using is giving me a heapspace error. So, does anyone know of a matrix manipulation library for the JVM with better memory allocation characteristics?
It would seem that this one -- http://code.google.com/p/efficient-java-matrix-library/ -- is "efficient" only in name. The data set has 9424 rows by 2 columns, all values are doubles (timestamp and one dimension out of 3 on readings from a sensor).
Many thanks, guys!
1) The Kalman filter should not require massive, non linear scaling amounts of memory : it is only calculating the estimates based on 2 values - the initial value, and the previous value. Thus, you should expect that the amount of memory you will need should be proportional to the total amount of data points. See : http://rsbweb.nih.gov/ij/plugins/kalman.html
2) Switching over to floats will 1/2 the memory required for your calculation . That will probably be insignificant in your case - I assume that if the data set is crashing due to memory, you are running your JVM with a very small amount of memory or you have a massive data set.
3) If you really have a large data set ( > 1G ) and halving it is important, the library you mentioned can be refactored to only use floats.
4) For a comparison of java matrix libraries, you can checkout http://code.google.com/p/java-matrix-benchmark/wiki/MemoryResults_2012_02 --- the lowest memory footprint libs are ojAlgo, EJML, and Colt.
Ive had excellent luck with Colt for large scale calculations - but I'm not sure which ones implement the Kalaman method.