Entity prepopulation for MongoDB to avoid padding with Spring

Entity prepopulation for MongoDB to avoid padding with Spring - java

In an application I use the concept of buckets to store objects. All buckets are empty at creation time. Some of which may fill up to their maximum capacity of 20 objects in 2hrs, some in 6 months. Each object's size is pretty much fixed, i.e. I don't expect their size to differ more than 10%, i.e. the sizes of full buckets wouldn't either. The implementation looks similar to that.
#Document
public class MyBucket {
// maximum capacity of 20
private List<MyObject> objects;
}
One approach to keep the padding factor low would be to prepopulate my bucket with dummy data. Two options come to my mind:
Create the bucket with dummy data, save it, then reset its content and save it again
Create the bucket with dummy data and flag it as "pristine". On the first write the flag is set to false and the data get reset.
The disadvantages are obvious, option 1 requires two DB writes, option 2 requires extra (non-business) code in my entities.
Probably I won't get off cheaply with any solution. Nevertheless, any real-life experience with that issue, any best practices or hints?
Setup: Spring Data MongoDB 1.9.2, MongoDB 3.2

As far as understand your main concern is performance overhead related to increasing of documents size resulting to documents relocation and indexes update. It is actual for the mmapv1 storage engine, however since MongoDB version 3.0 there is the WiredTiger storage engine available that does not have such issues (check the similar question).

Related

What is the best way to cache large data objects into Hazlecast

We have around 20k merchants data ,size around 3mb
If we cache these much data together then hazlecast performance not doing good
Please note if we cache all 20k individual then for get all merchants call slowing down as reading each merchant from cache costs high network time.
How should we partition these data
What will be the partition key
What will be the max size per partition
Merchant entity attributed as below
Merchant Id , parent merchant id, name , address , contacts, status, type
Merchant id is the unique attribute
Please suggest

Adding to what Mike said, it's not unusual to see Hazelcast maps with millions of entries, so I wouldn't be concerned with the number of entries.
You should structure your map(s) to fit your applications design needs. Doing a 'getAll' on a single map seems inefficient to me. It may make more sense to create multiple maps or use a complex key that allows you to be more selective with entries returned.
Also, you may want to look at indexes. You can index the key and/or value which can really help with performance. Predicates you construct for selections will automatically use any defined indexes.

I wouldn't worry about changing partition key unless you have reason to believe the default partitioning scheme is not giving you a good distribution of keys.
With 20K merchants and 3MB of data per merchant, your total data is around 60GB. How many nodes are you using for your cache, and what memory size per node? Distributing the cache across a larger number of nodes should give you more effective bandwidth.
Make sure you're using an efficient serialization mechanism, the default Java serialization is very inefficient (both in terms of object size and speed to serialize and deserialize); using something like IdentifiedDataSerializable (if Java) or Portable (if using non-Java clients) could help a lot.

I would strongly recommend that you break down your object from 3MB to few 10s of KBs, otherwise you will run into problems that are not particularly related to Hazelcast. For example, fat packets blocking other packets resulting in heavy latency in read/write operations, heavy serialization/deserialization overhead, choked network etc. You have already identified high network time and it is not going to go away without flattening the value object. If yours is read heavy use case then I also suggest to look into NearCache for ultra low latency read operations.
As for partition size, keep it under 100MB, I'd say between 50-100MB per partition. Simple maths will help you:
3mb/object x 20k objects = 60GB
Default partition count = 271
Each partition size = 60,000 MB / 271 = 221MB.
So increasing the partition count to, lets say, 751 will mean:
60,000 MB / 751 = 80MB.
So you can go with partition count set to 751. To cater to possible increase in future traffic, I'd set the partition count to an even higher number - 881.
Note: Always use a prime number for partition count.
Fyi - in one of the future releases, the default partition count will be changed from 271 to 1999.

Which data structure should I use to represent this data set?

Suppose I have a data set as follows:
Screen ID User ID
1 24
2 50
2 80
3 23
5 50
3 60
6 64
. .
. .
. .
400,000 200,000
and I want to track the screens that each user visited. My first approach would be to create a Hash Map where the keys would be the user ids, and the values would be the screen ids. However, I get an OutofMemory error when using Java. Are there efficient data structures that can handle this volume of data? There will be about 3,000,000 keys and for each key about 1000 values. Would Spark(Python) be the way to go for this? The original dataset has around 300,000,000 rows and 2 columns.

Why do you want to store such a large data in memory it would be better to store it in data base and use only required data. As using any data structure in any language will consume nearly equal memory.

HashMap will not work with what you're describing as the keys must be unique. Your scenario is duplicating the keys.
If you want to be more memory efficient and don't have access to a relational database or an external file, consider designing something using arrays.
The advantage of arrays is the ability to store primitives which use less data than objects. Collections will always implicitly convert a primitive into its wrapper type when stored.
You could have your array index represent the screen id, and the value stored at the index could be another array or collection which stores the associated user ids.

What data type you are using? Let's say to your are using a..
Map<Integer,Integer>
.then each entry takes 8 bytes (32-Bit) or 16 bytes (64-Bit).. Let's calculate your memory consumption:
8 * 400000 = 3200000 bytes / 1024 = 3125 kbytes / 1024 = 3.05MB
or 6.1MB in case of an 64-Bit data type (like Long)
To say it short.. 3.05 MB or 6 MB is nothing for your hardware.
Even if we calc 3 million entries, we end up with an memory usage of 22 MB (in case of an integer entry set). I don't think a OutofMemory exception is caused by the data size. Check your data type or
switch to MapDB for a quick prototype (supports off-heap memory, see below).
Yes handling 3 000 000 000 entries is getting more seriously. We end up with a memory usage of 22.8 gig. In this case you should consider
a data storage that can handle this amount of data efficiently. I don't think a Java Map (or a vector in another language) is a good use case for such a data amount
(as Brain wrote, with this amount of data you have to increase the JVM heap space or use MapDB). Also think about your deployment; your product will need 22 gig in memory which
means high hardware costs. Then the question cost versus in-memory performance has to be balanced... I would go with one of the following alternatives:
Riak (Key-Value Storage, fits your data structure)
Neo4J (your data structure can be handled as a net graph; in this case a screen can have multiple relationships to users and versa-vi)
Or for a quick prototype consider MapDB (http://www.mapdb.org/)
For a professional and performance solution, you can look at SAP Hana (but its not for free)
H2 (http://www.h2database.com/html/main.html) can be also a good choice. It's an SQL in-memory database.
With one of the solutions above, you can also persist and query your data (without coding indexing, B-trees and stuff). And this is what you want to do, I guess,
process and operate with your data. At the end only tests can show which technology has the best performance for your needs.
The OutofMemory exception has nothing to do with java or python. Your use case can be implemented in java with no problems.

Just looking on the data structure. You have a two dimensional matrix indexed by user-id and screen-id containing a single boolean value, whether it was visisted by that user or not: visited[screen-id, user-id]
In the case each user visits almost every screen, the optimal representation would be a set of bits. This means you need 400k x 200k bits, which is roughly 10G bytes. In Java I would use a BitSet and linearize the access, e.g. BitSet.get(screen-id + 400000 * user-id)
If each user only visits a few screens, then there are a lot of repeating false-values in the bit set. This is what is called a sparse matrix. Actually, this is a well researched problem in computer science and you will find lots of different solutions for it.
This answers your original question, but probably does not solve your problem. In the comment you stated that you want to look up for the users that visited a specific screen. Now, that's a different problem domain, we are shifting from efficient data representation and storage to efficient data access.
Looking up the users that visited a set of screens, is essentially the identical problem to, looking up the documents that contain a set of words. That is a basic information retrieval problem. For this problem, you need a so called inverted index data structure. One popular library for this is Apache Lucene.
You can read in the visits and build a a data structure by yourself. Essentially it is a map, addressed by the screen-id, returning a set of the affected users, which is: Map<Integer, Set<Integer>>. For the set of integers the first choice would be a HashSet, which is not very memory efficient. I recommend using a high performance set library targeted for integer values instead, e.g. IntOpenHashSet. Still this will probably not fit in memory, however, if you use Spark you can split your processing in slices and join the processing results later.

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.

Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.

This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards

In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.

Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!

SOLR performance tuning

I've read the following:
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/solr/SolrCaching
http://www.lucidimagination.com/content/scaling-lucene-and-solr
And I have questions about a few things:
If I use the JVM option -XX:+UseCompressedStrings what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.
When exactly is the Solr filter cache doing? If I'm just doing a simple query with AND and a few ORs, and sorting by score, do I even need it?
If I want to cache all documents in the document cache, how would I compute the space required? Using the example from above, if I have 20M documents, use compressed strings, and the average length of the stored field is 25 characters, is the space required basically (25 bytes + small_admin_overhead) * 20M?
if all documents are in the document cache, how important is the query cache?
If I want to autowarm every document into the doc cache, will autowarm query of *:* do it?
The scaling-lucene-and-solr article says FuzzyQuery is slow. If I'm using the spellcheck feature of solr then I'm basically using fuzzy query right (because spellcheck does the same edit distance calculation)? So presumably spellcheck and fuzzy query are both equally "slow"?
The section describing the lucene field cache for strings is a bit confusing. Am I reading it correctly that the space required is basically the size of the indexed string field + an integer arry equal to the number of unique terms in that field?
Finally, under maximizing throughput, there is a statement about leaving enough space for the OS disk cache. It says, "All in all, for a large scale index, it's best to be sure you have at least a few gigabytes of RAM beyond what you are giving to the JVM.". So if I have a 12GB memory machine (as an example), I should give at least 2-3GB to the OS? Can I estimate the disk cache space needed by the OS by looking at the on disk index size?

Only way to be sure is to try it out. However, I would expect very little savings in the Index, as the index would only contain the actual string once each time, the rest is data for locations of that string within documents. They aren't a large part of the index.
Filter cache only caches filter queries. It may not be useful for your precise use case, but many do find them useful. For example, narrowing results by country, language, product type, etc. Solr can avoid recalculating the query results for things like this if you use them frequently.
Realistically, you just have to try it and measure it with a profiler. Without in depth knowledge of EXACTLY the data structure used, anything else is pure SWAG. Your calculation is just as good as anyone else's without profiling.
Document cache only saves time in constituting the results AFTER the query has been calculated. If you spend most of your time calculating queries, the document cache will do you little good. Query cache is only useful for re-used queries. If none of your queries are repeated, then Query cache is useless
yes, assuming your Document cache is large enough to hold them all.
6-8 Not positive.
From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.

Caches
In general, caching looks like a good idea to improve performance, but this also has a lot of issues:
cached objects are likely to go into the old generation of the garbage collector, which is more costly to collect,
managing insertions and evictions adds some overhead.
Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.
Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).
I recommand you read this to get more information on application-level cache vs. I/O cache:
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
http://antirez.com/post/what-is-wrong-with-2006-programming.html
Field cache
The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S.
The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.
Fuzzy Query
FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.
It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.

Algorithm to store Item-to-Item-Associations

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?

Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap

You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.