What is the best way to cache large data objects into Hazlecast

What is the best way to cache large data objects into Hazlecast - java

We have around 20k merchants data ,size around 3mb
If we cache these much data together then hazlecast performance not doing good
Please note if we cache all 20k individual then for get all merchants call slowing down as reading each merchant from cache costs high network time.
How should we partition these data
What will be the partition key
What will be the max size per partition
Merchant entity attributed as below
Merchant Id , parent merchant id, name , address , contacts, status, type
Merchant id is the unique attribute
Please suggest

Adding to what Mike said, it's not unusual to see Hazelcast maps with millions of entries, so I wouldn't be concerned with the number of entries.
You should structure your map(s) to fit your applications design needs. Doing a 'getAll' on a single map seems inefficient to me. It may make more sense to create multiple maps or use a complex key that allows you to be more selective with entries returned.
Also, you may want to look at indexes. You can index the key and/or value which can really help with performance. Predicates you construct for selections will automatically use any defined indexes.

I wouldn't worry about changing partition key unless you have reason to believe the default partitioning scheme is not giving you a good distribution of keys.
With 20K merchants and 3MB of data per merchant, your total data is around 60GB. How many nodes are you using for your cache, and what memory size per node? Distributing the cache across a larger number of nodes should give you more effective bandwidth.
Make sure you're using an efficient serialization mechanism, the default Java serialization is very inefficient (both in terms of object size and speed to serialize and deserialize); using something like IdentifiedDataSerializable (if Java) or Portable (if using non-Java clients) could help a lot.

I would strongly recommend that you break down your object from 3MB to few 10s of KBs, otherwise you will run into problems that are not particularly related to Hazelcast. For example, fat packets blocking other packets resulting in heavy latency in read/write operations, heavy serialization/deserialization overhead, choked network etc. You have already identified high network time and it is not going to go away without flattening the value object. If yours is read heavy use case then I also suggest to look into NearCache for ultra low latency read operations.
As for partition size, keep it under 100MB, I'd say between 50-100MB per partition. Simple maths will help you:
3mb/object x 20k objects = 60GB
Default partition count = 271
Each partition size = 60,000 MB / 271 = 221MB.
So increasing the partition count to, lets say, 751 will mean:
60,000 MB / 751 = 80MB.
So you can go with partition count set to 751. To cater to possible increase in future traffic, I'd set the partition count to an even higher number - 881.
Note: Always use a prime number for partition count.
Fyi - in one of the future releases, the default partition count will be changed from 271 to 1999.

Related

Which data structure should I use to represent this data set?

Suppose I have a data set as follows:
Screen ID User ID
1 24
2 50
2 80
3 23
5 50
3 60
6 64
. .
. .
. .
400,000 200,000
and I want to track the screens that each user visited. My first approach would be to create a Hash Map where the keys would be the user ids, and the values would be the screen ids. However, I get an OutofMemory error when using Java. Are there efficient data structures that can handle this volume of data? There will be about 3,000,000 keys and for each key about 1000 values. Would Spark(Python) be the way to go for this? The original dataset has around 300,000,000 rows and 2 columns.

Why do you want to store such a large data in memory it would be better to store it in data base and use only required data. As using any data structure in any language will consume nearly equal memory.

HashMap will not work with what you're describing as the keys must be unique. Your scenario is duplicating the keys.
If you want to be more memory efficient and don't have access to a relational database or an external file, consider designing something using arrays.
The advantage of arrays is the ability to store primitives which use less data than objects. Collections will always implicitly convert a primitive into its wrapper type when stored.
You could have your array index represent the screen id, and the value stored at the index could be another array or collection which stores the associated user ids.

What data type you are using? Let's say to your are using a..
Map<Integer,Integer>
.then each entry takes 8 bytes (32-Bit) or 16 bytes (64-Bit).. Let's calculate your memory consumption:
8 * 400000 = 3200000 bytes / 1024 = 3125 kbytes / 1024 = 3.05MB
or 6.1MB in case of an 64-Bit data type (like Long)
To say it short.. 3.05 MB or 6 MB is nothing for your hardware.
Even if we calc 3 million entries, we end up with an memory usage of 22 MB (in case of an integer entry set). I don't think a OutofMemory exception is caused by the data size. Check your data type or
switch to MapDB for a quick prototype (supports off-heap memory, see below).
Yes handling 3 000 000 000 entries is getting more seriously. We end up with a memory usage of 22.8 gig. In this case you should consider
a data storage that can handle this amount of data efficiently. I don't think a Java Map (or a vector in another language) is a good use case for such a data amount
(as Brain wrote, with this amount of data you have to increase the JVM heap space or use MapDB). Also think about your deployment; your product will need 22 gig in memory which
means high hardware costs. Then the question cost versus in-memory performance has to be balanced... I would go with one of the following alternatives:
Riak (Key-Value Storage, fits your data structure)
Neo4J (your data structure can be handled as a net graph; in this case a screen can have multiple relationships to users and versa-vi)
Or for a quick prototype consider MapDB (http://www.mapdb.org/)
For a professional and performance solution, you can look at SAP Hana (but its not for free)
H2 (http://www.h2database.com/html/main.html) can be also a good choice. It's an SQL in-memory database.
With one of the solutions above, you can also persist and query your data (without coding indexing, B-trees and stuff). And this is what you want to do, I guess,
process and operate with your data. At the end only tests can show which technology has the best performance for your needs.
The OutofMemory exception has nothing to do with java or python. Your use case can be implemented in java with no problems.

Just looking on the data structure. You have a two dimensional matrix indexed by user-id and screen-id containing a single boolean value, whether it was visisted by that user or not: visited[screen-id, user-id]
In the case each user visits almost every screen, the optimal representation would be a set of bits. This means you need 400k x 200k bits, which is roughly 10G bytes. In Java I would use a BitSet and linearize the access, e.g. BitSet.get(screen-id + 400000 * user-id)
If each user only visits a few screens, then there are a lot of repeating false-values in the bit set. This is what is called a sparse matrix. Actually, this is a well researched problem in computer science and you will find lots of different solutions for it.
This answers your original question, but probably does not solve your problem. In the comment you stated that you want to look up for the users that visited a specific screen. Now, that's a different problem domain, we are shifting from efficient data representation and storage to efficient data access.
Looking up the users that visited a set of screens, is essentially the identical problem to, looking up the documents that contain a set of words. That is a basic information retrieval problem. For this problem, you need a so called inverted index data structure. One popular library for this is Apache Lucene.
You can read in the visits and build a a data structure by yourself. Essentially it is a map, addressed by the screen-id, returning a set of the affected users, which is: Map<Integer, Set<Integer>>. For the set of integers the first choice would be a HashSet, which is not very memory efficient. I recommend using a high performance set library targeted for integer values instead, e.g. IntOpenHashSet. Still this will probably not fit in memory, however, if you use Spark you can split your processing in slices and join the processing results later.

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.

Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.

This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards

In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.

Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!

SOLR performance tuning

I've read the following:
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/solr/SolrCaching
http://www.lucidimagination.com/content/scaling-lucene-and-solr
And I have questions about a few things:
If I use the JVM option -XX:+UseCompressedStrings what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.
When exactly is the Solr filter cache doing? If I'm just doing a simple query with AND and a few ORs, and sorting by score, do I even need it?
If I want to cache all documents in the document cache, how would I compute the space required? Using the example from above, if I have 20M documents, use compressed strings, and the average length of the stored field is 25 characters, is the space required basically (25 bytes + small_admin_overhead) * 20M?
if all documents are in the document cache, how important is the query cache?
If I want to autowarm every document into the doc cache, will autowarm query of *:* do it?
The scaling-lucene-and-solr article says FuzzyQuery is slow. If I'm using the spellcheck feature of solr then I'm basically using fuzzy query right (because spellcheck does the same edit distance calculation)? So presumably spellcheck and fuzzy query are both equally "slow"?
The section describing the lucene field cache for strings is a bit confusing. Am I reading it correctly that the space required is basically the size of the indexed string field + an integer arry equal to the number of unique terms in that field?
Finally, under maximizing throughput, there is a statement about leaving enough space for the OS disk cache. It says, "All in all, for a large scale index, it's best to be sure you have at least a few gigabytes of RAM beyond what you are giving to the JVM.". So if I have a 12GB memory machine (as an example), I should give at least 2-3GB to the OS? Can I estimate the disk cache space needed by the OS by looking at the on disk index size?

Only way to be sure is to try it out. However, I would expect very little savings in the Index, as the index would only contain the actual string once each time, the rest is data for locations of that string within documents. They aren't a large part of the index.
Filter cache only caches filter queries. It may not be useful for your precise use case, but many do find them useful. For example, narrowing results by country, language, product type, etc. Solr can avoid recalculating the query results for things like this if you use them frequently.
Realistically, you just have to try it and measure it with a profiler. Without in depth knowledge of EXACTLY the data structure used, anything else is pure SWAG. Your calculation is just as good as anyone else's without profiling.
Document cache only saves time in constituting the results AFTER the query has been calculated. If you spend most of your time calculating queries, the document cache will do you little good. Query cache is only useful for re-used queries. If none of your queries are repeated, then Query cache is useless
yes, assuming your Document cache is large enough to hold them all.
6-8 Not positive.
From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.

Caches
In general, caching looks like a good idea to improve performance, but this also has a lot of issues:
cached objects are likely to go into the old generation of the garbage collector, which is more costly to collect,
managing insertions and evictions adds some overhead.
Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.
Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).
I recommand you read this to get more information on application-level cache vs. I/O cache:
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
http://antirez.com/post/what-is-wrong-with-2006-programming.html
Field cache
The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S.
The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.
Fuzzy Query
FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.
It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.

JDBC/Hibernate Fetch Size and memory issues

After investigating a bit at work I noticed that the application I'm working on is using the default fetch size (which is 10 for Oracle from what I know). The problem is that in the majority of cases the users fetch large amount of data (ranging from few thousand to even hundreds of thousands) and that the default 10 is really a huge bottleneck.
So the obvious conclusion here would be to make the fetch size larger. At first I was thinking about setting the default to 100 and bumping it to a 1000 for several queries. But then I read on the net that the default is so small to prevent memory issues (i.e. when the JVM heap cannot handle so much data), should I be worried about it?
I haven't seen anywhere further explanation to this. Does it mean that a bigger fetch sizes means more overhead while fetching the result set? Or do they just mean that with the default I can fetch 10 records and then GC them and fetch another 10 and so on (whereas lets say fetching a 10000 all at once would result in an OutOfMemory exception)? In such case I wouldn't really care as I need all the records in the memory anyway. In the former case (where bigger result set means bigger memory overhead) I guess I should load test it first.

By setting the fetch size too, big you are risking OutOfMemoryError.
The fact that you need all these records anyway is probably not justifiable. More chances you need the entities reflected by the returned ResultSets... Setting the fetch size to 10000 means you're heaping 10000 records represented by JDBC classes. Of course, you don't pass these around through your application. You first transform them into your favorite business-logic-entities and then hand them to your business-logic-executor. This way, The records form the first fetch bulk are available for GC as soon as JDBC fetches the next fetch bulk.
Typically, this transformation is done a little bunch at a time exactly because of the memory threat aforementioned.
One thing you're absolutely right, though: you should test for performance with well-defined requirements before tweaking.

So the obvious conclusion here would be to make the fetch size larger.
Perhaps an equally obvious conclusion should be: "Let's see if we can cut down on the number of objects that users bring back." When Google returns results, it does so in batches of 25 or 50 sorted by greatest likelihood to be considered useful by you. If your users are bringing back thousands of objects, perhaps you need to think about how to cut down on that. Can the database do more of the work? Are there other operations that could be written to eliminate some of those objects? Could the objects themselves be smarter?

Distributed sequence number generation?

I've generally implemented sequence number generation using database sequences in the past.
e.g. Using Postgres SERIAL type http://www.neilconway.org/docs/sequences/
I'm curious though as how to generate sequence numbers for large distributed systems where there is no database. Does anybody have any experience or suggestions of a best practice for achieving sequence number generation in a thread safe manner for multiple clients?

OK, this is a very old question, which I'm first seeing now.
You'll need to differentiate between sequence numbers and unique IDs that are (optionally) loosely sortable by a specific criteria (typically generation time). True sequence numbers imply knowledge of what all other workers have done, and as such require shared state. There is no easy way of doing this in a distributed, high-scale manner. You could look into things like network broadcasts, windowed ranges for each worker, and distributed hash tables for unique worker IDs, but it's a lot of work.
Unique IDs are another matter, there are several good ways of generating unique IDs in a decentralized manner:
a) You could use Twitter's Snowflake ID network service. Snowflake is a:
Networked service, i.e. you make a network call to get a unique ID;
which produces 64 bit unique IDs that are ordered by generation time;
and the service is highly scalable and (potentially) highly available; each instance can generate many thousand IDs per second, and you can run multiple instances on your LAN/WAN;
written in Scala, runs on the JVM.
b) You could generate the unique IDs on the clients themselves, using an approach derived from how UUIDs and Snowflake's IDs are made. There are multiple options, but something along the lines of:
The most significant 40 or so bits: A timestamp; the generation time of the ID. (We're using the most significant bits for the timestamp to make IDs sort-able by generation time.)
The next 14 or so bits: A per-generator counter, which each generator increments by one for each new ID generated. This ensures that IDs generated at the same moment (same timestamps) do not overlap.
The last 10 or so bits: A unique value for each generator. Using this, we don't need to do any synchronization between generators (which is extremely hard), as all generators produce non-overlapping IDs because of this value.
c) You could generate the IDs on the clients, using just a timestamp and random value. This avoids the need to know all generators, and assign each generator a unique value. On the flip side, such IDs are not guaranteed to be globally unique, they're only very highly likely to be unique. (To collide, one or more generators would have to create the same random value at the exact same time.) Something along the lines of:
The most significant 32 bits: Timestamp, the generation time of the ID.
The least significant 32 bits: 32-bits of randomness, generated anew for each ID.
d) The easy way out, use UUIDs / GUIDs.

You could have each node have a unique ID (which you may have anyway) and then prepend that to the sequence number.
For example, node 1 generates sequence 001-00001 001-00002 001-00003 etc. and node 5 generates 005-00001 005-00002
Unique :-)
Alternately if you want some sort of a centralized system, you could consider having your sequence server give out in blocks. This reduces the overhead significantly. For example, instead of requesting a new ID from the central server for each ID that must be assigned, you request IDs in blocks of 10,000 from the central server and then only have to do another network request when you run out.

Now there are more options.
Though this question is "old", I got here, so I think it might be useful to leave the options I know of (so far):
You could try Hazelcast. In it's 1.9 release it includes a Distributed implementation of java.util.concurrent.AtomicLong
You can also use Zookeeper. It provides methods for creating sequence nodes (appended to znode names, though I prefer using version numbers of the nodes). Be careful with this one though: if you don't want missed numbers in your sequence, it may not be what you want.
Cheers

It can be done with Redisson. It implements distributed and scalable version of AtomicLong. Here is example:
Config config = new Config();
config.addAddress("some.server.com:8291");
Redisson redisson = Redisson.create(config);
RAtomicLong atomicLong = redisson.getAtomicLong("anyAtomicLong");
atomicLong.incrementAndGet();

If it really has to be globally sequential, and not simply unique, then I would consider creating a single, simple service for dispensing these numbers.
Distributed systems rely on lots of little services interacting, and for this simple kind of task, do you really need or would you really benefit from some other complex, distributed solution?

There are a few strategies; but none that i know can be really distributed and give a real sequence.
have a central number generator. it doesn't have to be a big database. memcached has a fast atomic counter, in the vast majority of cases it's fast enough for your entire cluster.
separate an integer range for each node (like Steven Schlanskter's answer)
use random numbers or UUIDs
use some piece of data, together with the node's ID, and hash it all (or hmac it)
personally, i'd lean to UUIDs, or memcached if i want to have a mostly-contiguous space.

Why not use a (thread safe) UUID generator?
I should probably expand on this.
UUIDs are guaranteed to be globally unique (if you avoid the ones based on random numbers, where the uniqueness is just highly probable).
Your "distributed" requirement is met, regardless of how many UUID generators you use, by the global uniqueness of each UUID.
Your "thread safe" requirement can be met by choosing "thread safe" UUID generators.
Your "sequence number" requirement is assumed to be met by the guaranteed global uniqueness of each UUID.
Note that many database sequence number implementations (e.g. Oracle) do not guarantee either monotonically increasing, or (even) increasing sequence numbers (on a per "connection" basis). This is because a consecutive batch of sequence numbers gets allocated in "cached" blocks on a per connection basis. This guarantees global uniqueness and maintains adequate speed. But the sequence numbers actually allocated (over time) can be jumbled when there are being allocated by multiple connections!

Distributed ID generation can be archived with Redis and Lua. The implementation available in Github. It produces a distributed and k-sortable unique ids.

I know this is an old question but we were also facing the same need and was unable to find the solution that fulfills our need.
Our requirement was to get a unique sequence (0,1,2,3...n) of ids and hence snowflake did not help.
We created our own system to generate the ids using Redis. Redis is single threaded hence its list/queue mechanism would always give us 1 pop at a time.
What we do is, We create a buffer of ids, Initially, the queue will have 0 to 20 ids that are ready to be dispatched when requested. Multiple clients can request an id and redis will pop 1 id at a time, After every pop from left, we insert BUFFER + currentId to the right, Which keeps the buffer list going. Implementation here

I have written a simple service which can generate semi-unique non-sequential 64 bit long numbers. It can be deployed on multiple machines for redundancy and scalability. It use ZeroMQ for messaging. For more information on how it works look at github page: zUID

Using a database you can reach 1.000+ increments per second with a single core. It is pretty easy. You can use its own database as backend to generate that number (as it should be its own aggregate, in DDD terms).
I had what seems a similar problem. I had several partitions and I wanted to get an offset counter for each one. I implemented something like this:
CREATE DATABASE example;
USE example;
CREATE TABLE offsets (partition INTEGER, offset LONG, PRIMARY KEY (partition));
INSERT offsets VALUES (1,0);
Then executed the following statement:
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+1 WHERE partition=1;
If your application allows you, you can allocate a block at once (that was my case).
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+100 WHERE partition=1;
If you need further throughput an cannot allocate offsets in advance you can implement your own service using Flink for real time processing. I was able to get around 100K increments per partition.
Hope it helps!

The problem is similar to:
In iscsi world, where each luns/volumes have to be uniquely identifiable by the initiators running on the client side.
The iscsi standard says that the first few bits have to represent the Storage provider/manufacturer information, and the rest monotonically increasing.
Similarly, one can use the initial bits in the distributed system of nodes to represent the nodeID and the rest can be monotonically increasing.

One solution that is decent is to use a long time based generation.
It can be done with the backing of a distributed database.

My two cents for gcloud. Using storage file.
Implemented as cloud function, can easily be converted to a library.
https://github.com/zaky/sequential-counter

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.