Which data structure should I use to represent this data set? - java

Suppose I have a data set as follows:
Screen ID User ID
1 24
2 50
2 80
3 23
5 50
3 60
6 64
. .
. .
. .
400,000 200,000
and I want to track the screens that each user visited. My first approach would be to create a Hash Map where the keys would be the user ids, and the values would be the screen ids. However, I get an OutofMemory error when using Java. Are there efficient data structures that can handle this volume of data? There will be about 3,000,000 keys and for each key about 1000 values. Would Spark(Python) be the way to go for this? The original dataset has around 300,000,000 rows and 2 columns.

Why do you want to store such a large data in memory it would be better to store it in data base and use only required data. As using any data structure in any language will consume nearly equal memory.

HashMap will not work with what you're describing as the keys must be unique. Your scenario is duplicating the keys.
If you want to be more memory efficient and don't have access to a relational database or an external file, consider designing something using arrays.
The advantage of arrays is the ability to store primitives which use less data than objects. Collections will always implicitly convert a primitive into its wrapper type when stored.
You could have your array index represent the screen id, and the value stored at the index could be another array or collection which stores the associated user ids.

What data type you are using? Let's say to your are using a..
Map<Integer,Integer>
.then each entry takes 8 bytes (32-Bit) or 16 bytes (64-Bit).. Let's calculate your memory consumption:
8 * 400000 = 3200000 bytes / 1024 = 3125 kbytes / 1024 = 3.05MB
or 6.1MB in case of an 64-Bit data type (like Long)
To say it short.. 3.05 MB or 6 MB is nothing for your hardware.
Even if we calc 3 million entries, we end up with an memory usage of 22 MB (in case of an integer entry set). I don't think a OutofMemory exception is caused by the data size. Check your data type or
switch to MapDB for a quick prototype (supports off-heap memory, see below).
Yes handling 3 000 000 000 entries is getting more seriously. We end up with a memory usage of 22.8 gig. In this case you should consider
a data storage that can handle this amount of data efficiently. I don't think a Java Map (or a vector in another language) is a good use case for such a data amount
(as Brain wrote, with this amount of data you have to increase the JVM heap space or use MapDB). Also think about your deployment; your product will need 22 gig in memory which
means high hardware costs. Then the question cost versus in-memory performance has to be balanced... I would go with one of the following alternatives:
Riak (Key-Value Storage, fits your data structure)
Neo4J (your data structure can be handled as a net graph; in this case a screen can have multiple relationships to users and versa-vi)
Or for a quick prototype consider MapDB (http://www.mapdb.org/)
For a professional and performance solution, you can look at SAP Hana (but its not for free)
H2 (http://www.h2database.com/html/main.html) can be also a good choice. It's an SQL in-memory database.
With one of the solutions above, you can also persist and query your data (without coding indexing, B-trees and stuff). And this is what you want to do, I guess,
process and operate with your data. At the end only tests can show which technology has the best performance for your needs.
The OutofMemory exception has nothing to do with java or python. Your use case can be implemented in java with no problems.

Just looking on the data structure. You have a two dimensional matrix indexed by user-id and screen-id containing a single boolean value, whether it was visisted by that user or not: visited[screen-id, user-id]
In the case each user visits almost every screen, the optimal representation would be a set of bits. This means you need 400k x 200k bits, which is roughly 10G bytes. In Java I would use a BitSet and linearize the access, e.g. BitSet.get(screen-id + 400000 * user-id)
If each user only visits a few screens, then there are a lot of repeating false-values in the bit set. This is what is called a sparse matrix. Actually, this is a well researched problem in computer science and you will find lots of different solutions for it.
This answers your original question, but probably does not solve your problem. In the comment you stated that you want to look up for the users that visited a specific screen. Now, that's a different problem domain, we are shifting from efficient data representation and storage to efficient data access.
Looking up the users that visited a set of screens, is essentially the identical problem to, looking up the documents that contain a set of words. That is a basic information retrieval problem. For this problem, you need a so called inverted index data structure. One popular library for this is Apache Lucene.
You can read in the visits and build a a data structure by yourself. Essentially it is a map, addressed by the screen-id, returning a set of the affected users, which is: Map<Integer, Set<Integer>>. For the set of integers the first choice would be a HashSet, which is not very memory efficient. I recommend using a high performance set library targeted for integer values instead, e.g. IntOpenHashSet. Still this will probably not fit in memory, however, if you use Spark you can split your processing in slices and join the processing results later.

Related

What is the best way to cache large data objects into Hazlecast

We have around 20k merchants data ,size around 3mb
If we cache these much data together then hazlecast performance not doing good
Please note if we cache all 20k individual then for get all merchants call slowing down as reading each merchant from cache costs high network time.
How should we partition these data
What will be the partition key
What will be the max size per partition
Merchant entity attributed as below
Merchant Id , parent merchant id, name , address , contacts, status, type
Merchant id is the unique attribute
Please suggest
Adding to what Mike said, it's not unusual to see Hazelcast maps with millions of entries, so I wouldn't be concerned with the number of entries.
You should structure your map(s) to fit your applications design needs. Doing a 'getAll' on a single map seems inefficient to me. It may make more sense to create multiple maps or use a complex key that allows you to be more selective with entries returned.
Also, you may want to look at indexes. You can index the key and/or value which can really help with performance. Predicates you construct for selections will automatically use any defined indexes.
I wouldn't worry about changing partition key unless you have reason to believe the default partitioning scheme is not giving you a good distribution of keys.
With 20K merchants and 3MB of data per merchant, your total data is around 60GB. How many nodes are you using for your cache, and what memory size per node? Distributing the cache across a larger number of nodes should give you more effective bandwidth.
Make sure you're using an efficient serialization mechanism, the default Java serialization is very inefficient (both in terms of object size and speed to serialize and deserialize); using something like IdentifiedDataSerializable (if Java) or Portable (if using non-Java clients) could help a lot.
I would strongly recommend that you break down your object from 3MB to few 10s of KBs, otherwise you will run into problems that are not particularly related to Hazelcast. For example, fat packets blocking other packets resulting in heavy latency in read/write operations, heavy serialization/deserialization overhead, choked network etc. You have already identified high network time and it is not going to go away without flattening the value object. If yours is read heavy use case then I also suggest to look into NearCache for ultra low latency read operations.
As for partition size, keep it under 100MB, I'd say between 50-100MB per partition. Simple maths will help you:
3mb/object x 20k objects = 60GB
Default partition count = 271
Each partition size = 60,000 MB / 271 = 221MB.
So increasing the partition count to, lets say, 751 will mean:
60,000 MB / 751 = 80MB.
So you can go with partition count set to 751. To cater to possible increase in future traffic, I'd set the partition count to an even higher number - 881.
Note: Always use a prime number for partition count.
Fyi - in one of the future releases, the default partition count will be changed from 271 to 1999.

Entity prepopulation for MongoDB to avoid padding with Spring

In an application I use the concept of buckets to store objects. All buckets are empty at creation time. Some of which may fill up to their maximum capacity of 20 objects in 2hrs, some in 6 months. Each object's size is pretty much fixed, i.e. I don't expect their size to differ more than 10%, i.e. the sizes of full buckets wouldn't either. The implementation looks similar to that.
#Document
public class MyBucket {
// maximum capacity of 20
private List<MyObject> objects;
}
One approach to keep the padding factor low would be to prepopulate my bucket with dummy data. Two options come to my mind:
Create the bucket with dummy data, save it, then reset its content and save it again
Create the bucket with dummy data and flag it as "pristine". On the first write the flag is set to false and the data get reset.
The disadvantages are obvious, option 1 requires two DB writes, option 2 requires extra (non-business) code in my entities.
Probably I won't get off cheaply with any solution. Nevertheless, any real-life experience with that issue, any best practices or hints?
Setup: Spring Data MongoDB 1.9.2, MongoDB 3.2
As far as understand your main concern is performance overhead related to increasing of documents size resulting to documents relocation and indexes update. It is actual for the mmapv1 storage engine, however since MongoDB version 3.0 there is the WiredTiger storage engine available that does not have such issues (check the similar question).

Large 2D Array Storage in Java (Android)

I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)
Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.

How to efficiently process strings in Java

I am facing some optimization problem in java. I have to process a table which has 5 attributes. The table contains about 5 millions records. To simplify the problem, let say I have to read every record one by one. Then I have to process each record. From each record I have to generate a mathematical lattice structure which has 500 nodes. In other words each record generate 500 more new records which can be referred as parents of the original record. So in total there are 500 X 5 Millions records including original plus parent records. Now the job is to find the number of distinct records out of all 500 X 5 Millions records with their frequencies. Currently I have solved this problem as follow. I convert every record to a string with value of each attribute separated by "-". And I count them in a java HashMap. Since these records involve intermediate processing. A record is converted to a string and then back to a record during intermediate steps. The code is tested and it is working fine and produce accurate results for small number of records but it can not process 500 X 5 Millions records.
For large number of records it produce the following error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I understand that the number of distinct records are not more than 50 thousands for sure. Which means that the data should not cause memory or heap overflow. Can any one suggest any option. I will be very thankful.
Most likely, you have some data-structure somewhere which is keeping references to the processed records, also known as a "memory leak". It sounds like you intend to process each record in turn and then throw away all the intermediate data, but in fact the intermediate data is being kept around. the garbage collector can't throw away this data if you have some collection or something still pointing to it.
Note also that there is the very important java runtime parameter "-Xmx". Without any further detail than what you've provided, I would have thought that 50,000 records would fit easily into the default values, but maybe not. Try doubling -Xmx (hopefully your computer has enough RAM). If this solves the problem then great. If it just gets you twice as far before it fails, then you know it's an algorithm problem.
Using a sqlite database can used to safe (1.3tb?) data. With query´s you can find fast info back. Also the data get saved when youre program ends.
You probably need to adopt a different approach to calculating the frequencies of occurrence. Brute force is great when you only have a few million :)
For instance, after your calculation of the 'lattice structure' you could combine that with the original data and take either the MD5 or SHA1. This should be unique except when the data is not "distinct". Which then should reduce your total data down back below 5 million.

Looking for a purely disk based key-value cache for a large dataset in Java

I currently have a postgres database with a simple schema of a fixed length key (20 bytes) and a fixed length value (40 bytes). Its a massive table with billions of rows, but unfortunately we have lots of duplicated data. We'd like to separate this table into its own data store.
Ideally, I'm looking for ways to store this data on a large hard drive where it can be queried on occasion. Performance is not critical for reads, disk access is fast enough - no need to store anything in memory. And there is rarely new data added after the initial load.
If there is no product available I would be willing to roll my own with suggestions. I originally thought of using the key as a folder path based on the byte /0/32/231/32/value but obviously that results in too many files/folders on a single disk. Is there an optimization that can be used since both keys and values are fixed length?
Any suggestions?
Try some pure Java database engines like MapDb or LevelDB.

Categories

Resources