I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)
Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.
Related
I have been studying Java Collections recently. I noticed that ArrayList, ArrayDeque or HashMap contains helper functions which expand capacity of the containers if necessary, but neither of them have function to narrow the cap if the container gets empty.
If I am correct, is the memory cost of references (4 byte) so irrelevant?
You're correct, most of the collections have an internal capacity that is expanded automatically and that never shrinks. The exception is ArrayList, which has methods ensureCapacity() and trimToSize() that let the application manage the list's internal capacity explicitly. In practice, I believe these methods are rarely used.
The policy of growing but not shrinking automatically is based on some assumptions about the usage model of collections:
applications often don't know how many elements they want to store, so the collections will expand themselves automatically as elements are added;
once a collection is fully populated, the number of elements will generally remain around that number, neither growing nor shrinking significantly;
the per-element overhead of a collection is generally small compared to the size of the elements themselves.
For applications that fit these assumptions, the policy seems to work out reasonably well. For example, suppose you insert a million key-value pairs into a HashMap. The default load factor is 0.75, so the internal table size would be 1.33 million. Table sizes are rounded up to the next power of two, which would be 2^21 (2,097,152). In a sense, that's a million or so "extra" slots in the map's internal table. Since each slot is typically a 4-byte object reference, that's 4MB of wasted space!
But consider, you're using this map to store a million key-value pairs. Suppose each key and value is 50 bytes (which seems like a pretty small object). That's 100MB to store the data. Compared to that, 4MB of extra map overhead isn't that big of a deal.
Suppose, though, that you've stored a million mappings, and you want to run through them all and delete all but a hundred mappings of interest. Now you're storing 10KB of data, but your map's table of 2^21 elements is occupying 8MB of space. That's a lot of waste.
But it also seems that performing 999,900 deletions from a map is kind of an unlikely thing to do. If you want to keep 100 mappings, you'd probably create a new map, insert just the 100 mappings you want to keep, and throw away the original map. That would eliminate the space wastage, and it would probably be a lot faster as well. Given this, the lack of an automatic shrinking policy for the collections is usually not a problem in practice.
Suppose I have a data set as follows:
Screen ID User ID
1 24
2 50
2 80
3 23
5 50
3 60
6 64
. .
. .
. .
400,000 200,000
and I want to track the screens that each user visited. My first approach would be to create a Hash Map where the keys would be the user ids, and the values would be the screen ids. However, I get an OutofMemory error when using Java. Are there efficient data structures that can handle this volume of data? There will be about 3,000,000 keys and for each key about 1000 values. Would Spark(Python) be the way to go for this? The original dataset has around 300,000,000 rows and 2 columns.
Why do you want to store such a large data in memory it would be better to store it in data base and use only required data. As using any data structure in any language will consume nearly equal memory.
HashMap will not work with what you're describing as the keys must be unique. Your scenario is duplicating the keys.
If you want to be more memory efficient and don't have access to a relational database or an external file, consider designing something using arrays.
The advantage of arrays is the ability to store primitives which use less data than objects. Collections will always implicitly convert a primitive into its wrapper type when stored.
You could have your array index represent the screen id, and the value stored at the index could be another array or collection which stores the associated user ids.
What data type you are using? Let's say to your are using a..
Map<Integer,Integer>
.then each entry takes 8 bytes (32-Bit) or 16 bytes (64-Bit).. Let's calculate your memory consumption:
8 * 400000 = 3200000 bytes / 1024 = 3125 kbytes / 1024 = 3.05MB
or 6.1MB in case of an 64-Bit data type (like Long)
To say it short.. 3.05 MB or 6 MB is nothing for your hardware.
Even if we calc 3 million entries, we end up with an memory usage of 22 MB (in case of an integer entry set). I don't think a OutofMemory exception is caused by the data size. Check your data type or
switch to MapDB for a quick prototype (supports off-heap memory, see below).
Yes handling 3 000 000 000 entries is getting more seriously. We end up with a memory usage of 22.8 gig. In this case you should consider
a data storage that can handle this amount of data efficiently. I don't think a Java Map (or a vector in another language) is a good use case for such a data amount
(as Brain wrote, with this amount of data you have to increase the JVM heap space or use MapDB). Also think about your deployment; your product will need 22 gig in memory which
means high hardware costs. Then the question cost versus in-memory performance has to be balanced... I would go with one of the following alternatives:
Riak (Key-Value Storage, fits your data structure)
Neo4J (your data structure can be handled as a net graph; in this case a screen can have multiple relationships to users and versa-vi)
Or for a quick prototype consider MapDB (http://www.mapdb.org/)
For a professional and performance solution, you can look at SAP Hana (but its not for free)
H2 (http://www.h2database.com/html/main.html) can be also a good choice. It's an SQL in-memory database.
With one of the solutions above, you can also persist and query your data (without coding indexing, B-trees and stuff). And this is what you want to do, I guess,
process and operate with your data. At the end only tests can show which technology has the best performance for your needs.
The OutofMemory exception has nothing to do with java or python. Your use case can be implemented in java with no problems.
Just looking on the data structure. You have a two dimensional matrix indexed by user-id and screen-id containing a single boolean value, whether it was visisted by that user or not: visited[screen-id, user-id]
In the case each user visits almost every screen, the optimal representation would be a set of bits. This means you need 400k x 200k bits, which is roughly 10G bytes. In Java I would use a BitSet and linearize the access, e.g. BitSet.get(screen-id + 400000 * user-id)
If each user only visits a few screens, then there are a lot of repeating false-values in the bit set. This is what is called a sparse matrix. Actually, this is a well researched problem in computer science and you will find lots of different solutions for it.
This answers your original question, but probably does not solve your problem. In the comment you stated that you want to look up for the users that visited a specific screen. Now, that's a different problem domain, we are shifting from efficient data representation and storage to efficient data access.
Looking up the users that visited a set of screens, is essentially the identical problem to, looking up the documents that contain a set of words. That is a basic information retrieval problem. For this problem, you need a so called inverted index data structure. One popular library for this is Apache Lucene.
You can read in the visits and build a a data structure by yourself. Essentially it is a map, addressed by the screen-id, returning a set of the affected users, which is: Map<Integer, Set<Integer>>. For the set of integers the first choice would be a HashSet, which is not very memory efficient. I recommend using a high performance set library targeted for integer values instead, e.g. IntOpenHashSet. Still this will probably not fit in memory, however, if you use Spark you can split your processing in slices and join the processing results later.
I am writing an application that needs to look up data from a table (20x200) for calculation inputs. The table is filled with constants (i.e. I do not need to write to the table). I am still a novice programmer and have not had a lot of experience with databases, and so prior to proceeding I would like to know the best way to achieve this.
I had intended to place the data in an array and simply perform the lookup with 2 loops (one row look up and one column lookup) however I feel this is very inefficient. Is it worth looking into A database such as SQLite? or is that overkill for what is a relatively small data set with no requirement for editing?
As often, the answer is: It depends.
Do you need some advanced querying, like the sum of all values in the x column for which the value in the y column is greater then 23. If so a in memory SQL database comes in handy. Otherwise it would just be overkill.
Assuming the database is out of the discussion, the next questions are: Do you need single values, complete (or large parts of) columns or rows? And what are the natural "names" of your columns and rows.
Here are some options:
"names" are continuous integers: Use a 2D array (I wouldn't use arrays very often in Java, but in a read only situation with fixed lengths everything else sounds like to much overhead. By choosing the order of the indices, i.e. rows first vs. columns first you can get complete columns/rows very easy and efficient.
"names" are not continuous, Strings or any other objects: Use a Map of Maps if you need access to complete rows or columns. If you only need single values, create a Pair type and use it a the key for the map.
1) You can use a in-memory database like H2-Datbase Engine.
For which you just need to include a jar and data retrieval will be very fast.
It can't be considered as an overhead on your application.
2) Or you can use a Map<key,Map<String,string>> for the lookup.
For the main Map, key will be your record id, and for inner Map key will be your column name.
Whether to make it static or not I leave that on you to decide.
3) You can also explore caching options like ehcache.
I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.
If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!
This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.
This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.
IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).
This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/
I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?
Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap
You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?