Tracking unique identifiers over a large dataset

Tracking unique identifiers over a large dataset - java

I've a standalone JAVA applications which operates on a large amount of elements read from an input file, each element being associated with an identifier. For each element, I do the following (among others of course):
Check that the element has not already been processed using its
identifier.
Map the element to a grid using some statistical method,
each cell of the grid being responsible for tracking the unique elements
that were assigned to it, along with some properties calculated on each element.
The number of elements might be quite large (several millions), as well as the grid itself. Each cell is created on the fly as soon as an element has been assigned to it to avoid storing empty cells.
Question is: with large amount of data, memory issues naturally arise. What would be the best strategy to process large amount of data while avoiding memory issues ?
I've a couple of things in mind, but I'd like to know if anyone already has had this kind of problem, and if so, share its experience:
Embedded lightweight SQL database
Caching solutions such as Ehcache or apache jcs
NoSQL Key-value stores such as cassandra
Thoughts ?

Related

How to work with large database tables in java without suffering performance problems [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a table of vocabulary items that we use to search text documents. The java program that uses this table currently reads it from a database, stores it in memory and then searches documents for individual items in the table. The table is brought into memory for performance reasons. This has worked for many years but the table has grown quite large over time and now we are starting to see Java Heap Space errors.
There is a brute force approach to solving this problem which is to upgrade to a larger server, install more memory, and then allocate more memory to the Java heap. But I'm wondering if there are better solutions. I don't think an embedded database will work for our purposes because the tables are constantly being updated and the application is hosted on multiple sites suggesting a maintenance nightmare. But, I'm uncertain about what other techniques are out there that might help in this situation.
Some more details, there are currently over a million vocabulary items (think of these items as short text strings, not individual words). The documents are read from a directory by our application, and then each document is analyzed to determine if any of the vocabulary is present in the document. If it is, we note which items are present and store them in a database. The vocabulary itself is stored and maintained in a MS SQL relational database that we have been growing for years. Since, all vocabulary items must be analyzed for each document, repeatedly reading from the database is inefficient. And the number of documents that need to be analyzed each day can at some of our installations be quite large (on the order of 100K documents a day). The documents are typically 2 to 3 pages long although we occasionally see documents as large a 100 pages.

In the hopes of making your application more performant, you're taking all the data out of a database that is designed with efficient data operations in mind and putting it into your application's memory. This works fine for small data sets, but as those data sets grow, you are eventually going to run out of resources in the application to handle the entire dataset.
The solution is to use a database, or at least a data tier, that's appropriate for your use case. Let your data tier do the heavy lifting instead of replicating the data set into your application. Databases are incredible, and their ability to crunch through huge amounts of data is often underrated. You don't always get blazing fast performance for free (you might have to think hard about indexes and models), but few are the use cases where java code is going to be able to pull an entire data set down and process it more efficiently than a database can.
You don't say much about which database technologies you're using, but most relational databases are going to offer a lot of useful tools for full text searching . I've seen well designed relational databases perform text searches very effectively. But if you're constrained by your database technology or your table really is so big that a relational database text search isn't feasible, you should put your data into a searchable cache such as elastic search. If you model and index your data effectively, you can build a very performant text search platform that will scale reliably. Tom's suggestion of lucene is another good one. There's a lot of cloud technologies that can help with this kind of thing too: S3 + Athena comes to mind, if you're into AWS.

I'd look at http://lucene.apache.org it should be a good fit for what you've described.

I was having the same issue with a Table with one more than millon of Data and there was a Client that want export all that data. My solution was very simple I followed this Question. But there was a little Issue having more than 100k records go to Heap Space. So I just use Chunks with my queries WITH NO LOCK ( I know this can have some inconsistent data, but I needed to do that because it was Blocking the DB Without this Statement). I hope this approach help you.

When you had a small table, you probably implemented an approach of looping over the words in the table and for each one looking it up in the document to be processed.
Now the table has grown to the point where you have trouble loading it all in memory. I expect that the processing of each document has also slowed down due to having more words to look up in each document.
If you flip the processing around, you have more opportunities to optimize this process. In particular, to process a document you first identify the set of words in the document (e.g., by adding each word to a Set). Then you loop over each document word and look it up in the table. The simplest implementation simply does a database query for each word.
To optimize this, without loading the whole table in memory, you will want to implement an in-memory cache of some kind. Your database server will actually automatically implement this for you when you query the database for each word; the efficacy of this approach will depend on the hardware and configuration of your database server as well as the other queries that are competing with your word look-ups.
You can also implement an in-memory cache of the most-used portion of the table. You will limit the size of the cache based on how much memory you can afford to give it. All words that you look up that are not in the cache need to be checked by querying the database. Your cache might use a least-recently-used eviction strategy so that you keep the most common words in the cache.
While you can only store words that exist in the table in your cache, you might achieve better performance if you cache the result of the lookup. This will result in your cache having the most common words that show up in the documents being in the cache (and each one with a boolean value that indicates if the word is or is not in the table).
There are several really good open source in-memory caching implementations available in Java, which will minimize the amount of code you need to write to implement a caching solution.

Looking for a purely disk based key-value cache for a large dataset in Java

I currently have a postgres database with a simple schema of a fixed length key (20 bytes) and a fixed length value (40 bytes). Its a massive table with billions of rows, but unfortunately we have lots of duplicated data. We'd like to separate this table into its own data store.
Ideally, I'm looking for ways to store this data on a large hard drive where it can be queried on occasion. Performance is not critical for reads, disk access is fast enough - no need to store anything in memory. And there is rarely new data added after the initial load.
If there is no product available I would be willing to roll my own with suggestions. I originally thought of using the key as a folder path based on the byte /0/32/231/32/value but obviously that results in too many files/folders on a single disk. Is there an optimization that can be used since both keys and values are fixed length?
Any suggestions?

Try some pure Java database engines like MapDb or LevelDB.

Is it better to use HBase columns or serialize data using Avro?

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:
Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
Dump all the data into a single column and serialize it using Avro or Thrift.
What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

In summary, I lean towards using distinct columns per key.
1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.
2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.
3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.
4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.
5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

The right answer to this is a bit more complicated, so I'll give you the tl;dr first.
Use Avro/Thrift/Protobuf
You will need to strike a balance between how many fields to pack in a record vs. columns.
You'll typically want to put fields ("keys" in your original question) that are frequently accessed together into something like an avro record because as mentioned by cmonkey you don't want the overhead of retrieving extra data you won't use.
By making your row very wide, you'll increase seek times when fetching a subset of columns because of how HFiles are stored. Again, determining what is optimal, comes down to your access patterns.
I would also like to point out that by using something like avro, you're also providing yourself with evolvability. You don't need to delete the row and re-add it with the record containing a new field. Avro has rules for backward-compatibility and forward-compatibility. This actually makes your life much much easier because you can read both new and old records WITHOUT rewriting your data or forcing updates to older client code.
You should nearly always use compression in HBase (SNAPPY is always a good choice).

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

How to reduce the total memory hogging by compacting my Objects in Java?

I have a table with around 20 columns with mostly consisting of varchars and decimals. This table has almost 1.5M rows. But few things are common in them like column1 consists of only 100 distinct strings , column2 has almost 1000 and column3 has almost 500.
Right now, I am storing all these column values in a map with Key as first 5 columns and Data as rest of columns. My task is such, I need to initialize all these at the start of the task.
What pattern(like Flyweight, etc) or data structure should I use to minimize my Object storage?
Why I need pre-load of all data?
Assume the whole data of the table as a tree and the victims can be at any leaf, trunk or at root. So for each entry[this is coming from different place], I need to see if there is any match in the tree.

Internalizing is not the best option. Garbage collecting from the PermSpace is possible but nothing the VM is optimized for.
You can implement your own CharSequence implementation that is backed by shared char[] arrays.
With a CharSequence implementation you'll be able to implement basic sharing semantics like internalized strings or more complicated ones taking substrings and other projections into account.
A custom CharSequence implementation can also be optimized to perform fewer memory allocations than the String class which is copying char[] around (for safety reasons that are not necessary if you have the backing char[] under your full control). Even new String("..").intern() will intantiate a new String instance (char[] array) that is rapidly garbage collected.

My first question would be, what does you task plan with doing with the data in the table? Preloading a complete table into memory is not always the best approach, for instance keeping your current setup but loading on demand might be a better solution. And you might want to investigate flushing data that isn't used for a while, i.e. a kind of recently used map.
Could you elaborate what your task tries to achieve with all that data cached in a map?
Is the "victim" identification part of the key or part of the object? If part of the object, how do you select the keys that select the objects that you need? In other words; it sounds like you try to reproduce functionality that the database is very good at.
If your problem is that your table contents does not map easily on a tree-like structure, you could add that information in a way that is useable through the DB interface.

If your data loading process can support it then it isn't too difficult to implement something like String.intern() without the GC permgen side effects.
For any hashable data element, you can simply have a Map<T,T> to look-up preexisting instances. So for String:
Map<String,String> stringCache = new HashMap<String,String>();
...
String sharedValue = stringCache.get(loadedValue);
The process that loads the data from wherever will still be creating temporary strings but these will be rapidly GC'ed. Without knowing more about the specifics of where the data is coming from, it's difficult to comment on whether those temporary objects are necessary... though I have trouble seeing a way around it. They would be reclaimed rapidly during the load process anyway.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.