java efficient de-duplication

java efficient de-duplication - java

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Related

Lookup Data in Java

I am writing an application that needs to look up data from a table (20x200) for calculation inputs. The table is filled with constants (i.e. I do not need to write to the table). I am still a novice programmer and have not had a lot of experience with databases, and so prior to proceeding I would like to know the best way to achieve this.
I had intended to place the data in an array and simply perform the lookup with 2 loops (one row look up and one column lookup) however I feel this is very inefficient. Is it worth looking into A database such as SQLite? or is that overkill for what is a relatively small data set with no requirement for editing?

As often, the answer is: It depends.
Do you need some advanced querying, like the sum of all values in the x column for which the value in the y column is greater then 23. If so a in memory SQL database comes in handy. Otherwise it would just be overkill.
Assuming the database is out of the discussion, the next questions are: Do you need single values, complete (or large parts of) columns or rows? And what are the natural "names" of your columns and rows.
Here are some options:
"names" are continuous integers: Use a 2D array (I wouldn't use arrays very often in Java, but in a read only situation with fixed lengths everything else sounds like to much overhead. By choosing the order of the indices, i.e. rows first vs. columns first you can get complete columns/rows very easy and efficient.
"names" are not continuous, Strings or any other objects: Use a Map of Maps if you need access to complete rows or columns. If you only need single values, create a Pair type and use it a the key for the map.

1) You can use a in-memory database like H2-Datbase Engine.
For which you just need to include a jar and data retrieval will be very fast.
It can't be considered as an overhead on your application.
2) Or you can use a Map<key,Map<String,string>> for the lookup.
For the main Map, key will be your record id, and for inner Map key will be your column name.
Whether to make it static or not I leave that on you to decide.
3) You can also explore caching options like ehcache.

Hbase sort on column qualifiers

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.

You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

MongoD JAVA insert vs. update and compare changes

I have a large collection of roughly 3.2 million records, this collection data is being updated monthly but the source data is being fetched as-is, meaning I don't get just the updated records but everything.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
Thanks.

Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log.
Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id.
The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.
Hope this helps.

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

Java data structure to use with Hibernate to store unknown number of parameters?

Following problem: I want to render a news stream of short messages based on localized texts. In various places of these messages I have to insert parameters to "customize" them. I guess you know what I mean ;)
My question probably falls into the "Which is the best style to do it?" category: How would you store these parameters (they may be Strings and Numbers that need to be formatted according to Locale) in the database? I'm using Hibernate to do the ORM and I can think of the following solutions:
build a combined String and save it as such (ugly and hard to maintain I think)
do some kind of fancy normalization and and make every parameter a single row on the database (clean I guess, but a performance nightmare)
Put the params into an Array, Map or other Java data structure and save it in binary format (probably causes a lot of overhead size-wise)
I tend towards option #3 but I'm afraid that it might be to costly in terms of size in the database. What do you think?

If you can afford the performance hit of using the normalized approach of having a separate table I would go with this approach. We use the same approach as your first suggestion at work, and it gets messy, especially when you reach the column limit and key/values start getting truncated!

Do the normalization.
I would suggest something like:
Table Message
id
Table Params
message_id
key
value
Storing serialized Java objects in the database is quite a bad thing in most cases. As they are hard to maintain and you cannot access them with 'simple' SQL tools.
The performance impact is not as big, as you can fetch all together in a single select using a join.

It depends a bit. Is the number of parameters huge for each entity? If it is not probable second option is the best.
If you don't want to add extra queries caused by the lazy load you can always change fetch type for the variable number of parameters that would only add one join to a query you were always doing. In normal conditions it is not a big price to pay.
Also the third and the first one forbids forever any type of queries over the parameters. A huge technical debt for the future I would not be willing to pay.

directly put it as string and save it ..

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.