Lookup Data in Java

Lookup Data in Java - java

I am writing an application that needs to look up data from a table (20x200) for calculation inputs. The table is filled with constants (i.e. I do not need to write to the table). I am still a novice programmer and have not had a lot of experience with databases, and so prior to proceeding I would like to know the best way to achieve this.
I had intended to place the data in an array and simply perform the lookup with 2 loops (one row look up and one column lookup) however I feel this is very inefficient. Is it worth looking into A database such as SQLite? or is that overkill for what is a relatively small data set with no requirement for editing?

As often, the answer is: It depends.
Do you need some advanced querying, like the sum of all values in the x column for which the value in the y column is greater then 23. If so a in memory SQL database comes in handy. Otherwise it would just be overkill.
Assuming the database is out of the discussion, the next questions are: Do you need single values, complete (or large parts of) columns or rows? And what are the natural "names" of your columns and rows.
Here are some options:
"names" are continuous integers: Use a 2D array (I wouldn't use arrays very often in Java, but in a read only situation with fixed lengths everything else sounds like to much overhead. By choosing the order of the indices, i.e. rows first vs. columns first you can get complete columns/rows very easy and efficient.
"names" are not continuous, Strings or any other objects: Use a Map of Maps if you need access to complete rows or columns. If you only need single values, create a Pair type and use it a the key for the map.

1) You can use a in-memory database like H2-Datbase Engine.
For which you just need to include a jar and data retrieval will be very fast.
It can't be considered as an overhead on your application.
2) Or you can use a Map<key,Map<String,string>> for the lookup.
For the main Map, key will be your record id, and for inner Map key will be your column name.
Whether to make it static or not I leave that on you to decide.
3) You can also explore caching options like ehcache.

Related

Large 2D Array Storage in Java (Android)

I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)

Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.

Hbase sort on column qualifiers

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.

You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

Is it better to use HBase columns or serialize data using Avro?

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:
Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
Dump all the data into a single column and serialize it using Avro or Thrift.
What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

In summary, I lean towards using distinct columns per key.
1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.
2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.
3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.
4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.
5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

The right answer to this is a bit more complicated, so I'll give you the tl;dr first.
Use Avro/Thrift/Protobuf
You will need to strike a balance between how many fields to pack in a record vs. columns.
You'll typically want to put fields ("keys" in your original question) that are frequently accessed together into something like an avro record because as mentioned by cmonkey you don't want the overhead of retrieving extra data you won't use.
By making your row very wide, you'll increase seek times when fetching a subset of columns because of how HFiles are stored. Again, determining what is optimal, comes down to your access patterns.
I would also like to point out that by using something like avro, you're also providing yourself with evolvability. You don't need to delete the row and re-add it with the record containing a new field. Avro has rules for backward-compatibility and forward-compatibility. This actually makes your life much much easier because you can read both new and old records WITHOUT rewriting your data or forcing updates to older client code.
You should nearly always use compression in HBase (SNAPPY is always a good choice).

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.