gridgain cache access performance: sql vs cache.getKey? - java

I am accessing gridgain cache for large number of keys. I have two option to get values:
access gridgain cache and get value for each key in an IgniteClosure and return the result.
execute org.apache.ignite.cache.query.SqlQuery on the cache and then get the result.
Below are my questions:
What is the recommended/optimal way in this scenario?
Why one could be slower than others (like query parsing might be an extra overhead).

Have you considered doing a getAll(Set<K> keys) operation? Sounds like it suits your use case perfectly.
If you have even more data, consider collocated processing with local ScanQuery or map/reduce ExecuteTask/ExecuteJob.

If primary keys are known in advance, then use key-value APIs such as cache.get or cache.getAll. If those records are further to be used as part of a calculation then try to turn the calculation to a compute task and execute it one the nodes that store primary copies of the keys -- you can use compute.affinityRun methods for that.
SQL is favorable if the primary keys are not known in advance or if you need to filter data with the WHERE clause or do joins between tables.

Related

Better to filter using DISTINCT in query or use java's set when parsing said query's results

Question as in title. I'm querying a table to get a pair of values, and I only want to save 1 pair of each value. Is it quicker to use: the
DISTINCT clause in PostgreSQL and put that responsibility on the DB
Use Java's Set when unwrapping the query's results into objects
Both results should result in the same thing, but is one inherently better than the other?
The less data you have to marshal, transmit, and unmarshal, the faster your operation will be. This means that, in your case, getting unique data out of the database is preferable to getting non-unique data out and deriving the unique set later.
Generally it is best to do as much of your data processing as possible in the database. That's what it's there for, after all. There is an effort/complexity threshold where it can make more sense to process data in your client code, but finding unique values is one of the simplest operations.

Load multiple (lot of) objects from database , save it in ehcache and query it with primary key

I have to process an xml file and for that I need to get around ~4k objects using it's primary key from a single table . I am using EhCache. I have few queries as follows:
1) It is taking lot of time if I am querying row by row based on Id and saving it in Cache . Can I query at initial point of time and save whole table in EHCache and can query it using primary key later in the processing
2) I dont want to use Query cache. As I can't load 4k objects at a time and loop it for finding correct object.
I am looking for optimal solution as right now my process is taking around 2 hours (it involves other processing too)
Thank you for your kind help.
You can read the whole table and store it in a Map<primary-key, table-row> to reduce the overhead of the DB connection.
I guess a TreeMap is probably the best choice, it makes search for elements faster.
Ehcache is great to handle concurrence, but if you are reading the xml with a single process you don't even need it (just store the Map in memory).

Using Java ExecutorCompletionService and Postgres inserts via MyBatis

I thought I would be clever and am using an ExecutorCompletionService to parallelize tasks that insert a bunch of records into a Postgres database. Motly it works great and I can see an increase in performance. However, now and then it fails with a primary key exception.. most likely due to concurrent threads trying to create records at the same time. Is there a graceful way to handle this situation?
You need to use some coordinated and thread safe way to generate primary keys. The best option if your primary key is numeric is to user database sequence - it is safe and efficient.
Turns out my original problem had to do with the sequence being different from the record count of the table. In fact Postgres can create new unique id's concurrently as far as I can tell. No additional coding needs to be done.

Multiple key and value pair search

we have a plan to cache DB table on application side (to avoid DB calls). Our cache is key & value pair implementation. If I use primary key (column1) as key and all other data as value, how can we execute below queries against cache?
select * from table where column1=?
select * from table where
column2=? and column3=? select * from table where column4=? and
column5=? and column6=?
One simplest option is to build 3 caches as below.
(column1) --> Data (column2+column3) --> Data (column4+column5) -->
Data
Any other better options?
Key points:
Table contains millions of records
We are using Java ConcurrentHashMap for cache implementation.
Looks like you want an in-memory cache. Guava has cool caches--you would need a LoadingCache.
Here is the link to LoadingCache
Basically, for your problem, the idea would be to have three LoadingCache. LoadingCache has a method that you should implement. The method tells loading cache given the input, how to get the data in case of a cache miss. So, the first time you access the loading cache for query1, there would be a cache miss. The loading cache would use the method you implemented (your classic DAO method) to get the data, put it in the cache, and return it to you. The next time you access it, it will be served from your in-memory guava cache.
So if you have three methods
Data getData(Column1 column)
Data getData(Column2 column2, Column3 column3)
Data getData(Column4 column4, Column5 column5, Column6 column6)
your three LoadingCache will call these methods from the load implementation you write. And that's it. I find it very clean and simple to get what you want.
You mentioned that you have to cache millions of records. Thats quite a big number. I do not recommened you building your own caching framework, especially not based on simplistic datastructures such as HashMaps.
I highly recommend Redis - Check at http://redis.io. Companies such as Twitter, Stackoverflow etc are using Redis for their caches.
Here is the live demonstration of Redis - http://try.redis.io

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Categories

Resources