we have a plan to cache DB table on application side (to avoid DB calls). Our cache is key & value pair implementation. If I use primary key (column1) as key and all other data as value, how can we execute below queries against cache?
select * from table where column1=?
select * from table where
column2=? and column3=? select * from table where column4=? and
column5=? and column6=?
One simplest option is to build 3 caches as below.
(column1) --> Data (column2+column3) --> Data (column4+column5) -->
Data
Any other better options?
Key points:
Table contains millions of records
We are using Java ConcurrentHashMap for cache implementation.
Looks like you want an in-memory cache. Guava has cool caches--you would need a LoadingCache.
Here is the link to LoadingCache
Basically, for your problem, the idea would be to have three LoadingCache. LoadingCache has a method that you should implement. The method tells loading cache given the input, how to get the data in case of a cache miss. So, the first time you access the loading cache for query1, there would be a cache miss. The loading cache would use the method you implemented (your classic DAO method) to get the data, put it in the cache, and return it to you. The next time you access it, it will be served from your in-memory guava cache.
So if you have three methods
Data getData(Column1 column)
Data getData(Column2 column2, Column3 column3)
Data getData(Column4 column4, Column5 column5, Column6 column6)
your three LoadingCache will call these methods from the load implementation you write. And that's it. I find it very clean and simple to get what you want.
You mentioned that you have to cache millions of records. Thats quite a big number. I do not recommened you building your own caching framework, especially not based on simplistic datastructures such as HashMaps.
I highly recommend Redis - Check at http://redis.io. Companies such as Twitter, Stackoverflow etc are using Redis for their caches.
Here is the live demonstration of Redis - http://try.redis.io
Related
I am accessing gridgain cache for large number of keys. I have two option to get values:
access gridgain cache and get value for each key in an IgniteClosure and return the result.
execute org.apache.ignite.cache.query.SqlQuery on the cache and then get the result.
Below are my questions:
What is the recommended/optimal way in this scenario?
Why one could be slower than others (like query parsing might be an extra overhead).
Have you considered doing a getAll(Set<K> keys) operation? Sounds like it suits your use case perfectly.
If you have even more data, consider collocated processing with local ScanQuery or map/reduce ExecuteTask/ExecuteJob.
If primary keys are known in advance, then use key-value APIs such as cache.get or cache.getAll. If those records are further to be used as part of a calculation then try to turn the calculation to a compute task and execute it one the nodes that store primary copies of the keys -- you can use compute.affinityRun methods for that.
SQL is favorable if the primary keys are not known in advance or if you need to filter data with the WHERE clause or do joins between tables.
I have some persistent data in the rdms and csv files (they are independent objects, but I wanted to mention it because they are in different mediums,
I can not go with what rdbms provides, actually I do not want to do a trip to database for the next hour in even the data gets old). I need to store the data in memory for performance benefits and query (only read, no other operation) the objects based on multiple columns of it, and refresh the data every hour.
In my case ,what is a good way to store and query in-memory objects other than implementing my own object store and querying methods? For instance, can you provide an example/link to replace the sql query as
select * from employees where emplid like '%input%' or surname like '%input%' or email like '%input%';
Sorry for the dummy query but it explains what kind of queries are possible.
Go find yourself a key store implementation with the features you want. Use your Query string as the key and the result as the value. https://github.com/ben-manes/caffeine Has quite a few features including record timeouts (like an hour).
For my own work, I use a LRU key store (limited to X entries) containing objects with the timeout information and I manually decide if the record is stale or not before I use it. LRU is basically a linked-list which moves "read" records to the head of the list and drops the tail when records are added beyond the maximum desired size. This keeps the popular records in the store longer.
I have to process an xml file and for that I need to get around ~4k objects using it's primary key from a single table . I am using EhCache. I have few queries as follows:
1) It is taking lot of time if I am querying row by row based on Id and saving it in Cache . Can I query at initial point of time and save whole table in EHCache and can query it using primary key later in the processing
2) I dont want to use Query cache. As I can't load 4k objects at a time and loop it for finding correct object.
I am looking for optimal solution as right now my process is taking around 2 hours (it involves other processing too)
Thank you for your kind help.
You can read the whole table and store it in a Map<primary-key, table-row> to reduce the overhead of the DB connection.
I guess a TreeMap is probably the best choice, it makes search for elements faster.
Ehcache is great to handle concurrence, but if you are reading the xml with a single process you don't even need it (just store the Map in memory).
I want to perform a search of a inputs in a list. That list resides in a database. I see two options for doing that-
Hit the db for each search and return the result.
keep a copy in memory synced with table and search in memory and return the result.
I like the second option as it will be faster. However I am confused on how to keep the list in sync with table.
example : I have a list L = [12,11,14,42,56]
and I receive an input : 14
I need to return the result if the input does exists in the list or not. The list can be updated by other applications. I need to keep the list in sync with table.
What would be the most optimized approach here and how to keep the list in sync with database?
Is there any way my application can be informed of the changes in the table so that I can reload the list on demand.
Instead of recreating your own implementation of something that already exists, I would leverage Hibernate's Second Level Cache (2LC) with an implementation such as EhCache.
By using a 2LC, you can specify the time-to-live expiration time for your entities and once they expire, any query would reload them from the database. If the entity cache has not yet expired, Hibernate will hydrate them from the 2LC application cache rather than the database.
If you are using Spring, you might also want to take a look at #Cachable. This operates at the component / bean tier allowing Spring to cache a result-set into a named region. See their documentation for more details.
To satisfied your requirement, you should control the read and write in one place, otherwise, there will always be some unsync case for the data.
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.