I'm relatively new to using caching in larger programs intended to be used by a large number of people. I know what caching is and why its beneficial in general and I've started to integrate EHCache into my application which uses JSP and Spring MVC. In my application the user selects an ID from a drop down list and this uses a java class to grab data from DB according to the ID picked. First the query is executed and it returns a ResultSet object. At this point I am confused at what to do and feel like I'm missing something.
I know I want the object to go into cache if it's not already in there and if it's already in cache then just continue with the loop. But doing things this way requires me to iterate over the whole returned result set from the DB query, which is obviously not the way things are supposed to be done?
So, would you recommend that I just try to cache the whole result set returned? If I did this I guess I could update the list in the cache if the DB table is updated with a new record? Any suggestions on how to proceed and correctly put into ecache what is returned from the DB?
I know I'm throwing out a lot of questions and I certainly appreciate it if someone could offer some help! Here is a snippet of my code so you see what I mean.
rs = sta.executeQuery(QUERYBRANCHES + specifier);
while (rs.next())
{
//for each set of fields retrieved, use those to create a branch object.
//String brName = rs.getString("NAME");
String compareID = rs.getString("ID");
String fixedRegID = rs.getString("REGIONID").replace("0", "").trim();
//CHECKING IF BRANCH IS ALREADY IN THE CACHE. IF IT IS NOT CREATE
//THE NEW OBJECT AND ADD IT TO CACHE. IF THE BRANCH IS IN CACHE THEN CONTINUE
if(!cacheManager.isInMemory(compareID))
{
Branch branch =
new Branch(fixedRegID, rs.getString("ID"), rs.getString("NAME"), rs.getString("ADDR1"), rs.getString("CITY"), rs.getString("ST"), rs.getString("ZIP"));
cacheManager.addBranch(rs.getString("ID"), branch);
}
else
{
continue;
}
}
retData = cacheManager.getAllBranches();
But doing things this way requires me to iterate over the whole
returned result set from the DB query, which is obviously not the way
things are supposed to be done?
You need to iterate in order to fetch the results.
To avoid iteration on all elements you need to exclude the already cached values that are returned in the select.
What I mean is, add to your select exclusion clause the values you dont want, in this case the values already cached. (not like, <>, etc). This will reduce the iteration time.
Otherwise yes, Im afraid you will have to iterate over all returns if your SQL filter is not complete.
So, would you recommend that I just try to cache the whole result set
returned? If I did this I guess I could update the list in the cache
if the DB table is updated with a new record? Any suggestions on how
to proceed and correctly put into ecache what is returned from the DB?
You should not cache highly dynamic business information.
What I recommend is that you use database indexes so that would dramatically increase your performance, and get your values from there. Use pure native SQL if needed.
If you are going to work with a lot of users you will need a lot of memory to keep all those objects in memory.
As you start scaling horizontally caching management is going to be a challenge this way.
If you can, only cache values that wont change or that change very few times, like in the application start up or application parameters.
If you really need to cache business information, please let us know the specifics, like what is the hardware, platform, database, landscape, peak of access, etc.
Related
I want to perform a search of a inputs in a list. That list resides in a database. I see two options for doing that-
Hit the db for each search and return the result.
keep a copy in memory synced with table and search in memory and return the result.
I like the second option as it will be faster. However I am confused on how to keep the list in sync with table.
example : I have a list L = [12,11,14,42,56]
and I receive an input : 14
I need to return the result if the input does exists in the list or not. The list can be updated by other applications. I need to keep the list in sync with table.
What would be the most optimized approach here and how to keep the list in sync with database?
Is there any way my application can be informed of the changes in the table so that I can reload the list on demand.
Instead of recreating your own implementation of something that already exists, I would leverage Hibernate's Second Level Cache (2LC) with an implementation such as EhCache.
By using a 2LC, you can specify the time-to-live expiration time for your entities and once they expire, any query would reload them from the database. If the entity cache has not yet expired, Hibernate will hydrate them from the 2LC application cache rather than the database.
If you are using Spring, you might also want to take a look at #Cachable. This operates at the component / bean tier allowing Spring to cache a result-set into a named region. See their documentation for more details.
To satisfied your requirement, you should control the read and write in one place, otherwise, there will always be some unsync case for the data.
Let's say that I have a table with columns TABLE_ID, CUSTOMER_ID, ACCOUNT_NUMBER, PURCHASE_DATE, PRODUCT_CATEGORY, PRODUCT_PRICE.
This table contains all purchases made in some store.
Please don't concentrate on changing the database model (there are obvious improvement possibilities) because this is a made-up example and I can't change the actual database model, which is far from perfect.
The only thing I can change is the code which uses the already existing database model.
Now, I don't want to access the database all the time, so I have to store the data into cache and then read it from there. The problem is, my program has to support all sorts of things:
What is the total value of purchases made by customer X on date Y?
What is the total value of purchases made for products from category X?
Give me a list of total amounts spent grouped by customer_id.
etc.
I have to be able to preserve this hierarchy in my cache.
One possible solution is to have a map inside a map inside a map... etc.
However, that gets messy very quickly, because I need an extra nesting level for every attribute in the table.
Is there a smarter way to do this?
Have you already established that you need a cache? Are you sure the performance of your application requires it? The database itself can optimize queries, have things in memory, etc.
If you're sure you need a cache, you also need to think about cache invalidation: is the data changing from beneath your feet, i.e. is another process changing the data in the database, or is the database data immutable, or is your application the only process modifying your data.
What do you want your cache to do? Just keep track of queries and results that have been requested so the second time a query is run, you can return the result from the cache? Or do you want to aggressively pre calculate some aggregates? Can the cache data fit into your app memory or do you want to use ReferenceMaps for example that shrink when memory gets tight?
For your actual question, why do you need maps inside maps? You probably should design something that's closer to your business model, and store objects that represent the data in a meaningful way. You could have each query (PurchasesByCustomer, PurchasesByCategory) represented as an object and store them in different maps so you get some type safety. Similarly don't use maps for the result but the actual objects you want.
Sorry, your question is quite vague, but hopefully I've given you some food for thoughts.
Even I've searched on google I want to be very sure about a thing.
Does ScrollMode.FORWARD_ONLY means that I will receive the results in the order they are in DB?
I have something like:
Scroller<Integer> foundRecs = new Scroller<Integer>(query.scroll(ScrollMode.FORWARD_ONLY));
Maybe is a stupid question...
That specific API is Hibernate, which I don't know too much about, but I guess it maps down to TYPE_FORWARD_ONLY in the end (and its documentation agrees by mentioning that constant).
If that's the case, then no: this will not influence the order in which items are returned.
It only means that you can only traverse the result once and can only navigate forward (not backwards).
Databases don't really have an "order" in which they store data, because their tables are sets (not lists). If you need some order for your results, then you need to add an ORDER BY to your SQL (or the equivalent in whichever query system you use).
You cannot rely on the physical order of data in the database. This might work if you query only a single table, but will fail as soon as you are using joins.
If you want your data to appear in a specific order, you need an ORDER BY clause.
I am using spring to reduce new objects creation, for the number of resultset available, I will create a row object inside and add the row object in a rowset object, which maintains a list of rows.
My code goes like this.
while(rs.next()) {
Row r = new Row();
//Logic
rowset.addRow(r);
}
My rs will have atleast a minimum of 3000 rows, so there will be a 3000 row object injected, so to avoid it, I used spring injection like
while(rs.next()) {
Row r = (Row)getApplicationContext().getBean("row");
//Logic
rowset.addRow(r);
}
but the problem here is when the second row object is added to the rowset, the first row object's values are overriden with second row object and when last row is added , all the 1 to last-1 row objects values are updated as last row object values.
This is my rowset calss
public class RowSet {
private ArrayList m_alRow;
public RowSet() {
m_alRow = new ArrayList();
}
public void addRow(Row row) {
m_alRow.add(row);
}
}
I need a solution , such that the new row object creation is avoided inside the while loop with the same logic involved.
Spring injection is not your answer here. You aren't avoiding any of the problems with keeping 3,000 rows in memory. You're chewing up the same RAM whether you call "new" to create them or the bean factory creates.
I'd question why you need all 3,000 rows in memory at once. I can see where might ask for them in smaller bites, or perhaps ask the database server to crunch them into a useable piece of data instead of hauling all that data to the middle tier.
I'd rethink your design.
A RowSet might be a fine abstraction for a persistence tier, but I don't see any business objects here. What are you doing with that data? How is it ultimately consumed?
You should follow Google's example if you simply send it to a user interface for display. Google might get back millions of responses to your query, but it only shows them to you 25 at a time, with the best answers coming up first. They're betting that you won't need the rest of those million results if the first one meets your need. Maybe you need to try that for your problem, too.
If you are looking for 3000 records with 3000 different values then 3000 objects are the only way to go. If you are retrieving data only from one column then this could be mitigated by adding them to a list (Which internally autoboxes to objects for primitives so it also ends up with 3000 objects). Why not try ORM tools like Hibernate and not worry about too much data wiring?
By default Spring returns a single instance. So you are simply fetching the same row over and over. This is why the values are overwritten, you only have 1 instance of Row. If your goal was to create less objects, you have achieved it, of course your logic has suffered for it 8-)
As was mentioned, if you are going to store 3000 unique values, you will need 3000 objects, its that simple. Even what you were attempting to do with Spring was still going to create those objects, only you were making a very simple new statement into an obfuscated call to Spring to do the new for you.
This is a horrible misuse of Spring, it is meant for dependency injection not to simply replace new, you made a very simple and clear statement complex by doing it this way.
PS - You can create a new object each time with Spring, but you have to configure it to do so. I won't elaborate, since this is not the solution to your problem.
Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?
That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.
SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!
I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.
...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.