I am working on an Java application which uses MySQL database as the data storage layer. There are few configuration tables in database, but each table has many thousands of records / rows. These all configuration is cached / loaded in memory in corresponding data structures / beans(JAVA POJO's) when application starts up.
Everything is fine except that every time the application starts the caching takes place and this usually takes 15-20 minutes, as the data to be cached is huge and also some columns have XML string which is parsed and then stored in beans.
So what's the big deal??
Why should we cache when no data is changed between consecutive start-up's.?? I can have all the beans encapsulated in a common Config bean and serialize it. And load this serialized object the next time when I figure out no data is changed - and yes of course loading serialized object is far faster then database hit plus bean population.
So is there any way I can figure this out?
Of course at database level. I would query when the application starts - Was there any change in the database tables since it was last started. If yes do the same old boring caching process and store some unique identifier and serialize, Or if last identifier and current identifier are same just load the serialized object. This unique identifier will of course be persistent.
Add an last_updated column of type timestamp to the table.
When you need to check if there are changes on the table simply execute the query:
select max(last_updated) from YOUR_TABLE
If the last_updated is after the time you created the last cache copy you can update the cache with only the elements changed since last creation of the cache with a query similar to this one:
select * from YOUR_TABLE where last_updated > LAST_CACHE_UPDATE
As explained in the comments is higly recomandable to add an index on the column last_updated. Using an index give you the possibility to retrieve the maximum value in a table of 1.000.000.000 records in 30 steps (not 1.000.000.000 as wrong mentioned in the comments).
If you restart your application a lot and your cache can live in off memory data structure like redis or hazelcast, use that as cache, not the jvm memory. When update data, update both sides.
Related
I have to process an xml file and for that I need to get around ~4k objects using it's primary key from a single table . I am using EhCache. I have few queries as follows:
1) It is taking lot of time if I am querying row by row based on Id and saving it in Cache . Can I query at initial point of time and save whole table in EHCache and can query it using primary key later in the processing
2) I dont want to use Query cache. As I can't load 4k objects at a time and loop it for finding correct object.
I am looking for optimal solution as right now my process is taking around 2 hours (it involves other processing too)
Thank you for your kind help.
You can read the whole table and store it in a Map<primary-key, table-row> to reduce the overhead of the DB connection.
I guess a TreeMap is probably the best choice, it makes search for elements faster.
Ehcache is great to handle concurrence, but if you are reading the xml with a single process you don't even need it (just store the Map in memory).
I'm using spring-batch jobs to persist content of a large csv file to a database.
JpaItemWriter is used for persistence, which is fine so far.
But now I'd like first to check if an entity already exists in the database (by id - the id field in csv and in database are equal), and in case just update the entity instead.
How could this be done?
When I needed to do this, the best I came up with was having my custom FieldSetMapper (used by the FlatFileItemReader) load the item from the database (or create a new instance of it doesn't exist) and then setting the properties based on the input. Since JpaItemWriter uses .merge, it will write the entity by updating if it was loaded from the database and insert if it was a new entity.
I also needed to have it run with a batch size of 1, to ensure that if there were duplicates in my input (which I did have), it would actually go one row at a time and insert or update for each one and not try to insert them all at once causing key problems.
As you might imagine, all this worked a lot slower than I would have liked. It queries the database for each and every row, and then does the corresponding update or insert. But as for my case it was for a monthly overnight batch process, it was good enough for our needs, even if it took many hours to run.
I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.
Let's imagine I have two processes A and B performing transactions on a H2 database table T.
Process A performs CRUD (Create, Read, Update, Delete) on T.
Process B wants to know when line L in T has been last modified (i.e., B provides a System.currentTimeMillis() value for example).
One could create a column in T registering the last modification moment for each line, but I was wondering whether H2 was already holding this information somewhere and whether it can be accessed.
For my knowledge there is no such feature in H2 (and probably not in any RDBMS). The reason is simple - extra 4 or 8 bytes per each record might have a huge impact on overall database size, especially with small records. Also there will be a slight performance impact.
But it is relatively simple to implement this feature by using extra column and on update trigger. Also some databases might simplify it even further, like MySQL:
ts TIMESTAMP DEFAULT 0 ON UPDATE CURRENT_TIMESTAMP
Also please distinguish between database server clock, process A clock and process B clock. In real world they most likely won't be the same.
The database only store what you setup.
So you must set the table to store the modification time, and one really easy way to do it is by creating a "Computed Column" with the value of now(), A computed column is evalueated each time that the record is modified.
CREATE TABLE TEST(ID INT, NAME VARCHAR, LAST_MOD TIMESTAMP AS NOW());
FYI: http://www.h2database.com/html/features.html#computed_columns
Following situations:
I got two databases featuring an identical structure. On top of each of these databases runs an instance of the same app using Hibernate for ORM. The two are completely independent.
Now I have to merge both applications into one. In some tables, adjustments need to be made to avoid violating unique key constraints.
Since both databases are identical in terms of structure and the same Hibernate mapping is used, is there a way to use Hibernate for the task? I'm thinking of loading an Object from database A, modifying it in code and simply saving it to a Session from a SessionFactory based on database B. I'm wondering whether Hibernate would be able to update the primary and foreign key values accordingly and how difficult it would be to handle dependencies to objects that are not copied from the database A (because they are not needed any more).
Any recommendations?
isn't it easier to just do a database dump from database A and import it into database B? Or as an alternative use insert into B.table (col1,col2) values (select col1,col3 from A.table) ?
If your databases are MySQL, you use the MERGE storage engine. Here are the steps:
-In one of your databases, update all your id via Hibernate using the cascade all. All your id have to be increment by the last id of your other database on each table:
User1 (2000 rows, lastId: 2000) and User2 (3000 rows, lastId: 3000) -> User1 (2000 rows, lastId: 2000) and User2 (3000 rows, firstId:3000, lastId: 6000)
-Create an other database that merge all your databases
-Extract a dump from your new database and load this dump in your final database -> http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html
This is one possible way :)
I know it is an old thread, but I had a similar problem.
I solved including two date fields : included_date and changed_date to my tables, and also, I included another field to save the date I last sync the databases somewhere else (I have a table with configuration info).
When my system connects to the server I send the date from the last sync, then my routine can compare which rows hava been included or changed since my last sync.
Every new row I set the date into the included_date field, so when I sync I know which rows were created after my last sync, then I can do an INSERT. The same happens with row changes and the changed_date field, then I do an UPDATE.