One of my Java application's functionality is to read and parse very frequently (almost every 5 minutes) an xml file and populate a database table. I have created a cron job to do that. Most of the columns' values remain the same but for certain columns there may be a frequent update on the value. I was wondering what is the most efficient way of doing that:
1) Delete the table every time and re-create it or
2) Update the table data and specifically the column where a change in the source file has appeared.
The number of rows parsed and persisted every time is about 40000-50000.
I would assume that around 2000-3000 rows need to update on every cron job run.
I am using JPA to persist data to a mysql server and I have gone for the first option so far.
Obviously for both options the job would execute as a single transaction.
Any ideas which one is better and possibly any optimization suggestions?
I would suggest scheduling your jobs using something more sophisticated than cron. For instance, Quartz.
Related
I'm using spring-batch jobs to persist content of a large csv file to a database.
JpaItemWriter is used for persistence, which is fine so far.
But now I'd like first to check if an entity already exists in the database (by id - the id field in csv and in database are equal), and in case just update the entity instead.
How could this be done?
When I needed to do this, the best I came up with was having my custom FieldSetMapper (used by the FlatFileItemReader) load the item from the database (or create a new instance of it doesn't exist) and then setting the properties based on the input. Since JpaItemWriter uses .merge, it will write the entity by updating if it was loaded from the database and insert if it was a new entity.
I also needed to have it run with a batch size of 1, to ensure that if there were duplicates in my input (which I did have), it would actually go one row at a time and insert or update for each one and not try to insert them all at once causing key problems.
As you might imagine, all this worked a lot slower than I would have liked. It queries the database for each and every row, and then does the corresponding update or insert. But as for my case it was for a monthly overnight batch process, it was good enough for our needs, even if it took many hours to run.
I am developing an application using normal JDBC connection. The application is developed with Java-Java EE SpringsMVC 3.0 and SQL Server 08 as database. I am required to update a table based on a non primary key column.
Now, before updating the table we had to decide an approach for updating the table, as table may contain huge amount of data. The update Query will be executed in a batch and we are required to design application in a manner wherein it doesn't hog the system resources.
Now, We had to decide between either of the approaches,
1. SELECT DATA BEFORE YOU UPDATE or
2. UPDATE DATA AND THEN SELECT MISSING DATA.
Select data before update is only benificial if chances of failure are maximum, i.e. if a batch 100 Query update is executed, and out of which if only 20 rows are updated successfully, then this approach should be taken
Update data and then check missing data is benificial only when failure records are far less. By this ap[proach one database select call can be avoided, i.e after a batch update, the count of records updated can be taken and the select query should be executed if and only if theres is a count in mismatch w.r.t no of query.
We are totally unaware about the system on Production environment, but we want to counter for all possibilities and want a faster system. I need your inputs as which is a better approach.
Since there is 50:50 chance of successful updates or faster selects, its hard to tell from the current scenario mentioned. You probably would want a fuzzy logic approach, getting constant feedback of how many updates were successful over the period of time, and then decide on the basis of that data to either do an update before select or do a select before update.
I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.
Have a batch job written in Java which truncates and then loads certain table in Oracle database every few minutes. There are reports generated on web pages based on the data in the table. Am wondering of a good way of not affecting the report querying part when the data loading process is happeneing so that the users won't end up with some and/or no data.
If you process all your SQL statements inside a single transaction there will be always a valid state seen from outside. Beware that TRUNCATE doe not work in transactions, so you have to use DELETE. While this guarantees to always have reasonable data in your table it needs a bigger rollback segment and will be considerably slower.
you could have 2 tables and a meta table which tracks which table is the main table being used for querying. your batch job will be truncating and loading one of the table and you can switch the main tables once the loading is completed. so the query app will get recent data now and u can load now in the other table
What I would do is set a flag in a DB table to indicate that that the update is in progress and have the reports look for that flag and display an appropriate message and wait for the update to finish. Once the update is complete clear the flag.
I have a table that contains approx 10 million rows. This table is periodically updated (few times a day) by an external process. The table contains information that, if not in the update, should be deleted. Of course, you don't know if its in the update until the update has finished.
Right now, we take the timestamp of when the update began. When the update finishes, anything that has an "updated" value less than the start timestammp is wiped. This works for now, but is problematic when the updater process crashes for whatever value - we have to start again with a new timestamp value.
It seems to be that there must be something more robust as this is a common problem. Any advice?
Instead of a time stamp, use an integer revision number. Increment it ONLY when you have a complete update, and then delete the elements with out of date revisions.
If you use a storage engine that supports transactions, like InnoDb (you're using MySql right?), you can consider using transactions, so if the update process crashes, the modifications are not commited.
Here is the official documentation.
We don't know anything about your architecture, and how you do this update (pure SQL, webservice?), but you might already have a transaction management layer.