How to update existing entities with JpaItemWriter? - java

I'm using spring-batch jobs to persist content of a large csv file to a database.
JpaItemWriter is used for persistence, which is fine so far.
But now I'd like first to check if an entity already exists in the database (by id - the id field in csv and in database are equal), and in case just update the entity instead.
How could this be done?

When I needed to do this, the best I came up with was having my custom FieldSetMapper (used by the FlatFileItemReader) load the item from the database (or create a new instance of it doesn't exist) and then setting the properties based on the input. Since JpaItemWriter uses .merge, it will write the entity by updating if it was loaded from the database and insert if it was a new entity.
I also needed to have it run with a batch size of 1, to ensure that if there were duplicates in my input (which I did have), it would actually go one row at a time and insert or update for each one and not try to insert them all at once causing key problems.
As you might imagine, all this worked a lot slower than I would have liked. It queries the database for each and every row, and then does the corresponding update or insert. But as for my case it was for a monthly overnight batch process, it was good enough for our needs, even if it took many hours to run.

Related

Table data overrides

I'm currently sourcing some static data from a third party. It's a simple one-to-many, like this
garage:
id
name
desc
location
garage_price:
id
garage_id
price_type
price
Sometimes, the data is incorrect, and I will need to correct it. At the same time, I'd like to preserve the original sourced data somewhere and potentially run some queries to show the changes.
My question is whether someone is doing something like this with SQL, Java and Hibernate, and what's the approach you've taken, or would take.
I could add a boolean column, "original_data", to both tables, and before an update happens, run a trigger to copy the row from garage or garage_price into an "original_garage" or "original_price" table as long as original_data is true. Then set original_data to false, and all further updates will just happen on the garage/garage_price tables.
Anything wrong with that approach, and how do people typically work with multiple tables with the same data in Hibernate/JPA? Previously, I'd create a class that holds all the data, and subclass it twice, once per each table, while setting
#Inheritance(strategy=InheritanceType.TABLE_PER_CLASS)
on the parent.
As so often there are various options:
Use Hibernate Envers. It will keep a complete history of changes, so if you do multiple changes each will result in a row in the auditing tables. These tables are separate from your main data tables which might be a pro or a con, depending on your requirements.
Use the approach that you described: Write the original dataset, copy it before modifying it. You'll need two additional attributes:
A flag marking the original and a technical id do have a unique primary key.
Just as the second version, but you could actually do that in a trigger in the database. Which probably is faster, works no matter how the data gets inserted and to copy rows in the database is actually really easy, while it feels rather cumbersome in Java. Of course, writing triggers is considered a PITA in itself by many Java developers. If your application doesn't usually use triggers and stored procedures it is also really easy to forget about the trigger and being rather confused where these additional rows come from.

Populate database table on a frequent basis using JPA

One of my Java application's functionality is to read and parse very frequently (almost every 5 minutes) an xml file and populate a database table. I have created a cron job to do that. Most of the columns' values remain the same but for certain columns there may be a frequent update on the value. I was wondering what is the most efficient way of doing that:
1) Delete the table every time and re-create it or
2) Update the table data and specifically the column where a change in the source file has appeared.
The number of rows parsed and persisted every time is about 40000-50000.
I would assume that around 2000-3000 rows need to update on every cron job run.
I am using JPA to persist data to a mysql server and I have gone for the first option so far.
Obviously for both options the job would execute as a single transaction.
Any ideas which one is better and possibly any optimization suggestions?
I would suggest scheduling your jobs using something more sophisticated than cron. For instance, Quartz.

Managing history records in a database

I have a web project that uses a database to store data that is used to generate tasks that would be processed for remote machines to alter that records and store new data. My problem here is that I have to store all that changes on each table but I don't need all these information. For example, a table A could have 5 fields but I only need 2 for historical purposes. Another table B could have 3 and I would have to add another one (date for example). Also, I don't need changes during daily task generation, only the most recent one.
Which is the best way to maintain a change history? Someone told me that a good idea is having two tables, the A (B) table and another one called A_history (B_history) with the needed fields. This is actually what I'm doing, using triggers to insert into history tables but I don't feel comfortable with this approach. My project uses Spring (Spring-data, Hibernate and JPA) and if I change the DB (currently MySQL) I'd have to migrate triggers. Is there a good way to manage history records? Tables could be generated with Hibernate/JPA annotations.
If I maintain the two tables approach, can I add a method to the repository to fetch rows from current table and history table at once?
For this pourpose there is a special Hibernate Envers project. See official documentation here. Just configure it, annotate necessary properties with #Audited annotation and that's all. No need for DB triggers.
One pitfall: if you want to have a record for each delete operation then you need to use Session.delete(entity) way instead of HQL "delete ...".
EDIT. Also take a look into native auditing support of spring data jpa.
I am not a database expert. What I have seen them do boils down to a few ways of approach.
1) They add a trigger to the transactional table that copies inserts and updates to a history table but not deletes. This means any queries that need to include history can be done from the history table since all the current info is there too.
a) They can tag each entry in the history table with time and date and
keep track of all the states of the original records.
b) They can only
keep track of the current state of the original record and then it
settles when the original is deleted.
2) They have a periodic task that goes around and copies data marked as deletable into the history table. It then deletes the data from the transactional table. Any queries in the transactional table have to make sure to ignore the deletable rows. Any queries that need history have to search both tables and merge the results.
3) If the volume of data isn't too large, they just leave everything in one table and mark some entries as historical. Queries have to ignore historical rows. Queries that include history are easy. This may slow down database access as the table grows to include many unused rows but that can sometimes be ameliorated by clever use of indexes.

How to make update faster with Hibernate while using huge number of records

I am facing issue while using the hibernate update (Session.update()) portion with huge number of records. it is becoming very slower. but there is no issue with the insert (Session.insert()) portion. is there any way to do the update portion while we do update on lakh's of records.is there any way to tune the sql server so that the update will become faster. while we add seperate indexes to all the primary fields then the delete portion is taking time. is there any better way to tune sql server so that it performs well with insert, delete and update.
Thank you,
Saif.
do a batch update instead of individual update for each record. this way you will only hit the database once for all the records.
When you do a save only the data is saved into the database whereas when your updating a record it has to first perform the search operation and then update the record that is why your facing issues on update and not on save when your are handling huge number of records can use hibernate's BATCH PROCESSING to update your records. Here is a good link for batch processing in hibernate from tutorials point:
http://www.tutorialspoint.com/hibernate/hibernate_batch_processing.htm
There may be other solutions to this but one way I know is:
Whenever you save or update an instance through session (e.g. session.save(), session.update(), session.saveOrUpdate() etc.), it also updates the instance FK associations.
So if your POJO has multiple FK associations, it will fire queries on those tables as well.
So instead of updating instance in this way, I would suggest to use HQL (if it applies to your requirement) to save or update instance.

database audit table

I have an existing application that I am working w/ and the customer has defined the table structure they would like for an audit log. It has the following columns:
storeNo
timeChanged
user
tableChanged
fieldChanged
BeforeValue
AfterValue
Usually I just have simple audit columns on each table that provide a userChanged, and timeChanged value. The application that will be writing to these tables is a java application, and the calls are made via jdbc, on an oracle database. The question I have is what is the best way to get the before/after values. I hate to compare objects to see what changes were made to populate this table, this is not going to be efficient. If several columns change in one update, then this new table will have several entries. Or is there a way to do this in oracle? What have others done in the past to track not only changes but changed values?
This traditionally what oracle triggers are for. Each insert or update triggers a stored procedure which has access to the "before and after" data, which you can do with as you please, such as logging the old values to an audit table. It's transparent to the application.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:59412348055
If you use Oracle 10g or later, you can use built in auditing functions. You paid good money for the license, might as well use it.
Read more at http://www.oracle.com/technology/pub/articles/10gdba/week10_10gdba.html
"the customer has defined the table structure they would like for an audit log"
Dread words.
Here is how you would implement such a thing:
create or replace trigger emp_bur before insert on emp for each row
begin
if :new.ename = :old.ename then
insert_audit_record('EMP', 'ENAME', :old.ename, :new.ename);
end if;
if :new.sal = :old.sal then
insert_audit_record('EMP', 'SAL', :old.sal, :new.sal);
end if;
if :new.deptno = :old.deptno then
insert_audit_record('EMP', 'DEPTNO', :old.deptno, :new.deptno);
end if;
end;
/
As you can see, it involves a lot of repetition, but that is easy enough to handle, with a code generator built over the data dictionary. But there are more serious problems with this approach.
It has a sizeable overhead: an
single update which touches ten
field will generate ten insert
statements.
The BeforeValue and AfterValue
columns become problematic when we
have to handle different datatypes -
even dates and timestamps become
interesting, let alone CLOBs.
It is hard to reconstruct the state
of a record at a point in time. We
need to start with the earliest
version of the record and apply the
subsequent changes incrementally.
It is not immediately obvious how
this approach would handle INSERT
and DELETE statements.
Now, none of those objections are a problem if the customer's underlying requirement is to monitor changes to a handful of sensitive columns: EMPLOYEES.SALARY, CREDIT_CARDS.LIMIT, etc. But if the requirement is to monitor changes to every table, a "whole record" approach is better: just insert a single audit record for each row affected by the DML.
I'll ditto on triggers.
If you have to do it at the application level, I don't see how it would be possible without going through these steps:
start a transaction
SELECT FOR UPDATE of the record to be changed
for each field to be changed, pick up the old value from the record and the new value from the program logic
for each field to be changed, write an audit record
update the record
end the transaction
If there's a lot of this, I think I would be creating an update-record function to do the compares, either at a generic level or a separate function for each table.

Categories

Resources