Hibernate bulk operations migrate databases - java

I wrote a small executable jar using Spring & Spring Data JPA to migrate data from a database, converting objects from original database (throught several tables) to valid objects for the new database and then insert the new objects in new database.
Problem is : I process a large amount of data (200 000) and doing my insert one by one is really time consuming (1hr, all the time is spent for the INSERT operations which happen after validating/transforming incoming data, it is not spent for the retrieval from original database nor validation/conversion).
I already had suggestions about it :
[Edit because i didn't explain it well] As I am doing a
extract-validate-transform-insert, do my insert (which are valid
because they are verified first) X objects by X objects (instead of
inserting it one by one). That is the suggestion from the frist
answer : tried it but that not so efficient, stil time consuming.
Instead of saving directly in database, save the insert into a .sql file and then import the file directly in database. But how to translate myDao.save() to a final SQL output and then write it to a file.
Use Talend : know as probably the best way, but too long to re-do everything. I'd like to find a way using java and refactor my jar.
Other ideas ?
Note : one important point is that if one valisation fails I want to continue to process other data, I only log an error.
Thanks

You should pause and think for a minute: what could cause an error when inserting your data into the database? Short of "your database is hosed", there are two posibilities:
There is a bug in your code;
The data coming in is bad.
If you have a bug in your code, you would be better of if all your data load is reverted. You will get another chance to transfer data after you fix your code.
If the data coming in is bad, or is suspected bad, you should add a step for validating your data. So, your process workflow might look like this: Extract --> Validate --> Transform --> Load. If the incoming data is invalid, write it into the log or load into a separate table for erroneous data.
You should keep all your process run in the same transaction, using the same Hibernate session. Keeping all 200K reords in memory would be pushing it. I suggest using batching (see http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/batch.html). In two words, after a predetermined number of records, say 1000, flush and clear your Hibernate session.

Related

Reading huge data set from database through java using multi-threading program

In my project I am generating a report. This involves huge data transmission from the DB.
The logic is like user will give certain criteria,based on which first we will fetch parent items from db.There may be 100000 parent items.Not only this after getting this items we are gathering child items of this parent items and there detailed details. All to gather this parent and child information we are putting in one response xml.
It is fine for small records. But for huge records it is taking more time. We are using a tool as a back end system.Which stores the records.It has its own query set so query optimization did not work.All we have to do it with java.
Can any one from the team give some idea how to optimize this.
Not really a true answer, but too long for a comment
You must benchmark the different steps:
database - time a select extracting all the records (parents + childs) directly on the database (assuming it is a simple database)
network - time a transfert of the approximate size of the whole records.
processing - store the result on a local file and time the processing reading from local file (you must also time a copy of the file to know the time used to read from disk
Multithreading will only help if the bottleneck is processing.

Get MYSQL last updated data or lastly inserted data

I have some problem like this.
I am accessing a database which is currently having over 100,000 data in new entry table.
Now I want to write a listener, means if any new record insert to table from somewhere else I have to get a notification.
My question is: What is best and fastest way to do this? because for a day there should have around 500 new data in the new entry table. Is is suitable to check the database every time using a thread?
Im using Java to do this with MySQL.
Please advice me.
I am not sure whether there is any listener that exists for Mysql changes. So it wouldn't be straight forward to get these details.
But there is something called 'The Binary Log' in mysql, which contains “events” that describe database changes such as table creation operations or changes to table data.
So one way to track the changes can be polling these logs. The challenge is that these logs are written in binary format. Mysql provides a utility called mysqlbinlog to process these logs in text format.
Here is one java parser for your rescue, which can read the mysql binary logs:
https://github.com/tangfl/jbinlog
Integrating all this bits and pieces , you may be able to get what you need.
try out this...
numero = stmt.executeUpdate(query, Statement.RETURN_GENERATED_KEYS);
Take a look at the documentation for the JDBC Statement interface.
I used java timer class for as an alternative to this solution. Now it works fine. It checks the database in every 10 seconds and if the condition true, it will execute what I want.

Accessing database multiple times

I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.

Merging a large table with a large text file using JPA?

We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Categories

Resources