How to prepare a large txt File to batch insert using Hibernate? - java

Im trying to insert over 200k rows to sql database, each row represent a card info(70+ string field). Within a Large TXT File.
I'm (new Java Developer) facing a quite hard time in this, My approach:
read File
File file = ReadFile.loadCardFile(pathName);
convert File to stream
Stream<String> cardsStream = new BufferedReader(new InputStreamReader(new FileInputStream(file), ("UTF-8"))).lines());
get each line in string array (the card info splitted by '|' and may or maynot that field is spaced)
cardsStream.forEach(s -> {
String[] card = Arrays.stream(s.split("\\|")).map(String::trim).toArray(String[]::new);
insert each line (card data)
numberOfRows = insertCardService.setCard(card, numberOfRows);
setCard is to map row data to its columns then I save each card
CardService.save(Card);
with this approach it takes up to 2h which is really really Long time
Is there any advice to better approach or could you provide me with links to read code it better?
oh btw I want to use batch insert to shorten time significantly but I think my way of reading the file is wrong!
Thanks in advance!!

JPA is the wrong tool for this kind of operation.
While it is probably possible to make it fast with JPA it is unnecessary difficult to do this.
JPA works best in a workflow where you load some entities, edit some attributes and let JPA figure out which exact updates are necessary.
For this JPA does a lot of caching which might cost considerable resources.
But here it seems you just want to pump some relevant amount of data into the database.
You don't need JPA to figure out what to do, it's all insert.
You don't need JPAs cache.
I recommend Springs JdbcTemplate or NamedParameterJdbcTemplate.
This probably already speeds up things considerable.
Once that works consider the following:
Batch inserts, i.e. sending just one statement to the database. See https://mkyong.com/spring/spring-jdbctemplate-batchupdate-example/ Note that some database need special driver argument to properly handle batch updates.
Doing intermittent commits. In general commits cost performance, because it forces databases to actually write data. But to long transaction might cause trouble as well, especially when the database is doing other stuff as well and in case of errors/rollbacks.
You need more control over your batches, take a look at Spring Batch.

By default hibernate will not save data in batches. You can enable that by setting below params.
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true

Related

Efficient way of processing large CSV file using java

Let's consider a scenario
Accounts.csv
Transaction.csv
We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account
If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice
As others have said a Database would be a good solution.
Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)
LC_ALL=C sort -t, -k5 file.csv > sorted.csv
taken from Sorting csv file by 5th column using bash
You can then read the 2 files sequentially
Your Programming logic is:
if (Accounts.accountNumber < Transaction.accountNumber) {
read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
process transaction
read Transaction file
} else {
read Transaction file
}
The memory requirements will be tiny, you only need to hold one record from each file in memory.
Let's say you are using Oracle as Database,.
you could load the data into its corresponding tables using the Oracle SQL Loader tool.
Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.
This will work in all types of Databases but you will need to find the appropriate tool for loading the data.
Of cause importing the data to a database first would be the most elegant way.
Beside that your question leaves the impression that this isn‘t an option.
So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account.
After one file-traversal you have all transactions for one account and that should usually fit into memory.
A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.

Hibernate bulk operations migrate databases

I wrote a small executable jar using Spring & Spring Data JPA to migrate data from a database, converting objects from original database (throught several tables) to valid objects for the new database and then insert the new objects in new database.
Problem is : I process a large amount of data (200 000) and doing my insert one by one is really time consuming (1hr, all the time is spent for the INSERT operations which happen after validating/transforming incoming data, it is not spent for the retrieval from original database nor validation/conversion).
I already had suggestions about it :
[Edit because i didn't explain it well] As I am doing a
extract-validate-transform-insert, do my insert (which are valid
because they are verified first) X objects by X objects (instead of
inserting it one by one). That is the suggestion from the frist
answer : tried it but that not so efficient, stil time consuming.
Instead of saving directly in database, save the insert into a .sql file and then import the file directly in database. But how to translate myDao.save() to a final SQL output and then write it to a file.
Use Talend : know as probably the best way, but too long to re-do everything. I'd like to find a way using java and refactor my jar.
Other ideas ?
Note : one important point is that if one valisation fails I want to continue to process other data, I only log an error.
Thanks
You should pause and think for a minute: what could cause an error when inserting your data into the database? Short of "your database is hosed", there are two posibilities:
There is a bug in your code;
The data coming in is bad.
If you have a bug in your code, you would be better of if all your data load is reverted. You will get another chance to transfer data after you fix your code.
If the data coming in is bad, or is suspected bad, you should add a step for validating your data. So, your process workflow might look like this: Extract --> Validate --> Transform --> Load. If the incoming data is invalid, write it into the log or load into a separate table for erroneous data.
You should keep all your process run in the same transaction, using the same Hibernate session. Keeping all 200K reords in memory would be pushing it. I suggest using batching (see http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/batch.html). In two words, after a predetermined number of records, say 1000, flush and clear your Hibernate session.

Processing large number of data

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.
Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.
There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

Which way would be better to import spreadsheet data?

I am trying to import data from speadsheet into a database using Java. There are two ways that I could do this: 1) Read and extract the data from speardsheets and organize them into data structures, such as ArrayLists, Vectors or maps of different objects, so that I could get rid of redundant entries etc, then write the data structures into the database. 2) Extract the data and put them into the database directly as the cells are read and extracted. I think the first way is probably better but would the second way be faster? Any other considerations i should think of?
Thank.
You would want to do a executeBatch() here which is similar to approach #1. So basically you read data from the spread sheet for a batch size (ie. 1000 records) and then you do a commit for transactions a batch at a time to the DB. After that move on to the next batch and so on and so forth. With this approach you utilize database efficiently, save yourself network trips, and also you do not end up hoarding a lot of data in memory which could lead to out of memory exceptions. You should also re-use the same connection and prepared statement objects.
Regarding the data clean up process, you should definitely sanitize your data before putting into a persistent storage such as a table. You may need to generate reports or use the data in other applications in the future, so having clean & well structured tables will help you in the long run. For batch applications, usually the performance requirements are not as high as the transactional systems.
You should also utilize a helper library like apache poi for reading excel documents. As far as the data structure is concerned it will depend on your data, but generally an ArrayList should suffice here.
Another point you might consider is that ypically most ETL tools offer these kinds of data loading tasks out of the box. If your situation allows for it, I highly recommend looking at an ETL tool like Kettle to load the data. You may be able to save yourself some time and learn a new tool.
Hope this helps!
You can consider using an ETL tools (Extraction, Transformation and Loading) for the kind of task you are referring

Merging a large table with a large text file using JPA?

We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.

Categories

Resources