Processing large number of data

Processing large number of data - java

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.

Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.

There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

Related

How to prepare a large txt File to batch insert using Hibernate?

Im trying to insert over 200k rows to sql database, each row represent a card info(70+ string field). Within a Large TXT File.
I'm (new Java Developer) facing a quite hard time in this, My approach:
read File
File file = ReadFile.loadCardFile(pathName);
convert File to stream
Stream<String> cardsStream = new BufferedReader(new InputStreamReader(new FileInputStream(file), ("UTF-8"))).lines());
get each line in string array (the card info splitted by '|' and may or maynot that field is spaced)
cardsStream.forEach(s -> {
String[] card = Arrays.stream(s.split("\\|")).map(String::trim).toArray(String[]::new);
insert each line (card data)
numberOfRows = insertCardService.setCard(card, numberOfRows);
setCard is to map row data to its columns then I save each card
CardService.save(Card);
with this approach it takes up to 2h which is really really Long time
Is there any advice to better approach or could you provide me with links to read code it better?
oh btw I want to use batch insert to shorten time significantly but I think my way of reading the file is wrong!
Thanks in advance!!

JPA is the wrong tool for this kind of operation.
While it is probably possible to make it fast with JPA it is unnecessary difficult to do this.
JPA works best in a workflow where you load some entities, edit some attributes and let JPA figure out which exact updates are necessary.
For this JPA does a lot of caching which might cost considerable resources.
But here it seems you just want to pump some relevant amount of data into the database.
You don't need JPA to figure out what to do, it's all insert.
You don't need JPAs cache.
I recommend Springs JdbcTemplate or NamedParameterJdbcTemplate.
This probably already speeds up things considerable.
Once that works consider the following:
Batch inserts, i.e. sending just one statement to the database. See https://mkyong.com/spring/spring-jdbctemplate-batchupdate-example/ Note that some database need special driver argument to properly handle batch updates.
Doing intermittent commits. In general commits cost performance, because it forces databases to actually write data. But to long transaction might cause trouble as well, especially when the database is doing other stuff as well and in case of errors/rollbacks.
You need more control over your batches, take a look at Spring Batch.

By default hibernate will not save data in batches. You can enable that by setting below params.
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true

Spring Batch filtering out records with Processor

I'm working on a Spring Batch job that creates a string that is based off of sql insert, delete, or update statements. It reads in a flatfile where the first three characters of each line are either add, chg, or del.
Example:
ADD123456001SOUTHLAND PAPER INCORPORATED ... //more info
CHG123456002GUERNSEY BIG DEAL FAIRFAX ...//more info
DEL123456002GUERNSEY BIG DEAL FAIRFAX ...//more info
From the above statements my ItemReader will generate three strings: insert into ..., update ... and delete .... The reader reads in the entire flatfile, returns an arraylist of these strings to my writer, and my writer will take these strings and write to my database.
Here's my problem. What happens if there's a chg requested before an add is requested? What if I try changing something that's already deleted?
I read up on ItemProcessor on SpringDocs and the description of filtering processes is exactly what I'm trying to do:
For example, consider a batch job that reads a file containing three
different types of records: records to insert, records to update, and
records to delete. If record deletion is not supported by the system,
then we would not want to send any "delete" records to the ItemWriter.
But, since these records are not actually bad records, we would want
to filter them out, rather than skip. As a result, the ItemWriter
would receive only "insert" and "update" records.
But the examples of ItemProcessor listed on the docs don't really make sense to me. Can someone make sense of the process to me? Or show me some examples of good ItemProcessing?
Edit: the 6 characters following the command are the id associated in the SQL database.

In the case described in the question you're not filtering out records, you just want to change the order they come through in. You'd be better off here sorting the file in an earlier step (to do your inserts first, then your updates, then your deletes). ItemProcessor is more for filtering out the occasional bad or irrelevant input line.
You could use the ItemProcessor to validate that the row updated or deleted exists, or that the row to be added isn't already present. Here I would wonder if the amount of querying you'd have to do in the ItemProcessor (one query per row in the input file) wouldn't be a lot of overhead for a condition that might only happen occasionally. Your choice would be between
using the ItemProcessor to filter (doing a query up front for each row), or
not doing any up-front queries, but having the ItemWriter skip these rows instead if RI is violated (rolling back the chunk and retrying one line at a time), see Spring Batch skip exception for ItemWriter.

Spring Batch Job Design -Multiple Readers

I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew

This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]

Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.

Hbase sort on column qualifiers

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.

You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)

All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.