Reading huge data set from database through java using multi-threading program

Reading huge data set from database through java using multi-threading program - java

In my project I am generating a report. This involves huge data transmission from the DB.
The logic is like user will give certain criteria,based on which first we will fetch parent items from db.There may be 100000 parent items.Not only this after getting this items we are gathering child items of this parent items and there detailed details. All to gather this parent and child information we are putting in one response xml.
It is fine for small records. But for huge records it is taking more time. We are using a tool as a back end system.Which stores the records.It has its own query set so query optimization did not work.All we have to do it with java.
Can any one from the team give some idea how to optimize this.

Not really a true answer, but too long for a comment
You must benchmark the different steps:
database - time a select extracting all the records (parents + childs) directly on the database (assuming it is a simple database)
network - time a transfert of the approximate size of the whole records.
processing - store the result on a local file and time the processing reading from local file (you must also time a copy of the file to know the time used to read from disk
Multithreading will only help if the bottleneck is processing.

Related

How to avoid OutOfMemory issues in particular scenario

We have the background job which fetches records from particular table in batch of 1000 records at a time.
This job runs at a interval of 30 minutes.
Now,
These records have email (key) and reason (value).
The issues is we have to lookup these records against data warehouse, (sort of filtering kind of thing - fetches last 180 days data from warehouse).
As call to the data warehouse is very costly in terms of time (45 minutes approximately).
So, existing scenario is like this.
We check for flat file on disk. if it does not exists. we make a call to data warehouse, fetch the records ( size ranges from 0.2 million to 0.25 million )
and write these records on a disk.
on subsequent calls we do lookup from flat file only.
- Loading entire file in memory and do in-memory search and filtering.
This caused OutOfMemory issue so many times.
So, we modified the logic like this.
We modified the Data warehouse call in 4 equal intervals and stored each result in file again with ObjectOutputStream,
Now, on subsequent calls we load data into memory in interval, i.e. 0-30 days, 31-60 days and so on.
But, this also is not helping out. Can experts please suggest what should be the ideal way to tackle this issue ?. Someone in senior team suggested to use CouchDB for storing and querying the data. But, at first glance I would prefer if with existing infrastructure any good solution is available ? if not then can think of using other tools.
Filtering code as of now.
private void filterRecords(List<Record> filterRecords) {
long start = System.currentTimeMillis();
logger.error("In filter Records - start (ms) : "+start);
Map<String, String> invalidDataSet=new HashMap<String, String>(5000);
try{
boolean isValidTempResultFile = isValidTempResultFile();
//1. fetch invalid leads data from DWHS only if existing file is 7 days older.
String[] intervals = {"0","45","90","135"};
if(!isValidTempResultFile){
logger.error("#CCBLDM isValidTempResultFile false");
for(String interval : intervals){
invalidDataSet.clear();
// This call takes approx 45 minutes to fetch the data
getInvalidLeadsFromDWHS(invalidDataSet, interval, start);
filterDataSet(invalidDataSet, filterRecords);
}
}
else{
//Set data from temporary file in interval to avoid heap space issue
logger.error("#CCBLDM isValidTempResultFile true");
intervals = new String[]{"1","2","3","4"};
for(String interval : intervals){
// Here GC won't happen at immediately causing OOM issue.
invalidDataSet.clear();
// Keeps 45 days of data in memory at a time
setInvalidDataSetFromTempFile(invalidDataSet, interval);
//2. mark current record as incorrect if it belongs to invalid email list
filterDataSet(invalidDataSet, filterRecords);
}
}
}catch(Exception filterExc){
Scheduler.log.error("Exception occurred while Filtering Records", filterExc);
}finally{
invalidDataSet.clear();
invalidDataSet = null;
}
long end = System.currentTimeMillis();
logger.error("Total time taken to filter all records ::: ["+(end-start)+"] ms.");
}

I'd strongly suggest to slightly change your infrastructure. IIYC you're looking up something in a file and something in a map. Working with a file is a pain and loading everything to memory causes OOME. You can do better using a memory mapped file, which allows fast and simple access.
There's a Chronicle-Map offering a Map interface to data stored off-heap. Actually, the data reside on the disk and take main memory as needed. You need to make your keys and values Serializable (which AFAIK you did already) or use an alternative way (which may be faster).
It's no database, it's just a ConcurrentMap, which makes working with it very simple. The whole installation is just adding a line like compile "net.openhft:chronicle-map:3.14.5" to build.gradle (or a few maven lines).
There are alternatives, but Chronicle-Map is what I've tried (just started with it, but so far everything works perfectly).
There's also Chronicle-Queue, in case you need batch processing, but I strongly doubt you'll need it as you're limited by your disk rather than main memory.

This is a typical use case for a batch processing kind of code. You could introduce a new column, say 'isProcessed' with 'false' value. Read say 1-100 rows (where id>=1 && id<=100) , process them and mark that flag as true. Then read say next 100 and so on. At the end of the job, mark all flags as false again (reset). But with time, it might become difficult to maintain and develop features on such a custom framework. There are open source alternatives.
Spring batch is a very good framework for these use cases and could be considered: https://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html.
There is also Easy batch: https://github.com/j-easy/easy-batch which doesn't require 'Spring framework' knowledge
But if it is a huge data set and would keep on growing in future, consider moving to 'big data' tech stack

Maintaining preprocessed data from large, continous data feed in MySQL

I'm currently working on an analytics tool that every night (with a Java program) parses huge event logs (approx. 1 GB each) to a MySQL database - for each event there's about 40 attributes. The event logs are parsed "raw" to the database.
The user of the application needs to see different graphs and charts based on complicated calculations on the log data. For the user not to wait several minuts for a chart-request to be fulfilled, we need to store the preprocessed data somehow ready to display for the user (the user is able to filter by dates, units etc., but the largest parts of the calculations can be done on beforehand). My question is concerned about how to maintain such preprocessed data - currently, all calculations are expressed in SQL as we assume is the most efficient way (is this a correct assumption?). We need to be able to easily expand with new calculations for new charts, customer specific wishes etc.
Some kind of materialized view jumps to my mind, but MySQL doesn't seem to support this feature. Similarly, we could execute the SQL calculation each night after the event logs has been imported, but in this way each calculation/preprocessed data table needs to know which events it has processed and which it hasn't. The table will contain up to a year worth of data (i.e. events) so simply truncating the table and doing all calculations once again seems not to be the solution? Using triggers doesn't seem right neither, as some calculations need to consider for example the time difference between to specific kinds of events?
I'm having a hard time weighing the pros and cons of possible solutions.

"Materialized Views" are not directly supported by MySQL. "Summary Tables" is another name for them in this context. Yes, that is the technique to use. You must create and maintain the summary table(s) yourself. They would be updated either as you insert data into the 'Fact' table, or periodically through a cron job, or simply after uploading the nightly dump.
The details of such are far more than can be laid out in this forum, and the specific techniques that would work best for you involve many questions. I have covered much of it in three blogs: DW, Summary Tables, and High speed ingestion. If you have further, more specific, questions, please open a new Question and I will dig into more details as needed.
I have done such in several projects; usually the performance is 10x better than reading the Fact table; in one extreme case, it was 1000x. I always end up with UI-friendly "reports" coming from the Summary Table(s).
In some situations, you are actually better off building the Summary Tables and not saving the Fact rows in a table. Optionally, you could simply keep the source file in case of a need to reprocess it. Not building the Fact table will get the summary info to the end-user even faster.
If you are gathering data for a year, and then purging the 'old' data, see my blog on partitioning. I often use that on the Fact table, but rarely feel the need on a Summary Table, since the Summary table is much smaller (that is, not filling up disk).
One use case had a 1GB dump every hour. A perl script moved the data to a Fact table, plus augmented 7 Summary Tables, in less than 10 minutes. The system was also replicated, that added some extra challenges. So, I can safely say that 1GB a day is not a problem.

Spring Batch Job Design -Multiple Readers

I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew

This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]

Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.

Hibernate bulk operations migrate databases

I wrote a small executable jar using Spring & Spring Data JPA to migrate data from a database, converting objects from original database (throught several tables) to valid objects for the new database and then insert the new objects in new database.
Problem is : I process a large amount of data (200 000) and doing my insert one by one is really time consuming (1hr, all the time is spent for the INSERT operations which happen after validating/transforming incoming data, it is not spent for the retrieval from original database nor validation/conversion).
I already had suggestions about it :
[Edit because i didn't explain it well] As I am doing a
extract-validate-transform-insert, do my insert (which are valid
because they are verified first) X objects by X objects (instead of
inserting it one by one). That is the suggestion from the frist
answer : tried it but that not so efficient, stil time consuming.
Instead of saving directly in database, save the insert into a .sql file and then import the file directly in database. But how to translate myDao.save() to a final SQL output and then write it to a file.
Use Talend : know as probably the best way, but too long to re-do everything. I'd like to find a way using java and refactor my jar.
Other ideas ?
Note : one important point is that if one valisation fails I want to continue to process other data, I only log an error.
Thanks

You should pause and think for a minute: what could cause an error when inserting your data into the database? Short of "your database is hosed", there are two posibilities:
There is a bug in your code;
The data coming in is bad.
If you have a bug in your code, you would be better of if all your data load is reverted. You will get another chance to transfer data after you fix your code.
If the data coming in is bad, or is suspected bad, you should add a step for validating your data. So, your process workflow might look like this: Extract --> Validate --> Transform --> Load. If the incoming data is invalid, write it into the log or load into a separate table for erroneous data.
You should keep all your process run in the same transaction, using the same Hibernate session. Keeping all 200K reords in memory would be pushing it. I suggest using batching (see http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/batch.html). In two words, after a predetermined number of records, say 1000, flush and clear your Hibernate session.

Best Design for the scenario

I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.

Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.

I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.

Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.

Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.