load the data from file using multi threading - java

I want to load the data into PostgreSQL table using java program.
In the java program I want to use multi threading and I use the COPY command API
CopyManager copyManager = ((PGConnection)conn).getCopyAPI();
FileReader fileReader = new FileReader(filename);
copyManager.copyIn("COPY "+tblname+ " FROM STDIN DELIMITER ',' ", fileReader);
I have divided the file into 'n' parts and then each thread load that part into PostgreSQL table.
Example:
file:test.csv
Thread=3
then test1.csv,test2.csv,test3.csv is created. and each thread execute each file.
it works correctly.
Now I want to check the data loading performance using multi threading of single file(not divide the file into n no of file)
Example:
file: test.csv(contain 30000 records)
thread =3(1st thread execute the 1st 10000 rows, 2nd thread execute the 1000120000 and 3rd thread execute the 20001-30000);
three thread divide the content of file and load the data into Postgres using COPY.
Is it possible to divide the records of file and load into Postgres using java multithreading?

Writing something to a storage will prevent all other threads from writing to the same. You cannot simply make everything multithreaded.
In this case all other parts Need to wait for the previous one to finish.

Related

How does the Spark executor and driver share data back and forth? Not the result of the task, but the data itself

So I have a spark application that reads DB records (lets say 1000 records), processes them, and writes a CSV file (with 1000 lines) out to the cloud Object storage. So three questions here:
Is DB read request sent to executors? If so in case of 1000 DB records, would each executor read partial DB data (example 500 records each) and send the records back to driver? Or does it write to a central cache and driver would read it from there?
Next step processing the DB records (fold job), is sent to 2 executors. Lets say each executor gets 500 records or so. Once the executor finishes processing its partition does it send all 500 processed (formatted) rows back to driver? Or does it write some central cache and driver gets it back? How is the data exchange happening between driver and executor happen?
Last step is the .save csvfile call in my main() function. In this code I am doing a reparition(1) with the idea that I will only save this file from one executor. If so, how is the data collected into this one executor. Remember earlier we had two executors process 500 records each. How is a total of 1000 records sent to one executor and gets saved into the object storage by one executor? how is the data collected from all executors shared into that one executor executing the .save?
dataset.repartition(1)
.write()
.format("csv")
.option("header", "true")
.save(filepath);
If I dont do repartition(1), will the save happen from multiple executors and would it overwrite each other? I dont think there is a way we can specify the filename to be unique using spark. Do I have to save the file in temp and rename later and all that?
Are there any articles, youtube videos that will explain how data is distributed and collected or shared across executors? I can understand how .count() works. but how does .save work or how is large data results like millions of DB records or rows shared across executors? I have been looking for resources to read can't seem to find one that answers my questions. I am very new to spark, like 3 weeks new.

Reading a file using multiple threads

I have a large sized file. Each line in that file maps to a database record. So I need to read that file line by line and persist each record into the database. Suppose I use multiple threads to read that file.
Is there a way in java wherein one thread can read lines from line number 1...50 and other thread reads lines from line number 51..100. In the similar way I will have multiple threads which will be reading from that single file.

Spring Batch Job reader is running continuously although the scheduled time is 5 minutes

I have configured a Spring Batch job with
triggerBean.setCronExpression(task.getCronExpression());
triggerBean.setStartTime(task.getStartTime());
LOG.info("Scheduling task {} to {}", task.getTaskName(), task.getCronExpression());
scheduler.scheduleJob(jobBean, triggerBean);
Now I have a reader which reads about let say 100 of db rows and then extract the list and give it to writer, now my writer used to process some of the list entries let say 50 of the list provided by reader. The job scheduled for after 10 minutes but I am observing that reader is running continuously infinitely.
Another example:
Let say my Reader reads 1 db rows with status column = 1 and count = 4 and give it the writer and I have implemented writer to delete rows entries with column = 1 and count =5. For this scenario my reader is running continuously and bringing that same data again and again for the writer but the writer cannot process it so this causes infinite loop for the reader.
Please suggest what is the problem here and what may be the solution.
Note: Another ticket is raised but i am not sure what may be the issue.
Have a look at it.
Spring Batch Job Running in Infinite loop
You need to return null in the reader once you completes the reading of all records from the DB if you are using your own reader.Please check your custom reader.

Multi-threaded file processing and database batch insertions

I have an Java main application which will read a file, line-by-line. Each line represents subscriber data.
name, email, mobile, ...
An subscriber object is created for each line being processed and then this object is persisted in database using JDBC.
PS: Input file has around 15 million subscriber data and application takes around 10-12 hours to process. I need to reduce this to around 2-3 hours as this task is an migration activity and down-time that we get is around 4-5 hours.
I know I need to use multiple thread / thread pool may be Java's native ExecuterService. But I am asked to do a batch update as well. Say taking a thread pool of 50 or 100 worker threads and batch update of 500-1000 subscribers.
I am familiar using ExecuterService but not getting an approach where I can have batch update logic too in it.
My overall application code looks like:
while (null != (line = getNextLine())) {
Subscriber sub = getSub(line); // creates subscriber object by parsing the line
persistSub(sub); // JDBC - PreparedStatement insert query executed
}
Need to know an approach where I can process it faster with multiple threads and using batch update or any existing frameworks or Java API's which can be used for such cases.
persistSub(sub) should not immediately access the database. Instead, it should store sub in an array of length 500-1000 and only when the array is full, or the input file terminated, wrap it in a Runnable and submit to a thread pool. The Runnable then accesses database via jdbc like it is described in JDBC Batching with PrepareStatement Object.
UPDATE
If writing into database is slow and input file reading is fast, many arrays with data can be created waiting to be written in the database, and system can run out of memory. So persistSub(sub) should keep track of the number of allocated arrays. The easiest way is to use a Semaphore inbitialized with allowed number of arrays. Before a new array is allocated, persistSub(sub) makes Semaphore.aquire(). Each Runnable task, before its end, makes Semaphore.release().

using ThreadPoolTaskExecutor while transferring files over the network

I am currently writing a spring batch which is supposed to transfer all the files from my application to a shared location. The batch consists of a single step which consists of a reader which reads byte[], a processor that converts it to pdf and a writer that creates new file at the shared location.
1) Since its an IO bound operation should I use ThreadPoolTaskExecutor in my batch? Will using it cause the data to be lost?
2) Also in my ItemWriter I am writing using a FileOutputStream. My server is in paris and the shared location is in New York. SO while writing the file in such scenarios is there any better or effecient way to achieve this with least delay?
Thanks in advance
1) If you can separate the IO-bound operation into its own thread and other parts into their own threads, you could go for a multithreaded approach. Data won't be lost if you write it correctly.
One approach could be as follows:
Have one reader thread read the data into a buffer.
Have a second thread perform the transformation into PDF.
Have a third thread perform the write out (if it's not to the same
disk as the reading).
2) So it's a mapped network share? There's probably not much you can do from Java then. But before you get worried about it, you should make sure it's an actual issue and not premature optimization.
I guess you can achieve the above task by partitioning. Create Master step which returns you the file paths and slave task with multi-thread with two approach.
simple tasklet which would read the file /convert to pdf and write to shared drive.
Rather than using FileOutputStream use FileChannel with BufferedRead/Write it has much better performance
Use Chunk Reader specifically ItemStream and pass it to custom ItemWriter to write to file(Never got chance to write to pdf but I believe at the end of the day it will be a stream with different encoding)
I would recommand the second one, its always better to use lib rather than using custom code

Categories

Resources