Multi-threaded file processing and database batch insertions - java

I have an Java main application which will read a file, line-by-line. Each line represents subscriber data.
name, email, mobile, ...
An subscriber object is created for each line being processed and then this object is persisted in database using JDBC.
PS: Input file has around 15 million subscriber data and application takes around 10-12 hours to process. I need to reduce this to around 2-3 hours as this task is an migration activity and down-time that we get is around 4-5 hours.
I know I need to use multiple thread / thread pool may be Java's native ExecuterService. But I am asked to do a batch update as well. Say taking a thread pool of 50 or 100 worker threads and batch update of 500-1000 subscribers.
I am familiar using ExecuterService but not getting an approach where I can have batch update logic too in it.
My overall application code looks like:
while (null != (line = getNextLine())) {
Subscriber sub = getSub(line); // creates subscriber object by parsing the line
persistSub(sub); // JDBC - PreparedStatement insert query executed
}
Need to know an approach where I can process it faster with multiple threads and using batch update or any existing frameworks or Java API's which can be used for such cases.

persistSub(sub) should not immediately access the database. Instead, it should store sub in an array of length 500-1000 and only when the array is full, or the input file terminated, wrap it in a Runnable and submit to a thread pool. The Runnable then accesses database via jdbc like it is described in JDBC Batching with PrepareStatement Object.
UPDATE
If writing into database is slow and input file reading is fast, many arrays with data can be created waiting to be written in the database, and system can run out of memory. So persistSub(sub) should keep track of the number of allocated arrays. The easiest way is to use a Semaphore inbitialized with allowed number of arrays. Before a new array is allocated, persistSub(sub) makes Semaphore.aquire(). Each Runnable task, before its end, makes Semaphore.release().

Related

How does the Spark executor and driver share data back and forth? Not the result of the task, but the data itself

So I have a spark application that reads DB records (lets say 1000 records), processes them, and writes a CSV file (with 1000 lines) out to the cloud Object storage. So three questions here:
Is DB read request sent to executors? If so in case of 1000 DB records, would each executor read partial DB data (example 500 records each) and send the records back to driver? Or does it write to a central cache and driver would read it from there?
Next step processing the DB records (fold job), is sent to 2 executors. Lets say each executor gets 500 records or so. Once the executor finishes processing its partition does it send all 500 processed (formatted) rows back to driver? Or does it write some central cache and driver gets it back? How is the data exchange happening between driver and executor happen?
Last step is the .save csvfile call in my main() function. In this code I am doing a reparition(1) with the idea that I will only save this file from one executor. If so, how is the data collected into this one executor. Remember earlier we had two executors process 500 records each. How is a total of 1000 records sent to one executor and gets saved into the object storage by one executor? how is the data collected from all executors shared into that one executor executing the .save?
dataset.repartition(1)
.write()
.format("csv")
.option("header", "true")
.save(filepath);
If I dont do repartition(1), will the save happen from multiple executors and would it overwrite each other? I dont think there is a way we can specify the filename to be unique using spark. Do I have to save the file in temp and rename later and all that?
Are there any articles, youtube videos that will explain how data is distributed and collected or shared across executors? I can understand how .count() works. but how does .save work or how is large data results like millions of DB records or rows shared across executors? I have been looking for resources to read can't seem to find one that answers my questions. I am very new to spark, like 3 weeks new.

Load data into threads of Java ExecutorService

I am writing server to response to the queries of the same type. As I will have several clients I want to perform tasks in parallel. Prior to performing task I need to load static data from file - this process takes most of the time. When this is done I can answer any amount of queries using this loaded data.
I want to use Executors.newFixedThreadPool(n) as my ExecutorService, so it manages all multithreading staff for me. If I understand correctly threads are created once and then all my task are run using this threads. So it will ideally fit my problem, if it is possible to load data into every thread when it is created and use it for all tasks which will be run using this thread.
Is this possible?
Another idea is just to create array of several copies of the same data with boolean isInUse. As there will be fixed amount of tasks performed in parallel, just select the one data entity that is free at the moment, mark it taken and mark it free in the end of performing the task.
But I think I will need somehow synchronise this boolean parameter between threads.
(I need several copies of data as it can be modified while performing task, but will be returned to initial state after task is performed.)

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?
It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.
Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file
A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Using multi threading for reading information

I have the next scenario:
the server send a lot of information from a Socket, so I need to read this information and validate it. The idea is to use 20 threads and batches, each time the batch size is 20, the thread must send the information to the database and keep reading from the socket waiting for more.
I don't know what it would be the best way to do this, I was thinking:
create a Socket that will read the information
Create a Executor (Executors.newFixedThreadPool(20)) and validate de information, and add each line into a list and when the size is 20 execute the Runnable class that will send the information to the database.
Thanks in advance for you help.
You don't want to do this with a whole bunch of threads. You're better off using a producer-consumer model with just two threads.
The producer thread reads records from the socket and places them on a queue. That's all it does: read record, add to queue, read next record. Lather, rinse, repeat.
The consumer thread reads a record from the queue, validates it, and writes it to the database. If you want to batch the items so that you write 20 at a time to the database, then you can have the consumer add the record to a list and when the list gets to 20, do the database update.
You probably want to look up information on using the Java BlockingQueue in producer-consumer programs.
You said that you might get a million records a day from the socket. That's only 12 records per second. Unless your validation is hugely processor intensive, a single thread could probably handle 1,200 records per second with no problem.
In any case, your major bottleneck is going to be the database updates, which probably won't benefit from multiple threads.

Help with java threads or executors: Executing several MySQL selects, inserts and updates simmultaneously

I'm writing an application to analyse a MySQL database, and I need to execute several DMLs simmultaneously; for example:
// In ResultSet rsA: Select * from A;
rsA.beforeFirst();
while (rsA.next()) {
id = rsA.getInt("id");
// Retrieve data from table B: Select * from B where B.Id=" + id;
// Crunch some numbers using the data from B
// Close resultset B
}
I'm declaring an array of data objects, each with its own Connection to the database, which in turn calls several methods for the data analysis. The problem is all threads use the same connection, thus all tasks throw exceptios: "Lock wait timeout exceeded; try restarting transaction"
I believe there is a way to write the code in such a way that any given object has its own connection and executes the required tasks independent from any other object. For example:
DataObject dataObject[0] = new DataObject(id[0]);
DataObject dataObject[1] = new DataObject(id[1]);
DataObject dataObject[2] = new DataObject(id[2]);
...
DataObject dataObject[N] = new DataObject(id[N]);
// The 'DataObject' class has its own connection to the database,
// so each instance of the object should use its own connection.
// It also has a "run" method, which contains all the tasks required.
Executor ex = Executors.newFixedThreadPool(10);
for(i=0;i<=N;i++) {
ex.execute(dataObject[i]);
}
// Here where the problem is: Each instance creates a new connection,
// but every DML from any of the objects is cluttered in just one connection
// (in MySQL command line, "SHOW PROCESSLIST;" throws every connection, and all but
// one are idle).
Can you point me in the right direction?
Thanks
I think the problem is that you've confounded a lot of middle tier, transactional, and persistent logic into one class.
If you're dealing directly with ResultSet, you're not thinking about things in a very object-oriented fashion.
You're smart if you can figure out how to get the database to do some of your calculations.
If not, I'd recommend keeping Connections open for the minimum time possible. Open a Connection, get the ResultSet, map it into an object or data structure, close the ResultSet and Connection in local scope, and return the mapped object/data structure for processing.
You keep persistence and processing logic separate this way. You save yourself a lot of grief by keeping connections short-lived.
If a stored procedure solution is slow it could be due to poor indexing. Another solution will perform equally poorly if not worse. Try running EXPLAIN PLAN and see if any of your queries are using TABLE SCAN. If yes, you have some indexes to add. It could also be due to large rollback logs if your transactions are long-running. There's a lot you could and should do to ensure you've done everything possible with the solution you have before switching. You could go to a great deal of effort and still not address the root cause.
After some time of brain breaking, I figured out my own mistakes... I want to put this new knowledge, so... here I go
I made a very big mistake by declaring the Connection objet as a Static object in my code... so obviously, despite I created a new Connection for each new data object I created, every transaction went through a single, static, connection.
With that first issue corrected, I went back to the design table, and realized that my process was:
Read an Id from an input table
Take a block of data related to the Id read in step 1, stored in other input tables
Crunch numbers: Read the related input tables and process the data stored in them
Save the results in one or more output tables
Repeat the process while I have pending Ids in the input table
Just by using a dedicated connection for input reading and a dedicated connection for output writing, the performance of my program increased... but I needed a lot more!
My original approach for steps 3 and 4 was to save into the output each one of the results as soon as I had them... But I found a better approach:
Read the input data
Crunch the numbers, and put the results in a bunch of queues (one for each output table)
A separated thread is checking every second if there's data in any of the queues. If there's data in the queues, write it to the tables.
So, by dividing input and output tasks using different connections, and by redirecting the core process output to a queue, and by using a dedicated thread for output storage tasks, I finally achieved what I wanted: Multithreaded DML execution!
I know there are better approaches to this particular problem, but this one works quite fine.
So... if anyone is stuck with a problem like this... I hope this helps.

Categories

Resources