How to parallelize java source codes for large data with Threadpool

How to parallelize java source codes for large data with Threadpool - java

I try to parallelize some source code with ExecutorService and LinkedList in Java.
In the following source code, some process is done for each line of a text file.
When each line is relatively short, the following source code works well (all lines are read and processes are done).
However, when it is relatively long, the program stops reading a line without any errors.
Is there any capacity for LinkedList? Can I increase the capacity? Is there another proper way for parallelization?
ExecutorService threadPool = Executors.newFixedThreadPool(4);
Collection<Callable<Void>> processes = new LinkedList<Callable<Void>>();
int c = 0;
while ((string = br.readLine()) != null) {
final String str = string;
processes.add(new Callable<Void>() {
#Override
public Void call() {
// some process
return null;
}
});
}
try {
threadPool.invokeAll(processes);
} catch (InterruptedException e) {
throw new RuntimeException(e);
} finally {
System.out.println("parallel finishes");
threadPool.shutdown();
}

Related

Multithreading help using ExecutorService in Java [duplicate]

This question already has an answer here:
ExecutorService Future::get very slow
(1 answer)
Closed 5 years ago.
I am trying to search a list of words and find the total count of all the words across multiple files.
My logic is to have separate threads for each file and get the count. Finally I can aggregate the total count got from each of the threads.
Say, I have 50 files each of 1MB. The performance does not improve when I am using multiple threads. My total execution time does not improve with FILE_THREAD_COUNT. I am getting almost the same execution time when my thread count is either 1 or 50.
Am I doing something wrong in using the executor service?
Here is my code.
public void searchText(List<File> filesInPath, Set<String> searchWords) {
try {
BlockingQueue<File> filesBlockingQueue = new ArrayBlockingQueue<>(filesInPath.size());
filesBlockingQueue.addAll(filesInPath);
ExecutorService executorService = Executors.newFixedThreadPool(FILE_THREAD_COUNT);
int totalWordCount = 0;
while (!filesBlockingQueue.isEmpty()) {
Callable<Integer> task = () -> {
int wordCount = 0;
try {
File file = filesBlockingQueue.take();
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
String currentLine;
while ((currentLine = bufferedReader.readLine()) != null) {
String[] words = currentLine.split("\\s+");
for (String word : words) {
for (String searchWord : searchWords) {
if (word.contains(searchWord)) {
wordCount++;
}
}
}
}
} catch (Exception e) {
// Handle error
}
} catch (Exception e) {
// Handle error
}
return wordCount;
};
totalWordCount += executorService.submit(task).get();
}
System.out.println("Final word count=" + totalWordCount);
executorService.shutdown();
} catch (Exception e) {
// Handle error
}
}

Yes, you're doing something wrong.
The problem is here:
executorService.submit(task).get()
Your code submits a task then waits for it to finish, which achieves nothing in parallel; the tasks run sequentially. And your BlockingQueue adds no value whatsoever.
The way to run tasks in parallel is to first submit all tasks, collect the Futures returned, then call get() on all of them. Like this:
List<Future<Integer>> futures = filesInPath.stream()
.map(<create your Callable>)
.map(executorService::submit)
.collect(toList());
for (Future future : futures)
totalWordCount += future.get();
}
You can actually do it in one stream, by going through the intermediate list (as above) but then immediately streaming that, but you have to wrap the call to Future#get in some code to catch the checked exception - I leave that as an exercise for the reader.

Running threads in round robin fashion in java

I am new to Multithreading and synchronization in java. I am trying to achieve a task in which i am given 5 files, each file will be read by one particular thread. Every thread should read one line from file then forward execution to next thread and so on. When all 5 threads read the first line, then again start from thread 1 running line no. 2 of file 1 and so on.
Thread ReadThread1 = new Thread(new ReadFile(0));
Thread ReadThread2 = new Thread(new ReadFile(1));
Thread ReadThread3 = new Thread(new ReadFile(2));
Thread ReadThread4 = new Thread(new ReadFile(3));
Thread ReadThread5 = new Thread(new ReadFile(4));
// starting all the threads
ReadThread1.start();
ReadThread2.start();
ReadThread3.start();
ReadThread4.start();
ReadThread5.start();
and in ReadFile (which implements Runnable, in the run method, i am trying to synchronize on bufferreader object.
BufferedReader br = null;
String sCurrentLine;
String filename="Source/"+files[fileno];
br = new BufferedReader(new FileReader(filename));
synchronized(br)
{
while ((sCurrentLine = br.readLine()) != null) {
int f=fileno+1;
System.out.print("File No."+f);
System.out.println("-->"+sCurrentLine);
br.notifyAll();
// some thing needs to be dine here i guess
}}
Need Help

Though this is not an ideal scenario for using multi-threading but as this is assignment I am putting one solution that works. The threads will execute sequentially and there are few point to note:
Current thread cannot move ahead to read the line in the file until and unless its immediately previous thread is done as they are supposed to read in round-robin fashion.
After current thread is done reading the line it must notify the other thread else that thread will wait forever.
I have tested this code with some files in temp package and it was able to read the lines in round robin fashion. I believe Phaser can also be used to solve this problem.
public class FileReaderRoundRobinNew {
public Object[] locks;
private static class LinePrinterJob implements Runnable {
private final Object currentLock;
private final Object nextLock;
BufferedReader bufferedReader = null;
public LinePrinterJob(String fileToRead, Object currentLock, Object nextLock) {
this.currentLock = currentLock;
this.nextLock = nextLock;
try {
this.bufferedReader = new BufferedReader(new FileReader(fileToRead));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
#Override
public void run() {
/*
* Few points to be noted:
* 1. Current thread cannot move ahead to read the line in the file until and unless its immediately previous thread is done as they are supposed to read in round-robin fashion.
* 2. After current thread is done reading the line it must notify the other thread else that thread will wait forever.
* */
String currentLine;
synchronized(currentLock) {
try {
while ( (currentLine = bufferedReader.readLine()) != null) {
try {
currentLock.wait();
System.out.println(currentLine);
}
catch(InterruptedException e) {}
synchronized(nextLock) {
nextLock.notify();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
synchronized(nextLock) {
nextLock.notify(); /// Ensures all threads exit at the end
}
}
}
public FileReaderRoundRobinNew(int numberOfFilesToRead) {
locks = new Object[numberOfFilesToRead];
int i;
String fileLocation = "src/temp/";
//Initialize lock instances in array.
for(i = 0; i < numberOfFilesToRead; ++i) locks[i] = new Object();
//Create threads
int j;
for(j=0; j<(numberOfFilesToRead-1); j++ ){
Thread linePrinterThread = new Thread(new LinePrinterJob(fileLocation + "Temp" + j,locks[j],locks[j+1]));
linePrinterThread.start();
}
Thread lastLinePrinterThread = new Thread(new LinePrinterJob(fileLocation + "Temp" + j,locks[numberOfFilesToRead-1],locks[0]));
lastLinePrinterThread.start();
}
public void startPrinting() {
synchronized (locks[0]) {
locks[0].notify();
}
}
public static void main(String[] args) {
FileReaderRoundRobinNew fileReaderRoundRobin = new FileReaderRoundRobinNew(4);
fileReaderRoundRobin.startPrinting();
}
}
If the only objective is to read the files in round-robin fashion and not strictly in same order then we can also use Phaser. In this case the order in which files are read is not always same, for example if we have four files (F1, F2, F3 and F4) then in first phase it can read them as F1-F2-F3-F4 but in next one it can read them as F2-F1-F4-F3. I am still putting this solution for sake of completion.
public class FileReaderRoundRobinUsingPhaser {
final List<Runnable> tasks = new ArrayList<>();
final int numberOfLinesToRead;
private static class LinePrinterJob implements Runnable {
private BufferedReader bufferedReader;
public LinePrinterJob(BufferedReader bufferedReader) {
this.bufferedReader = bufferedReader;
}
#Override
public void run() {
String currentLine;
try {
currentLine = bufferedReader.readLine();
System.out.println(currentLine);
} catch (IOException e) {
e.printStackTrace();
}
}
}
public FileReaderRoundRobinUsingPhaser(int numberOfFilesToRead, int numberOfLinesToRead) {
this.numberOfLinesToRead = numberOfLinesToRead;
String fileLocation = "src/temp/";
for(int j=0; j<(numberOfFilesToRead-1); j++ ){
try {
tasks.add(new LinePrinterJob(new BufferedReader(new FileReader(fileLocation + "Temp" + j))));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
public void startPrinting( ) {
final Phaser phaser = new Phaser(1){
#Override
protected boolean onAdvance(int phase, int registeredParties) {
System.out.println("Phase Number: " + phase +" Registeres parties: " + getRegisteredParties() + " Arrived: " + getArrivedParties());
return ( phase >= numberOfLinesToRead || registeredParties == 0);
}
};
for(Runnable task : tasks) {
phaser.register();
new Thread(() -> {
do {
phaser.arriveAndAwaitAdvance();
task.run();
} while(!phaser.isTerminated());
}).start();
}
phaser.arriveAndDeregister();
}
public static void main(String[] args) {
FileReaderRoundRobinUsingPhaser fileReaderRoundRobin = new FileReaderRoundRobinUsingPhaser(4, 4);
fileReaderRoundRobin.startPrinting();
// Files will be accessed in round robin fashion but not exactly in same order always. For example it can read 4 files as 1234 then 1342 or 1243 etc.
}
}
The above example can be modified as per exact requirement. Here the constructor of FileReaderRoundRobinUsingPhaser takes the number of files and number of lines to read from each file. Also the boundary conditions need to be taken into consideration.

You are missing many parts of the puzzle:
you attempt to synchronize on an object local to each thread. This can have no effect and the JVM may even remove the whole locking operation;
you execute notifyAll without a matching wait;
the missing wait must be at the top of the run method, not at the bottom as you indicate.
Altogether, I'm afraid that fixing your code at this point is beyond the scope of one StackOverflow answer. My suggestion is to first familiarize yourself with the core concepts: the semantics of locks in Java, how they interoperate with wait and notify, and the precise semantics of those methods. An Oracle tutorial on the subject would be a nice start.

Read the 30Million user id's one by one from the big file

I am trying to read a very big file using Java. That big file will have data like this, meaning each line will have an user id.
149905320
1165665384
66969324
886633368
1145241312
286585320
1008665352
And in that big file there will be around 30Million user id's. Now I am trying to read all the user id's one by one from that big file only once. Meaning each user id should be selected only once from that big file. For example, if I have 30Million user id's then it should print 30 Million user id only once with the use of Multithreading code.
Below is the code I have which is a multithreaded code running with 10 threads but with the below program, I am not able to make sure that each user id is selected only once.
public class ReadingFile {
public static void main(String[] args) {
// create thread pool with given size
ExecutorService service = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
service.submit(new FileTask());
}
}
}
class FileTask implements Runnable {
#Override
public void run() {
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader("D:/abc.txt"));
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
//do things with line
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Can anybody help me with this? What wrong I am doing? And what is the fastest way to do this?

You really can't improve on having one thread reading the file sequentially, assuming that you haven't done anything like stripe the file across multiple disks. With one thread, you do one seek and then one long sequential read; with multiple threads you're going to have the threads causing multiple seeks as each gains control of the disk head.
Edit: This is a way to parallelize the line processing while still using serial I/O to read the lines. It uses a BlockingQueue to communicate between threads; the FileTask adds lines to the queue, and the CPUTask reads them and processes them. This is a thread-safe data structure, so no need to add any synchronization to it. You're using put(E e) to add strings to the queue, so if the queue is full (it can hold up to 200 strings, as defined in the declaration in ReadingFile) the FileTask blocks until space frees up; likewise you're using take() to remove items from the queue, so the CPUTask will block until an item is available.
public class ReadingFile {
public static void main(String[] args) {
final int threadCount = 10;
// BlockingQueue with a capacity of 200
BlockingQueue<String> queue = new ArrayBlockingQueue<>(200);
// create thread pool with given size
ExecutorService service = Executors.newFixedThreadPool(threadCount);
for (int i = 0; i < (threadCount - 1); i++) {
service.submit(new CPUTask(queue));
}
// Wait til FileTask completes
service.submit(new FileTask(queue)).get();
service.shutdownNow(); // interrupt CPUTasks
// Wait til CPUTasks terminate
service.awaitTermination(365, TimeUnit.DAYS);
}
}
class FileTask implements Runnable {
private final BlockingQueue<String> queue;
public FileTask(BlockingQueue<String> queue) {
this.queue = queue;
}
#Override
public void run() {
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader("D:/abc.txt"));
String line;
while ((line = br.readLine()) != null) {
// block if the queue is full
queue.put(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
class CPUTask implements Runnable {
private final BlockingQueue<String> queue;
public CPUTask(BlockingQueue<String> queue) {
this.queue = queue;
}
#Override
public void run() {
String line;
while(true) {
try {
// block if the queue is empty
line = queue.take();
// do things with line
} catch (InterruptedException ex) {
break; // FileTask has completed
}
}
// poll() returns null if the queue is empty
while((line = queue.poll()) != null) {
// do things with line;
}
}
}

We are talking about an average of a 315 MB file with lines separated by new line. I presume this easily fits into memory. It is implied that there is no particular order in the user names that has to be conserved. So I would recommend the following algorithm:
Get the file length
Copy each a 10th of the file into a byte buffer (binary copy should be fast)
Start a thread for processing each of these buffers
Each thread processes all lines in his area except the first and last one.
Each thread must return the first and last partitial line in its data when done,
the “last” of each thread must be recombined with the “first” one of the one working on the next file block because you may have cut through a line. And these tokens must then be processed afterwards.

Fork Join API introduced in 1.7 is a great fit for this use case. Check out http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html. If you search, you are going to find lots of examples out there.

Processing log files, distribute work among worker threads, to find a simple sum

I want to distribute work among threads. Load parts of a log file and then distribute the work to process parts of the file.
In my simple example, I wrote 800,000 lines of data and had a number in each line. And then I sum the number.
When I run this example, I get totals that are slightly off. Do you see in this threading code where threads might not complete properly and hence won't total the numbers?
public void process() {
final String d = FILE;
FileInputStream stream = null;
try {
stream = new FileInputStream(d);
final BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String data = "";
do {
final Stack<List<String>> allWork = new Stack<List<String>>();
final Stack<ParserWorkerAtLineThread> threadPool = new Stack<ParserWorkerAtLineThread>();
do {
if (data != null) {
final List<String> currentWorkToDo = new ArrayList<String>();
do {
data = reader.readLine();
if (data != null) {
currentWorkToDo.add(data);
} // End of the if //
} while(data != null && (currentWorkToDo.size() < thresholdLinesToAdd));
// Hand out future work
allWork.push(currentWorkToDo);
} // End of the if //
} while(data != null && (allWork.size() < numberOfThreadsAllowedInPool));
// Process the lines from the work to do //
// Hand out the work
for (final List<String> theCurrentTaskWork : allWork) {
final ParserWorkerAtLineThread t = new ParserWorkerAtLineThread();
t.data = theCurrentTaskWork;
threadPool.push(t);
}
for (final Thread workerAboutToDoWork : threadPool) {
workerAboutToDoWork.start();
System.out.println(" -> Starting my work... My name is : " + workerAboutToDoWork.getName());
} // End of the for //
// Waiting on threads to finish //
System.out.println("Waiting for all work to complete ... ");
for (final Thread waiting : threadPool) {
waiting.join();
} // End of the for //
System.out.println("Done waiting ... ");
} while(data != null); // End of outer parse file loop //
} catch(Exception e) {
e.printStackTrace();
} finally {
if (stream != null) {
try {
stream.close();
} catch (final IOException e) {
e.printStackTrace();
}
} // End of the stream //
} // End of the try - catch finally //
}

While you're at it, why not use a bounded BlockingQueue (ArrayBlockingQueue) of size thresholdLinesToAdd. This would be your producer code where you read the lines and use the method put on that queue to block until space is available.
As Chris mentionned before, use the Executors.newFixedThreadPool() to submit your work items on it. Your consumers would call take() to block until an element is available.
This is not a map/reduce. If you wanted a map/reduce, you would need another queue in the mix where you would publish keys to it. As an example, if you were to count the number of INFO and DEBUG occurances in your logs, your mapper would queue the extracted words every time it encounters it. The reducer would dequeue the mapper's output and increment the counter of each words. The result of your reducer would the word count for DEBUG and INFO.

Capturing large amounts of output from Apache Commons-Exec

I am writing a video application in Java by executing ffmpeg and capturing its output to standard output. I decided to use Apache Commons-Exec instead of Java's Runtime, because it seems better. However, I am have a difficult time capturing all of the output.
I thought using pipes would be the way to go, because it is a standard way of inter-process communication. However, my setup using PipedInputStream and PipedOutputStream is wrong. It seems to work, but only for the first 1042 bytes of the stream, which curiously happens to be the value of PipedInputStream.PIPE_SIZE.
I have no love affair with using pipes, but I want to avoid use disk I/O (if possible), because of speed and volume of data (a 1m 20s video at 512x384 resolution produces 690M of piped data).
Thoughts on the best solution to handle large amounts of data coming from a pipe? My code for my two classes are below. (yes, sleep is bad. Thoughts on that? wait() and notifyAll() ?)
WriteFrames.java
public class WriteFrames {
public static void main(String[] args) {
String commandName = "ffmpeg";
CommandLine commandLine = new CommandLine(commandName);
File filename = new File(args[0]);
String[] options = new String[] {
"-i",
filename.getAbsolutePath(),
"-an",
"-f",
"yuv4mpegpipe",
"-"};
for (String s : options) {
commandLine.addArgument(s);
}
PipedOutputStream output = new PipedOutputStream();
PumpStreamHandler streamHandler = new PumpStreamHandler(output, System.err);
DefaultExecutor executor = new DefaultExecutor();
try {
DataInputStream is = new DataInputStream(new PipedInputStream(output));
YUV4MPEGPipeParser p = new YUV4MPEGPipeParser(is);
p.start();
executor.setStreamHandler(streamHandler);
executor.execute(commandLine);
} catch (IOException e) {
e.printStackTrace();
}
}
}
YUV4MPEGPipeParser.java
public class YUV4MPEGPipeParser extends Thread {
private InputStream is;
int width, height;
public YUV4MPEGPipeParser(InputStream is) {
this.is = is;
}
public void run() {
try {
while (is.available() == 0) {
Thread.sleep(100);
}
while (is.available() != 0) {
// do stuff.... like write out YUV frames
}
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}

The problem is in the run method of YUV4MPEGPipeParser class. There are two successive loops. The second loop terminates immediately if there are no data currently available on the stream (e.g. all input so far was processed by parser, and ffmpeg or stream pump were not fast enough to serve some new data for it -> available() == 0 -> loop is terminated -> pump thread finishes).
Just get rid of these two loops and sleep and just perform a simple blocking read() instead of checking if any data are available for processing. There is also probably no need for wait()/notify() or even sleep() because the parser code is started on a separate thread.
You can rewrite the code of run() method like this:
public class YUV4MPEGPipeParser extends Thread {
...
// optimal size of buffer for reading from pipe stream :-)
private static final int BUFSIZE = PipedInputStream.PIPE_SIZE;
public void run() {
try {
byte buffer[] = new byte[BUFSIZE];
int len = 0;
while ((len = is.read(buffer, 0, BUFSIZE) != -1) {
// we have valid data available
// in first 'len' bytes of 'buffer' array.
// do stuff.... like write out YUV frames
}
} catch ...
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parallelize java source codes for large data with Threadpool - java

Related

Multithreading help using ExecutorService in Java [duplicate]

Running threads in round robin fashion in java

Read the 30Million user id's one by one from the big file

Processing log files, distribute work among worker threads, to find a simple sum

Capturing large amounts of output from Apache Commons-Exec

Categories

Resources