I am implementing a class that should receive a large text file. I want to split it in chunks and each chunk to be hold by a different thread that will count the frequency of each character in this chunk. I expect with starting more threads to get better performance but it turns out performance is getting poorer. Here`s my code:
public class Main {
public static void main(String[] args)
throws IOException, InterruptedException, ExecutionException, ParseException
{
// save the current run's start time
long startTime = System.currentTimeMillis();
// create options
Options options = new Options();
options.addOption("t", true, "number of threads to be start");
// variables to hold options
int numberOfThreads = 1;
// parse options
CommandLineParser parser = new DefaultParser();
CommandLine cmd;
cmd = parser.parse(options, args);
String threadsNumber = cmd.getOptionValue("t");
numberOfThreads = Integer.parseInt(threadsNumber);
// read file
RandomAccessFile raf = new RandomAccessFile(args[0], "r");
MappedByteBuffer mbb
= raf.getChannel().map(FileChannel.MapMode.READ_ONLY, 0, raf.length());
ExecutorService pool = Executors.newFixedThreadPool(numberOfThreads);
Set<Future<int[]>> set = new HashSet<Future<int[]>>();
long chunkSize = raf.length() / numberOfThreads;
byte[] buffer = new byte[(int) chunkSize];
while(mbb.hasRemaining())
{
int remaining = buffer.length;
if(mbb.remaining() < remaining)
{
remaining = mbb.remaining();
}
mbb.get(buffer, 0, remaining);
String content = new String(buffer, "ISO-8859-1");
#SuppressWarnings("unchecked")
Callable<int[]> callable = new FrequenciesCounter(content);
Future<int[]> future = pool.submit(callable);
set.add(future);
}
raf.close();
// let`s assume we will use extended ASCII characters only
int alphabet = 256;
// hold how many times each character is contained in the input file
int[] frequencies = new int[alphabet];
// sum the frequencies from each thread
for(Future<int[]> future: set)
{
for(int i = 0; i < alphabet; i++)
{
frequencies[i] += future.get()[i];
}
}
}
}
//help class for multithreaded frequencies` counting
class FrequenciesCounter implements Callable
{
private int[] frequencies = new int[256];
private char[] content;
public FrequenciesCounter(String input)
{
content = input.toCharArray();
}
public int[] call()
{
System.out.println("Thread " + Thread.currentThread().getName() + "start");
for(int i = 0; i < content.length; i++)
{
frequencies[(int)content[i]]++;
}
System.out.println("Thread " + Thread.currentThread().getName() + "finished");
return frequencies;
}
}
As suggested in comments, you will (usually) do not get better performance when reading from multiple threads. Rather you should process the chunks you have read on multiple threads. Usually processing does some blocking, I/O operations (saving to another file? saving to database? HTTP call?) and your performance will get better if you process on multiple threads.
For processing, you may have ExecutorService (with sensible number of threads). use java.util.concurrent.Executors to obtain instance of java.util.concurrent.ExecutorService
Having ExecutorService instance, you may submit your chunks for processing. Submitting chunks would not block. ExecutorService will start to process each chunk at separate thread (details depends of configuration of ExecutorService ). You may submit instances of Runnable or Callable.
Finally, after you submit all items you should call awaitTermination at your ExecutorService. It will wait until processing of all submited items is finished. After awaitTermination returns you should call shutdownNow() to abort processing (otherwise it may hang indefinitely, procesing some rogue task).
Your program is almost certainly limited by the speed of reading from disk. Using multiple threads does not help with this since the limit is a hardware limit on how fast the information can be transferred from disk.
In addition, the use of both RandomAccessFile and a subsequent buffer likely results in a small slowdown, since you are moving the data in memory after reading it in but before processing, rather than just processing it in place. You would be better off not using an intermediate buffer.
You might get a slight speedup by reading from the file directly into the final buffers and dispatching those buffers to be processed by threads as they are filled, rather than waiting for the entire file to be read before processing. However, most of the time would still be used by the disk read, so any speedup would likely be minimal.
Related
I have a list of files and a list of analyzers that analyze those files. Number of files can be large (200,000) and number of analyzers (1000). So total number of operations can be really large (200,000,000). Now, I need to apply multithreading to speed things up. I followed this approach:
ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (File file : listOfFiles) {
for (Analyzer analyzer : listOfAnalyzers){
executor.execute(() -> {
boolean exists = file.exists();
if(exists){
analyzer.analyze(file);
}
});
}
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);
But the problem of this approach is that it's taking too much from memory and I guess there is better way to do it. I'm still beginner at java and multithreading.
Where are 200M tasks going to reside? Not in memory, I hope, unless you plan to implement your solution in a distributed fashion. In meantime, you need to instantiate an ExecutorService that does not accumulate a massive queue. Use with the "caller runs policy" (see here) when you create the service. If you try to put another task in the queue when it's already full, you'll end up executing it yourself, which is probably what you want.
OTOH, now that I look at your question more conscientiously, why not analyze a single file concurrently? Then the queue is never larger than the number of analyzers. That's what I'd do, frankly, since I'd like a readable log that has a message for each file as I load it, in the correct order.
I apologize for not being more helpful:
analysts.stream().map(analyst -> executor.submit(() -> analyst.analyze(file))).map(Future::get);
Basically, create bunch of futures for a single file, then wait for all of them before you move on.
One idea is to employ fork/join algorithm and group items (files) into batches in order to process them individually.
My suggestion is the following:
Firstly, filter out all files that do not exist - they occupy resources unnecessarily.
The following pseudo-code demonstrates the algorithm that might help you out:
public static class CustomRecursiveTask extends RecursiveTask<Integer {
private final Analyzer[] analyzers;
private final int threshold;
private final File[] files;
private final int start;
private final int end;
public CustomRecursiveTask(Analyzer[] analyzers,
final int threshold,
File[] files,
int start,
int end) {
this.analyzers = analyzers;
this.threshold = threshold;
this.files = files;
this.start = start;
this.end = end;
}
#Override
protected Integer compute() {
final int filesProcessed = end - start;
if (filesProcessed < threshold) {
return processSequentially();
} else {
final int middle = (start + end) / 2;
final int analyzersCount = analyzers.length;
final ForkJoinTask<Integer> left =
new CustomRecursiveTask(analyzers, threshold, files, start, middle);
final ForkJoinTask<Integer> right =
new CustomRecursiveTask(analyzers, threshold, files, middle + 1, end);
left.fork();
right.fork();
return left.join() + right.join();
}
}
private Integer processSequentially() {
for (int i = start; i < end; i++) {
File file = files[i];
for(Analyzer analyzer : analyzers) { analyzer.analyze(file) };
}
return 1;
}
}
And the usage looks the following way:
public static void main(String[] args) {
final Analyzer[] analyzers = new Analyzer[]{};
final File[] files = new File[] {};
final int threshold = files.length / 5;
ForkJoinPool.commonPool().execute(
new CustomRecursiveTask(
analyzers,
threshold,
files,
0,
files.length
)
);
}
Notice that depending on constraints you can manipulate the task's constructor arguments so that the algorithm will adjust to the amount of files.
You could specify different thresholds let's say depending on the amount of files.
final int threshold;
if(files.length > 100_000) {
threshold = files.length / 4;
} else {
threshold = files.length / 8;
}
You could also specify the amount of worker threads in ForkJoinPool depending on the input amount.
Measure, adjust, modify, you will solve the problem eventually.
Hope that helps.
UPDATE:
If the result analysis is of no interest, you could replace the RecursiveTask with RecursiveAction. The pseudo-code adds auto-boxing overhead in between.
I have a client-server application where the server sends some binary data to the client and the client has to deserialize objects from that byte stream according to a custom binary format. The data is sent via an HTTPS connection and the client uses HttpsURLConnection.getInputStream() to read it.
I implemented a DataDeserializer that takes an InputStream and deserializes it completely. It works in a way that it performs multiple inputStream.read(buffer) calls with small buffers (usually less than 100 bytes). On my way of achieving better overall performance I also tried different implementations here. One change did improve this class' performance significantly (I'm using a ByteBuffer now to read primitive types rather than doing it manually with byte shifting), but in combination with the network stream no differences show up. See the section below for more details.
Quick summary of my issue
Deserializing from the network stream takes way too long even though I proved that the network and the deserializer themselves are fast. Are there any common performance tricks that I could try? I am already wrapping the network stream with a BufferedInputStream. Also, I tried double buffering with some success (see code below). Any solution to achieve better performance is welcome.
The performance test scenario
In my test scenario server and client are located on the same machine and the server sends ~174 MB of data. The code snippets can be found at the end of this post. All numbers you see here are averages of 5 test runs.
First I wanted to know, how fast that InputStream of the HttpsURLConnection can be read. Wrapped into a BufferedInputStream, it took 26.250s to write the entire data into a ByteArrayOutputStream.1
Then I tested the performance of my deserializer passing it all that 174 MB as a ByteArrayInputStream. Before I improved the deserializer's implementation, it took 38.151s. After the improvement it took only 23.466s.2
So this is going to be it, I thought... but no.
What I actually want to do, somehow, is passing connection.getInputStream() to the deserializer. And here comes the strange thing: Before the deserializer improvement deserializing took 61.413s and after improving it was 60.100s!3
How can that happen? Almost no improvement here despite the deserializer improved significantly. Also, unrelated to that improvement, I was surprised that this takes longer than the separate performances summed up (60.100 > 26.250 + 23.466). Why? Don't get me wrong, I didn't expect this to be the best solution, but I didn't expect it to be that bad either.
So, three things to notice:
The overall speed is bound by the network which takes at least 26.250s. Maybe there are some http-settings that I could tweak or I could further optimize the server, but for now this is likely not what I should focus on.
My deserializer implementation is very likely still not perfect, but on its own it is faster than the network, so I don't think there is need to further improve it.
Based on 1. and 2. I'm assuming that it should be somehow possible to do the entire job in a combined way (reading from the network + deserializing) which should take not much more than 26.250s. Any suggestions on how to achieve this are welcome.
I was looking for some kind of double buffer allowing two threads to read from it and write to it in parallel.
Is there something like that in standard Java? Preferably some class inheriting from InputStream that allows to write to it in parallel? If there is something similar, but not inheriting from InputStream, I may be able to change my DataDeserializer to consume from that one as well.
As I haven't found any such DoubleBufferInputStream, I implemented it myself.
The code is quite long and likely not perfect and I don't want to bother you to do a code review for me. It has two 16kB buffers. Using it I was able to improve the overall performance to 39.885s.4
That is much better than 60.100s but still much worse than 26.250s. Choosing different buffer sizes didn't change much. So, I hope someone can lead me to some good double buffer implementation.
The test code
1 (26.250s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
int count = 0;
long start = System.nanoTime();
while ((count = inputStream.read(buffer)) >= 0) {
outputStream .write(buffer, 0, count);
}
long end = System.nanoTime();
2 (23.466s)
InputStream inputStream = new ByteArrayInputStream(entire174MBbuffer);
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
3 (60.100s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
4 (39.885s)
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
byte[] buffer = new byte[16 * 1024];
int count = 0;
while ((count = inputStream.read(buffer)) >= 0) {
doubleBufferInputStream.write(buffer, 0, count);
}
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting(); // read() may return -1 now
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
long start = System.nanoTime();
deserializer.deserialize();
long end = System.nanoTime();
Update
As requested, here is the core of my deserializer. I think the most important method is prepareForRead() which performs the actual reading of the stream.
class DataDeserializer {
private InputStream _stream;
private ByteBuffer _buffer;
public DataDeserializer(InputStream stream) {
_stream = stream;
_buffer = ByteBuffer.allocate(256 * 1024);
_buffer.order(ByteOrder.LITTLE_ENDIAN);
_buffer.flip();
}
private int readInt() throws IOException {
prepareForRead(4);
return _buffer.getInt();
}
private long readLong() throws IOException {
prepareForRead(8);
return _buffer.getLong();
}
private CustomObject readCustomObject() throws IOException {
prepareForRead(/*size of CustomObject*/);
int customMember1 = _buffer.getInt();
long customMember2 = _buffer.getLong();
// ...
return new CustomObject(customMember1, customMember2, ...);
}
// several other built-in and custom object read methods
private void prepareForRead(int count) throws IOException {
while (_buffer.remaining() < count) {
if (_buffer.capacity() - _buffer.limit() < count) {
_buffer.compact();
_buffer.flip();
}
int read = _stream.read(_buffer.array(), _buffer.limit(), _buffer.capacity() - _buffer.limit());
if (read < 0)
throw new EOFException("Unexpected end of stream.");
_buffer.limit(_buffer.limit() + read);
}
}
public HugeCustomObject Deserialize() throws IOException {
while (...) {
// call several of the above methods
}
return new HugeCustomObject(/* deserialized members */);
}
}
Update 2
I modified my code snippet #1 a little bit to see more precisely where time is being spent:
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = istream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
ostream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
This tells me that reading from the network stream takes 25.756s while writing to the ByteArrayOutputStream only takes 0.817s. This makes sense as these two numbers almost perfectly sum up to the previously measured 26.250s (plus some additional measuring overhead).
In the very same way I modified code snippet #4:
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(httpChannelOutputStream.getConnection().getInputStream(), 256 * 1024)) {
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = inputStream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
doubleBufferInputStream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting();
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
deserializer.deserialize();
Now I would expect that the measured reading time is exactly the same as in the previous example. But instead, the read variable holds a value of 39.294s (How is that possible?? It's the exact same code being measured as in the previous example with 25.756s!)* while writing to my double buffer only takes 0.096s. Again, these numbers almost perfectly sum up to the measured time of code snippet #4.
Additionally, I profiled this very same code using Java VisualVM. That tells me that 40s were spent in this thread's run() method and 100% of these 40s are CPU time. On the other hand, it also spends 40s inside of the deserializer, but here only 26s are CPU time and 14s are spent waiting. This perfectly matches the time of reading from network into ByteBufferOutputStream. So I guess I have to improve my double buffer's "buffer-switching-algorithm".
*) Is there any explanation for this strange observation? I could only imagine that this way of measuring is very inaccurate. However, the read- and write-times of the latest measurements perfectly sum up to the original measurement, so it cannot be that inaccurate... Could someone please shed some light on this?
I was not able to find these read and write performances in the profiler... I will try to find some settings that allow me to observe the profiling results for these two methods.
Apparently, my "mistake" was to use a 32-bit JVM (jre1.8.0_172 being precise).
Running the very same code snippets on a 64-bit version JVM, and tadaaa... it is fast and makes all sense there.
In particular see these new numbers for the corresponding code snippets:
snippet #1: 4.667s (vs. 26.250s)
snippet #2: 11.568s (vs. 23.466s)
snippet #3: 17.185s (vs. 60.100s)
snippet #4: 12.336s (vs. 39.885s)
So apparently, the answers given to Does Java 64 bit perform better than the 32-bit version? are simply not true anymore. Or, there is a serious bug in this particular 32-bit JRE version. I didn't test any others yet.
As you can see, #4 is only slightly slower than #2 which perfectly matches my original assumption that
Based on 1. and 2. I'm assuming that it should be somehow possible to
do the entire job in a combined way (reading from the network +
deserializing) which should take not much more than 26.250s.
Also the very weird results of my profiling approach described in Update 2 of my question do not occur anymore. I didn't repeat every single test in 64 bit yet, but all profiling results that I did do are plausible now, i.e. the same code takes the same time no matter in which code snippet. So maybe it's really a bug, or does anybody have a reasonable explanation?
The most certain way to improve any of these is to change
connection.getInputStream()
to
new BufferedInputStream(connection.getInputStream())
If that doesn't help, the input stream isn't your problem.
I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}
Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.
I'm still in the process of wrapping my brain around how concurrency works in Java. I understand that (if you're subscribing to the OO Java 5 concurrency model) you implement a Task or Callable with a run() or call() method (respectively), and it behooves you to parallelize as much of that implemented method as possible.
But I'm still not understanding something inherent about concurrent programming in Java:
How is a Task's run() method assigned the right amount of concurrent work to be performed?
As a concrete example, what if I have an I/O-bound readMobyDick() method that reads the entire contents of Herman Melville's Moby Dick into memory from a file on the local system. And let's just say I want this readMobyDick() method to be concurrent and handled by 3 threads, where:
Thread #1 reads the first 1/3rd of the book into memory
Thread #2 reads the second 1/3rd of the book into memory
Thread #3 reads the last 1/3rd of the book into memory
Do I need to chunk Moby Dick up into three files and pass them each to their own task, or do I I just call readMobyDick() from inside the implemented run() method and (somehow) the Executor knows how to break the work up amongst the threads.
I am a very visual learner, so any code examples of the right way to approach this are greatly appreciated! Thanks!
You have probably chosen by accident the absolute worst example of parallel activities!
Reading in parallel from a single mechanical disk is actually slower than reading with a single thread, because you are in fact bouncing the mechanical head to different sections of the disk as each thread gets its turn to run. This is best left as a single threaded activity.
Let's take another example, which is similar to yours but can actually offer some benefit: assume I want to search for the occurrences of a certain word in a huge list of words (this list could even have come from a disk file, but like I said, read by a single thread). Assume I can use 3 threads like in your example, each searching on 1/3rd of the huge word list and keeping a local counter of how many times the searched word appeared.
In this case you'd want to partition the list in 3 parts, pass each part to a different object whose type implements Runnable and have the search implemented in the run method.
The runtime itself has no idea how to do the partitioning or anything like that, you have to specify it yourself. There are many other partitioning strategies, each with its own strengths and weaknesses, but we can stick to the static partitioning for now.
Let's see some code:
class SearchTask implements Runnable {
private int localCounter = 0;
private int start; // start index of search
private int end;
private List<String> words;
private String token;
public SearchTask(int start, int end, List<String> words, String token) {
this.start = start;
this.end = end;
this.words = words;
this.token = token;
}
public void run() {
for(int i = start; i < end; i++) {
if(words.get(i).equals(token)) localCounter++;
}
}
public int getCounter() { return localCounter; }
}
// meanwhile in main :)
List<String> words = new ArrayList<String>();
// populate words
// let's assume you have 30000 words
// create tasks
SearchTask task1 = new SearchTask(0, 10000, words, "John");
SearchTask task2 = new SearchTask(10000, 20000, words, "John");
SearchTask task3 = new SearchTask(20000, 30000, words, "John");
// create threads for each task
Thread t1 = new Thread(task1);
Thread t2 = new Thread(task2);
Thread t3 = new Thread(task3);
// start threads
t1.start();
t2.start();
t3.start();
// wait for threads to finish
t1.join();
t2.join();
t3.join();
// collect results
int counter = 0;
counter += task1.getCounter();
counter += task2.getCounter();
counter += task3.getCounter();
This should work nicely. Note that in practical cases you would build a more generic partitioning scheme. You could alternatively use an ExecutorService and implement Callable instead of Runnable if you wish to return a result.
So an alternative example using more advanced constructs:
class SearchTask implements Callable<Integer> {
private int localCounter = 0;
private int start; // start index of search
private int end;
private List<String> words;
private String token;
public SearchTask(int start, int end, List<String> words, String token) {
this.start = start;
this.end = end;
this.words = words;
this.token = token;
}
public Integer call() {
for(int i = start; i < end; i++) {
if(words.get(i).equals(token)) localCounter++;
}
return localCounter;
}
}
// meanwhile in main :)
List<String> words = new ArrayList<String>();
// populate words
// let's assume you have 30000 words
// create tasks
List<Callable> tasks = new ArrayList<Callable>();
tasks.add(new SearchTask(0, 10000, words, "John"));
tasks.add(new SearchTask(10000, 20000, words, "John"));
tasks.add(new SearchTask(20000, 30000, words, "John"));
// create thread pool and start tasks
ExecutorService exec = Executors.newFixedThreadPool(3);
List<Future> results = exec.invokeAll(tasks);
// wait for tasks to finish and collect results
int counter = 0;
for(Future f: results) {
counter += f.get();
}
You picked a bad example, as Tudor was so kind to point out. Spinning disk hardware is subject to physical constraints of moving platters and heads, and the most efficient read implementation is to read each block in order, which reduces the need to move the head or wait for the disk to align.
That said, some operating systems don't always store things continuously on disks, and for those who remember, defragmentation could provide a disk performance boost if you OS / filesystem didn't do the job for you.
As you mentioned wanting a program that would benefit, let me suggest a simple one, matrix addition.
Assuming you made one thread per core, you can trivially divide any two matrices to be added into N (one for each thread) rows. Matrix addition (if you recall) works as such:
A + B = C
or
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ]
[ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ]
[ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
So to distribute this across N threads, we simply need to take the row count and modulus divide by the number of threads to get the "thread id" it will be added with.
matrix with 20 rows across 3 threads
row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18)
row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19)
row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17)
// row 20 doesn't exist, because we number rows from 0
Now each thread "knows" which rows it should handle, and the results "per row" can be computed trivially because the results do not cross into other thread's domain of computation.
All that is needed now is a "result" data structure which tracks when the values have been computed, and when last value is set, then the computation is complete. In this "fake" example of a matrix addition result with two threads, computing the answer with two threads takes approximately half the time.
// the following assumes that threads don't get rescheduled to different cores for
// illustrative purposes only. Real Threads are scheduled across cores due to
// availability and attempts to prevent unnecessary core migration of a running thread.
[ done, done, done ] // filled in at about the same time as row 2 (runs on core 3)
[ done, done, done ] // filled in at about the same time as row 1 (runs on core 1)
[ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3)
[ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)
More complex problems can be solved by multithreading, and different problems are solved with different techniques. I purposefully picked one of the simplest examples.
you implement a Task or Callable with a run() or call() method
(respectively), and it behooves you to parallelize as much of that
implemented method as possible.
A Task represents a discrete unit of work
Loading a file into memory is a discrete unit of work and can therefore this activity can be delegated to a background thread. I.e. a background thread runs this task of loading the file.
It is a discrete unit of work since it has no other dependencies needed in order to do its job (load the file) and has discrete boundaries.
What you are asking is to further divide this into task. I.e. a thread loads 1/3 of the file while another thread the 2/3 etc.
If you were able to divide the task into further subtasks then it would not be a task in the first place by definition. So loading a file is a single task by itself.
To give you an example:
Let's say that you have a GUI and you need to present to the user data from 5 different files. To present them you need also to prepare some data structures to process the actual data.
All these are separate tasks.
E.g. the loading of files is 5 different tasks so could be done by 5 different threads.
The preparation of the data structures could be done a different thread.
The GUI runs of course in another thread.
All these can happen concurrently
If you system supported high-throughput I/O , here is how you can do it:
How to read a file using multiple threads in Java when a high throughput(3GB/s) file system is available
Here is the solution to read a single file with multiple threads.
Divide the file into N chunks, read each chunk in a thread, then merge them in order. Beware of lines that cross chunk boundaries. It is the basic idea as suggested by user
slaks
Bench-marking below implementation of multiple-threads for a single 20 GB file:
1 Thread : 50 seconds : 400 MB/s
2 Threads: 30 seconds : 666 MB/s
4 Threads: 20 seconds : 1GB/s
8 Threads: 60 seconds : 333 MB/s
Equivalent Java7 readAllLines() : 400 seconds : 50 MB/s
Note: This may only work on systems that are designed to support high-throughput I/O , and not on usual personal computers
Here is the essential nits of the code, for complete details , follow the link
public class FileRead implements Runnable
{
private FileChannel _channel;
private long _startLocation;
private int _size;
int _sequence_number;
public FileRead(long loc, int size, FileChannel chnl, int sequence)
{
_startLocation = loc;
_size = size;
_channel = chnl;
_sequence_number = sequence;
}
#Override
public void run()
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
//allocate memory
ByteBuffer buff = ByteBuffer.allocate(_size);
//Read file chunk to RAM
_channel.read(buff, _startLocation);
//chunk to String
String string_chunk = new String(buff.array(), Charset.forName("UTF-8"));
System.out.println("Done Reading the channel: " + _startLocation + ":" + _size);
}
//args[0] is path to read file
//args[1] is the size of thread pool; Need to try different values to fing sweet spot
public static void main(String[] args) throws Exception
{
FileInputStream fileInputStream = new FileInputStream(args[0]);
FileChannel channel = fileInputStream.getChannel();
long remaining_size = channel.size(); //get the total number of bytes in the file
long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads
//thread pool
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1]));
long start_loc = 0;//file pointer
int i = 0; //loop counter
while (remaining_size >= chunk_size)
{
//launches a new thread
executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i));
remaining_size = remaining_size - chunk_size;
start_loc = start_loc + chunk_size;
i++;
}
//load the last remaining piece
executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i));
//Tear Down
}
}
Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}
Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.
I would throw away the entire requirement and use a database.