Use MultiThreading to read a file faster - java

I want to read a file of 500 Mb with the help of 2 threads, so that reading the file will be much faster. Someone please give me some code for the task using core java concepts.

Multi-threading is not likely to make the code faster at all. This because reading a file is an I/O-bound process. You will be limited by the speed of the disk rather than your processor.

Instead of trying to multi-thread the reading, you may benefit from multi-threading the processing of the data. This can make it look like using multiple threads to read can help, but in reality, using one thread to read and multiple threads to process is often better.
This often takes longer and is CPU bound. Using multiple threads to read files usually helps when you have multiple files on different physical disks (a rare occasion)

While you might not be able to speed up the read from disc by using multiple threads to read the file you can speed up the process by not doing processing in the same thread as the read.
This will be dependant on the contents of the file.

Related

Is it useful to use a Thread for prefetching from a file?

Using multiple threads for speeding IO may work, but I need to process a huge file (or directory tree) sequentially by a single thread. However I could imagine two possible ways how to speed up reading from a file:
Feeder
The main thread gets all it's data from a PipedInputStream (or alike) fed by the auxiliary thread, which is the only one accessing the file. The synchronization overhead is higher, but there's less communication to (the underlying library communicating with) the OS. This is straightforward for a single file, but very complicated for a directory tree.
Prefetcher
The main thread opens new FileInputStream(file) and reads it as if it was alone. The auxiliary thread opens it's own stream over the same file and reads ahead. The main Thread doesn't need to wait for the disk since it gets all it's data from the OS cache. There should be some trivial synchronization assuring that the auxiliary thread doesn't run too far ahead. This could work for directory trees without much additional effort.
The questions
Which idea (if any) would you recommend to try out?
Have you used something like this?
Any other idea?
I had an app that read multiple files, created xml out of it and sent it to a server.
In this situation having a dedicated "feeder" (reads file and put them in a queue) and a few "sender" (creates xml and send it to the server) helped.
If you are doing moderate to intensive CPU consuming work (like XML parsing), then having 2 threads (1 reads and 1 processes) is likely to help even on a single core machine. I won't be too concerned about synchronization overhead. When there is little contention, the gain by doing work while waiting for IO would be much bigger. If your thread wait for IO time to time, then there will be even more benefits.
I'd recommend to read this chapter from JCiP. It addresses this topic.
It depends! ... on your access patterns, on your hardware...
"Using multiple threads for speeding IO may work" - IF your IO subsystem (such as a large disk array) is capable of handling multiple IO requests at once.
On a single desktop drive, your gains will be limited; if you have several threads performing largely independent work (i.e. there are few synchronisation points) you can benefit from one thread reading data, while others are processing data previously read.

Synchronizing several threads writing to the same file in java

Of course there is the obvious way of using "synchronized".
But I'm creating a system designed for running on several cores
and writing to that file various times at the same milisecond.
So I believe that using synchronize will hurt performance badly.
I was thinking of using the Pipe class of java (but not sure if it will help)
or having each thread write to a different file and an additional thread collecting
those writings, creating the final result.
I should mention that the order of the writings isn't important and it is timestamped
in nanotime anyway.
What is the better idea of those two? have any other suggestions?
thanks.
Using some sort of synchronization (eg. single mutexes) is quite easy to implement.
If I had enough RAM I would create a queue of some sort for each log-producer thread, and a log-consumer thread to read from all the queues in a round-robin fashion (or something like that).
Not a direct answer to your question, but the logback project has synchronization facilities built into it for writing to the same file from different threads, so you might try to use it if it suits your needs, or at least take a look at it's source code. Since it's built for speed, I'm pretty sure the implementation isn't to be taken for granted.
You are right to be concerned, you are not going to be able to have all the threads write to the same file without a performance problem.
When I had this problem (writing my own logging, way back before Log4j) I created two fixed-size buffers in memory and had all the producer threads write to one buffer while a dedicated consumer thread read from the other buffer and wrote to a file. That way the writer threads had to synchronize only on getting and incrementing the index to the buffer and when the buffers were being swapped, and it only blocked when the current buffer was full. It was memory-intensive but fast.
For other ideas you could check out how loggers like Log4j and Logback work, they would have had to solve this problem.
Try to use JMS. All your processes running on different machines may send JMS message instead of writing to file. Create only one queue receiver that receives messages and writes them to file. Log4J already has this functionality: see JMSAppender.

Using Multithreading in Java to read data

I'm trying to think how should I utilize threads in my program.
Right now I have a single threaded program that reads a single huge file. Very simple program, just reads line by line and collects some statistics about the words.
Now, I would like to use multi threads to make it faster. I'm not sure how to approach this.
One solution is to separate the data into X pieces in advance, then have X threads, each runs on one piece simultaneously, with one sync method to write the stats to memory. Is there a better approach? specifically, I would like to avoid separating the data in advance.
Thanks!
First of all, do some profiling to make sure that your process is actually compute-bound and not I/O bound. That is, that your statistics collection is slower than accessing the file. Otherwise, multi-threading will slow your program, not speed it, particularly if you are running on a single-core CPU (or an ancient JVM).
Also consider: if your file resides on a hard drive: how will you schedule reads? You risk adding hard drive seek delays otherwise, stalling all threads that have managed to finish their chunk of work, while one thread is asking the hard drive to seek to position 0x03457000...
You could have a look at the producer-consumer approach. It is a classical threading problem where one thread produces data (in your case the one that reads the file) and writes it to a shared buffer from where another thread reads that data (consumer) which is your calculation thread (some Java examples).
Also have a look at Javas non-blocking IO.
The assumption that multithreaded disk access is faster might be wrong, as disguessed here: Directory walker on modern operating systems slower when it's multi-threaded?
Performance improvement could be achieved by splitting reading and processing of data in separate threads.
But wait, reading files line-by-line? That doesn't sounds optimal. Better read them as stream of characters (using FileReader).
See this sun tutorial.
if your problem is I/O Bound, maybe you can consider splitting your data into multiple files and putting it into a distributed filesystem such as Hadoop Filesystem (HDFS) and then run Map/Reduce operation on it?

How do I get Java to use my multi-core processor with GZIPInputStream?

I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).
PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.
Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
OutputStream out = new BufferedOutputStream(
new GZIPOutputStream(
new FileOutputStream(myFile)
)
)
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.
I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.
If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.
Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.
Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.
compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.
Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.
I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
The question is too general to assume we are reading from a single hard drive.
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.
You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.

What concurrency works with the java file system?

Using java IO, it seems like forking a new process gives better ability for a process B to read data written by process A to file than what you could get if thread A wrote to a file that thread B is trying to read (within the same process).
It seems like the rules are not comparable to the memory model. So what file-based concurrency works ? References would be appreciated.
Any observations like this bound to be operating system specific, and may be specific to different versions of the operating system (kernel). What you are hitting here is probably related to the way that the OS implements threads, and thread scheduling. The Java platform provides little in the way of tuning for this kind of thing.
IMO, if you need better performance, you probably should not be using a file as a data transfer channel between two threads in the same JVM. Code your application to detect that the threads are colocated in the same JVM and use (say) Java Pipe streams.
Maybe it could have to do with thread and process blocking.
When a process wants a resource (writing/reading a file) it blocks untils the S.O. fulfills the requirement and return something to the process.
If you are not using hyperthreading a process with two threads will block both threads for fullfilling each one of the tasks. But if you separate them, maybe the S.O. can optimize access and paralelize the read/write better.
(just guessing :)

Categories

Resources