How to read huge file in Java, in chunks without being blocked?

How to read huge file in Java, in chunks without being blocked? - java

Say you have a file of bigger size then you have memory to handle. You'd like to read the files n bytes in turns and not get blocked in the process
read a block
pass it to a thread
read another block
pass it to a thread
I tried different things with varying success, however blocking always seem to be the issue.
Please provide an example of a non-blocking way to gain access to, say byte[]

You can't.
You will always block while waiting for the disk to provide you with data. If you have a lot of work to do with each chunk of data, then using a second thread may help: that thread can perform CPU-intensive work on the data while the first thread is blocked waiting for the next read to complete.
But that doesn't sound like your situation.
Your best bet is to read data in as large a block as you possibly can (say, 1MB or more). This minimizes the time blocked in the kernel, and may result in less time waiting for the disk (if the blocks being read happen to be contiguous).
Here's teh codez
ExecutorService exec = Executors.newFixedThreadPool(1);
// use RandomAccessFile because it supports readFully()
RandomAccessFile in = new RandomAccessFile("myfile.dat", "r");
in.seek(0L);
while (in.getFilePointer() < in.length())
{
int readSize = (int)Math.min(1000000, in.length() - in.getFilePointer());
final byte[] data = new byte[readSize];
in.readFully(data);
exec.execute(new Runnable()
{
public void run()
{
// do something with data
}
});
}

It sounds like you are looking for Streams, buffering, or some combination of the two (BufferedInputStream anyone?).
Check this out:
http://docs.oracle.com/javase/tutorial/essential/io/buffers.html
This is the standard way to deal with very large files. I apologize if this isn't what you were looking for, but hopefully it'll help get the juices flowing anyway.
Good luck!

If you have a program that does I/O and CPU computations, blocking is inevitable (somewhere in your program) if on average the amount of CPU time it takes to process a byte is less than the time to read a byte.
If you try to read a file and that requires a disk seek, the data might not arrive for 10 ms. A 2 GHz CPU could have done 20 M clock cycles of work in that time.

Related

Splitting text file into chunks in java using multithread

I have split a text file (50GB) based on the formula (total size of file/split size).Now the splitting is done in single thread sequentially, how can i change this code to perform the splitting in multithread (ie parallely the thread should split the file and store in folder) I dont want to read the file as it would utilize more cpu. My main goal is I have to reduce the cpu utilization and complete the splitting of the file quickly with less amount of time. I have 8 cpu cores.
Any suggestions ?? Thanks in advance.
public class ExecMap {
public static void main(String[] args) throws InterruptedException, ExecutionException, TimeoutException {
String FilePath = "/home/xm/Downloads/wikipedia_50GB/wikipedia_50GB/file21";
File file = new File(FilePath);
long splitFileSize = 64 * 1024 * 1024;
long fileSize = file.length();
System.out.println(+fileSize);
int mappers = (int) (fileSize / splitFileSize);
System.out.println(+mappers);
ExecMap exec= new ExecMap();
exec.mapSplit(FilePath,splitFileSize,mappers,fileSize);
}
private static void mapSplit(String FilePath, long splitlen, int mappers,long fileSize) {
ExecutorService executor = Executors.newFixedThreadPool(1);
executor.submit(() -> {
long len = fileSize;
long leninfile = 0, leng = 0;
int count = 1, data;
try {
long startTime = System.currentTimeMillis(); // Get the start Time
long endTime = 0;
System.out.println(startTime);
File filename = new File(FilePath);
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
while (data != -1) {
String name = Thread.currentThread().getName();
System.out.println("task started: " + name +" ====Time " +System.currentTimeMillis());
filename = new File("/home/xm/Desktop/split/" +"Mapper " + count + ".txt");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (data != -1 && leng < splitlen) {
outfile.write(data);
leng++;
data = infile.read();
}
leninfile += leng;
leng = 0;
outfile.close();
count++;
System.out.println("task finished: " + name);
}
endTime = System.currentTimeMillis();
System.out.println(endTime);
long msec = endTime - startTime;
long sec = endTime - startTime;
System.out.println("Difference in milli seconds: " + msec); //Print the difference in mili seconds
System.out.println("Differencce in Seconds: " + sec / 1000);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdownNow();
});
}
}

The basic multi-threaded approach is to take a task, divide it into sub-tasks which can be done as individual unit of work, and create a thread for each sub-task. This works best when the threads can be independent of each other and do not require any kind of communication and are not sharing any resources.
House building as an analogy
So if we are building a house, some sub-tasks need to be done in a particular order. A foundation must exist before the house can be built. The walls need to be in place before the roof can be put on.
However some sub-tasks can be done independently. The roof can be shingled while the plumbers are installing the plumbing and the brick layers are bricking up the outside of the house.
Basic thoughts on the problem to be solved
In the case of your file splitting, the basic approach would be to take the task, splitting the file, and divide this into several sub-tasks, assign a portion of the file to be split to each thread.
However this particular task, splitting the file, has a common piece of work that is going to create a bottleneck and may require some kind of synchronization between the threads when they are reading from the original file to be split. Having several threads accessing the same file requires that the file access be done in a manner that the threads can access their assigned portion of the file.
The good news is that since the only thing being shared is the original file and it is only being read from, you do not need to worry about synchronizing the file reads at the Java level.
A first approach
The approach I would consider at first is to divide the number of output files by the number of threads. Each thread would then open the file with its own file reader so that each thread is independent of the other threads with its file I/O. This way though the original file is shared, each thread has its own data about file read position so each thread is reading from the file independently.
Each thread would then create its own set of output files and read from the original file and write to the output file. The threads would create their output files one at a time beginning with their assigned offset within the original file, reading from the original file and writing to the output file.
Doing it this way you will maintain the independency of work of each thread. Each thread has its own original file access data. Each thread has its own assigned region of the original file. Each thread generates their own set of output files.
Other considerations
At the operating system level the file system is shared. So the operating system will need to interleave and multiplex the file system access. For an application such as this where data is being read from a disk file and then immediately written back to another disk file, most of the time the application is waiting for the operating system to perform the I/O operation requested.
For a disk file there are several lower level operations that need to be performed such as: (1) finding the location of the file on the disk, (2) seeking to that location, and (3) reading or writing the requested amount of data. The operating system does all of these things for the application and while the operating system is doing these actions the application waits.
So in your multi-threading application, each thread is asking the operating system for disk I/O so each thread will be spending most of its time waiting for the disk I/O request to be done by the operating system, whether that disk I/O is reading from the original file or writing to a new split file.
Since this disk I/O will probably be the bounding action that will require the most time, the question is whether the time taken can be reduced.
A second approach
So an alternative architecture would be to have a single thread which only reads from the original file and reads in large chunks which are several times the size of the split file size. Then one or more other threads are used to take each chunk and create the output split file.
The single thread reading from the original file reads a split file chunk and gives that chunk to another thread. The other thread then creates the split file and writes the chunk out. While the other thread is performing that sub-task, the single thread reads the next chunk from the original file and uses a second thread to write that chunk out to a split file.
This approach should allow the disk I/O to be more efficient in that large chunks of the original file are being read into memory. The disk I/O with the original large file is done sequentially allowing the operating system to do disk seeks and disk I/O more efficiently.
In the first approach accessing the original file is done randomly which requires that the disk heads which read the data from the disk must be repositioned more often as each thread makes a disk read request.
Final thoughts: test predictions by measuring
In order to determine which of these approaches is actually more efficient would require trying both. While you can make a prediction based on a model of how the operating system and the disk hardware works, until you have actually tried it and measured the two approaches, you will not know whether one is superior over another.
And in the end, the most efficient method may be to just have a single thread which reads large chunks of the original file and then writes out the smaller split files.
Possible benefits from multiple threads
On the other hand, if you have several threads which are handed large chunks of the file being split, some of the operating system overhead involved in creating, opening, and closing of files may be scheduled more efficiently with the multiple threads. Using multiple threads may allow the operating system's file management subsystem and the disk I/O routines to schedule the disk I/O more efficiently by choosing among multiple pending disk I/O requests.
Due to the overhead of creating and destroying threads, you would probably want to create a set of working threads at application startup and then assign a particular split file I/O task to the threads. When the thread finishes with that assignment it then waits for another.

You can use RandomAccessFile and use seek to skip to a certain position.
This way you can give your executors a start position and an end position so each executor will work on a small chunk of the file
But as it was mentioned your problem will be Disk I/O

You will not see any advantage from launching multiple threads (as noted by many in comments to the original question) to "split a file in parallel".
Having multiple threads working on parts of a large task in parallel can only speed things up if they are acting independently of each other. Since, in this case, the time-consuming part is reading 50 Gb of file and writing it out as smaller files, and this is not done by Java but by the OS (and ultimately, by the disk driver having to read and later write all those bytes), having multiple threads will only add a small overhead (for thread creation & scheduling), making everything a bit slower.
Additionally, sequential reads and writes in rotating disks (SSDs are exempted from this rule) are much faster than random reads and writes - if many threads are reading and writing from different parts of a disk, throughput will be considerably worse than if a single thread does everything.
Think about it this way - you have a truck driver (the OS+disk) and have to split a big heap of bricks at place A into smaller heaps of disks at places C, D and E; and bricks can only travel by truck. There is only that truck driver and you, the supervisor giving the orders. Would hiring more supervisors (threads) to give orders in parallel speed things up?. No - you would just get in the way of each other and the truck-driver, trying to please all of you, would need many more journeys driving smaller amounts of bricks to do the same job.

How to store big amount of data

i have a program, that at the start generates big amount of data ( several GB, possibly more than 10GB ) and then for several times process all data, do something, process all data, do something... That much data doesn't fit into my RAM and when it starts paging, its really painful. What is the optimal way to store my data and in general, how to solve this problem?
Should i use DB even though i dont need to save the data after my program ends?
Should i split my data somehow and just save it into files and load them when i need them? Or just keep using RAM and get over paging?
With DB and files there is a problem. I have to process the data by pieces. So i load chunk of data (lets say 500mb), calculate, load next chunk and after i load and calculate everything, i can do something and repeat the cycle. That means i would read from HDD the same chunks of data i read in previous cycle.

try to reduce the amount of data.
try to modify the algorithm, to extract the relevant data at an early stage
try to divide and / or parallelize the problem, and execute it over several clients in a cluster of computing nodes

File-style will be enough for your task, couple sample:
Use BuffereReader skip() method
RandomAccessFile
Read this two, and problem with duplication chunks should go away.

You should definitely try to reduce the amount of data and have multiple threads to handle your data.
FutureTask could help you :
ExecutorService exec = Executors.newFixedThreadPool(5);
FutureTask<BigDecimal> task1 = new FutureTask<>(new Callable<BigDecimal>() {
#Override
public BigDecimal call() throws Exception {
return doBigProcessing();
}
});
// start future task asynchronously
exec.execute(task1);
// do other stuff
// blocking till processing is over
BigDecimal result = task1.get();
In the same way, you could consider caching the future task to speed up your application if possible.
If not enough, you could use Apache Spark framework to process large datasets.

Before you think about performance you must consider belows:
find a good data structure for the data.
find good algorithms to process the data.
If you do not have enough memory space,
use memory mapped file to work on data
If you have a chance to process data without load all data
divide and conquer
And please give us more details.

What causes this performance drop?

I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:
RS Decoder 1
/ \
Producer- ... - Consumer
\ /
RS Decoder 8
The producer reads blocks of 2064 bytes from disk into a byte buffer.
The 8 RS decoder consumers perform Reed-Solomon error correction in parallel.
The consumer writes files to disk.
In the disruptor DSL terms, the setup looks like this:
RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
for (int i = 0; i < numRsWorkers; i++) {
rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
}
disruptor.handleEventsWith(rsWorkers)
.then(writerHandler);
When I don't have a disk output consumer (no .then(writerHandler) part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.
I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:
Without an additional consumer:
With an additional consumer:
What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.

Your hypothesis is correct, it is a thread synchronization issue.
From the API Documentation for EventHandlerGroup<T>.then (Emphasis mine)
Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every EventProcessor in this group has processed the event.
This method is generally used as part of a chain. For example if the handler A must process events before handler B:
This should necessarily decrease throughput. Think about it like a funnel:
The consumer has to wait for every EventProcessor to be finished, before it can proceed through the bottleneck.

I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both.
1) IO processing bottleneck.
2) Contention on multiple threads writing to buffer.
IO processing
From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write() call is significantly more expensive than doing nothing. Your Decoders are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write() call. In other words, open the output file, prepare the string for output, and just not issue the write call.
Suggestions
Try removing the write() call in the Consumer, see if it reduces kernel time.
Are you writing to a single flat file sequentially - if not, try this
Are you using smart batching (ie: buffering until endOfBatch flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?
Contention on multiple writers
Based on your description I suspect your Decoders are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:
Producer writes to #1
Decoder reads from #1, performs RS decode and writes the result to #2
Consumer reads from #2, and writes to disk
Assuming your RBs are sufficiently large, this should result in good clean walking through memory.
The key here is not having the Decoder threads (which may be running on a different core) write to the same memory that was just owned by the Producer. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.
I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html
Other thoughts
It would also be helpful to know what WaitStrategy you are using, how many physical CPUs are in the machine, etc.
You should be able to significantly reduce CPU utilisation by moving to a different WaitStrategy given that your biggest latency will be IO writes.
Assuming you are using reasonably new hardware, you should be able to saturate the IO devices with only this setup.
You will also need to make sure the files are on different physical devices to achieve reasonable performance.

How to write java thread pool programme to read content of file?

I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)

Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.

I want to read a big text file

I want to read a big text file, what i decided to create four threads and read 25% of file by each one.
and then join them.
but its not more impressive.
can any one tell me can i use concurrent programming for the same.
as my file structure have some data as
name contact compnay policyname policynumber uniqueno
and I want to put all data in hashmap at last.
thanks

Reading a large file is typically limited by I/O performance, not by CPU time. You can't speed up the reading by dividing into multiple threads (it will rather decrease performance, since it's still the same file, on the same drive). You can use concurrent programming to process the data, but that can only improve performance after reading the file.
You may, however, have some luck by dedicating one single thread to reading the file, and delegate the actual processing from this thread to worker threads, whenever a data unit has been read.

If it is a big file chances are that it is written to disk as a contiguous part and "streaming" the data would be faster than parallel reads as this would start moving the heads back and forth. To know what is fastest you need intimate knowledge of your target production environment, because on high end storage the data will likely be distributed over multiple disks and parallel reads might be faster.
Best approach is i think is to read it with large chunks into memory. Making it available as a ByteArrayInputStream to do the parsing.
Quite likely you will peg the CPU during parsing and handling of the data. Maybe parallel map-reduce could help here spread the load over all cores.

You might want to use Memory-mapped file buffers (NIO) instead of plain java.io.

Well, you might flush the disk cache and put a high contention on the synchronization of the hashmap if you do it like that. I would suggest that you simply make sure that you have buffered the stream properly (possibly with a large buffer size). Use the BufferedReader(Reader in, int sz) constructor to specify buffer size.
If the bottle neck is not parsing the lines (that is, the bottle neck is not the CPU usage) you should not parallelize the task in the way described.
You could also look into memory mapped files (available through the nio package), but thats probably only useful if you want to read and write files efficiently. A tutorial with source code is available here: http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm

well you can take help from below link
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
OR
by using large buffer
or using this
import java.io.*;
public class line1 {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("missing filename");
System.exit(1);
}
try {
FileInputStream fis =
new FileInputStream(args[0]);
BufferedInputStream bis =
new BufferedInputStream(fis);
DataInputStream dis =
new DataInputStream(bis);
int cnt = 0;
while (dis.readLine() != null)
cnt++;
dis.close();
System.out.println(cnt);
}
catch (IOException e) {
System.err.println(e);
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.