Splitting text file into chunks in java using multithread

Splitting text file into chunks in java using multithread - java

I have split a text file (50GB) based on the formula (total size of file/split size).Now the splitting is done in single thread sequentially, how can i change this code to perform the splitting in multithread (ie parallely the thread should split the file and store in folder) I dont want to read the file as it would utilize more cpu. My main goal is I have to reduce the cpu utilization and complete the splitting of the file quickly with less amount of time. I have 8 cpu cores.
Any suggestions ?? Thanks in advance.
public class ExecMap {
public static void main(String[] args) throws InterruptedException, ExecutionException, TimeoutException {
String FilePath = "/home/xm/Downloads/wikipedia_50GB/wikipedia_50GB/file21";
File file = new File(FilePath);
long splitFileSize = 64 * 1024 * 1024;
long fileSize = file.length();
System.out.println(+fileSize);
int mappers = (int) (fileSize / splitFileSize);
System.out.println(+mappers);
ExecMap exec= new ExecMap();
exec.mapSplit(FilePath,splitFileSize,mappers,fileSize);
}
private static void mapSplit(String FilePath, long splitlen, int mappers,long fileSize) {
ExecutorService executor = Executors.newFixedThreadPool(1);
executor.submit(() -> {
long len = fileSize;
long leninfile = 0, leng = 0;
int count = 1, data;
try {
long startTime = System.currentTimeMillis(); // Get the start Time
long endTime = 0;
System.out.println(startTime);
File filename = new File(FilePath);
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
while (data != -1) {
String name = Thread.currentThread().getName();
System.out.println("task started: " + name +" ====Time " +System.currentTimeMillis());
filename = new File("/home/xm/Desktop/split/" +"Mapper " + count + ".txt");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (data != -1 && leng < splitlen) {
outfile.write(data);
leng++;
data = infile.read();
}
leninfile += leng;
leng = 0;
outfile.close();
count++;
System.out.println("task finished: " + name);
}
endTime = System.currentTimeMillis();
System.out.println(endTime);
long msec = endTime - startTime;
long sec = endTime - startTime;
System.out.println("Difference in milli seconds: " + msec); //Print the difference in mili seconds
System.out.println("Differencce in Seconds: " + sec / 1000);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdownNow();
});
}
}

The basic multi-threaded approach is to take a task, divide it into sub-tasks which can be done as individual unit of work, and create a thread for each sub-task. This works best when the threads can be independent of each other and do not require any kind of communication and are not sharing any resources.
House building as an analogy
So if we are building a house, some sub-tasks need to be done in a particular order. A foundation must exist before the house can be built. The walls need to be in place before the roof can be put on.
However some sub-tasks can be done independently. The roof can be shingled while the plumbers are installing the plumbing and the brick layers are bricking up the outside of the house.
Basic thoughts on the problem to be solved
In the case of your file splitting, the basic approach would be to take the task, splitting the file, and divide this into several sub-tasks, assign a portion of the file to be split to each thread.
However this particular task, splitting the file, has a common piece of work that is going to create a bottleneck and may require some kind of synchronization between the threads when they are reading from the original file to be split. Having several threads accessing the same file requires that the file access be done in a manner that the threads can access their assigned portion of the file.
The good news is that since the only thing being shared is the original file and it is only being read from, you do not need to worry about synchronizing the file reads at the Java level.
A first approach
The approach I would consider at first is to divide the number of output files by the number of threads. Each thread would then open the file with its own file reader so that each thread is independent of the other threads with its file I/O. This way though the original file is shared, each thread has its own data about file read position so each thread is reading from the file independently.
Each thread would then create its own set of output files and read from the original file and write to the output file. The threads would create their output files one at a time beginning with their assigned offset within the original file, reading from the original file and writing to the output file.
Doing it this way you will maintain the independency of work of each thread. Each thread has its own original file access data. Each thread has its own assigned region of the original file. Each thread generates their own set of output files.
Other considerations
At the operating system level the file system is shared. So the operating system will need to interleave and multiplex the file system access. For an application such as this where data is being read from a disk file and then immediately written back to another disk file, most of the time the application is waiting for the operating system to perform the I/O operation requested.
For a disk file there are several lower level operations that need to be performed such as: (1) finding the location of the file on the disk, (2) seeking to that location, and (3) reading or writing the requested amount of data. The operating system does all of these things for the application and while the operating system is doing these actions the application waits.
So in your multi-threading application, each thread is asking the operating system for disk I/O so each thread will be spending most of its time waiting for the disk I/O request to be done by the operating system, whether that disk I/O is reading from the original file or writing to a new split file.
Since this disk I/O will probably be the bounding action that will require the most time, the question is whether the time taken can be reduced.
A second approach
So an alternative architecture would be to have a single thread which only reads from the original file and reads in large chunks which are several times the size of the split file size. Then one or more other threads are used to take each chunk and create the output split file.
The single thread reading from the original file reads a split file chunk and gives that chunk to another thread. The other thread then creates the split file and writes the chunk out. While the other thread is performing that sub-task, the single thread reads the next chunk from the original file and uses a second thread to write that chunk out to a split file.
This approach should allow the disk I/O to be more efficient in that large chunks of the original file are being read into memory. The disk I/O with the original large file is done sequentially allowing the operating system to do disk seeks and disk I/O more efficiently.
In the first approach accessing the original file is done randomly which requires that the disk heads which read the data from the disk must be repositioned more often as each thread makes a disk read request.
Final thoughts: test predictions by measuring
In order to determine which of these approaches is actually more efficient would require trying both. While you can make a prediction based on a model of how the operating system and the disk hardware works, until you have actually tried it and measured the two approaches, you will not know whether one is superior over another.
And in the end, the most efficient method may be to just have a single thread which reads large chunks of the original file and then writes out the smaller split files.
Possible benefits from multiple threads
On the other hand, if you have several threads which are handed large chunks of the file being split, some of the operating system overhead involved in creating, opening, and closing of files may be scheduled more efficiently with the multiple threads. Using multiple threads may allow the operating system's file management subsystem and the disk I/O routines to schedule the disk I/O more efficiently by choosing among multiple pending disk I/O requests.
Due to the overhead of creating and destroying threads, you would probably want to create a set of working threads at application startup and then assign a particular split file I/O task to the threads. When the thread finishes with that assignment it then waits for another.

You can use RandomAccessFile and use seek to skip to a certain position.
This way you can give your executors a start position and an end position so each executor will work on a small chunk of the file
But as it was mentioned your problem will be Disk I/O

You will not see any advantage from launching multiple threads (as noted by many in comments to the original question) to "split a file in parallel".
Having multiple threads working on parts of a large task in parallel can only speed things up if they are acting independently of each other. Since, in this case, the time-consuming part is reading 50 Gb of file and writing it out as smaller files, and this is not done by Java but by the OS (and ultimately, by the disk driver having to read and later write all those bytes), having multiple threads will only add a small overhead (for thread creation & scheduling), making everything a bit slower.
Additionally, sequential reads and writes in rotating disks (SSDs are exempted from this rule) are much faster than random reads and writes - if many threads are reading and writing from different parts of a disk, throughput will be considerably worse than if a single thread does everything.
Think about it this way - you have a truck driver (the OS+disk) and have to split a big heap of bricks at place A into smaller heaps of disks at places C, D and E; and bricks can only travel by truck. There is only that truck driver and you, the supervisor giving the orders. Would hiring more supervisors (threads) to give orders in parallel speed things up?. No - you would just get in the way of each other and the truck-driver, trying to please all of you, would need many more journeys driving smaller amounts of bricks to do the same job.

Related

How to write java thread pool programme to read content of file?

I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)

Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.

Dictionary: hard-coded vs. external file

I have a java application which is started and stopped multiple times per second over hundreds of millions of items (called from an external script).
Input: String key
Output: int value
The purpose of this application is to look for a certain key in a never ever ever changing Map (~30k keys) and to return the value. Very easy.
Question: what is more efficient when used multiple times per second:
hard-coded dictionary in a Map
Read an external file with a BufferedReader
...amaze me with your other ideas
I know hard-coding is evil but sometimes, you need to be evil to be efficient :-)

Read in the dictionary from file. Store it in a Map. Set up your Java application as a service that runs continuously (since you said it gets called many times per second). Then your Map will be cached in RAM.

The fastest is a hard coded map in memory.
if u a have a huge file you can use a Memory Mapped file :
MappedByteBuffer in = new FileInputStream("map.txt").getChannel().map(
FileChannel.MapMode.READ_ONLY, 0, LENGTH);
StringBuilder bs = new StringBuilder();
//read 1/4 of the file
while (i < LENGTH/4)
bs.append((char)in.get(i++));
This approach is a bit problematic though,in practice you will want to partition the file
on line breaks i.e read until the 100th line clean the buffer and read some more.

I would load the file into a Map at startup of the application, and then use it as you describe.
I would store the data in a database for faster load times.
Definitely do not have the application startup and shutdown every time it is called; have it as a service that waits for IO, using Asynchronous I/O, such as netty

How to read huge file in Java, in chunks without being blocked?

Say you have a file of bigger size then you have memory to handle. You'd like to read the files n bytes in turns and not get blocked in the process
read a block
pass it to a thread
read another block
pass it to a thread
I tried different things with varying success, however blocking always seem to be the issue.
Please provide an example of a non-blocking way to gain access to, say byte[]

You can't.
You will always block while waiting for the disk to provide you with data. If you have a lot of work to do with each chunk of data, then using a second thread may help: that thread can perform CPU-intensive work on the data while the first thread is blocked waiting for the next read to complete.
But that doesn't sound like your situation.
Your best bet is to read data in as large a block as you possibly can (say, 1MB or more). This minimizes the time blocked in the kernel, and may result in less time waiting for the disk (if the blocks being read happen to be contiguous).
Here's teh codez
ExecutorService exec = Executors.newFixedThreadPool(1);
// use RandomAccessFile because it supports readFully()
RandomAccessFile in = new RandomAccessFile("myfile.dat", "r");
in.seek(0L);
while (in.getFilePointer() < in.length())
{
int readSize = (int)Math.min(1000000, in.length() - in.getFilePointer());
final byte[] data = new byte[readSize];
in.readFully(data);
exec.execute(new Runnable()
{
public void run()
{
// do something with data
}
});
}

It sounds like you are looking for Streams, buffering, or some combination of the two (BufferedInputStream anyone?).
Check this out:
http://docs.oracle.com/javase/tutorial/essential/io/buffers.html
This is the standard way to deal with very large files. I apologize if this isn't what you were looking for, but hopefully it'll help get the juices flowing anyway.
Good luck!

If you have a program that does I/O and CPU computations, blocking is inevitable (somewhere in your program) if on average the amount of CPU time it takes to process a byte is less than the time to read a byte.
If you try to read a file and that requires a disk seek, the data might not arrive for 10 ms. A 2 GHz CPU could have done 20 M clock cycles of work in that time.

Behavior of the FileChannel to RandomAccessFile

I have one use-case where my multiple threads are writing data to same file channel (pooled), and each thread has offset in the file from where they can start writing till the length of the data to be written. So when I ask the file channel from pool it will open the channel in "rw" mode if it already not opened and will return that file channel (the opened file might be fresh file i.e. size = 0), else it will return cached channel. Problem is that thread might write data in no particular, that means a thread with offset 1,000,000 might start writing before thread with offset 0. Consider I opened a fresh file (size = 0), and thread with offset = 1,000,000 starts writing data (using write(buffer, position) API) before thread with offset = 0.
My first question: Is this allowed at all, or I will get some exception
Secondly if it allowed: What is guarantee that my data is correctly written.
Third. When my (offset = 1,000,000) is done with writing to file, what will be the content in empty space (0-999,999). How operating system will allocate this intermediate space?

Without actually trying what you're describing, here's an educated guess:
First question: FileChannel is thread safe, and is documented to expand the file size as needed ("The size of the file increases when bytes are written beyond its current size"), so I would think this would be allowed.
Second question: There is no guarantee that your data is correctly written; that's entirely dependent on your skill as a programmer. :)
Third question: I'd expect the byte content of the "empty space" would be OS dependent, but you could write a simple program to test this easily enough.

I want to read a big text file

I want to read a big text file, what i decided to create four threads and read 25% of file by each one.
and then join them.
but its not more impressive.
can any one tell me can i use concurrent programming for the same.
as my file structure have some data as
name contact compnay policyname policynumber uniqueno
and I want to put all data in hashmap at last.
thanks

Reading a large file is typically limited by I/O performance, not by CPU time. You can't speed up the reading by dividing into multiple threads (it will rather decrease performance, since it's still the same file, on the same drive). You can use concurrent programming to process the data, but that can only improve performance after reading the file.
You may, however, have some luck by dedicating one single thread to reading the file, and delegate the actual processing from this thread to worker threads, whenever a data unit has been read.

If it is a big file chances are that it is written to disk as a contiguous part and "streaming" the data would be faster than parallel reads as this would start moving the heads back and forth. To know what is fastest you need intimate knowledge of your target production environment, because on high end storage the data will likely be distributed over multiple disks and parallel reads might be faster.
Best approach is i think is to read it with large chunks into memory. Making it available as a ByteArrayInputStream to do the parsing.
Quite likely you will peg the CPU during parsing and handling of the data. Maybe parallel map-reduce could help here spread the load over all cores.

You might want to use Memory-mapped file buffers (NIO) instead of plain java.io.

Well, you might flush the disk cache and put a high contention on the synchronization of the hashmap if you do it like that. I would suggest that you simply make sure that you have buffered the stream properly (possibly with a large buffer size). Use the BufferedReader(Reader in, int sz) constructor to specify buffer size.
If the bottle neck is not parsing the lines (that is, the bottle neck is not the CPU usage) you should not parallelize the task in the way described.
You could also look into memory mapped files (available through the nio package), but thats probably only useful if you want to read and write files efficiently. A tutorial with source code is available here: http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm

well you can take help from below link
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
OR
by using large buffer
or using this
import java.io.*;
public class line1 {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("missing filename");
System.exit(1);
}
try {
FileInputStream fis =
new FileInputStream(args[0]);
BufferedInputStream bis =
new BufferedInputStream(fis);
DataInputStream dis =
new DataInputStream(bis);
int cnt = 0;
while (dis.readLine() != null)
cnt++;
dis.close();
System.out.println(cnt);
}
catch (IOException e) {
System.err.println(e);
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.