Behavior of the FileChannel to RandomAccessFile - java

I have one use-case where my multiple threads are writing data to same file channel (pooled), and each thread has offset in the file from where they can start writing till the length of the data to be written. So when I ask the file channel from pool it will open the channel in "rw" mode if it already not opened and will return that file channel (the opened file might be fresh file i.e. size = 0), else it will return cached channel. Problem is that thread might write data in no particular, that means a thread with offset 1,000,000 might start writing before thread with offset 0. Consider I opened a fresh file (size = 0), and thread with offset = 1,000,000 starts writing data (using write(buffer, position) API) before thread with offset = 0.
My first question: Is this allowed at all, or I will get some exception
Secondly if it allowed: What is guarantee that my data is correctly written.
Third. When my (offset = 1,000,000) is done with writing to file, what will be the content in empty space (0-999,999). How operating system will allocate this intermediate space?

Without actually trying what you're describing, here's an educated guess:
First question: FileChannel is thread safe, and is documented to expand the file size as needed ("The size of the file increases when bytes are written beyond its current size"), so I would think this would be allowed.
Second question: There is no guarantee that your data is correctly written; that's entirely dependent on your skill as a programmer. :)
Third question: I'd expect the byte content of the "empty space" would be OS dependent, but you could write a simple program to test this easily enough.

Related

FileChannel read behaviour [duplicate]

For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

java many threads writing random bytes to file simultaneously [just need advice]

I am writing a simple benchmark of sorts in java to test parallelization. The program generates 1000 random bytes in total and writes them to a binary file. It uses different amounts of threads to parallelize the byte generation and writing to disk, and measures the execution time of the overall process for each thread count.
The program splits the entirety of the execution among a specified number of threads - both the generation of byte arrays and the writing of these bytes to a file.
My problem is, I need to have a single binary file at the end. I need advice as far as how best to make each thread write its trash bytes to the same file. Keep in mind I do not care at all what order they end up in. I have three ideas so far:
1) Should I have each thread create an instance of RandomAccessFile each referencing the same empty file on the disk, and have each thread write to the file starting from a different location in a way that they do not overlap? This seems like the best way to truly parallelize the disk writing.
2) Can I pass each thread a reference to some kind of buffered stream object and have each thread send its byte array into this stream? Is there a way to create an object which will just listen for bytes and immediately write them to a file in whatever order it receives them? I am worried that having a single object collect all of the bytes would not truly represent parallelized disk writing.
3) Should I have each thread write its bytes to its own file, and then "merge" its file into the main file?
Thanks for your time! I don't need detailed code examples, just want to get some advice as I work on this to point me in the right direction.
Create FileOutputStream, get corresponding FileChannel and write data using ByteBuffers.

Splitting text file into chunks in java using multithread

I have split a text file (50GB) based on the formula (total size of file/split size).Now the splitting is done in single thread sequentially, how can i change this code to perform the splitting in multithread (ie parallely the thread should split the file and store in folder) I dont want to read the file as it would utilize more cpu. My main goal is I have to reduce the cpu utilization and complete the splitting of the file quickly with less amount of time. I have 8 cpu cores.
Any suggestions ?? Thanks in advance.
public class ExecMap {
public static void main(String[] args) throws InterruptedException, ExecutionException, TimeoutException {
String FilePath = "/home/xm/Downloads/wikipedia_50GB/wikipedia_50GB/file21";
File file = new File(FilePath);
long splitFileSize = 64 * 1024 * 1024;
long fileSize = file.length();
System.out.println(+fileSize);
int mappers = (int) (fileSize / splitFileSize);
System.out.println(+mappers);
ExecMap exec= new ExecMap();
exec.mapSplit(FilePath,splitFileSize,mappers,fileSize);
}
private static void mapSplit(String FilePath, long splitlen, int mappers,long fileSize) {
ExecutorService executor = Executors.newFixedThreadPool(1);
executor.submit(() -> {
long len = fileSize;
long leninfile = 0, leng = 0;
int count = 1, data;
try {
long startTime = System.currentTimeMillis(); // Get the start Time
long endTime = 0;
System.out.println(startTime);
File filename = new File(FilePath);
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
while (data != -1) {
String name = Thread.currentThread().getName();
System.out.println("task started: " + name +" ====Time " +System.currentTimeMillis());
filename = new File("/home/xm/Desktop/split/" +"Mapper " + count + ".txt");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (data != -1 && leng < splitlen) {
outfile.write(data);
leng++;
data = infile.read();
}
leninfile += leng;
leng = 0;
outfile.close();
count++;
System.out.println("task finished: " + name);
}
endTime = System.currentTimeMillis();
System.out.println(endTime);
long msec = endTime - startTime;
long sec = endTime - startTime;
System.out.println("Difference in milli seconds: " + msec); //Print the difference in mili seconds
System.out.println("Differencce in Seconds: " + sec / 1000);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdownNow();
});
}
}
The basic multi-threaded approach is to take a task, divide it into sub-tasks which can be done as individual unit of work, and create a thread for each sub-task. This works best when the threads can be independent of each other and do not require any kind of communication and are not sharing any resources.
House building as an analogy
So if we are building a house, some sub-tasks need to be done in a particular order. A foundation must exist before the house can be built. The walls need to be in place before the roof can be put on.
However some sub-tasks can be done independently. The roof can be shingled while the plumbers are installing the plumbing and the brick layers are bricking up the outside of the house.
Basic thoughts on the problem to be solved
In the case of your file splitting, the basic approach would be to take the task, splitting the file, and divide this into several sub-tasks, assign a portion of the file to be split to each thread.
However this particular task, splitting the file, has a common piece of work that is going to create a bottleneck and may require some kind of synchronization between the threads when they are reading from the original file to be split. Having several threads accessing the same file requires that the file access be done in a manner that the threads can access their assigned portion of the file.
The good news is that since the only thing being shared is the original file and it is only being read from, you do not need to worry about synchronizing the file reads at the Java level.
A first approach
The approach I would consider at first is to divide the number of output files by the number of threads. Each thread would then open the file with its own file reader so that each thread is independent of the other threads with its file I/O. This way though the original file is shared, each thread has its own data about file read position so each thread is reading from the file independently.
Each thread would then create its own set of output files and read from the original file and write to the output file. The threads would create their output files one at a time beginning with their assigned offset within the original file, reading from the original file and writing to the output file.
Doing it this way you will maintain the independency of work of each thread. Each thread has its own original file access data. Each thread has its own assigned region of the original file. Each thread generates their own set of output files.
Other considerations
At the operating system level the file system is shared. So the operating system will need to interleave and multiplex the file system access. For an application such as this where data is being read from a disk file and then immediately written back to another disk file, most of the time the application is waiting for the operating system to perform the I/O operation requested.
For a disk file there are several lower level operations that need to be performed such as: (1) finding the location of the file on the disk, (2) seeking to that location, and (3) reading or writing the requested amount of data. The operating system does all of these things for the application and while the operating system is doing these actions the application waits.
So in your multi-threading application, each thread is asking the operating system for disk I/O so each thread will be spending most of its time waiting for the disk I/O request to be done by the operating system, whether that disk I/O is reading from the original file or writing to a new split file.
Since this disk I/O will probably be the bounding action that will require the most time, the question is whether the time taken can be reduced.
A second approach
So an alternative architecture would be to have a single thread which only reads from the original file and reads in large chunks which are several times the size of the split file size. Then one or more other threads are used to take each chunk and create the output split file.
The single thread reading from the original file reads a split file chunk and gives that chunk to another thread. The other thread then creates the split file and writes the chunk out. While the other thread is performing that sub-task, the single thread reads the next chunk from the original file and uses a second thread to write that chunk out to a split file.
This approach should allow the disk I/O to be more efficient in that large chunks of the original file are being read into memory. The disk I/O with the original large file is done sequentially allowing the operating system to do disk seeks and disk I/O more efficiently.
In the first approach accessing the original file is done randomly which requires that the disk heads which read the data from the disk must be repositioned more often as each thread makes a disk read request.
Final thoughts: test predictions by measuring
In order to determine which of these approaches is actually more efficient would require trying both. While you can make a prediction based on a model of how the operating system and the disk hardware works, until you have actually tried it and measured the two approaches, you will not know whether one is superior over another.
And in the end, the most efficient method may be to just have a single thread which reads large chunks of the original file and then writes out the smaller split files.
Possible benefits from multiple threads
On the other hand, if you have several threads which are handed large chunks of the file being split, some of the operating system overhead involved in creating, opening, and closing of files may be scheduled more efficiently with the multiple threads. Using multiple threads may allow the operating system's file management subsystem and the disk I/O routines to schedule the disk I/O more efficiently by choosing among multiple pending disk I/O requests.
Due to the overhead of creating and destroying threads, you would probably want to create a set of working threads at application startup and then assign a particular split file I/O task to the threads. When the thread finishes with that assignment it then waits for another.
You can use RandomAccessFile and use seek to skip to a certain position.
This way you can give your executors a start position and an end position so each executor will work on a small chunk of the file
But as it was mentioned your problem will be Disk I/O
You will not see any advantage from launching multiple threads (as noted by many in comments to the original question) to "split a file in parallel".
Having multiple threads working on parts of a large task in parallel can only speed things up if they are acting independently of each other. Since, in this case, the time-consuming part is reading 50 Gb of file and writing it out as smaller files, and this is not done by Java but by the OS (and ultimately, by the disk driver having to read and later write all those bytes), having multiple threads will only add a small overhead (for thread creation & scheduling), making everything a bit slower.
Additionally, sequential reads and writes in rotating disks (SSDs are exempted from this rule) are much faster than random reads and writes - if many threads are reading and writing from different parts of a disk, throughput will be considerably worse than if a single thread does everything.
Think about it this way - you have a truck driver (the OS+disk) and have to split a big heap of bricks at place A into smaller heaps of disks at places C, D and E; and bricks can only travel by truck. There is only that truck driver and you, the supervisor giving the orders. Would hiring more supervisors (threads) to give orders in parallel speed things up?. No - you would just get in the way of each other and the truck-driver, trying to please all of you, would need many more journeys driving smaller amounts of bricks to do the same job.

Memory mapped file java NIO

I understand how to create a memory mapped file, but my question is let's say that in the following line:
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
Where i set SIZE to be 2MB for example, does this means that it will only load 2MB of the file or will it read further in the file and update the buffer as i consume bytes from it?
Where i set SIZE to be 2MB for example, does this means that it will only load 2MB of the file or will it read further in the file and update the buffer as i consume bytes from it?
It will only load the portion of the file specified in your buffer initialization. If you want it to read further you'll need to have some sort of read loop. While I would not go as far as saying this is tricky, if one isn't 100% familiar with the java.io and java.nio APIs involved then the chances of stuffing it up are high. (E.g.: not flipping the buffer; buffer/file edge case mistakes).
If you are looking for an easy approach to accessing this file in a ByteBuffer, consider using a MappedByteBuffer.
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
MappedByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
The nice thing about a using an MBB in this context is that it won't necessarily actually load the entire buffer into memory, but rather only the parts you are accessing.
The size of the buffer is the size you pass in. It will not grow or shrink.
The javadoc says:
Maps a region of this channel's file directly into memory.
...
size - The size of the region to be mapped; must be non-negative and no greater than Integer.MAX_VALUE
EDIT:
Depending on what you mean by "updated with new data", the answer is yes.
The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program. The view provided by an instance of this class may or may not, however, be consistent with the views seen by other concurrently-running programs due to caching performed by the underlying operating system and delays induced by network-filesystem protocols. This is true regardless of the language in which these other programs are written, and whether they are running on the same machine or on some other machine. The exact nature of any such inconsistencies are system-dependent and are therefore unspecified.
So, other systems may do caching, but when those caches are flushed or otherwise up-to-date, they will agree with the view presented by the FileChannel.
You can also use explicit calls to the position method and other methods to change what is presented by the view.
Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa. Changing the file's length via the file channel will change the length seen via the originating object, and vice versa. Changing the file's content by writing bytes will change the content seen by the originating object, and vice versa.

Would FileChannel.read read less bytes than specified if there's enough data?

For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

Categories

Resources