Merge big file parts faster in java

Merge big file parts faster in java - java

I'm writing a java rest service to support parallel upload of parts of a large file. I am writing these parts in separate files and merging them using file channel. I have a sample implemented in Golang, it does the same but when it merges the parts, it takes no time. When I use file channel or read from one stream and write to the final file, it takes long time. The difference I think is, Golang has ability to keep the data on the disk as it is and just merge them by not actually moving the data. Is there any way I can do the same in java?
Here is my code that merges parts, I loop through this method for all parts:
private void mergeFileUsingChannel(String destinationPath, String sourcePath, long partSize, long offset) throws Exception{
FileChannel outputChannel = null;
FileChannel inputChannel = null;
try{
outputChannel = new FileOutputStream(new File(destinationPath)).getChannel();
outputChannel.position(offset);
inputChannel = new FileInputStream(new File(sourcePath)).getChannel();
inputChannel.transferTo(0, partSize, outputChannel);
}catch(Exception e){
e.printStackTrace();
}
finally{
if(inputChannel != null)
inputChannel.close();
if(outputChannel != null){
outputChannel.close();
}
}
}

The documentation of FileChannel transferTo states:
"Many operating systems can transfer bytes directly from the filesystem cache to the target channel without actually copying them."
So the code you have written is correct, and the inefficiency you are seeing is probably related to the underlying file-system type.
One small optimization I could suggest would be to open the file in append mode.
"Whether the advancement of the position and the writing of the data are done in a single atomic operation is system-dependent"
Beyond that, you may have to think of a way to work around the problem. For example, by creating a large enough contiguous file as a first step.
EDIT: I also noticed that you are not explicitly closing your FileOutputStream. It would be best to hang on to that and close it, so that all the File Descriptors are closed.

Related

Java IO: Reading a file that is still being written

I am creating a program which needs to read from a file that is still being written.
The main question is this: If the read and write will be performed using InputStream and OutputStream classes running on a separate thread, what are the catches and edge cases that I will need to be aware of in order to prevent data corruption?
In case anyone is wondering if I have considered other, non-InputStream based approach, the answer is yes, I have but unfortunately it's not possible in this project since the program uses libraries that only works with InputStream and OutputStream.
Also, several readers have asked why this complications is necessary. Why not perform reading after the file has been written completely?
The reason is efficiency. The program will perform the following
Download a series of byte chunks of 1.5MB each. The program will receive thousands of such chunks that can total up to 30GB. Also, chunks are downloaded concurrently in order to maximize bandwidth, so they may arrive out of order.
The program will send each chunk for processing as soon as they have arrived. Please note that they will be sent for processing in order. If chunk m arrives before chunk m-1 does, they will be buffered on disk until chunk m-1 arrives and is sent for processing.
perform processing of these chunks starting from chunk 0 up to chunk n until every chunks has been processed
Resend the processed result back.
If we are to wait for the whole file to be transferred, it will introduce a huge delay on what is supposed to be a real-time system.

Use a RandomAccessFile. Via a getChannel or such one could use a ByteBuffer.
You will not be able to "insert" or "delete" middle parts of the file. For such a purpose your original approach would be fine, but using two files.
For concurrency: to keep in synch you could maintain one single object model of the file, do changes there. Only the pending changes need to be kept in memory, other hierarchical data could be reread and reparsed as needed.

So your problem (as you've cleared it up now) is that you can't start processing until chunk#1 has arrived, and you need to buffer every chunk#N (N > 1) until you can process them.
I would write each chunk to their own file and create a custom InputStream that will read every chunk in order. While downloading the chunkfile would be named something like chunk.1.downloading and when the whole chunk is loaded it will be renamed to chunk.1.
The custom InputStream will check to see if file chunk.N exists (where N = 1...X). If not, it will block. Each time a chunk has been downloaded completely, the InputStream is notified, it will check if the downloaded chunk was the next one to be processed. If yes, read as normally, otherwise block again.

You should use PipedInputStream and PipedOutputStream:
static Thread newCopyThread(InputStream is, OutputStream os) {
Thread t = new Thread() {
#Override
public void run() {
byte[] buffer = new byte[2048];
try {
while (true) {
int size = is.read(buffer);
if (size < 0) break;
os.write(buffer, 0, size);
}
is.close();
os.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
};
return t;
}
public void main(String[] args) throws IOException, InterruptedException {
ByteArrayInputStream bi = new ByteArrayInputStream("abcdefg".getBytes());
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
Thread p = newCopyThread(bi, os);
Thread c = newCopyThread(is, System.out);
p.start();
c.start();
p.join();
c.join();
}

JAI create seems to leave file descriptors open

I have some old code that was working until recently, but seems to barf now that it runs on a new server using OpenJDK 6 rather than Java SE 6.
The problem seems to revolve around JAI.create. I have jpeg files which I scale and convert to png files. This code used to work with no leaks, but now that the move has been made to a box running OpenJDK, the file descriptors seem to never close, and I see more and more tmp files accumulate in the tmp directory on the server. These are not files I create, so I assume it is JAI that does it.
Another reason might be the larger heap size on the new server. If JAI cleans up on finalize, but GC happens less frequently, then maybe the files pile up because of that. Reducing the heap size is not an option, and we seem to be having unrelated issues with increasing ulimit.
Here's an example of a file that leaks when I run this:
/tmp/imageio7201901174018490724.tmp
Some code:
// Processor is an internal class that aggregates operations
// performed on the image, like resizing
private byte[] processImage(Processor processor, InputStream stream) {
byte[] bytes = null;
SeekableStream s = null;
try {
// Read the file from the stream
s = SeekableStream.wrapInputStream(stream, true);
RenderedImage image = JAI.create("stream", s);
BufferedImage img = PlanarImage.wrapRenderedImage(image).getAsBufferedImage();
// Process image
if (processor != null) {
image = processor.process(img);
}
// Convert to bytes
bytes = convertToPngBytes(image);
} catch (Exception e){
// error handling
} finally {
// Clean up streams
IOUtils.closeQuietly(stream);
IOUtils.closeQuietly(s);
}
return bytes;
}
private static byte[] convertToPngBytes(RenderedImage image) throws IOException {
ByteArrayOutputStream out = null;
byte[] bytes = null;
try {
out = new ByteArrayOutputStream();
ImageIO.write(image, "png", out);
bytes = out.toByteArray();
} finally {
IOUtils.closeQuietly(out);
}
return bytes;
}
My questions are:
Has anyone run into this and solved it? Since the tmp files created are not mine, I don't know what their names are and thus can't really do anything about them.
What're some of the libraries of choice for resizing and reformatting images? I heard of Scalr - anything else I should look into?
I would rather not rewite the old code at this time, but if there is no other choice...
Thanks!

Just a comment on the temp files/finalizer issue, now that you seem to have solved the root of the problem (too long for a comment, so I'll post it as an answer... :-P):
The temp files are created by ImageIO's FileCacheImageInputStream. These instances are created whenever you call ImageIO.createImageInputStream(stream) and the useCache flag is true (the default). You can set it to false to disable the disk caching, at the expense of in-memory caching. This might make sense as you have a large heap, but probably not if you are processing very large images.
I also think you are (almost) correct about the finalizer issue. You'll find the following ´finalize´ method on FileCacheImageInputStream (Sun JDK 6/1.6.0_26):
protected void finalize() throws Throwable {
// Empty finalizer: for performance reasons we instead use the
// Disposer mechanism for ensuring that the underlying
// RandomAccessFile is closed/deleted prior to garbage collection
}
There's some quite "interesting" code in the class' constructor, that sets up automatic stream closing and disposing when the instance is finalized (should client code forget to do so). This might be different in the OpenJDK implentation, at least it seems kind of hacky. It's also unclear to me at the moment exactly what "performance reasons" we are talking about...
In any case, it seems calling close on the ImageInputStream instance, as you now do, will properly close the file descriptor and delete the temp file.

Found it!
So a stream gets wrapped by another stream in a different area in the code:
iis = ImageIO.createImageInputStream(stream);
And further down, stream is closed.
This doesn't seem to leak any resources when running with Sun Java, but does seem to cause a leak when running with Open JDK.
I'm not sure why that is (I have not looked at source code to verify, though I have my guesses), but that's what seems to be happening. Once I explicitly closed the wrapping stream, all was well.

How to open a file without saving it to disk

My Question: How do I open a file (in the system default [external] program for the file) without saving the file to disk?
My Situation: I have files in my resources and I want to display those without saving them to disk first. For example, I have an xml file and I want to open it on the user's machine in the default program for reading xml file without saving it to the disk first.
What I have been doing: So far I have just saved the file to a temporary location, but I have no way of knowing when they no longer need the file so I don't know when/if to delete it. Here's my SSCCE code for that (well, it's mostly sscce, except for the resource... You'll have to create that on your own):
package main;
import java.io.*;
public class SOQuestion {
public static void main(String[] args) throws IOException {
new SOQuestion().showTemplate();
}
/** Opens the temporary file */
private void showTemplate() throws IOException {
String tempDir = System.getProperty("java.io.tmpdir") + "\\BONotifier\\";
File parentFile = new File(tempDir);
if (!parentFile.exists()) {
parentFile.mkdirs();
}
File outputFile = new File(parentFile, "template.xml");
InputStream inputStream = getClass().getResourceAsStream("/resources/template.xml");
int size = 4096;
try (OutputStream out = new FileOutputStream(outputFile)) {
byte[] buffer = new byte[size];
int length;
while ((length = inputStream.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
inputStream.close();
}
java.awt.Desktop.getDesktop().open(outputFile);
}
}

Because of this line:
String tempDir = System.getProperty("java.io.tmpdir") + "\\BONotifier\\";
I deduce that you're working on Windows. You can easily make this code multiplatform, you know.
The answer to your question is: no. The Desktop class needs to know where the file is in order to invoke the correct program with a parameter. Note that there is no method in that class accepting an InputStream, which could be a solution.
Anyway, I don't see where the problem is: you create a temporary file, then open it in an editor or whatever. That's fine. In Linux, when the application is exited (normally) all its temporary files are deleted. In Windows, the user will need to trigger the temporary files deletion. However, provided you don't have security constraints, I can't understand where the problem is. After all, temporary files are the operating system's concern.

Depending on how portable your application needs to be, there might be no "one fits all" solution to your problem. However, you can help yourself a bit:
At least under Linux, you can use a pipe (|) to direct the output of one program to the input of another. A simple example for that (using the gedit text editor) might be:
echo "hello world" | gedit
This will (for gedit) open up a new editor window and show the contents "hello world" in a new, unsaved document.
The problem with the above is, that this might not be a platform-independent solution. It will work for Linux and probably OS X, but I don't have a Windows installation here to test it.
Also, you'd need to find out the default editor by yourself. This older question and it's linked article give some ideas on how this might work.

I don't understand your question very well. I can see only two possibilities to your question.
Open an existing file, and you wish to operate on its stream but do not want to save any modifications.
Create a file, so that you could use file i/o to operate on the file stream, but you don't wish to save the stream to file.
In either case, your main motivation is to exploit file i/o existingly available to your discretion and programming pleasure, am I correct?
I have feeling that the question is not that simple and this my answer is probably not the answer you seek. However, if my understanding of the question does coincide with your question ...
If you wish to use Stream io, instead of using FileOutputStream or FileInputStream which are consequent to your opening a File object, why not use non-File InputStream or OutputStream? Your file i/o utilities will finally boil down to manipulating i/o streams anyway.
http://docs.oracle.com/javase/7/docs/api/java/io/OutputStream.html
http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html
No need to involve temp files.

write file in memory with java.nio?

With nio it is possible to map an existing file in memory. But is it possible to create it only in memory without file on the hard drive ?
I want to mimic the CreateFileMapping windows functions which allow you to write in memory.
Is there an equivalent system in Java ?
The goal is to write in memory in order for another program ( c ) to read it.

Have a look at the following. A file is created but this might be as close as your going to get.
MappedByteBuffer
MappedByteBuffer.load()
FileChannel
FileChannel.map()
Here is a snippet to try and get you started.
filePipe = new File(tempDirectory, namedPipe.getName() + ".pipe");
try {
int pipeSize = 4096;
randomAccessFile = new RandomAccessFile(filePipe, "rw");
fileChannel = randomAccessFile.getChannel();
mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, pipeSize);
mappedByteBuffer.load();
} catch (Exception e) {
...

Most libraries in Java deal with input and output streams as opposed to java.io.File objects.
Examples: image reading, XML, audio, zip
Where possible, when dealing with I/O, use streams.
This may not be what you want, however, if you need random access to the data.
When using memory mapped files, and you get a MappedByteBuffer from a FileChannel using FileChannel.map(), if you don't need a file just use a ByteBuffer instead, which exists totally in memory. Create one of these using ByteBuffer.allocate() or ByteBuffer.allocateDirect().

Can multiple threads write data into a file at the same time?

If you have ever used a p2p downloading software, they can download a file with multi-threading, and they created only one file, So I wonder how the threads write data into that file. Sequentially or in parallel?
Imagine that you want to dump a big database table to a file, and how to make this job faster?

You can use multiple threads writing a to a file e.g. a log file. but you have to co-ordinate your threads as #Thilo points out. Either you need to synchronize file access and only write whole record/lines, or you need to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes.
This is rarely done for performance reasons as most disk subsystems perform best when being written to sequentially and disk IO is the bottleneck. If CPU to create the record or line of text (or network IO) is the bottleneck it can help.
Image that you want to dump a big database table to a file, and how to make this job faster?
Writing it sequentially is likely to be the fastest.

Java nio package was designed to allow this. Take a look for example at http://docs.oracle.com/javase/1.5.0/docs/api/java/nio/channels/FileChannel.html .
You can map several regions of one file to different buffers, each buffer can be filled separately by a separate thread.

The synchronized declaration enables doing this. Try the below code which I use in a similar context.
package hrblib;
import java.io.*;
public class FileOp {
static int nStatsCount = 0;
static public String getContents(String sFileName) {
try {
BufferedReader oReader = new BufferedReader(new FileReader(sFileName));
String sLine, sContent = "";
while ((sLine=oReader.readLine()) != null) {
sContent += (sContent=="")?sLine: ("\r\n"+sLine);
}
oReader.close();
return sContent;
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid file path/File cannot be read: \n" + sFileName);
}
}
static public void setContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid folder path/File cannot be written: \n" + sFileName);
}
}
public static synchronized void appendContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName, true));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Error appending/File cannot be written: \n" + sFileName);
}
}
}

You can have multiple threads write to the same file - but one at a time. All threads will need to enter a synchronized block before writing to the file.
In the P2P example - one way to implement it is to find the size of the file and create a empty file of that size. Each thread is downloading different sections of the file - when they need to write they will enter a synchronized block - move the file pointer using seek and write the contents of the buffer.

What kind of file is this? Why do you need to feed it with more threads? It depends on the characteristics (I don't know better word for it) of the file usage.
Transferring a file from several places over network (short: Torrent-like)
If you are transferring an existing file, the program should
as soon, as it gets know the size of the file, create it with empty content: this prevents later out-of-disk error (if there's not enough space, it will turns out at the creation, before downloading anything of it), also it helps the the performance;
if you organize the transfer well (and why not), each thread will responsible for a distinct portion of the file, thus file writes will be distinct,
even if somehow two threads pick the same portion of the file, it will cause no error, because they write the same data for the same file positions.
Appending data blocks to a file (short: logging)
If the threads just appends fixed or various-lenght info to a file, you should use a common thread. It should use a relatively large write buffer, so it can serve client threads quick (just taking the strings), and flush it out optimal scheduling and block size. It should use dedicated disk or even computer.
Also, there can be several performance issues, that's why are there logging servers around, even expensive commercial ones.
Reading and writing random time, random position (short: database)
It requires complex design, with mutexes etc., I never done this kinda stuff, but I can imagine. Ask Oracle for some tricks :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.