Can multiple threads write data into a file at the same time? - java

If you have ever used a p2p downloading software, they can download a file with multi-threading, and they created only one file, So I wonder how the threads write data into that file. Sequentially or in parallel?
Imagine that you want to dump a big database table to a file, and how to make this job faster?

You can use multiple threads writing a to a file e.g. a log file. but you have to co-ordinate your threads as #Thilo points out. Either you need to synchronize file access and only write whole record/lines, or you need to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes.
This is rarely done for performance reasons as most disk subsystems perform best when being written to sequentially and disk IO is the bottleneck. If CPU to create the record or line of text (or network IO) is the bottleneck it can help.
Image that you want to dump a big database table to a file, and how to make this job faster?
Writing it sequentially is likely to be the fastest.

Java nio package was designed to allow this. Take a look for example at http://docs.oracle.com/javase/1.5.0/docs/api/java/nio/channels/FileChannel.html .
You can map several regions of one file to different buffers, each buffer can be filled separately by a separate thread.

The synchronized declaration enables doing this. Try the below code which I use in a similar context.
package hrblib;
import java.io.*;
public class FileOp {
static int nStatsCount = 0;
static public String getContents(String sFileName) {
try {
BufferedReader oReader = new BufferedReader(new FileReader(sFileName));
String sLine, sContent = "";
while ((sLine=oReader.readLine()) != null) {
sContent += (sContent=="")?sLine: ("\r\n"+sLine);
}
oReader.close();
return sContent;
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid file path/File cannot be read: \n" + sFileName);
}
}
static public void setContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid folder path/File cannot be written: \n" + sFileName);
}
}
public static synchronized void appendContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName, true));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Error appending/File cannot be written: \n" + sFileName);
}
}
}

You can have multiple threads write to the same file - but one at a time. All threads will need to enter a synchronized block before writing to the file.
In the P2P example - one way to implement it is to find the size of the file and create a empty file of that size. Each thread is downloading different sections of the file - when they need to write they will enter a synchronized block - move the file pointer using seek and write the contents of the buffer.

What kind of file is this? Why do you need to feed it with more threads? It depends on the characteristics (I don't know better word for it) of the file usage.
Transferring a file from several places over network (short: Torrent-like)
If you are transferring an existing file, the program should
as soon, as it gets know the size of the file, create it with empty content: this prevents later out-of-disk error (if there's not enough space, it will turns out at the creation, before downloading anything of it), also it helps the the performance;
if you organize the transfer well (and why not), each thread will responsible for a distinct portion of the file, thus file writes will be distinct,
even if somehow two threads pick the same portion of the file, it will cause no error, because they write the same data for the same file positions.
Appending data blocks to a file (short: logging)
If the threads just appends fixed or various-lenght info to a file, you should use a common thread. It should use a relatively large write buffer, so it can serve client threads quick (just taking the strings), and flush it out optimal scheduling and block size. It should use dedicated disk or even computer.
Also, there can be several performance issues, that's why are there logging servers around, even expensive commercial ones.
Reading and writing random time, random position (short: database)
It requires complex design, with mutexes etc., I never done this kinda stuff, but I can imagine. Ask Oracle for some tricks :)

Related

Java File is disappearing from the path /tmp/hsperfdata_*username*/

This is very confusing problem.
We have a Java-application (Java8 and running on JBoss 6.4) that is looping a certain amount of objects and writing some rows to a File on each round.
On each round we check did we receive the File object as a parameter and if we did not, we create a new object and create a physical file:
if (file == null){
File file = new File(filename);
try{
file.createNewFile();
} catch (IOException e) {e.printStackTrace();}}
So the idea is that the file get's created only once and after that the step is skipped and we proceed straight to writing. The variable filename is not a path, it's just a file name with no path so the file gets created to a path jboss_root/tmp/hsperfdata_username/
edit1. I'll add here also the methods used from writing if they happen to make relevance:
fw = new FileWriter(indeksiFile, true); // append = true
bw = new BufferedWriter(fw);
out = new PrintWriter(bw);
.
.
out.println(..)
.
.
out.flush();
out.close(); // this flushes as well -> line above is useless
So now the problem is that occasionally, quite rarely thou, the physical file disappears from the path in the middle of the process. The java-object reference is never lost, but is seems that the object itself disappears because the code automatically creates the file again to the same path and keeps on writing stuff to it. This would not happen if the condition file == null would not evaluate to true. The effect is obviously that we loose the rows which were written to the previous file. Java application does not notice any errors and keeps on working.
So, I would have three questions which are strongly related for which I was not able to find answer from google.
If we call method File.CreateNewFile(), is the resulting file a permanent file in the filesystem or some JVM-proxy-file?
If it's permanent file, do you have any idea why it's disappearing? The default behavior in our case is that at some point the file is always deleted from the path. My guess is that same mechanism is deleting the file too early. I just dunno how to control that mechanism.
My best guess is that this is related to this path jboss_root/tmp/hsperfdata_username/ which is some temp-data folder created by the JVM and probably there is some default behavior that cleans the path. Am I even close?
Help appreciated! Thanks!
File.createNewFile I never used in my code: it is not needed.
When afterwards actually writing to the file, it probaby creates it anew, or appends.
In every case there is a race on the file system. Also as these are not atomic actions,
you might end up with something unstable.
So you want to write to a file, either appending on an existing file, or creating it.
For UTF-8 text:
Path path = Paths.get(filename);
try (PrintWriter out = new PrintWriter(
Files.newBufferedWriter(path, StandardOpenOption.CREATE, StandardOpenOption.APPEND),
false)) {
out.println("Killroy was here");
}
After comment
Honestly as you are interested in the cause, it is hard to say. An application restart or I/O (?) exceptions one would find in the logs. Add logging to a specific log for appending to the files, and a (logged) periodic check for those files' existence.
Safe-guard
Here we are doing repeated physical access to the file system.
To prevent appending to a file twice at the same time (of which I would expect an exception), one can make a critical section in some form.
// For 16 semaphores:
final int semaphoreCount = 16;
final int semaphoreMask = 0xF;
Semaphore[] semaphores = new Semaphore[semaphoreCount];
for (int i = 0; i < semaphores.length; ++i) {
semaphores[i] = new Semaphore(1, true); // FIFO
}
int hash = filename.hashcode() & semaphoreMask ; // toLowerCase on Windows
Semaphore semaphore = semaphores[hash];
try {
semaphore.aquire();
... append
} finally {
semaphore.release();
}
File locks would have been a more technical solution, which I would not like to propose.
The best solution, you perhaps already have, would be to queue messages per file.

Merge big file parts faster in java

I'm writing a java rest service to support parallel upload of parts of a large file. I am writing these parts in separate files and merging them using file channel. I have a sample implemented in Golang, it does the same but when it merges the parts, it takes no time. When I use file channel or read from one stream and write to the final file, it takes long time. The difference I think is, Golang has ability to keep the data on the disk as it is and just merge them by not actually moving the data. Is there any way I can do the same in java?
Here is my code that merges parts, I loop through this method for all parts:
private void mergeFileUsingChannel(String destinationPath, String sourcePath, long partSize, long offset) throws Exception{
FileChannel outputChannel = null;
FileChannel inputChannel = null;
try{
outputChannel = new FileOutputStream(new File(destinationPath)).getChannel();
outputChannel.position(offset);
inputChannel = new FileInputStream(new File(sourcePath)).getChannel();
inputChannel.transferTo(0, partSize, outputChannel);
}catch(Exception e){
e.printStackTrace();
}
finally{
if(inputChannel != null)
inputChannel.close();
if(outputChannel != null){
outputChannel.close();
}
}
}
The documentation of FileChannel transferTo states:
"Many operating systems can transfer bytes directly from the filesystem cache to the target channel without actually copying them."
So the code you have written is correct, and the inefficiency you are seeing is probably related to the underlying file-system type.
One small optimization I could suggest would be to open the file in append mode.
"Whether the advancement of the position and the writing of the data are done in a single atomic operation is system-dependent"
Beyond that, you may have to think of a way to work around the problem. For example, by creating a large enough contiguous file as a first step.
EDIT: I also noticed that you are not explicitly closing your FileOutputStream. It would be best to hang on to that and close it, so that all the File Descriptors are closed.

Java IO: Reading a file that is still being written

I am creating a program which needs to read from a file that is still being written.
The main question is this: If the read and write will be performed using InputStream and OutputStream classes running on a separate thread, what are the catches and edge cases that I will need to be aware of in order to prevent data corruption?
In case anyone is wondering if I have considered other, non-InputStream based approach, the answer is yes, I have but unfortunately it's not possible in this project since the program uses libraries that only works with InputStream and OutputStream.
Also, several readers have asked why this complications is necessary. Why not perform reading after the file has been written completely?
The reason is efficiency. The program will perform the following
Download a series of byte chunks of 1.5MB each. The program will receive thousands of such chunks that can total up to 30GB. Also, chunks are downloaded concurrently in order to maximize bandwidth, so they may arrive out of order.
The program will send each chunk for processing as soon as they have arrived. Please note that they will be sent for processing in order. If chunk m arrives before chunk m-1 does, they will be buffered on disk until chunk m-1 arrives and is sent for processing.
perform processing of these chunks starting from chunk 0 up to chunk n until every chunks has been processed
Resend the processed result back.
If we are to wait for the whole file to be transferred, it will introduce a huge delay on what is supposed to be a real-time system.
Use a RandomAccessFile. Via a getChannel or such one could use a ByteBuffer.
You will not be able to "insert" or "delete" middle parts of the file. For such a purpose your original approach would be fine, but using two files.
For concurrency: to keep in synch you could maintain one single object model of the file, do changes there. Only the pending changes need to be kept in memory, other hierarchical data could be reread and reparsed as needed.
So your problem (as you've cleared it up now) is that you can't start processing until chunk#1 has arrived, and you need to buffer every chunk#N (N > 1) until you can process them.
I would write each chunk to their own file and create a custom InputStream that will read every chunk in order. While downloading the chunkfile would be named something like chunk.1.downloading and when the whole chunk is loaded it will be renamed to chunk.1.
The custom InputStream will check to see if file chunk.N exists (where N = 1...X). If not, it will block. Each time a chunk has been downloaded completely, the InputStream is notified, it will check if the downloaded chunk was the next one to be processed. If yes, read as normally, otherwise block again.
You should use PipedInputStream and PipedOutputStream:
static Thread newCopyThread(InputStream is, OutputStream os) {
Thread t = new Thread() {
#Override
public void run() {
byte[] buffer = new byte[2048];
try {
while (true) {
int size = is.read(buffer);
if (size < 0) break;
os.write(buffer, 0, size);
}
is.close();
os.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
};
return t;
}
public void main(String[] args) throws IOException, InterruptedException {
ByteArrayInputStream bi = new ByteArrayInputStream("abcdefg".getBytes());
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
Thread p = newCopyThread(bi, os);
Thread c = newCopyThread(is, System.out);
p.start();
c.start();
p.join();
c.join();
}

How does other applications handle large text files without having a large memory foot print?

I need to know how applications like Bairtail or Baregrep can handle such large text files without having a huge foot print?
I am trying to do something similar in Java as in question:
Viewing large log files in JavaFX in a ListView
But when I handle large text log files (900Mb up to 2.5Gb of text) I am running into issues. The JVM memory size increase dramatically when I read the text files.
One other way was to only retrieve the lines that I am interested in. but I am not aware of any technology to do this in java. I have to start reading line by line until I get to the required line that I want (let’s say line 1000) and then grab hold of that text. But in doing so I have 999 lines in memory which is waiting to be GC’d.
Bairgrep for instance is scanning multiple files in a folder and look for a pattern. If I open the task manager I can hardly see that the memory footprint is growing. What type of technology or way of scanning is these programs using.
Is there a technology out there that I can use in my application to handle large text files?
I might add that my log file is files that is generated by a java application and the length of each line is not the same.
One correction... with memory footprint I mean I cannot read a 6Gb file in memory. Event if I specify the VM size with -Xmx to be small. The application is running out of memory when reading the 6Gb file.
Added 2 ways I tried to get the text from the 758 Mb Log file
Method 1
#FXML
private void handleButtonAction(ActionEvent event) {
final String fileName = "D:/Development/Logs/File1.log";
try {
BufferedReader in = new BufferedReader(new FileReader(fileName));
while (in.ready()) {
String s = in.readLine();
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Method 2
#FXML
private void handleButtonAction(ActionEvent event) {
final String fileName = "D:/Development/Logs/File1.log";
Scanner scan = null;
try {
File file = new File(fileName);
if (!file.exists()) {
return;
}
scan = new Scanner(file);
long start = System.nanoTime();
while (scan.hasNextLine()) {
final String line = scan.nextLine();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
scan.close();
}
}
I think "MemoryMappedFile" is what you are looking for.
I found some links to help you:
http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm
http://javarevisited.blogspot.de/2012/01/memorymapped-file-and-io-in-java.html
Both the applications you mentioned, might "handle" large files, but they don't actually need to load entire files into memory. The first one sounds like it might seek directly to the end of the file, while the second operates on a line by line basis.
It is possible they are using native code via JNI to achieve the low memory use.
Edit: Infact, they look to be purely C or C++ applications, they don't need to wait for GC like Java applications do.

How to atomically rename a file in Java, even if the dest file already exists?

I have a cluster of machines, each running a Java app.
These Java apps need to access a unique resource.txt file concurrently.
I need to atomically rename a temp.txt file to resource.txt in Java, even if resource.txt already exist.
Deleting resource.txt and renaming temp.txt doesn't work, as it's not atomic (it creates a small timeframe where resource.txt doesn't exist).
And it should be cross-platform...
For Java 1.7+, use java.nio.file.Files.move(Path source, Path target, CopyOption... options) with CopyOptions "REPLACE_EXISTING" and "ATOMIC_MOVE".
See API documentation for more information.
For example:
Files.move(src, dst, StandardCopyOption.ATOMIC_MOVE);
On Linux (and I believe Solaris and other UNIX operating systems), Java's File.renameTo() method will overwrite the destination file if it exists, but this is not the case under Windows.
To be cross platform, I think you'd have to use file locking on resource.txt and then overwrite the data.
The behavior of the file lock is
platform-dependent. On some platforms,
the file lock is advisory, which means
that unless an application checks for
a file lock, it will not be prevented
from accessing the file. On other
platforms, the file lock is mandatory,
which means that a file lock prevents
any application from accessing the
file.
try {
// Get a file channel for the file
File file = new File("filename");
FileChannel channel = new RandomAccessFile(file, "rw").getChannel();
// Use the file channel to create a lock on the file.
// This method blocks until it can retrieve the lock.
FileLock lock = channel.lock();
// Try acquiring the lock without blocking. This method returns
// null or throws an exception if the file is already locked.
try {
lock = channel.tryLock();
} catch (OverlappingFileLockException e) {
// File is already locked in this thread or virtual machine
}
// Release the lock
lock.release();
// Close the file
channel.close();
} catch (Exception e) {
}
Linux, by default, uses voluntary locking, while Windows enforces it. Maybe you could detect the OS, and use renameTo() under UNIX with some locking code for Windows?
There's also a way to turn on mandatory locking under Linux for specific files, but it's kind of obscure. You have to set the mode bits just right.
Linux, following System V (see System
V Interface Definition (SVID) Version
3), lets the sgid bit for files
without group execute permission mark
the file for mandatory locking
Here is a discussion that relates: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4017593
As stated here, it looks like the Windows OS doesn't even support atomic file rename for older versions. It's very likely you have to use some manual locking mechanisms or some kind of transactions. For that, you might want to take a look into the apache commons transaction package.
If this should be cross-platform I suggest 2 options:
Implement an intermediate service that is responsible for all the file accesses. Here you can use several mechanisms for synchronizing the requests. Each client java app accesses the file only through this service.
Create a control file each time you need to perform synchronized operations. Each java app that accesses the file is responsible checking for the control file and waiting while this control file exists. (almost like a semaphore). The process doing the delete/rename operation is responsible for creating/deleting the control file.
If the purpose of the rename is to replace resource.txt on the fly and you have control over all the programs involved, and the frequency of replacement is not high, you could do the following.
To open/read the file:
Open "resource.txt", if that fails
Open "resource.old.txt", if that fails
Open "resource.txt" again, if that fails
You have an error condition.
To replace the file:
Rename "resource.txt" to "resource.old.txt", then
Rename "resource.new.txt" to "resource.txt", then
Delete "resource.old.txt".
Which will ensure all your readers always find a valid file.
But, easier, would be to simply try your opening in a loop, like:
InputStream inp=null;
StopWatch tmr=new StopWatch(); // made up class, not std Java
IOException err=null;
while(inp==null && tmr.elapsed()<5000) { // or some approp. length of time
try { inp=new FileInputStream("resource.txt"); }
catch(IOException thr) { err=thr; sleep(100); } // or some approp. length of time
}
if(inp==null) {
// handle error here - file did not turn up after required elapsed time
throw new IOException("Could not obtain data from resource.txt file");
}
... carry on
You might get some traction by establishing a filechannel lock on the file before renaming it (and deleting the file you're going to overwrite once you have the lock).
-r
I solve with a simple rename function.
Calling :
File newPath = new File("...");
newPath = checkName(newPath);
Files.copy(file.toPath(), newPath.toPath(), StandardCopyOption.REPLACE_EXISTING);
The checkName function checks if exits.
If exits then concat a number between two bracket (1) to the end of the filename.
Functions:
private static File checkName(File newPath) {
if (Files.exists(newPath.toPath())) {
String extractRegExSubStr = extractRegExSubStr(newPath.getName(), "\\([0-9]+\\)");
if (extractRegExSubStr != null) {
extractRegExSubStr = extractRegExSubStr.replaceAll("\\(|\\)", "");
int parseInt = Integer.parseInt(extractRegExSubStr);
int parseIntPLus = parseInt + 1;
newPath = new File(newPath.getAbsolutePath().replace("(" + parseInt + ")", "(" + parseIntPLus + ")"));
return checkName(newPath);
} else {
newPath = new File(newPath.getAbsolutePath().replace(".pdf", " (" + 1 + ").pdf"));
return checkName(newPath);
}
}
return newPath;
}
private static String extractRegExSubStr(String row, String patternStr) {
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(row);
if (matcher.find()) {
return matcher.group(0);
}
return null;
}
EDIT: Its only works for pdf. If you want other please replace the .pdf or create an extension paramter for it.
NOTE: If the file contains additional numbers between brackets '(' then it may mess up your file names.

Categories

Resources