I want to write a Stream to file. However, the Stream is big (few Gb when write to file) so I want to use parallel. At the end of process, I would like to write to file (I am using FileWriter)
I would like to ask if that has potential cause any problem in file.
Here is some code
function to write stream to file
public static void writeStreamToFile(Stream<String> ss, String fileURI) {
try (FileWriter wr = new FileWriter(fileURI)) {
ss.forEach(line -> {
try {
if (line != null) {
wr.write(line + "\n");
}
} catch (Exception ex) {
System.err.println("error when write file");
}
});
} catch (IOException ex) {
Logger.getLogger(OaStreamer.class.getName()).log(Level.SEVERE, null, ex);
}
}
how I use my stream
Stream<String> ss = Files.lines(path).parallel()
.map(x->dosomething(x))
.map(x->dosomethingagain(x))
writeStreamToFile(ss, "path/to/output.csv")
As others have mentioned, this approach should work, however you should question if it is the best method. Writing to a file is a shared operation between threads meaning you are introducing thread contention.
While it is easy to think that having multiple threads will speed up performance, in the case of I/O operations the opposite is true. Remember I/O operations are finitely bounded, so more threads will not increase performance. In fact, this I/O contention will slow down access to the shared resource because of the constant locking/unlocking of the ability to write to the resource.
The bottom line is that only one thread can write to a file at a time, so parallelizing write operations is counterproductive.
Consider using multiple threads to handle your CPU intensive tasks, and then having all threads post to a queue/buffer. A single thread can then pull from the queue and write to your file. This solution (and more detail) was suggested in this answer.
Checkout this article for more info on thread contention and locks.
Yes It is Ok to use FileWriter as you are using, I have some another ways which may be helpful to you.
As you are dealing with large files, FileChannel can be faster than standard IO. The following code write String to a file using FileChannel:
#Test
public void givenWritingToFile_whenUsingFileChannel_thenCorrect()
throws IOException {
RandomAccessFile stream = new RandomAccessFile(fileName, "rw");
FileChannel channel = stream.getChannel();
String value = "Hello";
byte[] strBytes = value.getBytes();
ByteBuffer buffer = ByteBuffer.allocate(strBytes.length);
buffer.put(strBytes);
buffer.flip();
channel.write(buffer);
stream.close();
channel.close();
// verify
RandomAccessFile reader = new RandomAccessFile(fileName, "r");
assertEquals(value, reader.readLine());
reader.close();
}
Reference : https://www.baeldung.com/java-write-to-file
You can use Files.write with stream operations as below which converts the Stream to the Iterable:
Files.write(Paths.get(filepath), (Iterable<String>)yourstream::iterator);
For example:
Files.write(Paths.get("/dir1/dir2/file.txt"),
(Iterable<String>)IntStream.range(0, 1000).mapToObj(String::valueOf)::iterator);
If you have stream of some custom objects, you can always add the .map(Object::toString) step to apply the toString() method.
Heading
It is not a problem in case it is okay for the file to have the lines in random order. You are reading content in parallel, not in sequence. Therefore you have no guarantees at which point any line is coming in for processing.
That is only thing to keep in mind here.
Related
The problem with my code is an infinite loop of reading and writing.
I can't find a solution or a concept for this problem.
FileInputStream in = new FileInputStream("toto.txt");
FileOutputStream out = new FileOutputStream("toto.txt",false);
int m;
while ((m = in.read()) != 0) {
System.out.print(m);
out.write(m);
}
in.close();
out.close();
alter the loop condition to below:
while ((m = in.read()) != -1)
The problem with my code in an infinite loop of reading and writing. I
can't find a solution or a concept for this problem.
There's a number of problems with your code:
The file will be treated as empty after the FileOutputStream gets instantiated because you've set append flag to false. End method read() will always return -1 because there's no content to read.
Condition is incorrect and method read() and only because of that control enters the loop and EOF (-1) is being repeatedly written into the file. If you fixed the condition to (m = in.read()) != -1, the loop would be ignored because the file is blank from the start.
If you would do both: fix the condition and change the append flag to true then you would get another flavor of infinite loop. All the contents of the file will be successfully read and repeatedly appended to the file.
So at any condition, reading and writing simultaneously to the same file isn't a good idea.
One important note in regard to exception handling.
Because there's no catch block in your code snippet, I assume that you've added a throws to the main() - it's not a nice idea. Methods close() in your code will be invoked only in case of successful execution, but if exception occur resources will never get released.
Instead, I suggest you to make use of try with resources. That will provide an implicit finally block for you that will take care of closing the resources regardless whether exception occurred or not (now your invocations of close() will not get executed in case of exception). Another option is to declare finally block explicitly, and close the resources inside it.
Try with resource is more concise and cleaner way to ensure that resources will get released.
Also consider wrapping both streams with buffered high-level streams to improve performance. It'll significantly reduce the number of time your application will need to access the file system.
try (var in = new BufferedInputStream(new FileInputStream("source.txt"));
var out = new BufferedOutputStream(new FileOutputStream("destination.txt", false))) {
int next; // a subsequent byte that has been read from the source
while ((next = in.read()) != -1) {
out.write(next);
}
} catch (IOException e) {
e.printStackTrace();
}
It goes into an infinite loop because reads will see the results of past writes.
Reading and Writing the same file using FileInputStream and FileOutputStream is not possible. Use RandomAccessFile if you want to read/write to the same file. You can specify the position as well if you want to write at a specific place in your file.
If you want to write to the end of the file and then read all the lines on the file then here is a sample for that:
RandomAccessFile file = new RandomAccessFile("toto.txt", "rw");
file.seek(file.length());
file.writeBytes("This is a temp file");
file.seek(0); //sets the pointer to the first byte
String line;
while((line = file.readLine()) != null) {
System.out.println(line);
}
As stated in the title, should I close stream when reusing a FileOutputStream variable? For example, in the following codes, should I call the outfile.close() before I assign it a new file and why?
Thanks:)
FileOutputStream outfile = null;
int index = 1;
while (true) {
// check whether we should create a new file
boolean createNewFile = shouldCreateNewFile();
//write to a new file if pattern is identified
if (createNewFile) {
/* Should I close the outfile each time I create a new file?
if (outfile != null) {
outfile.close();
}
*/
outfile = new FileOutputStream(String.valueOf(index++) + ".txt");
}
if (outfile != null) {
outfile.write(getNewFileContent());
}
if (shouldEnd()) {
break;
}
}
try {
if (outfile != null) {
outfile.close();
}
} catch (IOException e) {
System.err.println("Something wrong happens...");
}
YES. Once you are done with one file (stream) you should always close it. So that the resources allocated with the file (stream) will be released to the operating system like file descriptors, buffer etc.
Java Documentation FileOutputStream.close()
Closes this file output stream and releases any system resources associated with this stream. This file output stream may no longer be used for writing bytes.
The unclosed file descriptors can even lead to resource leaks in the java program. Reference
I think the confusion here revolves around the concept of “re-using” the FileOutputStream. What you are doing is simply re-using an identifier (the name outfile of your variable) by associating a new value with it. But this only has syntactic meaning to the Java compiler. The object referred to by the name – the FileOutputStream – is simply dropped on the floor and will eventually be garbage collected at an unspecified later point in time. It doesn't matter what you do with the variable that once referred to it. Whether you re-assign it another FileOutputStream, set it to null or let it go out of scope is all the same.
Calling close explicitly flushes all buffered data to the file and releases the associated resources. (The garbage collector would release them too but you don't know when this might happen.) Note that close may also throw an IOException so it really matters that you know the point at which the operation is tried which you only do if you call the function explicitly.
Even without automatic resource management, or try-with-resources (see below), your code can be made much more readable and reliable:
for (int index = 1; shouldCreateNewFile(); ++index) {
FileOutputStream outfile = new FileOutputStream(index + ".txt");
try {
outfile.write(getNewFileContent());
}
finally {
outfile.close();
}
}
However, Java 7 introduced a new syntax for closures that is more reliable and informative in the case of errors. Using it, your code would look like this:
for (int index = 1; shouldCreateNewFile(); ++index) {
try (FileOutputStream outfile = new FileOutputStream(index + ".txt")) {
outfile.write(getNewFileContent());
}
}
The output stream will still be closed, but if there is an exception inside the try block, and another while closing the stream, the exception will be suppressed (linked to the main exception), rather than causing the main exception to be discarded like the previous example.
You should always use automatic resource management in Java 7 or above.
We have a situation where in two different main() programs access the same file for read/write operation.
One of the program tries to serialize a HashMap into the file whereas the other program tries to read the same HashMap.
The aim is to prevent the read operation while the write operation is ongoing.
I am able to get a lock for the file using java.nio.channels.FileChannel.lock(). But now I am unable to write to the file using ObjectOutputStream, since the lock is acquired by the FileChannel.
The main method for the write looks like as given below:
public static void main(String args[]) {
try {
HashMap<String, Double> h = new HashMap<>();
File f = new File("C:\\Sayan\\test.txt");
FileChannel channel = new RandomAccessFile(f, "rw").getChannel();
FileLock lock = channel.lock();
System.out.println("Created File object");
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(f));
System.out.println("Created output stream");
oos.writeObject(h);
lock.release();
} catch (Exception e) {
e.printStackTrace();
}
}
The read operation is also similar it just reads the data from the file modified by above code.
Constraints:
We cannot create any other class or threading for attaining synchronization.
One more doubt: Can threads created by two different main programs communicate with each other?
Get the channel you're locking with from the FileOutputStream instead of from a separate RandomAccessFile.
I have this ArrayList files
for(File file : files){
InputStream in = FileInputStream(file);
// process each file and save it to file
OutputStream out = FileOutputStream(file);
try{
} finally {
in.close();
out.close();
}
}
the performance is really slow since every loop there is a in/out close(), is there a better way to do this? I tried to put outputstream oustide of the loop, it doesn't work.
Using buffered streams makes a huge difference.
Try this:
for(final File file : files) {
final InputStream in = new BufferedInputStream(new FileInputStream(file));
final OutputStream out = new BufferedOutputStream(new FileOutputStream(new File(...)));
try {
// Process each file and save it to file
}
finally {
try {
in.close();
}
catch (IOException ignored) {}
try {
out.close();
}
catch (IOException ignored) {}
}
}
Note that the IOExceptions that can be thrown when closing the streams must be ignored, or you will lose the potential initial exception.
Another problem is that both streams are on the same file, which doesn't work. So I suppose you're using two different files.
A close() can take up to 20 ms. I doubt this is your program unless you have 1000's of files.
I suspect your performance problem is a lack of buffering the input and output. Can you show your buffering wrappers as well?
you can of course build a queue of OutputStreams and offload that to a background thread that handles the closing of these outputstreams. Same for InputStreams.
Alternatively you can leave it down to the JVM to do that -- simply don't close the files and leave it to the GC to do that when objects are finalized.
byte[] bytes = value.getBytes();
Process q = new ProcessBuilder("process","arg1", "arg2").start();
q.getOutputStream().write(bytes);
q.getOutputStream().flush();
System.out.println(q.getInputStream().available());
I'm trying to stream file contents to an executable and capture the output but the output(InputStream) is always empty. I can capture the output if i specify the the file location but not with streamed input.
How might I overcome this?
Try wrapping your streams with BufferedInputStream() and BufferedOutputStream():
http://download.oracle.com/javase/6/docs/api/java/lang/Process.html#getOutputStream%28%29
Implementation note: It is a good idea for the output stream to be buffered.
Implementation note: It is a good idea for the input stream to be buffered.
Even with buffered streams, it is still possible for the buffer to fill if you're dealing with large amounts of data, you can deal with this by starting a separate thread to read from q.getInputStream(), so you can still be reading from the process while writing to the process.
Perhaps the program you execute only starts its work when it detects the end of its input data. This is normally done by waiting for an EOF (end-of-file) symbol. You can send this by closing the output stream to the process:
q.getOutputStream().write(bytes);
q.getOutputStream().close();
Try this together with waiting for the process.
I dont know if something else may also be wrong here, but the other process ("process") does not even have time to respond, you are not waiting for it (the method available() does not block). To try this out you can first insert a sleep(2000) after the flush(), and if that works you should switch to query'ing q.getInputStream().available() multiple times with short pauses in between.
I think, you have to wait, until the process finished.
I implemented something like this this way:
public class ProcessReader {
private static final int PROCESS_LOOP_SLEEP_MILLIS = 100;
private String result;
public ProcessReader(Process process) {
BufferedReader resultReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
StringBuilder resultOutput = new StringBuilder();
try {
while (!checkProcessTerminated(process, resultReader, resultOutput)) {
}
} catch (Exception ex1) {
throw new RuntimeException(ex1);
}
result = resultOutput.toString();
}
public String getResult(){
return result;
}
private boolean checkProcessTerminated(Process process, BufferedReader resultReader, StringBuilder resultOutput) throws Exception {
try {
int exit = process.exitValue();
return true;
} catch (IllegalThreadStateException ex) {
Thread.sleep(PROCESS_LOOP_SLEEP_MILLIS);
} finally {
while (resultReader.ready()) {
String out = resultReader.readLine();
resultOutput.append(out).append("\n");
}
}
return false;
}
}
I just removed now some specific code, that you dont need, but it should work, try it.
Regards