I am running a long running operation, Say 100k jobs. i want to update the progress of it in a file once every 100 such jobs are completed.
i am opening the file using bufferedWriter with append mode as false. Writing it and then closing it. this is done once every 100 jobs are completed. So the file open and close would have happened 1000 times. Can i optimise it further by opening and closing the file only once?
public static void writeMetaData(String writeDir, JSONObject jsonObject) throws Exception {
String filePath = writeDir.concat("/").concat("metadata.txt");
BufferedWriter metaDataWriter = Files.newBufferedWriter(Paths.get(filePath), StandardCharsets.UTF_8, StandardOpenOption.TRUNCATE_EXISTING);
metaDataWriter.write(jsonObject.toString());
IOUtils.closeQuietly(metaDataWriter);
}
for(int i =0 ; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
writeMetaData(writeDir, jsonObject);
}
}
File should only have a single line.
Expected file content after 100 jobs:
progress: 100
Expected file content after 200 jobs:
progress: 200
Can this be optimised further?
First of all, an expression like writeDir.concat("/").concat("metadata.txt") is reducing readability and performance. A straight-forward writeDir + "/" + "metadata.txt" will provide better performance. But since you’re constructing a string merely for constructing a Path, it’s even more straight-forward not to do the Path’s job in your code but rather use Paths.get(writeDir, "metadata.txt").
You can not rewind a BufferedWriter but you can rewind a FileChannel. Therefore, to keep the channel open and rewind it when needed, you have to construct a new writer after rewinding:
public static void writeMetaData(FileChannel ch, JSONObject jsonObj) throws IOException {
ch.position(0);
if(ch.size() > 0) ch.truncate(0);
Writer w = Channels.newWriter(ch, StandardCharsets.UTF_8.newEncoder(), 8192);
w.write(jsonObj.toString());
w.flush();
}
try(FileChannel ch = FileChannel.open(Paths.get(writeDir, "metadata.txt"),
StandardOpenOption.WRITE, StandardOpenOption.CREATE)) {
for(int i = 0; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
writeMetaData(ch, jsonObject);
}
}
}
It’s important that the use of the Writer ends with flush() to force the write of all buffered data, but not close() as that would also close the underlying channel. Note that this code does not wrap the writer into a BufferedWriter; encoding text as UTF-8 is already a buffered operation and by requesting a larger buffer for the encoder, matching BufferedWriter’s default buffer size, we get the same effect of buffering without the copying overhead.
Since writing is not an end in itself, there’s a question left regarding your reading side. If the reader is trying to read the data in some intervals, there’s the risk of overlapping with the write, getting incomplete data.
You could use
public static void writeMetaData(FileChannel ch, JSONObject jsonObj) throws IOException {
try(FileLock lock = ch.lock()) {
ch.position(0);
if(ch.size() > 0) ch.truncate(0);
Writer w = Channels.newWriter(ch, StandardCharsets.UTF_8.newEncoder(), 8192);
w.write(jsonObj.toString());
w.flush();
}
}
to lock the file during the write. But depending on the system, file locking might not be mandatory but only affect readers also trying to get a read lock.
When you use JDK 11 or newer, you may consider using
for(int i = 0; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
Files.writeString(Paths.get(writeDir, "metadata.txt"), jsonObject.toString());
}
}
which clearly wins on simplicity (yes, that’s the complete code, no additional method required). The default options do already include the desired StandardCharsets.UTF_8 and StandardOpenOption.TRUNCATE_EXISTING.
While it does open and close the file internally, it has some other performance tweaks which may compensate. Especially in the likely case that the string consists of ASCII characters only, as the implementation will simply write the string’s internal array directly to the file then.
A Stream does not allow to go back and rewrite content. A way to achieve what you want is using a RandomAccessFile.
Its setLength() method will truncate the file if you pass 0.
Here is a simple example:
import java.io.*;
public class Test
{
public static void updateFile(RandomAccessFile raf, String content) throws IOException
{
raf.setLength(0);
raf.write(content.getBytes("UTF-8"));
}
public static void main(String[] args) throws IOException
{
try(RandomAccessFile raf = new RandomAccessFile("metadata.txt", "rw"))
{
updateFile(raf, "progress: 100");
updateFile(raf, "progress: 200");
}
}
}
File operations are typically buffered by the underlying kernel, and so you're unlikely to see much of a performance benefit by keeping an open file descriptor for this kind of low throughput application.
Keeping your code as a single operation that gets to leave no state after it finishes, rather than designing it as a continuous rewindable stream, makes for an elegant, simple, and unless you've specifically requested synchronous IO, then also sufficiently performant implementation that gets to benefit from the optimizations of all of the layers that sit beneath it.
When you do get measurable impedance to performance by this, which I suspect you never will, you could use the RandomAccessFile API, or go unnecessarily lower level by using FileChannel as others already specified.
I think you shouldn't compromise the simplicity/elegance of your design for this kind of micro-optimization, which in the grand scheme of things, is guaranteed to be insignificant (one tiny write operation per 100 jobs processed).
Related
try(FileReader reader = new FileReader("input.txt")) {
int c;
while ((c = reader.read()) != -1)
System.out.print((char)c);
} catch (Exception ignored) { }
In this code, I read a char by char. Is it more efficient in someway to read a into an array of chars at once? In other words, is there any kind of optimization that happens when reading in arrays?
For example in this code, I have an array of char called arr and I read into it until there is noting left to read. Is it more efficient?
try(FileReader reader = new FileReader("input.txt")) {
int size;
char[] arr = new char[100];
while ((size = reader.read(arr)) != -1)
for (int i = 0; i < size; i++)
System.out.print(arr[i]);
} catch (Exception ignored) { }
The question applies for both reading/writing both chars/bytes.
Depends on the reader. The answer can be yes, though. Whatever Reader or InputStream is the actual 'raw' driver (the one that isn't just wrapping another reader or inputstream, but the one that is actually talking to the OS to get the data) - it may well implement the single-character read() method by asking the OS to read a single character.
In the end, you have a disk, and disks return data in blocks. So if you ask for 1 byte, you have 2 options as a computer:
Ask the disk for the block that contains the byte that is to be read. Store the block in memory someplace for a while. Return one byte; for the next few moments, if more requests for bytes come in from the same block, return from the stored data in memory and don't bother asking the disk at all. NOTE: This requires memory! Who allocates it? How much memory is okay? Tricky questions. OSes tend to give low level tools and don't like just picking values for any of these questions.
Ask the disk for the block that contains the byte that is to be read. Find the 1 byte needed from within this block. Ignore the rest of the data, return just that one byte. If in a few moments another byte from that block is asked for... ask the disk, again, for the whole block, and repeat this routine.
Which of the two models you get depends on many factors: For example: What kind of disk is it, what OS do you have, what underlying java reader are you using. But it is plausible you end up in this second mode and that is, as you can probably tell, usually incredibly slow, because you end up reading the same block 4000+ times instead of only once.
So, how to fix this? Well, java doesn't really know what the OS is doing either, so the safest bet is to let java do the caching. Then you have no dependencies on whatever the OS is doing.
You could write it yourself, so instead of:
for (int i = in.read(); i != -1; i = in.read()) {
processOneChar((char) i);
}
you could do:
char[] buffer = new char[4096];
while (true) {
int r = in.read(buffer);
if (r == -1) break;
for (int i = 0; i < r; i++) processOneChar(buffer[i]);
}
more code, but now the second scenario (the same block is read off the disk a ton of times) can no longer occur; you have given the OS the freedom to return to you up to 4096 chars worth of data.
Or, use a java builtin: BufferedX:
BufferedReader br = new BufferedReader(in);
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
The implementation of BufferedReader guarantees that java will take care of making some reasonably sized buffer to avoid rereads of the same block off of disk.
NB: Note that the FileReader constructor you are using should not be used. It uses platform default encoding (anytime you convert bytes to characters, encoding is involved), and platform default is a recipe for untestable bugs, which are very bad. Use new FileReader(file, StandardCharsets.UTF_8) instead, or better yet, use the new API:
Path p = Paths.get("C:/file.txt");
try (BufferedReader br = Files.newBufferedReader(p)) {
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
}
Note that this:
Defaults to UTF-8, because the Files API defaults to UTF-8 unlike most places in the VM.
Makes a bufferedreader immediately, no need to make it yourself.
Properly manages the resource (ensures it is closed regardless of how this code exits, be it normally or be exception), by using an ARM block.
Because a BufferedX is involved, no risk of the 'read the same block a lot' performance hole.
NB: The same logic applies when writing; disks such as SSDs can only write a whole block at a time. Now it's not just slow as molasses to write, you're also ruining your disk, as they get a limited number of writes.
I have a client-server application where the server sends some binary data to the client and the client has to deserialize objects from that byte stream according to a custom binary format. The data is sent via an HTTPS connection and the client uses HttpsURLConnection.getInputStream() to read it.
I implemented a DataDeserializer that takes an InputStream and deserializes it completely. It works in a way that it performs multiple inputStream.read(buffer) calls with small buffers (usually less than 100 bytes). On my way of achieving better overall performance I also tried different implementations here. One change did improve this class' performance significantly (I'm using a ByteBuffer now to read primitive types rather than doing it manually with byte shifting), but in combination with the network stream no differences show up. See the section below for more details.
Quick summary of my issue
Deserializing from the network stream takes way too long even though I proved that the network and the deserializer themselves are fast. Are there any common performance tricks that I could try? I am already wrapping the network stream with a BufferedInputStream. Also, I tried double buffering with some success (see code below). Any solution to achieve better performance is welcome.
The performance test scenario
In my test scenario server and client are located on the same machine and the server sends ~174 MB of data. The code snippets can be found at the end of this post. All numbers you see here are averages of 5 test runs.
First I wanted to know, how fast that InputStream of the HttpsURLConnection can be read. Wrapped into a BufferedInputStream, it took 26.250s to write the entire data into a ByteArrayOutputStream.1
Then I tested the performance of my deserializer passing it all that 174 MB as a ByteArrayInputStream. Before I improved the deserializer's implementation, it took 38.151s. After the improvement it took only 23.466s.2
So this is going to be it, I thought... but no.
What I actually want to do, somehow, is passing connection.getInputStream() to the deserializer. And here comes the strange thing: Before the deserializer improvement deserializing took 61.413s and after improving it was 60.100s!3
How can that happen? Almost no improvement here despite the deserializer improved significantly. Also, unrelated to that improvement, I was surprised that this takes longer than the separate performances summed up (60.100 > 26.250 + 23.466). Why? Don't get me wrong, I didn't expect this to be the best solution, but I didn't expect it to be that bad either.
So, three things to notice:
The overall speed is bound by the network which takes at least 26.250s. Maybe there are some http-settings that I could tweak or I could further optimize the server, but for now this is likely not what I should focus on.
My deserializer implementation is very likely still not perfect, but on its own it is faster than the network, so I don't think there is need to further improve it.
Based on 1. and 2. I'm assuming that it should be somehow possible to do the entire job in a combined way (reading from the network + deserializing) which should take not much more than 26.250s. Any suggestions on how to achieve this are welcome.
I was looking for some kind of double buffer allowing two threads to read from it and write to it in parallel.
Is there something like that in standard Java? Preferably some class inheriting from InputStream that allows to write to it in parallel? If there is something similar, but not inheriting from InputStream, I may be able to change my DataDeserializer to consume from that one as well.
As I haven't found any such DoubleBufferInputStream, I implemented it myself.
The code is quite long and likely not perfect and I don't want to bother you to do a code review for me. It has two 16kB buffers. Using it I was able to improve the overall performance to 39.885s.4
That is much better than 60.100s but still much worse than 26.250s. Choosing different buffer sizes didn't change much. So, I hope someone can lead me to some good double buffer implementation.
The test code
1 (26.250s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
int count = 0;
long start = System.nanoTime();
while ((count = inputStream.read(buffer)) >= 0) {
outputStream .write(buffer, 0, count);
}
long end = System.nanoTime();
2 (23.466s)
InputStream inputStream = new ByteArrayInputStream(entire174MBbuffer);
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
3 (60.100s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
4 (39.885s)
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
byte[] buffer = new byte[16 * 1024];
int count = 0;
while ((count = inputStream.read(buffer)) >= 0) {
doubleBufferInputStream.write(buffer, 0, count);
}
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting(); // read() may return -1 now
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
long start = System.nanoTime();
deserializer.deserialize();
long end = System.nanoTime();
Update
As requested, here is the core of my deserializer. I think the most important method is prepareForRead() which performs the actual reading of the stream.
class DataDeserializer {
private InputStream _stream;
private ByteBuffer _buffer;
public DataDeserializer(InputStream stream) {
_stream = stream;
_buffer = ByteBuffer.allocate(256 * 1024);
_buffer.order(ByteOrder.LITTLE_ENDIAN);
_buffer.flip();
}
private int readInt() throws IOException {
prepareForRead(4);
return _buffer.getInt();
}
private long readLong() throws IOException {
prepareForRead(8);
return _buffer.getLong();
}
private CustomObject readCustomObject() throws IOException {
prepareForRead(/*size of CustomObject*/);
int customMember1 = _buffer.getInt();
long customMember2 = _buffer.getLong();
// ...
return new CustomObject(customMember1, customMember2, ...);
}
// several other built-in and custom object read methods
private void prepareForRead(int count) throws IOException {
while (_buffer.remaining() < count) {
if (_buffer.capacity() - _buffer.limit() < count) {
_buffer.compact();
_buffer.flip();
}
int read = _stream.read(_buffer.array(), _buffer.limit(), _buffer.capacity() - _buffer.limit());
if (read < 0)
throw new EOFException("Unexpected end of stream.");
_buffer.limit(_buffer.limit() + read);
}
}
public HugeCustomObject Deserialize() throws IOException {
while (...) {
// call several of the above methods
}
return new HugeCustomObject(/* deserialized members */);
}
}
Update 2
I modified my code snippet #1 a little bit to see more precisely where time is being spent:
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = istream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
ostream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
This tells me that reading from the network stream takes 25.756s while writing to the ByteArrayOutputStream only takes 0.817s. This makes sense as these two numbers almost perfectly sum up to the previously measured 26.250s (plus some additional measuring overhead).
In the very same way I modified code snippet #4:
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(httpChannelOutputStream.getConnection().getInputStream(), 256 * 1024)) {
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = inputStream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
doubleBufferInputStream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting();
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
deserializer.deserialize();
Now I would expect that the measured reading time is exactly the same as in the previous example. But instead, the read variable holds a value of 39.294s (How is that possible?? It's the exact same code being measured as in the previous example with 25.756s!)* while writing to my double buffer only takes 0.096s. Again, these numbers almost perfectly sum up to the measured time of code snippet #4.
Additionally, I profiled this very same code using Java VisualVM. That tells me that 40s were spent in this thread's run() method and 100% of these 40s are CPU time. On the other hand, it also spends 40s inside of the deserializer, but here only 26s are CPU time and 14s are spent waiting. This perfectly matches the time of reading from network into ByteBufferOutputStream. So I guess I have to improve my double buffer's "buffer-switching-algorithm".
*) Is there any explanation for this strange observation? I could only imagine that this way of measuring is very inaccurate. However, the read- and write-times of the latest measurements perfectly sum up to the original measurement, so it cannot be that inaccurate... Could someone please shed some light on this?
I was not able to find these read and write performances in the profiler... I will try to find some settings that allow me to observe the profiling results for these two methods.
Apparently, my "mistake" was to use a 32-bit JVM (jre1.8.0_172 being precise).
Running the very same code snippets on a 64-bit version JVM, and tadaaa... it is fast and makes all sense there.
In particular see these new numbers for the corresponding code snippets:
snippet #1: 4.667s (vs. 26.250s)
snippet #2: 11.568s (vs. 23.466s)
snippet #3: 17.185s (vs. 60.100s)
snippet #4: 12.336s (vs. 39.885s)
So apparently, the answers given to Does Java 64 bit perform better than the 32-bit version? are simply not true anymore. Or, there is a serious bug in this particular 32-bit JRE version. I didn't test any others yet.
As you can see, #4 is only slightly slower than #2 which perfectly matches my original assumption that
Based on 1. and 2. I'm assuming that it should be somehow possible to
do the entire job in a combined way (reading from the network +
deserializing) which should take not much more than 26.250s.
Also the very weird results of my profiling approach described in Update 2 of my question do not occur anymore. I didn't repeat every single test in 64 bit yet, but all profiling results that I did do are plausible now, i.e. the same code takes the same time no matter in which code snippet. So maybe it's really a bug, or does anybody have a reasonable explanation?
The most certain way to improve any of these is to change
connection.getInputStream()
to
new BufferedInputStream(connection.getInputStream())
If that doesn't help, the input stream isn't your problem.
I'm reading about Buffer Streams. I searched about it and found many answers that clear my concepts but still have little more questions.
After searching, I have come to know that, Buffer is temporary memory(RAM) which helps program to read data quickly instead hard disk. and when Buffers empty then native input API is called.
After reading little more I got answer from here that is.
Reading data from disk byte-by-byte is very inefficient. One way to
speed it up is to use a buffer: instead of reading one byte at a time,
you read a few thousand bytes at once, and put them in a buffer, in
memory. Then you can look at the bytes in the buffer one by one.
I have two confusion,
1: How/Who data filled in Buffers? (native API how?) as quote above, who filled thousand bytes at once? and it will consume same time. Suppose I have 5MB data, and 5MB loaded once in Buffer in 5 Seconds. and then program use this data from buffer in 5 seconds. Total 10 seconds. But if I skip buffering, then program get direct data from hard disk in 1MB/2sec same as 10Sec total. Please clear my this confusion.
2: The second one how this line works
BufferedReader inputStream = new BufferedReader(new FileReader("xanadu.txt"));
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
Thanks.
As for the performance of using buffering during read/write, it's probably minimal in impact since the OS will cache too, however buffering will reduce the number of calls to the OS, which will have an impact.
When you add other operations on top, such as character encoding/decoding or compression/decompression, the impact is greater as those operations are more efficient when done in blocks.
You second question said:
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
I believe your thinking is wrong. Yes, technically the FileReader will write data to a buffer, but the buffer is not defined by the FileReader, it's defined by the caller of the FileReader.read(buffer) method.
The operation is initiated from outside, when some code calls BufferedReader.read() (any of the overloads). BufferedReader will then check it's buffer, and if enough data is available in the buffer, it will return the data without involving the FileReader. If more data is needed, the BufferedReader will call the FileReader.read(buffer) method to get the next chunk of data.
It's a pull operation, not a push, meaning the data is pulled out of the readers by the caller.
All the stuff is done by a private method named fill() i give you for educational purpose, but all java IDE let you see the source code yourself :
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
// here copy the read chars in a memory buffer named cb
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}
I'm pretty new to NIO and wanted to implement some feature with it, instead of typical Streams (which can do all sort of things).
What I'm not sure I can get is reading from a file into a buffer and limiting the content that I will transfer. Let's say from position 100 to 200 (even if file length is 1000). It also would be nice to do on network sockets.
I know that NIO keeps things basic to leverage OS capabilities that's why I'm not sure it can be done.
I was thinking that a tricky way to do it would be a 'LimitedReadChannel' that when it's should return less than the available buffer size it uses another byte-buffer and then transfer to the original one (1). But seems more tricky than necessary. I also don't want to use anything related to streams because it would defeat the purpose of using NIO.
(1) So far....
LimitedChannel.read(buffer) {
if (buffer.available?? > contentLeft) {
delegateChannel.read(smallerBuffer);
// transfer from smallerBuffer to buffer
} else {
delegateChannel.read(buffer);
}
}
I've found that Buffers admit to ask for the current limit or set a new one. So that wrapper channel (the one that limits the effective number of bytes read) could modify the buffer limit to avoid reading more...
Something like:
// LimitedChannel.java
// private int bytesLeft; // remaining amount of bytes to read
public int read(ByteBuffer buffer) {
if (bytesLeft <= 0) {
return -1;
}
int oldLimit = buffer.limit();
if (bytesLeft < buffer.remaining()) {
// ensure I'm not reading more than allowed
buffer.limit(buffer.position() + bytesLeft);
}
int bytesRead = delegateChannel.read(buffer);
bytesLeft -= bytesRead;
buffer.limit(oldLimit);
return bytesRead;
}
Anyway not sure if this already exists somewhere. It's difficult to find documentation about this use case...
I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.