Fastest way to read/write an array from/to a file?

Fastest way to read/write an array from/to a file? - java

I know there were several similar threads here and on the net but I seem to be doing something wrong, I guess. My task is easy - write (and later read) a big array of integers (int [] or ArrayList or what you think is best) to a file. The faster the better. My concrete array has about 4.5M integers in it and currently the times are for example (in ms):
Generating trie: 14851.13071
Generating array: 2237.4661619999997
Saving array: 89250.167617
Loading array: 114908.08185799999
This is unacceptable and I guess the times should be much lower. What am I doing wrong? I don't need the fastest method on earth but getting these times to about 5 - 15 seconds (less is welcome but not mandatory) is my goal.
My current code:
long start = System.nanoTime();
Node trie = dawg.generateTrie("dict.txt");
long afterGeneratingTrie = System.nanoTime();
ArrayList<Integer> array = dawg.generateArray(trie);
long afterGeneratingArray = System.nanoTime();
try
{
new ObjectOutputStream(new FileOutputStream("test.txt")).writeObject(array);
}
catch (Exception e)
{
Logger.getLogger(DawgTester.class.getName()).log(Level.SEVERE, null, e);
}
long afterSavingArray = System.nanoTime();
ArrayList<Integer> read = new ArrayList<Integer>();
try
{
read = (ArrayList)new ObjectInputStream(new FileInputStream("test.txt")).readObject();
}
catch (Exception e)
{
Logger.getLogger(DawgTester.class.getName()).log(Level.SEVERE, null, e);
}
long afterLoadingArray = System.nanoTime();
System.out.println("Generating trie: " + 0.000001 * (afterGeneratingTrie - start));
System.out.println("Generating array: " + 0.000001 * (afterGeneratingArray - afterGeneratingTrie));
System.out.println("Saving array: " + 0.000001 * (afterSavingArray - afterGeneratingArray));
System.out.println("Loading array: " + 0.000001 * (afterLoadingArray - afterSavingArray));

Don't use java Serialization. it is very powerful and robust, but not particularly speedy (or compact). use a simple DataOutputStream and call writeInt(). (make sure you use a BufferedOutputStream between DataOutputStream and FileOutputStream).
if you want to pre-size your array on read, write your first int as the array length.

Something like the following is probably a fairly fast option. You should also use an actual array int[] rather a ArrayList<Integer> if you're concern is reducing overhead.
final Path path = Paths.get("dict.txt");
...
final int[] rsl = dawg.generateArray(trie);
final ByteBuffer buf = ByteBuffer.allocateDirect(rsl.length << 2);
final IntBuffer buf_i = buf.asIntBuffer().put(rsl).flip();
try (final WritableByteChannel out = Files.newByteChannel(path,
StandardOpenOptions.WRITE, StandardOpenOptions.TRUNCATE_EXISTING)) {
do {
out.write(buf);
} while (buf.hasRemaining());
}
buf.clear();
try (final ReadableByteChannel in = Files.newByteChannel(path,
StandardOpenOptions.READ)) {
do {
in.read(buf);
} while (buf.hasRemaining());
}
buf_i.clear();
buf_i.get(rsl);

Related

Read and compare two large Files

I would like to read and compare all the lines of both files, I explain, I would like to find for each password hasher (from my test.txt file) the hashes that are the same (from the password.txt file). The problem is that it should be fast enough (I would say max 45 min for 10M for password.txt and 1M for test.txt).
I have for the moment this code
private static void bufferedReaderFilePasswordFirst() {
Path path = Paths.get("C:\\Users\\basil\\OneDrive - Haute Ecole Bruxelles Brabant (HE2B)\\Documents\\NetBeansProjects\\sha256\\passwords.txt");
Path pathUser = Paths.get("C:\\Users\\basil\\OneDrive - Haute Ecole Bruxelles Brabant (HE2B)\\Documents\\NetBeansProjects\\sha256\\test.txt");
int nbOfLine = 0;
StringBuffer oui = new StringBuffer();
try (BufferedReader readerPasswordGenerate = Files.newBufferedReader(path, Charset.forName("UTF-8"));) {
String currentLineUser = null;
String currentLinePassword = null;
long start = System.nanoTime();
while (((currentLinePassword = readerPasswordGenerate.readLine()) != null)) {
BufferedReader readerPasswordUser = Files.newBufferedReader(pathUser, Charset.forName("UTF-8"));
while ((currentLineUser = readerPasswordUser.readLine()) != null) {
String firstWord = currentLinePassword.substring(0, currentLinePassword.indexOf(":"));
if ((firstWord.charAt(0) == currentLineUser.charAt(0))
&& (firstWord.charAt(14) == currentLineUser.charAt(14))
&& (firstWord.charAt(31) == currentLineUser.charAt(31))
&& (firstWord.charAt(63) == currentLineUser.charAt(63))
) {
if (firstWord.equals(currentLineUser)) {
String secondWord = currentLinePassword.substring(currentLinePassword.lastIndexOf(":") + 1);
oui.append(secondWord).append(System.lineSeparator());
}
}
}
if (nbOfLine % 300 == 0) {
System.out.println("We are at the " + nbOfLine);
final long consumed = System.nanoTime() - start;
final long totConsumed = TimeUnit.NANOSECONDS.toMillis(consumed);
final double tot = (double) totConsumed;
System.out.printf("Not done. Took %s seconds", (tot / 1000));
System.out.println(oui + " oui");
}
nbOfLine++;
}
System.out.println(oui);
final long consumed = System.nanoTime() - start;
final long totConsumed = TimeUnit.NANOSECONDS.toMillis(consumed);
final double tot = (double) totConsumed;
System.out.printf("Done. Took %s seconds", (tot / 1000));
} catch (IOException ex) {
ex.printStackTrace(); //handle an exception here
}
}
In this code, I just compare for each element in my test.txt if the corresponding element in the password hash is same.
The password.txt contains for all elements: hash:password
and test.txt contains only: hash
Thanks

In this code, I just compare for each element in my test.txt if the corresponding element in the password hash is same.
If you are familiar with Big-O notation, you might recognize that this means your algorithm runs in O(n^2) time. In your specific case, for each of the 1,000,000 lines in test.txt you are doing 10,000,000 comparisons for a total of 10,000,000,000,000 total comparisons. To achieve your goal of running it within 45 minutes you would need to do 3.7 billion comparisons per second. For comparison, the i7 in my laptop runs at a max of 3.9GHz (billion cycles per second) and it will take much more than a single cpu cycle to execute one of these comparisons.
You can reduce the time complexity down to O(n) by first reading the password.txt into a HashMap (10,000,000 operations). From there, any individual check from test.txt only takes a single operation (1,000,000 total), resulting in 11,000,000 operations total. That means you only have to do ~4,000 operations a second (a 99.99989% reduction) to finish in 45 minutes which is much more doable.
Here's some pseudo-code to illustrate what that could look like:
// I like Scanner over BufferedReader for reading files. Use whatever you like.
Scanner readPassword = new Scanner(new File("password.txt"));
// Load all password/hash pairings from password.txt into a HashMap for quick lookups
HashMap<String, List> passwords = new HashMap<>();
while (readPassword.hasNextLine()) {
String line = readPassword.nextLine();
String[] lineParts = line.split(":");
String hash = lineParts[0];
String password = lineParts[1];
// If we haven't seen the hash before, create a new list to store its associated passwords
if (passwords.get(hash) == null) {
passwords.put(hash, new LinkedList<>());
}
// Add the password to the list of all passwords that have this hash
passwords.get(hash).add(password);
}
// Perform all the lookups from test.txt
Scanner readTest = new Scanner(new File("test.txt"));
while (readTest.hasNextLine()) {
String testHash = readTest.nextLine();
List matchingPasswords = passwords.get(testHash);
// Now do whatever you want with the list of associated passwords...
}
Side Notes:
Looking at your code, it look like you have a few extra requirements (e.g. timing) that I didn't consider in this code snippet. I trust you can figure out how to integrate those additional requirements.
Some of the more academic people on here might take issue with a few parts of my Big-O description/analysis. I'm sure their comments on this post will expound that topic in greater detail if that interests you.

Understanding java ByteBuffer

I have been trying to understand how Java ByteBuffer works. My aim is to write a string to ByteBuffer and read it back. I want to understand how ByteBuffer properties like Limit, Capacity, Remaining, Position gets affected due to read/write operations.
Below is my test program (removed import statements for brevity).
public class TestBuffer {
private ByteBuffer bytes;
private String testStr = "Stackoverflow is a great place to discuss tech stuff!";
public TestBuffer() {
bytes = ByteBuffer.allocate(1000);
System.out.println("init: " + printBuffer());
}
public static void main(String a[]) {
TestBuffer buf = new TestBuffer();
try {
buf.writeBuffer();
} catch (IOException e) {
e.printStackTrace();
}
buf.readBuffer();
}
// write testStr to buffer
private void writeBuffer() throws IOException {
byte[] b = testStr.getBytes();
BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(b));
in.read(bytes.array());
in.close();
System.out.println("write: " + printBuffer());
}
// read buffer data back to byte array and print
private void readBuffer() {
bytes.flip();
byte[] b = new byte[bytes.position()];
bytes.position(0);
bytes.get(b);
System.out.println("data read: " + new String(b));
System.out.println("read: " + printBuffer());
}
public String printBuffer() {
return "ByteBuffer [limit=" + bytes.limit() + ", capacity=" + bytes.capacity() + ", position="
+ bytes.position() + ", remaining=" + bytes.remaining() + "]";
}
}
Output
init: ByteBuffer [limit=1000, capacity=1000, position=0, remaining=1000]
write: ByteBuffer [limit=1000, capacity=1000, position=0, remaining=1000]
data read:
read: ByteBuffer [limit=0, capacity=1000, position=0, remaining=0]
As you can see, there is no data after calling readBuffer() and no change in value if various fields after write and read operations.
Update
Below is the working piece of code from Android Screen Library which I was originally trying to understand
// retrieve the screenshot
// (this method - via ByteBuffer - seems to be the fastest)
ByteBuffer bytes = ByteBuffer.allocate (ss.width * ss.height * ss.bpp / 8);
is = new BufferedInputStream(is); // buffering is very important apparently
is.read(bytes.array()); // reading all at once for speed
bytes.position(0); // reset position to the beginning of ByteBuffer
Please help me to understand this.
Thanks

Your buffer is never filled. bytes.array() simply retrieves the backing byte array. If you write anything to this then the ByteBuffer fields - except the array itself of course - are unaffected. So the position stays at zero.
What you are doing in in.read(bytes.array()) is identical to byte[] tmp = bytes.array() followed by in.read(tmp). Changes to the tmp variable cannot be reflected in the bytes instance. The backing array is changed which may mean that the contents of the ByteBuffer is changed as well. But the offsets into the backing byte array - including the position and limit - aren't.
You should only fill the ByteBuffer using any of the put methods (that do not take an index) such as put(byte[]).
I'll provide a code fragment that may get you thinking on how to handle strings, encodings and character and byte buffers:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CoderResult;
import java.nio.charset.StandardCharsets;
public class TestBuffer {
private static final String testStr = "Stackoverflow is a great place to discuss tech stuff!";
private static final boolean END_OF_INPUT = true;
private ByteBuffer bytes = ByteBuffer.allocate(1000);
public TestBuffer() {
System.out.println("init : " + bytes.toString());
}
public static void main(String a[]) {
TestBuffer buf = new TestBuffer();
buf.writeBuffer();
buf.readBuffer();
}
// write testStr to buffer
private void writeBuffer() {
CharBuffer testBuffer = CharBuffer.wrap(testStr);
CharsetEncoder utf8Encoder = StandardCharsets.UTF_8.newEncoder();
CoderResult result = utf8Encoder.encode(testBuffer, bytes, END_OF_INPUT);
if (result.isError()) {
bytes.clear();
throw new IllegalArgumentException("That didn't go right because " + result.toString());
}
if (result.isOverflow()) {
bytes.clear();
throw new IllegalArgumentException("Well, too little buffer space.");
}
System.out.println("written: " + bytes.toString());
bytes.flip();
}
// read buffer data back to byte array and print
private void readBuffer() {
byte[] b = new byte[bytes.remaining()];
bytes.get(b);
System.out.println("data : " + new String(b, StandardCharsets.UTF_8));
System.out.println("read : " + bytes.toString());
bytes.clear();
}
}
Note that buffers and streams are really two separate ways of handling sequential data. If you are trying to use both of them at the same time you may be trying to be too clever.
You could also solve this without CharBuffer and ByteBuffer using a byte[] buffer and a StringReader wrapped by a ReaderInputStream.
That Android piece of code completely abuses the ByteBuffer. It should just have created a byte[] and wrapped that, setting the limit to the capacity. Whatever you do, do not use it as an example on ByteBuffer handling. It made my eyes water in disgust. Code like that is a bug waiting to happen.

You are not writting anything in the writeBuffer() method.
You may use something like bytes.put(b).

Although this question has been answered long time ago, let me also supplement some info. here.
There are 2 problems in the writeBuffer() and readBuffer() method separately which leads you failed to get your expected result.
1) writeBuffer() method
As explained by Maarten Bodewes above regarding the nature of bytebuffer's array, you cannot use byteBuffer.array() directly for reading from a Stream in your
Alternatively, if you want to keep testing the relationship between InputStream and ByteBuffer as your sample (which is also a common practice on server side application for handling incoming messages), you would require an additional byte array.
2) readBuffer() method
The original code is good of using an extra byte array for retrieving the context in bytebuffer for printing it.
However, the problem here is the improper use of flip() and position() method.
The flip() method should only be called right before you want to change the state of the bytebuffer from storing context to exporting context. So this method here should appear right before the line of bytes.get(b); instead. In the provided sample, it was too early to call this method before the line byte[] b = new byte[bytes.position()]; , as flip() method would change the position flag of the bytebuffer to 0 while setting the limit flag to current position.
There are no points of setting the bytebuffer's position to 0 explicitly in the sample codes. In case you want to keep storing context to the bytebuffer again at some later time by starting from current position (i.e. without overwriting existing context in it), you should follow this workflow:
2.1 store bytebuffer's current position: i.e. int pos = bytebuffer.position();
2.2 process with the bytebuffer which may affect the position flag of it: e.g. bytebuffer.get(byte[] dst) etc.
2.3 restore bytebuffer's position flag to original value: i.e. bytebuffer.position(pos);
Here I have slightly modified your sample code for achieving what you want to do:
import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
public class TestBuffer {
private ByteBuffer bytes;
private String testStr = "Stackoverflow is a great place to discuss tech stuff!";
public TestBuffer() {
bytes = ByteBuffer.allocate(1000);
System.out.println("init: " + printBuffer());
}
public static void main(String a[]) {
TestBuffer buf = new TestBuffer();
try {
buf.writeBuffer();
} catch (IOException e) {
e.printStackTrace();
}
buf.readBuffer();
}
// write testStr to buffer
private void writeBuffer() throws IOException {
byte[] b = testStr.getBytes();
BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(b));
// in.read(bytes.array());
byte[] dst = new byte[b.length];
in.read(dst);
bytes.put(dst);
in.close();
System.out.println("write: " + printBuffer());
}
// read buffer data back to byte array and print
private void readBuffer() {
//bytes.flip();
byte[] b = new byte[bytes.position()];
//bytes.position(0);
int pos = bytes.position();
bytes.flip(); // bytes.rewind(); could achieve the same result here, use which one depends on whether:
// (1) reading to this bytebuffer is finished and fine to overwrite the current context in this bytebuffer afterwards: can use flip(); or
// (2) just want to tentatively traverse this bytebuffer from the begining to current position,
// and keep writing to this bytebuffer again later from current position.
bytes.get(b);
bytes.position(pos);
System.out.println("data read: " + new String(b));
System.out.println("read: " + printBuffer());
}
public String printBuffer() {
return "ByteBuffer [limit=" + bytes.limit() + ", capacity=" + bytes.capacity() + ", position="
+ bytes.position() + ", remaining=" + bytes.remaining() + "]";
}
}

Reading a block of bytes from one file and writing to other until all blocks are read?

I am working a project in which I have to play with some file reading writing tasks. I have to read 8 bytes from a file at one time and perform some operations on that block and then write that block to second file, then repeat the cycle until first file is completely read in chuncks of 8 bytes everytime and the after manipulation the data should be added/appended to the second. However, in doing so, I am facing some problems. Following is what I am trying:
private File readFromFile1(File file1) {
int offset = 0;
long message= 0;
try {
FileInputStream fis = new FileInputStream(file1);
byte[] data = new byte[8];
file2 = new File("file2.txt");
FileOutputStream fos = new FileOutputStream(file2.getAbsolutePath(), true);
DataOutputStream dos = new DataOutputStream(fos);
while(fis.read(data, offset, 8) != -1)
{
message = someOperation(data); // operation according to business logic
dos.writeLong(message);
}
fos.close();
dos.close();
fis.close();
} catch (IOException e) {
System.out.println("Some error occurred while reading from File:" + e);
}
return file2;
}
I am not getting the desired output this way. Any help is appreciated.

Consider the following code:
private File readFromFile1(File file1) {
int offset = 0;
long message = 0;
File file2 = null;
try {
FileInputStream fis = new FileInputStream(file1);
byte[] data = new byte[8]; //Read buffer
byte[] tmpbuf = new byte[8]; //Temporary chunk buffer
file2 = new File("file2.txt");
FileOutputStream fos = new FileOutputStream(file2.getAbsolutePath(), true);
DataOutputStream dos = new DataOutputStream(fos);
int readcnt; //Read count
int chunk; //Chunk size to write to tmpbuf
while ((readcnt = fis.read(data, 0, 8)) != -1) {
//// POINT A ////
//Skip chunking system if an 8 byte octet is read directly.
if(readcnt == 8 && offset == 0){
message = someOperation(tmpbuf); // operation according to business logic
dos.writeLong(message);
continue;
}
//// POINT B ////
chunk = Math.min(tmpbuf.length - offset, readcnt); //Determine how much to add to the temp buf.
System.arraycopy(data, 0, tmpbuf, offset, chunk); //Copy bytes to temp buf
offset = offset + chunk; //Sets the offset to temp buf
if (offset == 8) {
message = someOperation(tmpbuf); // operation according to business logic
dos.writeLong(message);
if (chunk < readcnt) {
System.arraycopy(data, chunk, tmpbuf, 0, readcnt - chunk);
offset = readcnt - chunk;
} else {
offset = 0;
}
}
}
//// POINT C ////
//Process remaining bytes here...
//message = foo(tmpbuf);
//dos.writeLong(message);
fos.close();
dos.close();
fis.close();
} catch (IOException e) {
System.out.println("Some error occurred while reading from File:" + e);
}
return file2;
}
In this excerpt of code, what I did was:
Modify your reading code to include the amount of bytes actually read from the read() method (noted readcnt).
Added a byte chunking system (the processing does not happen until there are at least 8 bytes in the chunking buffer).
Allowed for separate processing of the final bytes (that do not make up a 8 byte octet).
As you can see from the code, the data being read is first stored in a chunking buffer (denoted tmpbuf) until at least 8 bytes are available. This will happen only if 8 bytes are not always available (If 8 bytes are available directly and nothing is chunked, directly process. See "Point A" in code). This is done as a form of optimization to prevent excess array copies.
The chunking system uses offsets which increment every time bytes are written to tmpbuf until it reaches a value of 8 (it will not go over as the Math.min() method used in the assignment of 'chunk' will limit the value). Upon offset == 8, proceed to execute the processing code.
If that particular read produced more bytes than actually processed, continue writing them to tmpbuf, from the beginning again, whilst setting offset appropriately, otherwise set offset to 0.
Repeat cycle.
The code will leave the last few bytes of data that do not fit in an octet in the array tmpbuf with the offset variable indicating how much has actually been written. This data can then be processed separately at point C.
Seems a lot more complicating than it should be, and there probably is a better solution (possibly using existing java library methods), but off the top of my head, this is what I got. Hope this is clear enough for you to understand.

You could use the following, it uses NIO and especially the ByteBuffer class for the long handling. You can of course implement it the standard java way, but since i am a NIO fan, here is a possible solution.
The major problem in your code is that while(fis.read(data, offset, 8) != -1) will read up to 8 bytes, and not always 8 bytes, plus reading in such small portions is not very efficient.
I have put some comments in my code, if something is unclear please leave a comment. My someOperation(...) function just copies the next long value from the buffer.
Update:
added finally block to close the files.
import java.io.File;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.StandardOpenOption;
public class TestFile {
static final int IN_BUFFER_SIZE = 1024 * 8;
static final int OUT_BUFFER_SIZE = 1024 *9; // make the out-buffer > in-buffer, i am lazy and don't want to check for overruns
static final int MIN_READ_BYTES = 8;
static final int MIN_WRITE_BYTES = 8;
private File readFromFile1(File inFile) {
final File outFile = new File("file2.txt");
final ByteBuffer inBuffer = ByteBuffer.allocate(IN_BUFFER_SIZE);
final ByteBuffer outBuffer = ByteBuffer.allocate(OUT_BUFFER_SIZE);
FileChannel readChannel = null;
FileChannel writeChannel = null;
try {
// open a file channel for reading and writing
readChannel = FileChannel.open(inFile.toPath(), StandardOpenOption.READ);
writeChannel = FileChannel.open(outFile.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
long totalReadByteCount = 0L;
long totalWriteByteCount = 0L;
boolean readMore = true;
while (readMore) {
// read some bytes into the in-buffer
int readOp = 0;
while ((readOp = readChannel.read(inBuffer)) != -1) {
totalReadByteCount += readOp;
} // while
// prepare the in-buffer to be consumed
inBuffer.flip();
// check if there where errors
if (readOp == -1) {
// end of file reached, read no more
readMore = false;
} // if
// now consume the in-buffer until there are at least MIN_READ_BYTES in the buffer
while (inBuffer.remaining() >= MIN_READ_BYTES) {
// add data to the write buffer
outBuffer.putLong(someOperation(inBuffer));
} // while
// compact the in-buffer and prepare for the next read, if we need to read more.
// that way the possible remaining bytes of the in-buffer can be consumed after leaving the loop
if (readMore) inBuffer.compact();
// prepare the out-buffer to be consumed
outBuffer.flip();
// write the out-buffer until the buffer is empty
while (outBuffer.hasRemaining())
totalWriteByteCount += writeChannel.write(outBuffer);
// prepare the out-buffer for writing again
outBuffer.flip();
} // while
// error handling
if (inBuffer.hasRemaining()) {
System.err.println("Truncated data! Not a long value! bytes remaining: " + inBuffer.remaining());
} // if
System.out.println("read total: " + totalReadByteCount + " bytes.");
System.out.println("write total: " + totalWriteByteCount + " bytes.");
} catch (IOException e) {
System.out.println("Some error occurred while reading from File: " + e);
} finally {
if (readChannel != null) {
try {
readChannel.close();
} catch (IOException e) {
System.out.println("Could not close read channel: " + e);
} // catch
} // if
if (writeChannel != null) {
try {
writeChannel.close();
} catch (IOException e) {
System.out.println("Could not close write channel: " + e);
} // catch
} // if
} // finally
return outFile;
}
private long someOperation(ByteBuffer bb) {
// consume the buffer, do whatever you want with the buffer.
return bb.getLong(); // consumes 8 bytes of the buffer.
}
public static void main(String[] args) {
TestFile testFile = new TestFile();
File source = new File("input.txt");
testFile.readFromFile1(source);
}
}

Java match/exceed performance of readline

For my application, I had to write a custom "readline" method since I wanted to detect and preserve the newline endings in an ASCII text file. The Java readLine() method does not tell which newline sequence (\r, \n, \r\n) or EOF was encountered, so I cannot put the exact same newline sequence when writing to the modified file.
Here is the SSCE of my test example.
public class TestLineIO {
public static java.util.ArrayList<String> readLineArrayFromFile1(java.io.File file) {
java.util.ArrayList<String> lineArray = new java.util.ArrayList<String>();
try {
java.io.BufferedReader br = new java.io.BufferedReader(new java.io.FileReader(file));
String strLine;
while ((strLine = br.readLine()) != null) {
lineArray.add(strLine);
}
br.close();
} catch (java.io.IOException e) {
System.err.println("Could not read file");
System.err.println(e);
}
lineArray.trimToSize();
return lineArray;
}
public static boolean writeLineArrayToFile1(java.util.ArrayList<String> lineArray, java.io.File file) {
try {
java.io.BufferedWriter out = new java.io.BufferedWriter(new java.io.FileWriter(file));
int size = lineArray.size();
for (int i = 0; i < size; i++) {
out.write(lineArray.get(i));
out.newLine();
}
out.close();
} catch (java.io.IOException e) {
System.err.println("Could not write file");
System.err.println(e);
return false;
}
return true;
}
public static java.util.ArrayList<String> readLineArrayFromFile2(java.io.File file) {
java.util.ArrayList<String> lineArray = new java.util.ArrayList<String>();
try {
java.io.FileInputStream stream = new java.io.FileInputStream(file);
try {
java.nio.channels.FileChannel fc = stream.getChannel();
java.nio.MappedByteBuffer bb = fc.map(java.nio.channels.FileChannel.MapMode.READ_ONLY, 0, fc.size());
char[] fileArray = java.nio.charset.Charset.defaultCharset().decode(bb).array();
if (fileArray == null || fileArray.length == 0) {
return lineArray;
}
int length = fileArray.length;
int start = 0;
int index = 0;
while (index < length) {
if (fileArray[index] == '\n') {
lineArray.add(new String(fileArray, start, index - start + 1));
start = index + 1;
} else if (fileArray[index] == '\r') {
if (index == length - 1) { //last character in the file
lineArray.add(new String(fileArray, start, length - start));
start = length;
break;
} else {
if (fileArray[index + 1] == '\n') {
lineArray.add(new String(fileArray, start, index - start + 2));
start = index + 2;
index++;
} else {
lineArray.add(new String(fileArray, start, index - start + 1));
start = index + 1;
}
}
}
index++;
}
if (start < length) {
lineArray.add(new String(fileArray, start, length - start));
}
} finally {
stream.close();
}
} catch (java.io.IOException e) {
System.err.println("Could not read file");
System.err.println(e);
e.printStackTrace();
return lineArray;
}
lineArray.trimToSize();
return lineArray;
}
public static boolean writeLineArrayToFile2(java.util.ArrayList<String> lineArray, java.io.File file) {
try {
java.io.BufferedWriter out = new java.io.BufferedWriter(new java.io.FileWriter(file));
int size = lineArray.size();
for (int i = 0; i < size; i++) {
out.write(lineArray.get(i));
}
out.close();
} catch (java.io.IOException e) {
System.err.println("Could not write file");
System.err.println(e);
return false;
}
return true;
}
public static void main(String[] args) {
System.out.println("Begin");
String fileName = "test.txt";
long start = 0;
long stop = 0;
start = java.util.Calendar.getInstance().getTimeInMillis();
java.io.File f = new java.io.File(fileName);
java.util.ArrayList<String> javaLineArray = readLineArrayFromFile1(f);
stop = java.util.Calendar.getInstance().getTimeInMillis();
System.out.println("Total time = " + (stop - start) + " ms");
java.io.File oj = new java.io.File(fileName + "_readline.txt");
writeLineArrayToFile1(javaLineArray, oj);
start = java.util.Calendar.getInstance().getTimeInMillis();
java.util.ArrayList<String> myLineArray = readLineArrayFromFile2(f);
stop = java.util.Calendar.getInstance().getTimeInMillis();
System.out.println("Total time = " + (stop - start) + " ms");
java.io.File om = new java.io.File(fileName + "_custom.txt");
writeLineArrayToFile2(myLineArray, om);
System.out.println("End");
}
}
Version 1 uses readLine(), whereas version 2 is my version, which preserves newline characters.
On a text file with about 500K lines, version1 takes about 380 ms, whereas version2 takes 1074 ms.
How can I speed-up the performance of version2?
I checked Google guava and apache-commons libraries but cannot find a suitable replacement for "readLine()" that will tell which newline character was encountered when reading a text file.

Whenever the issue regards a program's speed, the main thing you should keep in mind is that, for any continuous process within that program, the speed is nearly always limited by one of two things: CPU (processing power) or IO (memory allocation and transfer speed).
Usually either your CPU is faster than your IO, or the contrary. Because of this, your program's speed-limit is almost always dictated by one of them, and it's usually easy to know which:
A program that does a lot of calculations but makes only a few, small operations with files, is almost certainly CPU-bound.
A program that reads a lot of data from files, or writes a lot of data to them, but is not very demanding towards processing, is almost certainly IO-bound.
Things are kinda straightforward when trying to improve an CPU-bounded program's speed. It mostly comes down to achieving the same goal or effect while making less operations.
This, on the other hand, does not make the process any easier. In fact, it's usually much harder to optimize CPU-bounded programs than to optimize IO-bounded ones, because each CPU-related operation is usually unique, and has to be revised individually.
Although generally easier once you have the experience, things are not so straightforward with IO-bound programs. There are a lot more stuff to consider when dealing with IO-bound processes.
I'll be using Hard-Disk Drives (HDDs) as the basis, since the characteristics I'll mention affect HDDs the strongest (because they are mechanical), but you should keep in mind that many of the same concepts apply, to some extent, to almost every memory-storage hardware, including Solid-State Drives (SSDs) and even RAM!
These are the main performance characteristics of most memory-storage hardware:
Access time: Also known as response time, it is the time it takes before the hardware can actually transfer data.
For mechanical hardware such as HDDs, this is mostly related to the mechanical nature of the drive, in other words, it's rotating disk and moving "heads". As such, access time of mechanical drives can vary significantly between each-other.
For circuital hardware such as SSDs and RAM, this time is not dependent on moving parts, but rather electrical connections, so the access time is very quick and consistent, and you shouldn't worry about it.
Seek time: The time it takes for the hardware to seek (reach) the correct position within it's internal subdivisions, in order to read from or write to addresses in that section.
For mechanical drives, mainly rotary ones, the seek time measures the time it takes the head assembly on the actuator arm to travel to the track of the disk where the data will be read from or written to.
Average seek time ranges from 3 ms (~) for high-end server drives, to 15 ms (~) for mobile drives, with the most common desktop drives typically having a seek time around 9 ms (~).
With RAM and SSDs, there are no moving parts, so a measurement of the seek time is only testing the electronic circuits, and preparing a particular location on the memory in the device for the operation.
Typical SSDs will have a seek time between 0.08 to 0.16 ms (~), with RAM being even faster.
Command-Processing time: Also known as command overhead, it is the time it takes for the drive's electronics to set up the necessary communication between the various internal components, so it can read or write the data.
This is in the range of 0.003 ms (~) for both, mechanical and circuital devices, and is usually ignored in benchmarks.
Settle time: It is the time it takes for the heads to settle on the target track and stop vibrating, so that they do not read or write off-track.
This amount is usually very small (typically less than 0.1 ms), and typically included in benchmarks as part of the seek time.
Data-Transfer rate: Also called throughput, it covers both: The internal rate, which is the time it takes to move data between the disk surface and the controller on the drive. And the external rate, which is the time to move data between the controller on the drive and an external component in the host system. It has a few sub-factors within:
Media rate: Speed at which the drive can read bits from the media. In other words, the actual read/write speed.
Sector overhead: Additional time (bytes) needed for control structures and other information necessary to manage the drive, locate and validate data and perform other support functions.
Allocation speed: Similar to sector overhead, it's the time taken for the drive to determine the slots that will be written to, and to register them on it's address dictionary. Only needed for write operations.
Head-Switch time: Time required to electrically switch from one head to another; Only applies to multi-head drives and is about 1 to 2 ms.
Cylinder-switch time: Time required to move to an adjacent track; The name cylinder is used because typically all the tracks of a drive with more than one head or data surface are read before moving the actuator, implying the image of a circle or cylinder rather than a track. This time is exclusive to rotary mechanical drives, and is typically about about 2 to 3 ms.
This means that the main performance issues regarding IO are caused by going back-and-forth between IO and processing. An issue that can be enormously diminished by using buffers, and processing and reading/writhing in bigger chunks of data, rather than every byte.
As you can also see, although many of the speed characteristics are still present, RAM and SSDs do not have the same internal limits of HDDs, so their internal and external transfer rates often reach the maximum capabilities of the drive-to-host interface.
Chunk approach example:
This example will create a Test folder on the desktop, and generate a Test.txt file within.
The file is generated with an specified number of lines, each line containing the word "Test" repeated for an specific number of times (for file-size purposes). Each line is ended by "\r", "\n" or "\r\n", sequentially.
It's meaningless to save the results of each chunk in-memory cumulatively, as doing so would lead the whole file end up in-memory eventually, which is nearly the same problem of not using chunks to begin with.
As such, an output file is created in the same Test folder, to which the result of every chunk is stored at, once that chunk is finished.
The base file is read using buffers, and those buffers are additionally used as the chunks.
The process here is simply printing a textual version of the line-separator ("\\r", "\\n" or "\\r\\n"), followed by ": ", followed by the line contents; But for the last line, "EOF" is used instead.
To actually operate with chunks, it's probably easier to manage with a class-based approach, rather than a purely function-based one.
Anyways, here goes the code:
public static void main(String[] args) throws FileNotFoundException, IOException {
File file = new File(TEST_FOLDER, "Test.txt");
//These settings create a 122 MB file.
generateTestFile(file, 500000, 50);
long clock = System.nanoTime();
processChunks(file, 8 * (int) Math.pow(1024, 2));
clock = System.nanoTime() - clock;
float millis = clock / 1000000f;
float seconds = millis / 1000f;
System.out.printf(""
+ "%12d nanos\n"
+ "%12.3f millis\n"
+ "%12.3f seconds\n",
clock, millis, seconds);
}
public static File prepareResultFile(File source) {
String ofn = source.getName(); //Original File Name.
int extPos = ofn.lastIndexOf('.'); //Extension index.
String ext = ofn.substring(extPos); //Get extension.
ofn = ofn.substring(0, extPos); //Get name without extension reusing 'ofn'.
return new File(source.getParentFile(), ofn + "_Result" + ext);
}
public static void processChunks(File file, int buffSize)
throws FileNotFoundException, IOException {
//No need for buffers bigger than the file itself.
if (file.length() < buffSize) {
buffSize = (int)file.length();
}
byte[] buffer = new byte[buffSize];
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file), buffSize);
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(
prepareResultFile(file)), buffSize);
StringBuilder sb = new StringBuilder();
while (bis.read(buffer) > (-1)) {
//Check if a "\r\n" was split between chunks.
boolean skipFirst = false;
if (sb.length() > 0 && sb.charAt(sb.length() - 1) == '\r') {
if (buffer[0] == '\n') {
bos.write(("\\r\\n: " + sb.toString() + System.lineSeparator()).getBytes());
sb = new StringBuilder();
skipFirst = true;
}
}
for (int i = skipFirst ? 1 : 0; i < buffer.length; i++) {
if (buffer[i] == '\r') {
if (i + 1 < buffer.length) {
if (buffer[i + 1] == '\n') {
bos.write(("\\r\\n: " + sb.toString() + System.lineSeparator()).getBytes());
i++; //Skip '\n'.
} else {
bos.write(("\\r: " + sb.toString() + System.lineSeparator()).getBytes());
}
sb = new StringBuilder(); //Reset accumulator.
} else {
//A "\r\n" might be split between two chunks.
}
} else if (buffer[i] == '\n') {
bos.write(("\\n: " + sb.toString() + System.lineSeparator()).getBytes());
sb = new StringBuilder(); //Reset accumulator.
} else {
sb.append((char) buffer[i]);
}
}
}
bos.write(("EOF: " + sb.toString()).getBytes());
bos.flush();
bos.close();
bis.close();
System.out.println("Finished!");
}
public static boolean generateTestFile(File file, int lines, int elements)
throws IOException {
String[] lineBreakers = {"\r", "\n", "\r\n"};
BufferedOutputStream bos = null;
try {
bos = new BufferedOutputStream(new FileOutputStream(file));
for (int i = 0; i < lines; i++) {
for (int ii = 1; ii < elements; ii++) {
bos.write("test ".getBytes());
}
bos.write("test".getBytes());
bos.write(lineBreakers[i % 3].getBytes());
}
bos.flush();
System.out.printf("LOG: Test file \"%s\" created.\n", file.getName());
return true;
} catch (IOException ex) {
System.err.println("ERR: Could not write file.");
throw ex;
} finally {
try {
bos.close();
} catch (IOException ex) {
System.err.println("WRN: Could not close stream.");
Logger.getLogger(Q_13458142_v2.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
I don't know what IDE you are using, but if it's NetBeans, make a memory-profile of your code and compare to a profile of this one. You should notice a big difference in the amount of memory needed during processing.
Here, the chunk approach's memory usage, which includes not only the chunk itself but also the program's own variables and structures, does not go over 40 MB even tough we are dealing with a file bigger than 100 MB. As you can see:
It also spends very little time in GB, mostly less than 5% at any given point:

The second version doesn't seem to use BufferedReader or another form of buffer. It might be the cause of slow down.
Since you seem to read the whole file in memory, you can perhaps read it as a big string (with a buffer) then parse it in memory to analyze the line endings.

Your are doubling the out statements(one for line and one for newline):
Can you try below(use lineSeparator() to get the line separator and append before writing):
out.write(lineArray.get(i)+System.lineSeparator());

Don't reinvent the wheel.
Check the BufferedReader#readLine() code
Copy, paste, and make the changes you need to keep the line separator inside the line

Fastest way to write an array of integers to a file in Java?

As the title says, I'm looking for the fastest possible way to write integer arrays to files. The arrays will vary in size, and will realistically contain anywhere between 2500 and 25 000 000 ints.
Here's the code I'm presently using:
DataOutputStream writer = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(filename)));
for (int d : data)
writer.writeInt(d);
Given that DataOutputStream has a method for writing arrays of bytes, I've tried converting the int array to a byte array like this:
private static byte[] integersToBytes(int[] values) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
for (int i = 0; i < values.length; ++i) {
dos.writeInt(values[i]);
}
return baos.toByteArray();
}
and like this:
private static byte[] integersToBytes2(int[] src) {
int srcLength = src.length;
byte[] dst = new byte[srcLength << 2];
for (int i = 0; i < srcLength; i++) {
int x = src[i];
int j = i << 2;
dst[j++] = (byte) ((x >>> 0) & 0xff);
dst[j++] = (byte) ((x >>> 8) & 0xff);
dst[j++] = (byte) ((x >>> 16) & 0xff);
dst[j++] = (byte) ((x >>> 24) & 0xff);
}
return dst;
}
Both seem to give a minor speed increase, about 5%. I've not tested them rigorously enough to confirm that.
Are there any techniques that will speed up this file write operation, or relevant guides to best practice for Java IO write performance?

I had a look at three options:
Using DataOutputStream;
Using ObjectOutputStream (for Serializable objects, which int[] is); and
Using FileChannel.
The results are
DataOutputStream wrote 1,000,000 ints in 3,159.716 ms
ObjectOutputStream wrote 1,000,000 ints in 295.602 ms
FileChannel wrote 1,000,000 ints in 110.094 ms
So the NIO version is the fastest. It also has the advantage of allowing edits, meaning you can easily change one int whereas the ObjectOutputStream would require reading the entire array, modifying it and writing it out to file.
Code follows:
private static final int NUM_INTS = 1000000;
interface IntWriter {
void write(int[] ints);
}
public static void main(String[] args) {
int[] ints = new int[NUM_INTS];
Random r = new Random();
for (int i=0; i<NUM_INTS; i++) {
ints[i] = r.nextInt();
}
time("DataOutputStream", new IntWriter() {
public void write(int[] ints) {
storeDO(ints);
}
}, ints);
time("ObjectOutputStream", new IntWriter() {
public void write(int[] ints) {
storeOO(ints);
}
}, ints);
time("FileChannel", new IntWriter() {
public void write(int[] ints) {
storeFC(ints);
}
}, ints);
}
private static void time(String name, IntWriter writer, int[] ints) {
long start = System.nanoTime();
writer.write(ints);
long end = System.nanoTime();
double ms = (end - start) / 1000000d;
System.out.printf("%s wrote %,d ints in %,.3f ms%n", name, ints.length, ms);
}
private static void storeOO(int[] ints) {
ObjectOutputStream out = null;
try {
out = new ObjectOutputStream(new FileOutputStream("object.out"));
out.writeObject(ints);
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void storeDO(int[] ints) {
DataOutputStream out = null;
try {
out = new DataOutputStream(new FileOutputStream("data.out"));
for (int anInt : ints) {
out.write(anInt);
}
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void storeFC(int[] ints) {
FileOutputStream out = null;
try {
out = new FileOutputStream("fc.out");
FileChannel file = out.getChannel();
ByteBuffer buf = file.map(FileChannel.MapMode.READ_WRITE, 0, 4 * ints.length);
for (int i : ints) {
buf.putInt(i);
}
file.close();
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void safeClose(OutputStream out) {
try {
if (out != null) {
out.close();
}
} catch (IOException e) {
// do nothing
}
}

I would use FileChannel from the nio package and ByteBuffer. This approach seems (on my computer) give 2 to 4 times better write performance:
Output from program:
normal time: 2555
faster time: 765
This is the program:
public class Test {
public static void main(String[] args) throws IOException {
// create a test buffer
ByteBuffer buffer = createBuffer();
long start = System.currentTimeMillis();
{
// do the first test (the normal way of writing files)
normalToFile(new File("first"), buffer.asIntBuffer());
}
long middle = System.currentTimeMillis();
{
// use the faster nio stuff
fasterToFile(new File("second"), buffer);
}
long done = System.currentTimeMillis();
// print the result
System.out.println("normal time: " + (middle - start));
System.out.println("faster time: " + (done - middle));
}
private static void fasterToFile(File file, ByteBuffer buffer)
throws IOException {
FileChannel fc = null;
try {
fc = new FileOutputStream(file).getChannel();
fc.write(buffer);
} finally {
if (fc != null)
fc.close();
buffer.rewind();
}
}
private static void normalToFile(File file, IntBuffer buffer)
throws IOException {
DataOutputStream writer = null;
try {
writer =
new DataOutputStream(new BufferedOutputStream(
new FileOutputStream(file)));
while (buffer.hasRemaining())
writer.writeInt(buffer.get());
} finally {
if (writer != null)
writer.close();
buffer.rewind();
}
}
private static ByteBuffer createBuffer() {
ByteBuffer buffer = ByteBuffer.allocate(4 * 25000000);
Random r = new Random(1);
while (buffer.hasRemaining())
buffer.putInt(r.nextInt());
buffer.rewind();
return buffer;
}
}

Benchmarks should be repeated every once in a while, shouldn't they?
:) After fixing some bugs and adding my own writing variant, here are
the results I get when running the benchmark on an ASUS ZenBook UX305
running Windows 10 (times given in seconds):
Running tests... 0 1 2
Buffered DataOutputStream 8,14 8,46 8,30
FileChannel alt2 1,55 1,18 1,12
ObjectOutputStream 9,60 10,41 11,68
FileChannel 1,49 1,20 1,21
FileChannel alt 5,49 4,58 4,66
And here are the results running on the same computer but with Arch
Linux and the order of the write methods switched:
Running tests... 0 1 2
Buffered DataOutputStream 31,16 6,29 7,26
FileChannel 1,07 0,83 0,82
FileChannel alt2 1,25 1,71 1,42
ObjectOutputStream 3,47 5,39 4,40
FileChannel alt 2,70 3,27 3,46
Each test wrote an 800mb file. The unbuffered DataOutputStream took
way to long so I excluded it from the benchmark.
As seen, writing using a file channel still beats the crap out of all
other methods, but it matters a lot whether the byte buffer is
memory-mapped or not. Without memory-mapping the file channel write
took 3-5 seconds:
var bb = ByteBuffer.allocate(4 * ints.length);
for (int i : ints)
bb.putInt(i);
bb.flip();
try (var fc = new FileOutputStream("fcalt.out").getChannel()) {
fc.write(bb);
}
With memory-mapping, the time was reduced to between 0.8 to 1.5
seconds:
try (var fc = new RandomAccessFile("fcalt2.out", "rw").getChannel()) {
var bb = fc.map(READ_WRITE, 0, 4 * ints.length);
bb.asIntBuffer().put(ints);
}
But note that the results are order-dependent. Especially so on
Linux. It appears that the memory-mapped methods doesn't write the
data in full but rather offloads the job request to the OS and returns
before it is completed. Whether that behaviour is desirable or not
depends on the situation.
Memory-mapping can also lead to OutOfMemory problems so it is not
always the right tool to
use. Prevent OutOfMemory when using java.nio.MappedByteBuffer.
Here is my version of the benchmark code:
https://gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b

I think you should consider using file channels (the java.nio library) instead of plain streams (java.io). A good starting point is this interesting discussion: Java NIO FileChannel versus FileOutputstream performance / usefulness
and the relevant comments below.
Cheers!

The main improvement you can have for writing int[] is to either;
increase the buffer size. The size is right for most stream, but file access can be faster with a larger buffer. This could yield a 10-20% improvement.
Use NIO and a direct buffer. This allows you to write 32-bit values without converting to bytes. This may yield a 5% improvement.
BTW: You should be able to write at least 10 million int values per second. With disk caching you increase this to 200 million per second.

Array is Serializable - can't you just use writer.writeObject(data);? That's definitely going to be faster than individual writeInt calls.
If you have other requirements on the output data format than retrieval into int[], that's a different question.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fastest way to read/write an array from/to a file? - java

Related

Read and compare two large Files

Understanding java ByteBuffer

Reading a block of bytes from one file and writing to other until all blocks are read?

Java match/exceed performance of readline

Fastest way to write an array of integers to a file in Java?

Categories

Resources