How can I make this piece of code extremely quick?
It reads a raw image using RandomAccessFile (in) and write it in a file using DataOutputStream (out)
final int WORD_SIZE = 4;
byte[] singleValue = new byte[WORD_SIZE];
long position;
for (int i=1; i<=100000; i++)
{
out.writeBytes(i + " ");
for(int j=1; j<=17; j++)
{
in.seek(position);
in.read(singleValue);
String str = Integer.toString(ByteBuffer.wrap(singleValue).order(ByteOrder.LITTLE_ENDIAN).getInt());
out.writeBytes(str + " ");
position+=WORD_SIZE;
}
out.writeBytes("\n");
}
The inner for creates a new line in the file every 17 elements
Thanks
I assume that the reason you are asking is because this code is running really slowly. If that is the case, then one reason is that each seek and read call is doing a system call. A RandomAccessFile has no buffering. (I'm guessing that singleValue is a byte[] of length 1.)
So the way to make this go faster is to step back and think about what it is actually doing. If I understand it correctly, it is reading each 4th byte in the file, converting them to decimal numbers and outputting them as text, 17 to a line. You could easily do that using a BufferedInputStream like this:
int b = bis.read(); // read a byte
bis.skip(3); // skip 3 bytes.
(with a bit of error checking ....). If you use a BufferedInputStream like this, most of the read and skip calls will operate on data that has already been buffered, and the number of syscalls will reduce to 1 for every N bytes, where N is the buffer size.
UPDATE - my guess was wrong. You are actually reading alternate words, so ...
bis.read(singleValue);
bis.skip(4);
Every 100000 offsets I have to jump 200000 and then do it again till the end of the file.
Use bis.skip(800000) to do that. It should do a big skip by moving the file position without actually reading any data. One syscall at most. (For a FileInputStream, at least.)
You can also speed up the output side by a roughly equivalent amount by wrapping the DataOutputStream around a BufferedOutputStream.
But System.out is already buffered.
Related
I am currently looking to write to multiple files simultaneously. The files will hold about 17 Million lines of integers.
Currently, I am opening 5 Files that can be written to,(Some will remain empty), and then I perform shifting calculations to get a multiplier for the integer and to decide which files to write on.
My code looks like:
//Make files directory
File tempDir = new File("temp/test/txtFiles");
tempDir.mkdirs();
List<File> files = new ArrayList<>(); //Will hold the Files
List<FileWriter> writers = new ArrayList<>(); //Will hold fileWriter objects for all of the files
File currTxtFile; //Used to create the files
//Create the files
//Unused files will be blank
for(int f = 0; f < 5; f++)
{
currTxtFile = new File(tempDir, "file" + f + ".txt");
currTxtFile.createNewFile();
files.add(currTxtFile);
FileWriter fw = new FileWriter(currTxtFile);
writers.add(fw);
}
int[] multipliers = new int[5]; //will be used to calculate what to write to file
int[] fileNums = new int[5]; //will be used to know which file to write to
int start = 0;
/**
An example of fileNums output would be {0,4,0,1,4}
(i.e write the to file 0, then 4, then 0, then 1, then 4)
An example of multipliers output would be {100,10,5,1,2000}
(i.e value uses 100 for file 0, then 10 for file 4, then 5 for file 0, then 1 for file 1, then 2000 for file 4)
*/
for(long c = 0; c < 16980000, c++)
{
//Gets values for the multipliers and fileNums
int numOfMultipliers = getMultiplier(start,multipliers,fileNums);
for(int j = 0; j < numOfMultipliers; j++) // NumOfMultipliers can range from 0-4
{
int val = 30000000 * multipler[j] + 20000000;
writers.get(fileNums[j]).append(val + "\n");
}
start++;
}
for(FileWriter f: writers)
{
f.close();
}
The code is currently taking quite a while to write to the files (Over several hours (5+)). This code was translated from C++, where the files would output in about 10 minutes.
How could I improve upon the code to get the output to write quicker?
Likely flushing issues. In general, writing to multiple files is slower than writing to a single file, not faster. Think about it - with spinning disks, that thing doesn't have 5 separate write heads inside it. There's just the one, the process of writing to a spinning disk is fundamentally 'single threaded' - trying to write to multiple files simultaneously is in fact orders of magnitude slower, as the write head has to bounce around.
With modern SSDs it doesn't matter nearly as much, but there's still a bottleneck somewhere. It's either the disk or it isn't. There's nothing inherent about SSD design (for example, it doesn't have multiple pipelines or a whole bunch of CPUs to deal with incoming writes) that would make it faster if you write to multiple files simultaneously.
If the files exist each on a different volume, that's a different story, but from your code that's clearly not the case.
Thus, let's first get rid of this whole 'multiple files' thing. That either doesn't do anything, or makes things (significantly) slower.
So why is it slow in java?
Because of block processing. You need to know how disks work, first.
SSDs
The memory in an SSD can't actually be written to. Instead, entire blocks can be wiped clean and only then can they be written to. That's the only way an SSD can store data: Obliterate an entire block, then write data to it.
If a single block is 64k, and your code writes one integer at a time, that integer is about 10 bytes or so a pop. Your SSD will be obliterating a block, write one integer to it, a newline, and a lot of pointless further writes (it writes.. in blocks. It can't write any smaller, that's just how it works), and it'll do the exact same thing 6400 times more.
Instead, you'd want the SSD to just wipe that block and write 6400 integers into it once. The reason it doesn't just work that way out of the box is because people trip over power cables. Trust me, the bank is not going to stand for this. If you pull some bills out of an ATM and then some crash happens and because the last couple of transactions are just being stored in memory, waiting for a full block's worth of data before it actually writes, oh dear. So if you WANT to flush that stuff to disk, the system will dutifully execute.
Spinning disks
The write head needs to move to the right position and wait for the right sector to spin round and then it can write. Even though CPUs are really fast, the disk keeps spinning, it can't stop on a dime. So in the very short time it takes for the java code to supply you with another integer, the disk spins past the write point so the disk needs to wait one full spin, again. Much better to just send a much larger chunk of data to the disk controller so it can write it all in 'one spin', so to speak.
So how do I do that?
Simple. Use a BufferedWriter. This does the exact thing you want: It'll buffer data for quite a while, and only actually writes until its convenient, or you explicitly ask for it (call .flush() on it), or you close the writer. With the downside that if someone trips over a power cable, your data is gone, but presumably you don't mind - half of such a file is a problem no matter how much is there. Incomplete = useless.
Can it be faster?
Certainly. You're storing e.g. the number like '123456789' in at least 10 bytes, and the CPU needs to do conversion to turn that into the sequence [31, 32, 33, 34, 35, 36, 37, 38, 39, 13]. Much more efficient to just store exactly the bytes precisely as they are in memory - only takes 4 bytes, and no conversion needed, or at least simpler conversion. The downside is that you won't be able to make any sense of this file unless you use a hexeditor.
Example code - write integers in text form
Let's not use obsolete APIs.
Let's properly close resources.
Let's ditch this pointless 'multiple files' thing.
Path tgt = Paths.get("temp/test/txtFiles/out.txt");
try (var out = Files.newBufferedWriter(tgt)) {
for (long c = 0; c < 16980000, c++) {
//Gets values for the multipliers and fileNums
int numOfMultipliers = getMultiplier(start, multipliers, fileNums);
for(int j = 0; j < numOfMultipliers; j++) { // NumOfMultipliers can range from 0-4
int val = 30000000 * multipler[j] + 20000000;
out.write(val + "\n");
}
start++;
}
}
Example code - write ints directly
Path tgt = Paths.get("temp/test/txtFiles/out.txt");
try (var out = new DataOutputStream(
new BufferedOutputStream(
Files.newOutputStream(tgt))) {
for (long c = 0; c < 16980000, c++) {
//Gets values for the multipliers and fileNums
int numOfMultipliers = getMultiplier(start, multipliers, fileNums);
for(int j = 0; j < numOfMultipliers; j++) { // NumOfMultipliers can range from 0-4
int val = 30000000 * multipler[j] + 20000000;
out.writeInt(val);
}
start++;
}
}
If I have a given .dat file which I'm trying to read, how can I count the number of 32-bit integers? I'm getting 2 different answers using 2 different methods.
First method:
int size = 0;
try (DataInputStream Input = new DataInputStream(
new BufferedInputStream(new FileInputStream(file.getFD())))){
while (true) {
file.skipBytes(4);
size += 1;
}
}catch(Exception ex){
System.out.println(ex);
}
System.out.println(size);
Second method:
File fileRead = new File(file);
ret = fileRead.length() / 4
The first method is probably the most accurate since I'm reading 4 bytes each time and skipping it, to get the size of integers being packed sequentially in the file. However, the second method just gives me the direct file size and divided by 4, which is not the same. I think it might be including extra file related data not related to the content.
The first method is good but it is very inefficient for large files. Any idea how I can speed things up and get the number of integers efficiently?
If you want to know hoy many times can you read a 32-bit integer from a certain binary file, Method 2 is the certain answer.
You must not read your file through a DataInputStream unless you are certain that it was written through a DataOutputStream, because then it is not just a plain, binary file: Instead, it becomes a Java Object file, which will contain a lot of overhead data with every object written.
As a homework assignment we are supposed to read in a .pgm file and then draw a square in by changing the pixel values, and then output the new image.
After I go through and change the pixels I print them all to a .txt as a way to check that they actually got added. The part I'm having trouble with is writing the new file. I know it's supposed to be binary so after googling I think I should be using DataOutputStream, but I could be wrong. After I write the file its size is 1.9MB where as the original is only 480KB, so right off the bat I suspect something must be wrong. Any advice or tips for writing to .pgm files would be great!
public static void writeImage(String fileName) throws IOException{
DataOutputStream writeFile = new DataOutputStream(new FileOutputStream(fileName));
// Write the .pgm header (P5, 800 600, 250)
writeFile.writeUTF(type + "\n");
writeFile.writeUTF(width + " " + height + "\n");
writeFile.writeUTF(max + "\n");
for(int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
writeFile.writeByte(img[i][j]); //Write the number
writeFile.writeUTF(" "); //Add white space
}
writeFile.writeUTF(" \n"); //finished one line so drop to next
}
writeFile.close();
}
When I try to open the new file i get an error message saying "illegal image format", and the original file opens properly.
It can't be right that you use the writeByte method to write a pixel. writeByte will write a single byte even if the argument is of type int. (the eight low-order bits of the argument are written).
You need to read the file format specification carefully and make sure that you are writing out the right number of bytes. A hex editor can help a lot.
I think you are somehow confusing the binary (P5) and ASCII (P2) modes of the PGM format.
The ASCII version has spaces between each pixel and (optional) line feeds, like you have in your code.
For the binary format, you should just write the pixel values (as bytes, as you have max value 250). No spaces or line feeds.
(I don't write the code for you, as this is an assignment, but you are almost there, so I'm sure you'll make it! :-)
PS: Also carefully read the documentation on DataOuputStream.writeUTF(...):
First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow.
Are you sure this is what you want? Keep in mind that the PGM format headers are all ASCII, so there's really no need to use UTF here.
First some background.
Its not needed to answer the actual question, but maybe it'll help put things in perspective.
I have written an mp3 library in java (h) which reads out the information stored in the ID3 tag in an .mp3 file. Information about the song like the name of the song, the CD the song was released on, the track number, etc. are stored in this ID3 tag right at the beginning of an .mp3 file.
I have tested the library on 12,579 mp3 files which are located on my local hard drive, and it works flawlessly. Not a single IO error.
When I perform the same thing where the mp3 files are located on a web server, I get an IO error. Well, not actually an error. Actually its a difference in the behavior of the InputStream's read(byte[]) method.
The example below will illustrate the problem, which occurs when I'm trying to read an image file (.jpg, .gif, .png, etc) from the mp3 file.
// read bytes from an .mp3 file on your local hard drive
// reading from an input stream created this way works flawlessly
InputStream inputStream = new FileInputStream("song.mp3");
// read bytes from an .mp3 file given by a url
// reading from an input stream created this way fails every time.
URL url = "http://localhost/song.mp3");
HttpURLConnection httpConnection = (HttpURLConnection)url.openConnection();
httpConnection.connect();
InputStream inputStream = url.openStream();
int size = 25000; // size of the image file
byte[] buffer = new byte[size];
int numBytesRead = inputStream.read(buffer);
if (numBytesRead != buffer.length)
throw new IOException("Error reading the bytes into the buffer. Expected " + buffer.length + " bytes but got " + numBytesRead + " bytes");
So, my observation is:
Calling inputStream.read(buffer); always reads the entire number of bytes when the input stream is a FileInputStream. But it only reads a partial amount when I am using an input stream obtained from an http connection.
And hence my question is:
In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
That is, have I assumed behavior that is not true of the read(byte[]) method, and I've just gotten lucky working with FileInputStream?
Is the correct, and general behavior of InputStream.read(byte[]) that I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read, or EOF has been reached? Something like the code below:
int size = 25000;
byte[] buffer = new byte[size];
int numBytesRead = 0;
int totalBytesRead = 0;
while (totalBytesRead != size && numBytesRead != -1)
{
numBytesRead = inputStream.read(buffer);
totalBytesRead += numBytesRead
}
Your conclusions are sound, take a look at the documentation for InputStream.read(byte[]):
Reads some number of bytes from the input stream and stores them into
the buffer array b. The number of bytes actually read is returned as
an integer. This method blocks until input data is available, end of
file is detected, or an exception is thrown.
There is no guarantee that read(byte[]) will fill the array you have provided, only that it will either read at least 1 byte (provided your array's length is > 0), or it will return -1 to signal the EOS. This means that if you want to read bytes from an InputStream correctly, you must use a loop.
The loop you currently have has one bug in it. On the first iteration of the loop, you will read a certain number of bytes into your buffer, but on the second iteration you will overwrite some, or all, of those bytes. Take a look at InputStream.read(byte[], int, int).
And hence my question is: In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
No. That's why the documentation says "The number of bytes actually read" and "there is an attempt to read at least one byte."
I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read
Rather than reinvent the wheel, you can get an already-tested wheel at Jakarta Commons IO.
How can I read certain number of elements (characters, specifically) at a time in Java? It's a little difficult to explain, but my idea is this:
If I have a text file that contains:
This is a text file named text.txt
I want to be able to have a String or a character array of a certain length that iterates through the file. So if I specified the length to be 3, first iteration the char array would contain [T,h,i], and if I iterate through it once, it would become [h,i,s], and then [i,s, ], and so on.
I have tried using the BufferedReader.read(char[], off, len) method which reads certain number of characters at a time from the file, but performance is important for what I'm trying to do.
Is there any method to achieve this in Java? I've tried using BufferedReader but I'm not too familiar with it to fully utilize it.
You'll actually get the best I/O performance by buffering both the input stream and the reader. (Buffering just one gives most of the improvement; double buffering is only a bit better, but it is better.) Here's sample code to read a file a chunk at a time:
final int CHUNK_SIZE = 3;
final int BUFFER_SIZE = 8192; // explicit buffer size is better
File file = ...
InputStream is = new BufferedInputStream(new FileInputStream(file), BUFFER_SIZE);
Reader rdr = new BufferedReader(new InputStreamReader(is), BUFFER_SIZE);
char buff = new char[CHUNK_SIZE];
int len;
while ((len = rdr.read(buff)) != -1) {
// buff[0] through buff[len-1] are valid
}
rdr.close();
This, of course, is missing all sorts of error checking, exception handling, etc., but it shows the basic idea of buffering streams and readers. You may also want to specify a character encoding when constructing the InputStreamReader. (You could bypass dealing with input streams by using a FileReader to start with, but then you cannot specify a character set encoding and lose the slight performance boost that comes from double buffering.)