I am trying to write a lot of data into a binary file. Because it is a lot of data, it is important that this is done fast and I want to be able to write the data as ints one by one. I have tried RandomAccessFile, BufferedWriter, DataOutputStream etc. but all of those are either too slow or cannot write ints. Any ideas that might help me?
Every stream can 'write ints' if you write the correct code to convert ints to bytes.
The two 'fast' IO options in Java are BufferedOutputStream on top of FileOutputStream and the use of a FileChannel with NIO buffers.
If all you are writing is many, many, int values, you can use IntBuffer instances to pass the data to a fileChannel.
Further, 'one at a time' is generally incompatible with 'fast'. Sooner or later, data has to travel to the disk in blocks. If you force data to disk in small quantities, you will find that the process is very slow. You could, for example, add integer values to a buffer and write the buffer when it fills, and then repeat.
Take a look at the java.nio package. You will find classes that you can use for your needed purposes.
Well, writing to a file one int at a time isn't an inherently fast operation. Even with bufferedwriter you're potentially making a lot of function calls (and may still be doing a lot of file writes if you haven't set the buffer to be large enough).
Have you tried putting the integers into an array, using ByteBuffer to convert it to a byte array, and then writing the byte array to a file?
Related
I read binary data from a file in chunk into a ByteBuffer and then process the data in the ByteBuffer. I like using ByteBuffer because I can make use of the order() method to correctly read out Shorts and Integers.
At a later time I have to write new/modified data to the file. The idea was to use a ByteBuffer again and make use of its put methods. However I don't know easily know the size required in advance, I could work it out but it would mean parsing the data twice. Alternatively I could use ByteArrayOutputStream but then I have to write methods to deal with writing BigEndian/LittleEndian integers ectera.
What do you recommend, is there a third option ?
I can think of two (other) approaches:
If the problem is that you can't predict the file size in advanced, then don't. Instead use File.length() or Files.size(Path) to find the file size.
If you can process the file in chunks, then read a chunk, process it, and write it to a new file. Depending on your processing, you may even be able to update the ByteBuffer in-place. If not, then (presumably) the size of an output chunk is proportional to the size of an input chunk.
What do I recommend?
I'm not prepared to recommend anything without more details of the problem you are trying to solve; i.e. the nature of the processing.
If yours use case guarantees that the file won't change during the processing, then it might be a good idea to use MappedByteBuffer.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage. For example, what character set should I use (I just need to write positive and negative numbers), would I be able to write less than 1 byte to a file, should I be using Scanners/BufferedWriters etc. Thanks in advance, I can provide more information if needed.
Read the Java tutorial about IO.
You should
not use Writers and character sets, since you want to write binary data
use a buffered stream to avoid too many native calls and make the write fast
not use Scanners, as they're used to read data, and not write data
And no, you won't be able to write less than a byte in a file. The byte is the smallest amount of information that can be stored in a file.
Compression is almost always more expensive than file IO. You shouldn't worry about the speed of your writes unless you know it's a bottle neck.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage.
Write the data in a binary format and it will be the smallest. This is why almost all image formats use binary.
For example, what character set should I use (I just need to write positive and negative numbers),
Character encoding is for encoding characters i.e. text. You don't use these in binary formats generally (unless they contain some text which you are unlikely to do initially).
would I be able to write less than 1 byte to a file,
Technically you can use less than the block size on disk e.g. 512 bytes or 4 KB. You can write any amount less than this but it doesn't use less space, nor would it really matter if it did because the amount of disk is too small to worry about.
should I be using Scanners/BufferedWriters etc.
No, These are for text,
Instead use DataOutputStream and DataInputStream as these are for binary.
what character set should I use
You would need to write your data as bytes, not chars, so forget about character set.
would I be able to write less than 1 byte to a file
No, this would not be possible. But to follow decoder expected bit stream you might need to construct a byte, from something like 5 and 3 bits before writing that byte to the file.
I want to read an file of 1.5 GB into an array. Now, as it takes long time, I want to switch it to some other option. Can anybody help me,
If I preprocess the byte file into some database (or may be in other way) can I make it faster ?
Can anybody help me is there any other way to make it faster.
Actually, I have to process more than 50, 1.5GB file. So, such operation is quite expensive for me.
It depends on what you want to do.
If you only wanted to access a few random bytes, then reading into an array isn't good - a MappedByteBuffer would be better.
If you want to read all the data and sequentially process it a small portion at a time then you could stream it.
If you need to do computations that do random access of the whole dataset, particularly if you need to repeatedly read elements, then loading into an array might be sensible (but a ByteBuffer is still a candidate).
Can you show some example code or explain further?
How fast is your disk subsystem?
If you can read 40 MB per second, reading 1500 MB should take about 40 seconds. If you want to go faster than this, you need a faster disk subsystem. If you are reading from a local drive and its taking minutes, you have a tuning problem and there is not much you can doing Java to fix this because it is not the problem.
You can use a memory mapped file instead, but this will only speed up the access if you don't need all the data. If you need it all, you are limited by the speed of your hardware.
Using BufferedInputStream or InputStream is probably as fast as you can get (faster than RandomAccessFile). The largest int size is 2,147,483,647 so you're getting somewhat close there with your array of 1,610,612,736 which would also be the max size of an array.
I'd recommend you just access the file using BufferedInputStream for best speed, skip() and read() to get the data you want. Maybe have a class that implements those, is aware of its position, and takes care of the seeking for you when you send it an offset to read from. I believe you close and reopen the input stream to put it back at the beginning.
And... you may not want to save them in an array and just access them on need from the file. That might help if loading time is your killer.
I'm creating a compression algorithm in Java;
to use my algorithm I require a lot of information about the structure of the target file.
After collecting the data, I need to reread the file. <- But I don't want to.
While rereading the file, I make it a good target for compression by 'converting' the data of the file to a rather peculiar format. Then I compress it.
The problems now are:
I don't want to open a new FileInputStream for rereading the file.
I don't want to save the converted file which is usually 150% the size of the target file to the disk.
Are there any ways to 'reset' a FileInputStream for moving to the start of the file, and how would I store the huge amount 'converted' data efficiently without writing to the disk?
You can use one or more RandomAccessFiles. You can memory map them to ByteBuffer() which doesn't consume heap (actually they use about 128 bytes) or direct memory but can be accessed randomly.
Your temporary data can be storing in a direct ByteBuffer(s) or more memory mapped files. Since you have random access to the original data, you may not need to duplicate as much data in memory as you think.
This way you can access the whole data with just a few KB of heap.
There's the reset method, but you need to wrap the FileInputStream in a BufferedInputStream.
You could use RandomAccessFile, or java.nio ByteBuffer is what you are looking for. (I do not know.)
Resources might be saved by pipes/streams: immediately writing to a compressed stream.
To answer your questions on reset: not possible; the base class InputStream has provisions for mark and reset-to-mark, but FileInputStream was made optimal for several operating systems and does purely sequential input. Closing and opening is best.
What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!
I'd use RandomAccessFile, seek to the position I wanted to change and write the change.
If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.