I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.
One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.
Related
I am doing a project in which there are so many files I have to handle. Problem came when I have to provide file in different manner like:
File will contain one string in each line
numbers of char in each line e.g. :
1st line : A B 4
2nd line : 6 C A 6 & U #
etc.
File will contain no. of Strings e.g.
1st line : Lion Panther jaguar etc.
I have read how to efficiently handle file but I am so confused when to use Buffered Streams and when Unbuffered. If I am using BufferedStream then BufferInputStream and BufferReader / BufferWriter which should be used.
Similarly I am confuse with I/O Stream, File I/O Stream, ByteArray I/O Stream. There are so many things. Can any one suggest me when to use which one and why? What could be efficient handling according to the different scenarios?
Well, there might not be a direct answer for this, but you don't have to worry if you feel confused. Discussions about Buffered and Unbufferred have been done many times before.
For example in this link: bufferred vs non-bufferred, gives a good hint (check the answer marked as correct). This comes because while using Bufferred streams, those streams are stored in a small area of memory called (suprisingly) buffer. Same happens to written data (they go into the buffer before being stored into the hard memory). This improves performance because lowers the overhead of I/O operations (which are OS dependent). Check the Java Doc: Bufferred Streams
So, to make it clear, use Bufferred streams when you need to improve the performance of your I/O operations. Use Unbufferred streams when you want to ensure that the output has been written before continuing (because an error might always occur while writing from/into the buffer, an example might be when you want to write a log, it might be opened all the time, so there is no need to access it, no need for a buffer ).
I am having a question regarding reading files for instance in Java or in C/C++. You usually can get an offset value of the current position in file.
How robust is this offset? Assumed the file is not changed of course will I read the same line via Java as I would if using C/C++ if I position the stream on this offset?
I would guess yes, but I was wondering if I am missing something? What I want to do is making some kind of index that returns this offset value in a specific file? can that work is this offset bound to a certain API or even x-bit architecture?
Regards,
The offset of a given byte in a given file is going to be 100% reliable for (at least) any system with a POSIX / POSIX-like model of files. It follows that the same offest will give you the same byte in Java and C++. However, this does depend on you using the respective languages' I/O APIs correctly; i.e. understanding them.
One thing that can get a bit tricky is when you use some "binary I/O" scheme in C++ that involves treating objects (or structs) as arrays of bytes and reading / writing those bytes. If you do that, you have the problem that the byte-level representations of C / C++ objects are platform dependent. For instance, you can run into the big-endian vs little-endian problem. This doesn't alter offsets ... but it can mean that "stuff" gets mangled due to representation mismatches.
The best thing to do is to use a file representation that is not dependent on the platform where the file is read or written; i.e. don't do it that way.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage. For example, what character set should I use (I just need to write positive and negative numbers), would I be able to write less than 1 byte to a file, should I be using Scanners/BufferedWriters etc. Thanks in advance, I can provide more information if needed.
Read the Java tutorial about IO.
You should
not use Writers and character sets, since you want to write binary data
use a buffered stream to avoid too many native calls and make the write fast
not use Scanners, as they're used to read data, and not write data
And no, you won't be able to write less than a byte in a file. The byte is the smallest amount of information that can be stored in a file.
Compression is almost always more expensive than file IO. You shouldn't worry about the speed of your writes unless you know it's a bottle neck.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage.
Write the data in a binary format and it will be the smallest. This is why almost all image formats use binary.
For example, what character set should I use (I just need to write positive and negative numbers),
Character encoding is for encoding characters i.e. text. You don't use these in binary formats generally (unless they contain some text which you are unlikely to do initially).
would I be able to write less than 1 byte to a file,
Technically you can use less than the block size on disk e.g. 512 bytes or 4 KB. You can write any amount less than this but it doesn't use less space, nor would it really matter if it did because the amount of disk is too small to worry about.
should I be using Scanners/BufferedWriters etc.
No, These are for text,
Instead use DataOutputStream and DataInputStream as these are for binary.
what character set should I use
You would need to write your data as bytes, not chars, so forget about character set.
would I be able to write less than 1 byte to a file
No, this would not be possible. But to follow decoder expected bit stream you might need to construct a byte, from something like 5 and 3 bits before writing that byte to the file.
I am reading from an InputStream.
and writing what I read into an outputStream.
I also check a few things.
Like if I read an
& (ampersand)
I need to write
"& amp;"
My code works. But now I wonder if I have written the most efficient way (which I doubt).
I read byte by byte. (but this is because I need to do odd modifications)
Can somebody who's done this suggest the fastest way ?
Thanks
If you are using BufferedInputStream and BufferedOutputStream then it is hard to make it faster.
BTW if you are processing the input as characters as opposed to bytes, you should use readers/writers with BufferedReader and BufferedWriter.
The code should be reading/writing characters with Readers and Writers. For example, if its in the middle of a UTF-8 sequence, or it gets the second half of a UCS-2 character and it happens to read the equivalent byte value of an ampersand, then its going to damage the data that its attempting to copy. Code usually lives longer than you would expect it to, and somebody might try to pick it up later and use it in a situation where this could really matter.
As far as being faster or slower, using a BufferedReader will probably help the most. If you're writing to the file system, a BufferedWriter won't make much of a difference, because the operating system will buffer writes for you and it does a good job. If you're writing to a StringWriter, then buffering will make no difference (may even make it slower), but otherwise buffering your writes ought to help.
You could rewrite it to process arrays; and that might make it faster. You can still do that with arrays. You will have to write more complicated code to handle boundary conditions. That also needs to be a factor in the decision.
Measure, don't guess, and be wary of opinions from people who aren't informed of all the details. Ultimately, its up to you ot figure out if its fast enough for this situation. There is no single answer, because all situations are different.
I would prefer to use BufferedReader for reading input and BufferedWriter for output. Using Regular Expressions for matching your input can make your code short and also improve your time complexity.
I want encode every file by Huffman code.
I have found the length of bits per symbol (its Huffman code).
Is it possible to encode a character into a file in Java: are there any existing classes that read and write to a file bit by bit and not with minimum dimension of char?
You could create a BitSet to store your encoding as you are creating it and simply write the String representation to a file when you are done.
You really don't want to write single bits to a file, believe me. Usually we define a byte buffer, build the "file" in memory and, after all work is done, write the complete buffer. Otherwise it would take forever (nearly).
If you need a fast bit vector, then have a look at the colt library. That's pretty convenient if you want to write single bits and don't do all this bit shifting operations on your own.
I'm sure there are Huffman classes out there, but I'm not immediately aware of where they are. If you want to roll your own, two ways to do this spring to mind immediately.
The first is to assemble the bit strings in memory my using mask and shift operators and accumulate the bits into larger data objects (i.e. ints or longs) and then write those out to file with standard streaming.
The second, more ambitious and self-contained idea would be to write an implementation of OutputStream that has a method for writing a single bit and then this OutputStream class would do the aforementioned buffering/shifting/accumulating itself and probably pass the results down to a second, wrapped OutputStream.
Try writing a bit vector in java to do the bit representation: it should allow you to set/reset the individual bits in a bit stream.
The bit stream can thus hold your Huffman encoding. This is the best approach, and lightning fast too.
Huffmann sample analysis here
You can find a working (and fast) implementation here: http://code.google.com/p/kanzi/source/browse/src/kanzi/entropy/HuffmanTree.java