I am using BufferedWriter to write text to files in Java. However, I am providing the custom buffer size in the constructor. The thing is, it is writing to the file in chunks of whatever the size I am giving (for example, if I give the buffer size as 8KB, the files are written once for 8KB). But, when I look at the memory occupied by the bufferedwriter object (using YourKit profiler), it is actually twice the given buffer size (16KB in this case).
I tried to look at the internal implementation to see why this is happening, I see that it is creating a char array with the given size. And when it writes to the array, it makes sense that it occupies twice the buffer size as each char occupies 2 bytes.
My question is, how is BufferedWriter managing to write only 8KB in this case, where it is storing 16KB in the buffer. And is this technically correct? Because each time, it is flushing only 8KB (half) even though it has 16KB in buffer.
But I expected all the chars stored in the char array to be written to the file when it reaches the buffer size (which would be 16 KB in my given example).
8K of chars occupies 16 KB of memory. Correct.
Now lets assume that the chars are actually all in the ASCII subset.
When you write a character stream to an output file in Java, the characters are encoded as a byte stream according to some encoding scheme. (This encoding is performed by stuff inside the OutputStreamWriter class, for example.)
When you encode those 8K of characters using an 8 bit character set / encoding scheme such as ASCII or Latin-1 ... or to UTF-8 (!!) ... each character is encoded as 1 byte. Therefore flushing a buffer containing those 8K characters generates an 8K byte write.
The size of BufferedWriter is the char array size.
public BufferedWriter(Writer out, int sz) {
super(out);
if (sz <= 0)
throw new IllegalArgumentException("Buffer size <= 0");
this.out = out;
cb = new char[sz];
nChars = sz;
nextChar = 0;
lineSeparator = java.security.AccessController.doPrivileged(
new sun.security.action.GetPropertyAction("line.separator"));
}
A single char is not equal to a single byte. It is all defined by your character encoding.
Therefore, to execute the task exactly as what you described, you have to switch to another class: BufferedOutputStream, which the internal buffer is exactly counted by number of bytes.
public BufferedOutputStream(OutputStream out, int size) {
super(out);
if (size <= 0) {
throw new IllegalArgumentException("Buffer size <= 0");
}
buf = new byte[size];
}
It depends on the encoding used to write to the file: ISO-8859-1 store a character as a single byte, UTF-8 encodes all ASCII character as a single byte.
Related
I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?
I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.
Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.
You do this by accumulating eight bits into a byte buffer using the shift (<<) and or (|) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.
E.g. int buf = 0, count = 0;, for each bit: buf |= bit << count++;, check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; }. At the end, if (count > 0) out.writeByte(buf);.
When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.
One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.
I am sending a byte by a TCP connection, when I send a single negative number (like -30 in this example) I get three bytes:
Client Side:
PrintWriter out = new PrintWriter(new BufferedWriter(new OutputStreamWriter(socket.getOutputStream())));
out.write((byte)-30);
out.flush();
out.close();
Server Side:
is = new DataInputStream(clientSocket.getInputStream());
is.readFully(bbb);
for (int i=0;i<bbb.length;i++)
System.out.println(i+":"+bbb[i]);
what i get is:
0:-17
1:-65
2:-94
but I sent just -30
You're using a writer, and you're calling Writer.write(int):
Writes a single character. The character to be written is contained in the 16 low-order bits of the given integer value; the 16 high-order bits are ignored.
So you've got an conversion to int, then the bottom 16 bits of that int are taken. So you're actually writing Unicode character 65506 (U+FFE2), in your platform default encoding (which appears to be UTF-8). That's not what you want to write, but that's what you are writing.
If you only want to write binary data, you shouldn't be using a Writer at all. Just use OutputStream - wrap it in DataOutputStream if you want, but don't use a Writer. The Writer classes are for text.
I was reading through this article. It has this following snippet
OutputStream output = new FileOutputStream("c:\\data\\output-text.txt");
while(moreData) {
int data = getMoreData();
output.write(data);
}
output.close();
It is mentioned:
OutputStreams are used for writing byte based data, one byte at a time. The write() method of an OutputStream takes an int which contains the byte value of the byte to write.
Let's say I am writing the string Hello World to the file, so each character in string gets converted to int using getMoreData() method. and how does it get written? as character or byte in the output-text.txt? If it gets written in byte, what is the advantage of writing in bytes if I have to "reconvert" byte to character?
Each character (and almost anything stored on a file) is a byte / bytes. For example:
Lowercase 'a' is written as one byte with decimal value 97.
Number '1' is written as one byte with decimal value 49
There's no more concept of data types once the information is written into a file, everything is just a stream of bytes. What's important is the encoding used to store the information into the file
Have a look at ascii table, which is very useful for beginners learning information encoding.
To illustrate this, create a file containing the text 'hello world'
$ echo 'hello world' > hello.txt
Then output the bytes written to the file using od command:
$ od -td1 hello.txt
0000000 104 101 108 108 111 32 119 111 114 108 100 10
0000014
The above means, at address 0000000 from the start of the file, I see one byte with decimal value 104 (which is character 'h'), then one byte with decimal value 101 (which is character 'e") and so on..
The article is incomplete, because an OutputStream has overloaded methods for write that take a byte[], a byte[] along with offset and length arguments, or a single int.
In the case of writing a String to a stream when the only interface you have is OutputStream (say you don't know what the underlying implementation is), it would be much better to use output.write(string.getBytes()). Iteratively peeling off a single int at a time and writing it to the file is going to perform horribly compared to a single call to write that passes an array of bytes.
Streams operate on bytes and simply read/write raw data.
Readers and writers interpret the underlying data as strings using character sets such as UTF-8 or US-ASCII. This means they may take 8 bit characters (ASCII) and convert the data into UTF-16 strings.
Streams use bytes, readers/writers use strings (or other complex types).
The Java.io.OutputStream class is the superclass of all classes representing an output stream of bytes. When bytes are written to the OutputStream, it may not write the bytes immediately, instead the write method may put the bytes into a buffer.
There are methods to write as mentioned below:
void write(byte[] b)
This method writes b.length bytes from the specified byte array to this output stream.
void write(byte[] b, int position, int length)
This method writes length bytes from the specified byte array starting at offset position to this output stream.
void write(int b)
This method writes the specified byte to this output stream.
I was surprised to find that the following code
System.out.println("Character size:"+Character.SIZE/8);
System.out.println("String size:"+"a".getBytes().length);
outputs this:
Character size:2
String size:1
I would assume that a single character string should take up the same (or more ) bytes than a single char.
In particular I am wondering.
If I have a java bean with several fields in it, how its size will increase depending on the nature of the fields (Character, String, Boolean, Vector, etc...) I'm assuming that all java objects have some (probably minimal) footprint, and that one of the smallest of these footprints would be a single character. To test that basic assumption I started with the above code - and the results of the print statements seem counterintuitive.
Any insights into the way java stores/serializes characters vs strings by default would be very helpful.
getBytes() outputs the String with the default encoding (most likely ISO-8859-1) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.
I would like to say what i think,correct me if i am wrong but you are finding the length of the string which is correctly it is showing as 1 as you have only 1 character in the string. length shows the length not the size . length and size are two different things.
check this Link.. you are finding the number of bytes occupied in the wrong way
well, you have that 1 char in char array has the size of 2 bytes and that your String contains is 1 character long, not that it has 1 byte size.
The String object in Java consists of:
private final char value[];
private final int offset;
private final int count;
private int hash;
only this should assure you that anyway the String object is bigger then char array.
If you want to learn more about how object's size you can also read about the object headers and multiplicity factor for char arrays. For example here or here.
I want to add some code first and then a bit of explanation:
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) {
System.out.println("Character size: " + Character.SIZE / 8);
final byte[] bytes = "a".getBytes(Charset.forName("UTF-16"));
System.out.println("String size: " + bytes.length);
sprintByteAsHex(bytes[0]);
sprintByteAsHex(bytes[1]);
sprintByteAsHex(bytes[2]);
sprintByteAsHex(bytes[3]);
}
static void sprintByteAsHex(byte b) {
System.out.print((Integer.toHexString((b & 0xFF))));
}
}
And the output will be:
Character size: 2
String size: 4
feff061
So what you are actually missing is, you are not providing any parameter to the getBytes method. Probably, you are getting the bytes for UTF-8 representation of the character 'a'.
Well, but why did we get 4 bytes, when we asked for UTF-16? Ok, Java uses UTF-16 internally, then we should have gotten 2 bytes right?
If you examine the output:
feff061
Java actually returned us a BOM: https://en.wikipedia.org/wiki/Byte_order_mark.
So the first 2 bytes: feff is required for signalling that following bytes will be UTF-16 Big Endian. Please see the Wikipedia page for further information.
The remaining 2 bytes: 0061 is the 2 byte representation of the character "a" you have. Can be verified from: http://www.fileformat.info/info/unicode/char/0061/index.htm
So yes, a character in Java is 2 bytes, but when you ask for bytes without a specific encoding, you may not always get 2 bytes since different encodings will require different amount of bytes for various characters.
The SIZE of a Character is the storage needed for a char, which is 16 bit. The length of a string (also the length of the underlying char-array or bytes-array) is the number of characters (or bytes), not a size in bit.
That's why you had do to the division by 8 for the size, but not for the length. The length needs to be multiplied by two.
Also note that you will get other lengths for the byte-array if you specify a different encoding. In this case a transformation to a single- or varying-size encoding was performed when doing getBytes().
See: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset)
I have implemented the Huffman Encoding Algorithm in Java using Priority Queues where I traverse the Tree from Root to Leaf and get encoding example as #=000011 based on the number of times the symbol appears in the input. Everything is fine, the tree is being built fine, encoding is just as expected: But the output file I am getting is bigger size than the original file. I am currently appending '0' & '1' to a String on traversing left node and right node of the tree. Probably what I end up with uses all 8 bits for each characters and it does not help in compression. I am guessing there is some conversion of these bits into character values which is required. So that these characters use fewer bits than 8 and hence I get a compressed version of the original file. Could you please let me know how to achieve a compression by manipulating characters and reducing bits in Java? Thanks
You're probably using a StringBuilder and appending "0" or "1", or simply the + operator to concatenate "0" or "1" to the end of your string. Or you're using some sort of OutputStream and writing to it.
What you want to do is to write the actual bits. I'd suggest making a whole byte first before writing. A byte looks like this:
0x05
Which would represent the binary string 0000 0011.
You can make these by making a byte type, adding and shifting:
public void writeToFile(String binaryString, OutputStream os){
int pos = 0;
while(pos < binaryString.length()){
byte nextByte = 0x00;
for(int i=0;i<8 && pos+i < binaryString.length(); i++){
nextByte << 1;
nextByte += binaryString.charAt(pos+i)=='0'?0x0:0x1;
}
os.write(nextByte);
pos+=8;
}
}
Of course, it's inefficient to write one byte at a time, and on top of that the OutputStream interface only accepts byte arrays (byte[]). So you'd be better off storing the bytes in an array (or even easier, a List), then writing them at bigger chunks.
If you are not allowed to use byte writes (why the heck not? ObjectOutputStream supports writing byte arrays!), then you can use Base64 to encode your binary string. But remember that Base64 inflates your data usage by 33%.
An easy way to convert a byte array to base64 is by using an existing encoder. After adding the following import:
import sun.misc.BASE64Encoder;
You can instantiate the encoder and turn your byte array into a string:
byte[] bytes = getBytesFromHuffmanEncoding();
BASE64Encoder encoder = new BASE64Encoder();
String encodedString = encoder.encode(bytes);