How to know number of bytes of a Binary File? - java

How to count number of bytes of this binary file (t.dat) without running this code
(as a theoretical question) ?
Assuming that you run the following program on Windows using the default ASCII encoding.
public class Bin {
public static void main(String[] args){
DataOutputStream output = new DataOutputStream(
new FileOutputStream("t.dat"));
output.writeInt(12345);
output.writeUTF("5678");
output.close();
}
}

Instead of trying to compute the bytes output by each write operation, you could simply check the length of the file after it's closed using new File("t.dat").length().
If you wanted to figure it out without checking the length directly, an int takes up 4 bytes, and something written with writeUTF takes up 2 bytes to represented the encoded length of the string, plus the space the string itself takes, which in this case is another 4 bytes -- in UTF-8, each of the characters in "5678" requires 1 byte.
So that's 4 + 2 + 4, or 10 bytes.

Related

How to write Huffman code to a binary file?

I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?
I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.
Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.
You do this by accumulating eight bits into a byte buffer using the shift (<<) and or (|) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.
E.g. int buf = 0, count = 0;, for each bit: buf |= bit << count++;, check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; }. At the end, if (count > 0) out.writeByte(buf);.
When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.
One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.

Java: How to output a very large number (a billion or so digits) to a file?

So for no particular reason I wanted to know what the largest number you can store in a gigabyte of memory. So I used an arbitrary precision library to calculate it, but the trouble is trying to output this number to a file, since a string can only store int.max character.
Apint a = new Apint(2);
a = ApintMath.pow(a, 8589934591l);
a = a.subtract(new Apint(1));
File file = new File("theNumber.txt");
PrintWriter pls = new PrintWriter(file);
a.writeTo(pls, true);
pls.close();
You should convert that int number to 4 bytes with little endian or big endian style, and then save 4 bytes to file.
And with this method we can store a very very big number. ex: 8 bytes, 16 bytes...
Update:
Try to use BigInteger class and toByteArray() function when writing bytes to file.
(Untested method; may not work)
Use the mod operator % with a power of 10 to select the right most digits. Write those digits to a file on a line. Then divide by the same power of 10. Now your number is N digits shorter. Repeat writing each group of digits into the file on separate lines.
Now copy the lines in reverse order into another file, either using java or tac if you are on Linux.
You could join each line together, though I would discourage that because many programs will hang if you try to load on very long line of text into them but can handle many lines of text.

clarification on using OutputStream in Java?

I was reading through this article. It has this following snippet
OutputStream output = new FileOutputStream("c:\\data\\output-text.txt");
while(moreData) {
int data = getMoreData();
output.write(data);
}
output.close();
It is mentioned:
OutputStreams are used for writing byte based data, one byte at a time. The write() method of an OutputStream takes an int which contains the byte value of the byte to write.
Let's say I am writing the string Hello World to the file, so each character in string gets converted to int using getMoreData() method. and how does it get written? as character or byte in the output-text.txt? If it gets written in byte, what is the advantage of writing in bytes if I have to "reconvert" byte to character?
Each character (and almost anything stored on a file) is a byte / bytes. For example:
Lowercase 'a' is written as one byte with decimal value 97.
Number '1' is written as one byte with decimal value 49
There's no more concept of data types once the information is written into a file, everything is just a stream of bytes. What's important is the encoding used to store the information into the file
Have a look at ascii table, which is very useful for beginners learning information encoding.
To illustrate this, create a file containing the text 'hello world'
$ echo 'hello world' > hello.txt
Then output the bytes written to the file using od command:
$ od -td1 hello.txt
0000000 104 101 108 108 111 32 119 111 114 108 100 10
0000014
The above means, at address 0000000 from the start of the file, I see one byte with decimal value 104 (which is character 'h'), then one byte with decimal value 101 (which is character 'e") and so on..
The article is incomplete, because an OutputStream has overloaded methods for write that take a byte[], a byte[] along with offset and length arguments, or a single int.
In the case of writing a String to a stream when the only interface you have is OutputStream (say you don't know what the underlying implementation is), it would be much better to use output.write(string.getBytes()). Iteratively peeling off a single int at a time and writing it to the file is going to perform horribly compared to a single call to write that passes an array of bytes.
Streams operate on bytes and simply read/write raw data.
Readers and writers interpret the underlying data as strings using character sets such as UTF-8 or US-ASCII. This means they may take 8 bit characters (ASCII) and convert the data into UTF-16 strings.
Streams use bytes, readers/writers use strings (or other complex types).
The Java.io.OutputStream class is the superclass of all classes representing an output stream of bytes. When bytes are written to the OutputStream, it may not write the bytes immediately, instead the write method may put the bytes into a buffer.
There are methods to write as mentioned below:
void write(byte[] b)
This method writes b.length bytes from the specified byte array to this output stream.
void write(byte[] b, int position, int length)
This method writes length bytes from the specified byte array starting at offset position to this output stream.
void write(int b)
This method writes the specified byte to this output stream.

Java : Char vs String byte size

I was surprised to find that the following code
System.out.println("Character size:"+Character.SIZE/8);
System.out.println("String size:"+"a".getBytes().length);
outputs this:
Character size:2
String size:1
I would assume that a single character string should take up the same (or more ) bytes than a single char.
In particular I am wondering.
If I have a java bean with several fields in it, how its size will increase depending on the nature of the fields (Character, String, Boolean, Vector, etc...) I'm assuming that all java objects have some (probably minimal) footprint, and that one of the smallest of these footprints would be a single character. To test that basic assumption I started with the above code - and the results of the print statements seem counterintuitive.
Any insights into the way java stores/serializes characters vs strings by default would be very helpful.
getBytes() outputs the String with the default encoding (most likely ISO-8859-1) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.
I would like to say what i think,correct me if i am wrong but you are finding the length of the string which is correctly it is showing as 1 as you have only 1 character in the string. length shows the length not the size . length and size are two different things.
check this Link.. you are finding the number of bytes occupied in the wrong way
well, you have that 1 char in char array has the size of 2 bytes and that your String contains is 1 character long, not that it has 1 byte size.
The String object in Java consists of:
private final char value[];
private final int offset;
private final int count;
private int hash;
only this should assure you that anyway the String object is bigger then char array.
If you want to learn more about how object's size you can also read about the object headers and multiplicity factor for char arrays. For example here or here.
I want to add some code first and then a bit of explanation:
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) {
System.out.println("Character size: " + Character.SIZE / 8);
final byte[] bytes = "a".getBytes(Charset.forName("UTF-16"));
System.out.println("String size: " + bytes.length);
sprintByteAsHex(bytes[0]);
sprintByteAsHex(bytes[1]);
sprintByteAsHex(bytes[2]);
sprintByteAsHex(bytes[3]);
}
static void sprintByteAsHex(byte b) {
System.out.print((Integer.toHexString((b & 0xFF))));
}
}
And the output will be:
Character size: 2
String size: 4
feff061
So what you are actually missing is, you are not providing any parameter to the getBytes method. Probably, you are getting the bytes for UTF-8 representation of the character 'a'.
Well, but why did we get 4 bytes, when we asked for UTF-16? Ok, Java uses UTF-16 internally, then we should have gotten 2 bytes right?
If you examine the output:
feff061
Java actually returned us a BOM: https://en.wikipedia.org/wiki/Byte_order_mark.
So the first 2 bytes: feff is required for signalling that following bytes will be UTF-16 Big Endian. Please see the Wikipedia page for further information.
The remaining 2 bytes: 0061 is the 2 byte representation of the character "a" you have. Can be verified from: http://www.fileformat.info/info/unicode/char/0061/index.htm
So yes, a character in Java is 2 bytes, but when you ask for bytes without a specific encoding, you may not always get 2 bytes since different encodings will require different amount of bytes for various characters.
The SIZE of a Character is the storage needed for a char, which is 16 bit. The length of a string (also the length of the underlying char-array or bytes-array) is the number of characters (or bytes), not a size in bit.
That's why you had do to the division by 8 for the size, but not for the length. The length needs to be multiplied by two.
Also note that you will get other lengths for the byte-array if you specify a different encoding. In this case a transformation to a single- or varying-size encoding was performed when doing getBytes().
See: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset)

How can I read a file as unsigned bytes in Java?

How can I read a file to bytes in Java?
It is important to note that all the bytes need to be positive, i.e. the negative range cannot be used.
Can this be done in Java, and if yes, how?
I need to be able to multiply the contents of a file by a constant. I was assuming that I can read the bytes into a BigInteger and then multiply, however since some of the bytes are negative I am ending up with 12 13 15 -12 etc and get stuck.
Well, Java doesn't have the concept of unsigned bytes... the byte type is always signed, with values from -128 to 127 inclusive. However, this will interoperate just fine with other systems which have worked with unsigned values for example, C# code writing a byte of "255" will produce a file where the same value is read as "-1" in Java. Just be careful, and you'll be okay.
EDIT: You can convert the signed byte to an int with the unsigned value very easily using a bitmask. For example:
byte b = -1; // Imagine this was read from the file
int i = b & 0xff;
System.out.println(i); // 255
Do all your arithmetic using int, and then cast back to byte when you need to write it out again.
You generally read binary data from from files using FileInputStream or possibly FileChannel.
It's hard to know what else you're looking for at the moment... if you can give more details in your question, we may be able to help you more.
With the unsigned API in Java 8 you have Byte.toUnsignedInt. That'll be a lot cleaner than manually casting and masking out.
To convert the int back to byte after messing with it of course you just need a cast (byte)value
You wrote in a comment (please put such informations in the question - there is an edit link for this):
I need to be able to multiply the contents of a file by a constant.
I was assuming that I can read the bytes into a BigInteger and then
multiply, however since some of the bytes are negative I am ending
up with 12 13 15 -12 etc and gets stuck.
If you want to use the whole file as a BigInteger, read it in a byte[], and give this array (as a whole) to the BigInteger-constructor.
/**
* reads a file and converts the content to a BigInteger.
* #param f the file name. The content is interpreted as
* big-endian base-256 number.
* #param signed if true, interpret the file's content as two's complement
* representation of a signed number.
* if false, interpret the file's content as a unsigned
* (nonnegative) number.
*/
public static BigInteger fileToBigInteger(File f, boolean signed)
throws IOException
{
byte[] array = new byte[file.length()];
InputStream in = new FileInputStream(file);
int i = 0; int r;
while((r = in.read(array, i, array.length - i) > 0) {
i = i + r;
}
in.close();
if(signed) {
return new BigInteger(array);
}
else {
return new BigInteger(1, array);
}
}
Then you can multiply your BigInteger and save the result in a new file (using the toByteArray() method).
Of course, this very depends on the format of your file - my method assumes the file contains the result of the toByteArray() method, not some other format. If you have some other format, please add information about this to your question.
"I need to be able to multiply the contents of a file by a constant." seems quite a dubious goal - what do you really want to do?
If using a larger integer type internally is not a problem, just go with the easy solution, and add 128 to all integers before multiplying them. Instead of -128 to 127, you get 0 to 255. Addition is not difficult ;)
Also, remember that the arithmetic and bitwise operators in Java only returns integers, so:
byte a = 0;
byte b = 1;
byte c = a | b;
would give a compile time error since a | b returns an integer. You would have to to
byte c = (byte) a | b;
So I would suggest just adding 128 to all your numbers before you multiply them.
Some testing revealed that this returns the unsigned byte values in [0…255] range one by one from the file:
Reader bytestream = new BufferedReader(new InputStreamReader(
new FileInputStream(inputFileName), "ISO-8859-1"));
int unsignedByte;
while((unsignedByte = bytestream.read()) != -1){
// do work
}
It seems to be work for all bytes in the range, including those that no characters are defined for in ISO 8859-1.

Categories

Resources