FileInputStream and Huffman Tree

FileInputStream and Huffman Tree - java

I am creating a Huffman tree to compress a text file but I am having some issues. This method I am making is supposed to take a FileInputStream which inputs the text data and returns a Map of the characters and the counts. However, to do that, I need to define the size of byte[] to store the data. The problem is that the byte[] array size needs to be just the right length or else the Map will also have some unneeded data. Is there a way to make the byte[] just the right size?
Here is my code:
// provides a count of characters in an input file and place in map
public static Map<Character, Integer> getCounts(FileInputStream input)
throws IOException {
Map<Character, Integer> output = new TreeMap<Character, Integer>(); // treemap keeps keys in sorted order (chars alphabetized)
byte[] fileContent = new byte[100]; // creates a byte[]
//ArrayList<Byte> test = new ArrayList<Byte>();
input.read(fileContent); // reads the input into fileContent
String test = new String(fileContent); // contains entire file into this string to process
// goes through each character of String to put chars as keys and occurrences as keys
for (int i = 0; i < test.length(); i++) {
char temp = test.charAt(i);
if (output.containsKey(temp)) { // seen this character before; increase count
int count = output.get(temp);
System.out.println("repeat; char is: " + temp + "count is: " + count);
output.put(temp, count + 1);
} else { // Haven't seen this character before; create count of 1
System.out.println("new; char is: " + temp + "count is: 1");
output.put(temp, 1);
}
}
return output;
}

The return value of FileInputStream.read() is the number of bytes actually read, or -1 in case of EOF. You can use this value instead of test.length() in the for loop.
Notice that read() is not guaranteed to read in the buffer length worth of bytes, even if the end of file is not reached, so it is usually used in a loop:
int bytesRead;
//Read until there is no more bytes to read.
while((bytesRead = input.read(buf))!=-1)
{
//You have next bytesRead bytes in a buffer here
}
Finally, if your strings are Unicode, this approach will not work, since read() can terminate mid-character. Consider using InputStreamReader to wrap FileInputStream:
Reader fileReader = new InputStreamReader(input, "UTF-8");
int charsRead;
char buf[] = new char[256];
while ((charsRead = fileReader.read(buf)) > 0) {
//You have charsRead characters in a buffer here
}

Related

Error in reading GZip in java but not in python

I write some data to GZIP binary file using the below java code
public static void WriteDictAndIndex(HashMap<String, Term> terms, int index){
try{
GZIPOutputStream postingListOutput = new GZIPOutputStream(new FileOutputStream(String.format("./generated/posting_list_%d", index)));
GZIPOutputStream dictionaryOutput = new GZIPOutputStream(new FileOutputStream(String.format("./generated/dictionary_%d", index)));
Integer START=0, SIZE=0, VOCAB=0;
for(String s : terms.keySet()){
ArrayList<Pair<Integer, Byte>> postingList = terms.get(s).postingList;
SIZE = postingList.size()*5;
// Write one posting list to the file system
ByteBuffer list_buffer = ByteBuffer.allocate(SIZE);
int totalCount = 0;
for(Pair<Integer, Byte> p : postingList) {
// Write the docID (4 bytes)
list_buffer.putInt(p.getValue0());
// Write the term frequency (1 byte)
byte termFrequency = p.getValue1();
list_buffer.put(termFrequency);
// Counter for the total occurrences of words
totalCount += (int)termFrequency;
}
if(index == 0 && totalCount == 1)
continue;
postingListOutput.write(list_buffer.array());
// Write one dictionary entry to the file system
byte[] token = s.getBytes();
ByteBuffer dict_buffer = ByteBuffer.allocate(16+token.length);
dict_buffer.putInt(token.length);
dict_buffer.put(token);
dict_buffer.putInt(terms.get(s).documentFrequency);
dict_buffer.putInt(START);
dict_buffer.putInt(SIZE);
dictionaryOutput.write(dict_buffer.array());
START += SIZE;
VOCAB += 1;
}
//INFO
System.out.println(String.format("Vocabulary Size: %d", VOCAB));
postingListOutput.close();
dictionaryOutput.close();
}catch(IOException e){
System.err.println(e);
}
}
Now when I read first 695 bytes of this file using python, it reads as expected. But when I read the file using java GZIP, there are some discrepancies (the last 10 bytes of the first 695 bytes that I read are different)
I am trying to read using the following code:
try{
GZIPInputStream postingList = new GZIPInputStream(new FileInputStream(new File(args[1])));
GZIPInputStream dictionary = new GZIPInputStream(new FileInputStream(new File(args[2])));
byte[] buf = new byte[4];
while(true){
// Get the size of the token from the dictionary
dictionary.read(buf);
int tokenSize = ByteBuffer.wrap(buf).getInt();
// Read the token
byte[] tokenBuffer = new byte[tokenSize];
dictionary.read(tokenBuffer);
String token = new String(tokenBuffer, StandardCharsets.UTF_8);
// Read the document frequency
dictionary.read(buf);
int documentFrequency = ByteBuffer.wrap(buf).getInt();
// Read the starting index of the posting list
dictionary.read(buf);
int START = ByteBuffer.wrap(buf).getInt();
// Read the size of the posting list
dictionary.read(buf);
int SIZE = ByteBuffer.wrap(buf).getInt();
// Read the posting list
for(int i=0; i<documentFrequency; i++){
byte[] ID = new byte[4];
postingList.read(ID);
int docID = ByteBuffer.wrap(ID).getInt();
byte[] frequency = new byte[1];
postingList.read(frequency);
System.out.println(String.format("%d: %d: %d",i, docID, frequency[0]));
}
break;
}
postingList.close();
dictionary.close();
}
catch(IOException e){
System.err.println(e);
}
The print statement above will print multiple lines with after reading an integer(4 byte) and a byte in each line.
Last 2 print statments should be of the form(which python reads fine)
137: 81257: 1
138: 81737: 1
But I am getting(using the below java code)
137: 65536: 61
138: 1761673217: 63
Any pointers on what could be the mistake?

Stream of short[]

Hi I need to calculate the entropy of order m of a file where m is the number of bit (m <= 16).
So:
H_m(X)=-sum_i=0 to i=2^m-1{(p_i,m)(log_2 (p_i,m))}
So, I thought to create an input stream to read the file and then calculate the probability of each sequence composed by m bit.
For m = 8 it's easy because I consider a byte.
Since that m<=16 I tought to consider as primitive type short, save each short of the file in an array short[] and then manipulate bits using bitwise operators to obtain all the sequences of m bit in the file.
Is this a good idea?
Anyway, I'm not able to create a stream of short. This is what I've done:
public static void main(String[] args) {
readFile(FILE_NAME_INPUT);
}
public static void readFile(String filename) {
short[] buffer = null;
File a_file = new File(filename);
try {
File file = new File(filename);
FileInputStream fis = new FileInputStream(filename);
DataInputStream dis = new DataInputStream(fis);
int length = (int)file.length() / 2;
buffer = new short[length];
int count = 0;
while(dis.available() > 0 && count < length) {
buffer[count] = dis.readShort();
count++;
}
System.out.println("length=" + length);
System.out.println("count=" + count);
for(int i = 0; i < buffer.length; i++) {
System.out.println("buffer[" + i + "]: " + buffer[i]);
}
fis.close();
}
catch(EOFException eof) {
System.out.println("EOFException: " + eof);
}
catch(FileNotFoundException fe) {
System.out.println("FileNotFoundException: " + fe);
}
catch(IOException ioe) {
System.out.println("IOException: " + ioe);
}
}
But I lose a byte and I don't think this is the best way to proced.
This is what I think to do using bitwise operator:
int[] list = new int[l];
foreach n in buffer {
for(int i = 16 - m; i > 0; i-m) {
list.add( (n >> i) & 2^m-1 );
}
}
I'm assuming in this case to use shorts.
If I use bytes, how can I do a cycle like that for m > 8?
That cycle doesn't work because I have to concatenate multiple bytes and each time varying the number of bits to be joined..
Any ideas?
Thanks

I think you just need to have a byte array:
public static void readFile(String filename) {
ByteArrayOutputStream outputStream=new ByteArrayOutputStream();
try {
FileInputStream fis = new FileInputStream(filename);
byte b=0;
while((b=fis.read())!=-1) {
outputStream.write(b);
}
byte[] byteData=outputStream.toByteArray();
fis.close();
}
catch(IOException ioe) {
System.out.println("IOException: " + ioe);
}
Then you can manipulate byteData as per your bitwise operations.
--
If you want to work with shorts you can combine bytes read this way
short[] buffer=new short[(int)(byteData.length/2.)+1];
j=0;
for(i=0; i<byteData.length-1; i+=2) {
buffer[j]=(short)((byteData[i]<<8)|byteData[i+1]);
j++;
}
To check for odd bytes do this
if((byteData.length%2)==1) last=(short)((0x00<<8)|byteData[byteData.length-1]]);
last is a short so it could be placed in buffer[buffer.length-1]; I'm not sure if that last position in buffer is available or occupied; I think it is but you need to check j after exiting the loop; if j's value is buffer.length-1 then it is available; otherwise might be some problem.
Then manipulate buffer.
The second approach with working with bytes is more involved. It's a question of its own. So try this above.

Reading binary file byte by byte

I've been doing research on a java problem I have with no success. I've read a whole bunch of similar questions here on StackOverflow but the solutions just doesn't seem to work as expected.
I'm trying to read a binary file byte by byte.
I've used:
while ((data = inputStream.read()) != -1)
loops...
for (int i = 0; i < bFile.length; i++) {
loops...
But I only get empty or blank output. The actual content of the file I'm trying to read is as follows:
¬í sr assignment6.PetI¿Z8kyQŸ I ageD weightL namet Ljava/lang/String;xp > #4 t andysq ~ #bÀ t simbasq ~ #I t wolletjiesq ~
#$ t rakker
I'm merely trying to read it byte for byte and feed it to a character array with the following line:
char[] charArray = Character.toChars(byteValue);
Bytevalue here represents an int of the byte it's reading.
What is going wrong where?

Since java 7 it is not needed to read byte by byte, there are two utility function in Files:
Path path = Paths.get("C:/temp/test.txt");
// Load as binary:
byte[] bytes = Files.readAllBytes(path);
String asText = new String(bytes, StandardCharset.ISO_8859_1);
// Load as text, with some Charset:
List<String> lines = Files.readAllLines(path, StandardCharsets.ISO_8859_1);
As you want to read binary data, one would use readAllBytes.
String and char is for text. As opposed to many other programming languages, this means Unicode, so all scripts of the world may be combined. char is 16 bit as opposed to the 8 bit byte.
For pure ASCII, the 7 bit subset of Unicode / UTF-8, byte and char values are identical.
Then you might have done the following (low-quality code):
int fileLength = (int) path.size();
char[] chars = new char[fileLength];
int i = 0;
int data;
while ((data = inputStream.read()) != -1) {
chars[i] = (char) data; // data actually being a byte
++i;
}
inputStream.close();
String text = new String(chars);
System.out.println(Arrays.toString(chars));
The problem you had, probably concerned the unwieldy fixed size array in java, and that a char[] still is not a String.
For binary usage, as you seem to be reading serialized data, you might like to dump the file:
int i = 0;
int data;
while ((data = inputStream.read()) != -1) {
char ch = 32 <= data && data < 127 ? (char) data : ' ';
System.out.println("[%06d] %02x %c%n", i, data, ch);
++i;
}
Dumping file position, hex value and char value.

it is simple example:
public class CopyBytes {
public static void main(String[] args) throws IOException {
FileInputStream in = null;
FileOutputStream out = null;
try {
in = new FileInputStream("xanadu.txt");
out = new FileOutputStream("outagain.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
} finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
If you want to read text(characters) - use Readers, if you want to read bytes - use Streams

Why not using Apache Commons:
byte[] bytes = IOUtils.toByteArray(inputStream);
Then you can convert it to char:
String str = new String(bytes);
Char[] chars = str.toCharArray();
Or like you did:
char[] charArray = Character.toChars(bytes);
To deserialize objects:
List<Object> results = new ArrayList<Object>();
FileInputStream fis = new FileInputStream("your_file.dat");
ObjectInputStream ois = new ObjectInputStream(fis);
try {
while (true) {
results.add(ois.readObject());
}
} catch (OptionalDataException e) {
if (!e.eof) throw e;
} finally {
ois.close();
}

Edit:
Use file.length() for they array size, and make a byte array. Then inputstream.read(b).
Edit again: if you want characters, use inputstreamreader(fileinputstream(file),charset), it even comes with charset.

Android: how to have random access from an InputStream?

I have an InputStream, and the relative file name and size.
I need to access/read some random (increasing) positions in the InputStream. This positions are stored in an integer array (named offsets).
InputStream inputStream = ...
String fileName = ...
int fileSize = (int) ...
int[] offsets = new int[]{...}; // the random (increasing) offsets array
Now, given an InputStream, I've found only two possible solutions to jump to random (increasing) positions of the file.
The first one is to use the skip() method of the InputStream (note that I actually use BufferedInputStream, since I will need to mark() and reset() the file pointer).
//Open a BufferInputStream:
BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);
byte[] bytes = new byte[1];
int curFilePointer = 0;
long numBytesSkipped = 0;
long numBytesToSkip = 0;
int numBytesRead = 0;
//Check the file size:
if ( fileSize < offsets[offsets.length-1] ) { // the last (bigger) offset is bigger then the file size...
//Debug:
Log.d(TAG, "The file is too small!\n");
return;
}
for (int i=0, k=0; i < offsets.length; i++, k=0) { // for each offset I have to jump...
try {
//Jump to the offset [i]:
while( (curFilePointer < offsets[i]) && (k < 10) ) { // until the correct offset is reached (at most 10 tries)
numBytesToSkip = offsets[i] - curFilePointer;
numBytesSkipped = bufferedInputStream.skip(numBytesToSkip);
curFilePointer += numBytesSkipped; // move the file pointer forward
//Debug:
Log.d(TAG, "FP: " + curFilePointer + "\n");
k++;
}
if ( curFilePointer != offsets[i] ) { // it did NOT jump properly... (what's going on?!)
//Debug:
Log.d(TAG, "InputStream.skip() DID NOT JUMP PROPERLY!!!\n");
break;
}
//Read the content of the file at the offset [i]:
numBytesRead = bufferedInputStream.read(bytes, 0, bytes.length);
curFilePointer += numBytesRead; // move the file pointer forward
//Debug:
Log.d(TAG, "READ [" + curFilePointer + "]: " + bytes[0] + "\n");
}
catch ( IOException e ) {
e.printStackTrace();
break;
}
catch ( IndexOutOfBoundsException e ) {
e.printStackTrace();
break;
}
}
//Close the BufferInputStream:
bufferedInputStream.close()
The problem is that, during my tests, for some (usually big) offsets, it has cycled 5 or more times before skipping the correct number of bytes. Is it normal? And, above all, can/should I thrust skip()? (That is: Are 10 cycles enough to be SURE it will ALWAYS arrive to the correct offset?)
The only alternative way I've found is the one of creating a RandomAccessFile from the InputStream, through File.createTempFile(prefix, suffix, directory) and the following function.
public static RandomAccessFile toRandomAccessFile(InputStream inputStream, File tempFile, int fileSize) throws IOException {
RandomAccessFile randomAccessFile = new RandomAccessFile(tempFile, "rw");
byte[] buffer = new byte[fileSize];
int numBytesRead = 0;
while ( (numBytesRead = inputStream.read(buffer)) != -1 ) {
randomAccessFile.write(buffer, 0, numBytesRead);
}
randomAccessFile.seek(0);
return randomAccessFile;
}
Having a RandomAccessFile is actually a much better solution, but the performance are exponentially worse (above all because I will have more than a single file).
EDIT: Using byte[] buffer = new byte[fileSize] speeds up (and a lot) the RandomAccessFile creation!
//Create a temporary RandomAccessFile:
File tempFile = File.createTempFile(fileName, null, context.getCacheDir());
RandomAccessFile randomAccessFile = toRandomAccessFile(inputStream, tempFile, fileSize);
byte[] bytes = new byte[1];
int numBytesRead = 0;
//Check the file size:
if ( fileSize < offsets[offsets.length-1] ) { // the last (bigger) offset is bigger then the file size...
//Debug:
Log.d(TAG, "The file is too small!\n");
return;
}
for (int i=0, k=0; i < offsets.length; i++, k=0) { // for each offset I have to jump...
try {
//Jump to the offset [i]:
randomAccessFile.seek(offsets[i]);
//Read the content of the file at the offset [i]:
numBytesRead = randomAccessFile.read(bytes, 0, bytes.length);
//Debug:
Log.d(TAG, "READ [" + (randomAccessFile.getFilePointer()-4) + "]: " + bytes[0] + "\n");
}
catch ( IOException e ) {
e.printStackTrace();
break;
}
catch ( IndexOutOfBoundsException e ) {
e.printStackTrace();
break;
}
}
//Delete the temporary RandomAccessFile:
randomAccessFile.close();
tempFile.delete();
Now, is there a better (or more elegant) solution to have a "random" access from an InputStream?

It's a bit unfortunate you have an InputStream to begin with, but in this situation buffering the stream in a file is of no use iff you are always skipping forward. But you don't have to count the number of times you have called skip, that's not really of interest.
What you do have to check if the stream has ended already, to prevent an infinite loop. Checking the source of the default skip implementation, I'd say you'll have to keep calling skip until it returns 0. This will indicate the end of stream has been reached. The JavaDoc was a bit unclear about this for my taste.

You can't. An InputStream is a stream, that is to say a sequential construct. Your question embodies a contradiction in terms.

How can I write a sequence of strings and then a byte array to a file?

I want to write first a sequence of strings and then a sequence of bytes into a file, using Java. I started by using FileOutputStream because of the array of bytes. After searching the API, I realised that FileOutputStream cannot write Strings, only ints and bytes, so I switched to DataOutputStream. When I run the program, I get an exception. Why?
Here's a portion of my code:
try {
// Create the file
FileOutputStream fos;
DataOutputStream dos; // = new DataOutputStream("compressedfile.ecs_h");
File file= new File("C:\\MyFile.txt");
fos = new FileOutputStream(file);
dos=new DataOutputStream(fos);
/* saves the characters as a dictionary into the file before the binary seq*/
for (int i = 0; i < al.size(); i++) {
String name= al.get(i).name; //gets the string from a global arraylist, don't pay attention to this!
dos.writeChars(name); //saving the name in the file
}
System.out.println("\nIS SUCCESFULLY WRITTEN INTO FILE! ");
dos.writeChars("><");
String strseq;
/*write all elements from the arraylist into a string variable*/
strseq= seq.toString();
System.out.println("sTringSeq: " + strseq);
/*transpose the sequence string into a byte array*/
byte[] data = new byte[strseq.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
dos.write(data[i]);
}
dos.flush();
//Close the output stream
dos.close();
} catch(Exception e){}

The problem with your code is that the last for loop was counting over the wrong number of bytes. The code below fixes your problem writing your test data to a file. This works on my machine.
public static void main(String[] args) {
ArrayList<String> al = new ArrayList<String>();
al.add("String1");
al.add("String2");
try {
// Create the file
FileOutputStream fos = new FileOutputStream("MyFile.txt");
DataOutputStream dos = new DataOutputStream(fos);
/* saves the characters as a dictionary into the file before the binary seq */
for (String str : al) {
dos.writeChars(str);
}
System.out.println("\nIS SUCCESFULLY WRITTEN INTO FILE! ");
dos.writeChars("><");
String strseq = "001100111100101000101010111010100100111000000000";
// Ensure that you have a string of the correct size
if (strseq.length() % 8 != 0) {
throw new IllegalStateException(
"Input String is cannot be converted to bytes - wrong size: "
+ strseq.length());
}
int numBytes = strseq.length() / 8;
for (int i = 0; i < numBytes; i++) {
int start = i * 8;
int end = (i + 1) * 8;
byte output = (byte) Integer.parseInt(strseq.substring(start, end), 2);
dos.write(output);
}
dos.writeChars("> Enf of File");
dos.flush();
// Close the output stream
dos.close();
} catch (Exception e) {
e.printStackTrace();
}
}
The approach of writing bytes directly to a test file does have a few problems (I assume that it's a text file in that your test file name ends with .txt), the most obvious one being that some text editors don't handle/display null characters very well (your last test byte was: 00000000 or null). If you want to see the bytes as readable bytes then you could investigate encoding them using Base64 encoding.

Line:
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
looks very suspiciously...
can you provide move details about strseq and its value?

What about this code ?
this code :
byte[] data = new byte[strseq.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
dos.write(data[i]);
}
becomes
byte[] data = strseq.getBytes();

With the FileWriter class you have a nice abstraction of a file writing operation.
May this class can help you to write your file...
You can substitute the other OutputStreams by only this class. It have all the methods of you want for write a string and a byte array in a file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

FileInputStream and Huffman Tree - java

Related

Error in reading GZip in java but not in python

Stream of short[]

Reading binary file byte by byte

Android: how to have random access from an InputStream?

How can I write a sequence of strings and then a byte array to a file?

Categories

Resources