Error in reading GZip in java but not in python

Error in reading GZip in java but not in python - java

I write some data to GZIP binary file using the below java code
public static void WriteDictAndIndex(HashMap<String, Term> terms, int index){
try{
GZIPOutputStream postingListOutput = new GZIPOutputStream(new FileOutputStream(String.format("./generated/posting_list_%d", index)));
GZIPOutputStream dictionaryOutput = new GZIPOutputStream(new FileOutputStream(String.format("./generated/dictionary_%d", index)));
Integer START=0, SIZE=0, VOCAB=0;
for(String s : terms.keySet()){
ArrayList<Pair<Integer, Byte>> postingList = terms.get(s).postingList;
SIZE = postingList.size()*5;
// Write one posting list to the file system
ByteBuffer list_buffer = ByteBuffer.allocate(SIZE);
int totalCount = 0;
for(Pair<Integer, Byte> p : postingList) {
// Write the docID (4 bytes)
list_buffer.putInt(p.getValue0());
// Write the term frequency (1 byte)
byte termFrequency = p.getValue1();
list_buffer.put(termFrequency);
// Counter for the total occurrences of words
totalCount += (int)termFrequency;
}
if(index == 0 && totalCount == 1)
continue;
postingListOutput.write(list_buffer.array());
// Write one dictionary entry to the file system
byte[] token = s.getBytes();
ByteBuffer dict_buffer = ByteBuffer.allocate(16+token.length);
dict_buffer.putInt(token.length);
dict_buffer.put(token);
dict_buffer.putInt(terms.get(s).documentFrequency);
dict_buffer.putInt(START);
dict_buffer.putInt(SIZE);
dictionaryOutput.write(dict_buffer.array());
START += SIZE;
VOCAB += 1;
}
//INFO
System.out.println(String.format("Vocabulary Size: %d", VOCAB));
postingListOutput.close();
dictionaryOutput.close();
}catch(IOException e){
System.err.println(e);
}
}
Now when I read first 695 bytes of this file using python, it reads as expected. But when I read the file using java GZIP, there are some discrepancies (the last 10 bytes of the first 695 bytes that I read are different)
I am trying to read using the following code:
try{
GZIPInputStream postingList = new GZIPInputStream(new FileInputStream(new File(args[1])));
GZIPInputStream dictionary = new GZIPInputStream(new FileInputStream(new File(args[2])));
byte[] buf = new byte[4];
while(true){
// Get the size of the token from the dictionary
dictionary.read(buf);
int tokenSize = ByteBuffer.wrap(buf).getInt();
// Read the token
byte[] tokenBuffer = new byte[tokenSize];
dictionary.read(tokenBuffer);
String token = new String(tokenBuffer, StandardCharsets.UTF_8);
// Read the document frequency
dictionary.read(buf);
int documentFrequency = ByteBuffer.wrap(buf).getInt();
// Read the starting index of the posting list
dictionary.read(buf);
int START = ByteBuffer.wrap(buf).getInt();
// Read the size of the posting list
dictionary.read(buf);
int SIZE = ByteBuffer.wrap(buf).getInt();
// Read the posting list
for(int i=0; i<documentFrequency; i++){
byte[] ID = new byte[4];
postingList.read(ID);
int docID = ByteBuffer.wrap(ID).getInt();
byte[] frequency = new byte[1];
postingList.read(frequency);
System.out.println(String.format("%d: %d: %d",i, docID, frequency[0]));
}
break;
}
postingList.close();
dictionary.close();
}
catch(IOException e){
System.err.println(e);
}
The print statement above will print multiple lines with after reading an integer(4 byte) and a byte in each line.
Last 2 print statments should be of the form(which python reads fine)
137: 81257: 1
138: 81737: 1
But I am getting(using the below java code)
137: 65536: 61
138: 1761673217: 63
Any pointers on what could be the mistake?

Related

My DeflaterOutputStream/InputStream code corrupting data

I've got a simple test case that fails to compress a stream of data. I generate a byte[] of some random bytes, compress it via DeflaterOutputStream, flush() the stream, then reverse those operations to retrieve the original array. At byte 505 the reconstructed stream starts to consist entirely of 0x00 bytes, and I don't understand why:
//
// create some random bytes
//
Random rng = new Random();
int len = 5000;
byte[] data = new byte[len];
for (int i = 0; i < len; ++i)
data[i] = (byte) rng.nextInt(0xff);
//
// write to byte[] via a deflater stream
//
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DeflaterOutputStream os = new DeflaterOutputStream(baos, true);
os.write(data);
os.flush();
//
// read back into byte[] via an inflater stream
//
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
InflaterInputStream is = new InflaterInputStream(bais);
byte[] readbytes = new byte[len];
is.read(readbytes);
//
// check they match (they don't, at byte 505)
//
for (int i = 0; i < len; ++i)
if (data[i] != readbytes[i])
throw new RuntimeException("Mismatch at position " + i);
It doesn't seem to matter what's in the source array, it's always at position 505 it fails.
Here's what the two byte[] arrays look like around the region they differ:
?\m·g··gWNLErZ···,··-··=·;n=··F?···13·{·rw·······\`3···f····{/····t·1·WK$·······WZ······x
?\m·g··gWNLErZ···,··-····································································
^byte 505
All those unprintable chars are 0x00 from that point on. Why is this happening? I feel like I must be misunderstanding something fundamental about how the Deflate/Inflate streams work. The real-world use case here is a stream over a network that I thought I could easily improve the performance of by inserting Deflate/Inflate streams into

When I test this, is.read(readBytes) returns 505, the length of bytes read. The other single-argument-array stream methods return void and guarantee that the entire array is read or written, but is.read() is a different API and requires that you check the amount of bytes actually read.
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
System.err.println( "bais size = " + bais.available() );
InflaterInputStream is = new InflaterInputStream(bais);
byte[] readbytes = new byte[len];
System.err.println( "read = " + is.read(readbytes) ); // 505
This runs without throwing an error for me:
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
System.err.println( "bais size = " + bais.available() );
InflaterInputStream is = new InflaterInputStream(bais);
byte[] readbytes = new byte[len];
for( int total = 0, result = 0; (result = is.read(readbytes, total, len-total )) != -1; )
{
total += result;
System.err.println( "reading : " + total );
if( total == len ) break;
}

FileInputStream and Huffman Tree

I am creating a Huffman tree to compress a text file but I am having some issues. This method I am making is supposed to take a FileInputStream which inputs the text data and returns a Map of the characters and the counts. However, to do that, I need to define the size of byte[] to store the data. The problem is that the byte[] array size needs to be just the right length or else the Map will also have some unneeded data. Is there a way to make the byte[] just the right size?
Here is my code:
// provides a count of characters in an input file and place in map
public static Map<Character, Integer> getCounts(FileInputStream input)
throws IOException {
Map<Character, Integer> output = new TreeMap<Character, Integer>(); // treemap keeps keys in sorted order (chars alphabetized)
byte[] fileContent = new byte[100]; // creates a byte[]
//ArrayList<Byte> test = new ArrayList<Byte>();
input.read(fileContent); // reads the input into fileContent
String test = new String(fileContent); // contains entire file into this string to process
// goes through each character of String to put chars as keys and occurrences as keys
for (int i = 0; i < test.length(); i++) {
char temp = test.charAt(i);
if (output.containsKey(temp)) { // seen this character before; increase count
int count = output.get(temp);
System.out.println("repeat; char is: " + temp + "count is: " + count);
output.put(temp, count + 1);
} else { // Haven't seen this character before; create count of 1
System.out.println("new; char is: " + temp + "count is: 1");
output.put(temp, 1);
}
}
return output;
}

The return value of FileInputStream.read() is the number of bytes actually read, or -1 in case of EOF. You can use this value instead of test.length() in the for loop.
Notice that read() is not guaranteed to read in the buffer length worth of bytes, even if the end of file is not reached, so it is usually used in a loop:
int bytesRead;
//Read until there is no more bytes to read.
while((bytesRead = input.read(buf))!=-1)
{
//You have next bytesRead bytes in a buffer here
}
Finally, if your strings are Unicode, this approach will not work, since read() can terminate mid-character. Consider using InputStreamReader to wrap FileInputStream:
Reader fileReader = new InputStreamReader(input, "UTF-8");
int charsRead;
char buf[] = new char[256];
while ((charsRead = fileReader.read(buf)) > 0) {
//You have charsRead characters in a buffer here
}

Java read into byte array

I have a .txt file consisting of 1's and 0's like so;
11111100000001010000000101110010
11111100000001100000000101110010
00000000101001100010000000100000
I would like to be able to read 8 (1's and 0's) and put each 'byte' into a byte array. So a line would be 4 bytes;
11111100 00000101 00000001 01110010 --> 4 bytes, line 1
11111100 00000110 00000001 01110010 --> 8 bytes, line 2
00000000 10100110 00100000 00100000 --> total 12 bytes, line 3
...
and so on.
I believe I need to store the data in a binary file but I'm not sure how to do this. Any help is greatly appreciated.
Edit 1:
I would like to put 8 1's and 0's (11111100, 00000101) into a byte and store in a byte array so 11111100 would be the first byte in the array, 00000101 the second and so on. I hope this is clearer.
Edit 2:
fileopen = new JFileChooser(System.getProperty("user.dir") + "/Example programs"); // open file from current directory
filter = new FileNameExtensionFilter(".txt", "txt");
fileopen.addChoosableFileFilter(filter);
if (fileopen.showOpenDialog(null)== JFileChooser.APPROVE_OPTION)
{
try
{
file = fileopen.getSelectedFile();
//create FileInputStream object
FileInputStream fin = new FileInputStream(file);
byte[] fileContent = new byte[(int)file.length()];
fin.read(fileContent);
for(int i = 0; i < fileContent.length; i++)
{
System.out.println("bit " + i + "= " + fileContent[i]);
}
//create string from byte array
String strFileContent = new String(fileContent);
System.out.println("File content : ");
System.out.println(strFileContent);
}
catch(FileNotFoundException e){}
catch(IOException e){}
}

Here's one way, with comments in the code:
import java.lang.*;
import java.io.*;
import java.util.*;
public class Mkt {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new FileReader("in.txt"));
List<Byte> bytesList = new ArrayList<Byte>();
// Read line by line
for(String line = br.readLine(); line != null; line = br.readLine()) {
// 4 byte representations per line
for(int i = 0; i < 4; i++) {
// Get each of the 4 bytes (i.e. 8 characters representing the byte)
String part = line.substring(i * 8, (i + 1) * 8);
// Parse that into the binary representation
// Integer.parseInt is used as byte in Java is signed (-128 to 127)
byte currByte = (byte)Integer.parseInt(part, 2);
bytesList.add(currByte);
}
}
Byte[] byteArray = bytesList.toArray(new Byte[]{});
// Just print for test
for(byte currByte: byteArray) {
System.out.println(currByte);
}
}
}
Input is read from file named in.txt. Here's a sample run:
$ javac Mkt.java && java Mkt
-4
5
1
114
-4
6
1
114
0
-90
32
32
Hope this helps to get you started, you can tweak to your needs.

Use BufferedReader to read in the txt file.
BufferedReader in = new BufferedReader(...);
ArrayList<byte> bytes = new ArrayList<byte>();
ArrayList<char> buffer = new ArrayList<char>();
int c = 0;
while((c = in.read()) >= 0) {
if(c == '1' || c == '0') buffer.add((char)c);
if(buffer.size() == 8) {
bytes.add(convertToByte(buffer));
buffer.clear();
}
}

How do I read an NSInputStream while writing it to an NSOutputStream in iOS?

I am porting an Android app to iPhone (more like improving the iPhone app based on the Android version) and I need to split and combine large uncompressed audio files.
Currently, I load all the files into memory and split them and combine them in separate functions. It crashes with 100MB+ files.
This is the new process needed to do it:
I have two recordings (file1 and file2) and a split position where I want file2 to be inserted inside file1.
-create the input streams for file1 and file2 and the output stream for the outputfile.
-rewrite the new CAF header
-read the data from inputStream1 until it reaches the split point and I write all that data to the output file.
and write it to the output stream.
-read all data from inputStream2 and write it to output file.
-read remaining data from inputStream1 and write it to output file.
Here is my Android code for the process:
File file1File = new File(file1);
File file2File = new File(file2);
long file1Length = file1File.length();
long file2Length = file2File.length();
FileInputStream file1ByteStream = new FileInputStream(file1);
FileInputStream file2ByteStream = new FileInputStream(file2);
FileOutputStream outputFileByteStream = new FileOutputStream(outputFile);
// time = fileLength / (Sample Rate * Channels * Bits per sample / 8)
// convert position to number of bytes for this function
long sampleRate = eRecorder.RECORDER_SAMPLERATE;
int channels = 1;
long bitsPerSample = eRecorder.RECORDER_BPP;
long bytePositionLength = (position * (sampleRate * channels * bitsPerSample / 8)) / 1000;
//calculate total data size
int dataSize = 0;
dataSize = (int)(file1Length + file2Length);
WriteWaveFileHeaderForMerge(outputFileByteStream, dataSize,
dataSize + 36,
eRecorder.RECORDER_SAMPLERATE, 1,
2 * eRecorder.RECORDER_SAMPLERATE);
long bytesWritten = 0;
int length = 0;
//set limit for bytes read, and write file1 bytes to outputfile until split position reached
int limit = (int)bytePositionLength;
//read bytes to limit
writeBytesToLimit(file1ByteStream, outputFileByteStream, limit);
file1ByteStream.close();
file2ByteStream.skip(44);//skip wav file header
writeBytesToLimit(file2ByteStream, outputFileByteStream, (int)file2Length);
file2ByteStream.close();
//calculate length of remaining file1 bytes to be written
long file1offset = bytePositionLength;
//reinitialize file1 input stream
file1ByteStream = new FileInputStream(file1);
file1ByteStream.skip(file1offset);
writeBytesToLimit(file1ByteStream, outputFileByteStream, (int)file1Length);
file1ByteStream.close();
outputFileByteStream.close();
And this is my writeBytesToLimit function:
private void writeBytesToLimit(FileInputStream inputStream, FileOutputStream outputStream, int byteLimit) throws IOException
{
int bytesRead = 0;
int chunkSize = 65536;
int length = 0;
byte[] buffer = new byte[chunkSize];
while((length = inputStream.read(buffer)) != -1)
{
bytesRead += length;
if(bytesRead >= byteLimit)
{
int leftoverBytes = byteLimit % chunkSize;
byte[] smallBuffer = new byte[leftoverBytes];
System.arraycopy(buffer, 0, smallBuffer, 0, leftoverBytes);
outputStream.write(smallBuffer);
break;
}
if(length == chunkSize)
outputStream.write(buffer);
else
{
byte[] smallBuffer = new byte[length];
System.arraycopy(buffer, 0, smallBuffer, 0, length);
outputStream.write(smallBuffer);
}
}
}
How do I do this in iOS? Using the same delegate for two NSInputStreams and an NSOutputStream looks like it will get very messy.
Has anyone seen an example of how to do this (and do it clean)?

I ended up using NSFileHandle. For example, this is the first part of what I am doing.
NSData *readData = [[NSData alloc] init];
NSFileHandle *reader1 = [NSFileHandle fileHandleForReadingAtPath:file1Path];
NSFileHandle *writer = [NSFileHandle fileHandleForWritingAtPath:outputFilePath];
//start reading data from file1 to split point and writing it to file
long bytesRead = 0;
while(bytesRead < splitPointInBytes)
{
//read a chunk of data
readData = [reader1 readDataOfLength:chunkSize];
if(readData.length == 0)break;
//trim data if too much was read
if(bytesRead + readData.length > splitPointInBytes)
{
//get difference of read bytes and byte limit
long difference = bytesRead + readData.length - splitPointInBytes;
//trim data
NSMutableData *readDataMutable = [NSMutableData dataWithData:readData];
[readDataMutable setLength:readDataMutable.length - difference];
readData = [NSData dataWithData:readDataMutable];
NSLog(#"Too much data read, trimming");
}
//write data to output file
[writer writeData:readData];
//update byte counter
bytesRead += readData.length;
}
long file1BytesWritten = bytesRead;

How can I write a sequence of strings and then a byte array to a file?

I want to write first a sequence of strings and then a sequence of bytes into a file, using Java. I started by using FileOutputStream because of the array of bytes. After searching the API, I realised that FileOutputStream cannot write Strings, only ints and bytes, so I switched to DataOutputStream. When I run the program, I get an exception. Why?
Here's a portion of my code:
try {
// Create the file
FileOutputStream fos;
DataOutputStream dos; // = new DataOutputStream("compressedfile.ecs_h");
File file= new File("C:\\MyFile.txt");
fos = new FileOutputStream(file);
dos=new DataOutputStream(fos);
/* saves the characters as a dictionary into the file before the binary seq*/
for (int i = 0; i < al.size(); i++) {
String name= al.get(i).name; //gets the string from a global arraylist, don't pay attention to this!
dos.writeChars(name); //saving the name in the file
}
System.out.println("\nIS SUCCESFULLY WRITTEN INTO FILE! ");
dos.writeChars("><");
String strseq;
/*write all elements from the arraylist into a string variable*/
strseq= seq.toString();
System.out.println("sTringSeq: " + strseq);
/*transpose the sequence string into a byte array*/
byte[] data = new byte[strseq.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
dos.write(data[i]);
}
dos.flush();
//Close the output stream
dos.close();
} catch(Exception e){}

The problem with your code is that the last for loop was counting over the wrong number of bytes. The code below fixes your problem writing your test data to a file. This works on my machine.
public static void main(String[] args) {
ArrayList<String> al = new ArrayList<String>();
al.add("String1");
al.add("String2");
try {
// Create the file
FileOutputStream fos = new FileOutputStream("MyFile.txt");
DataOutputStream dos = new DataOutputStream(fos);
/* saves the characters as a dictionary into the file before the binary seq */
for (String str : al) {
dos.writeChars(str);
}
System.out.println("\nIS SUCCESFULLY WRITTEN INTO FILE! ");
dos.writeChars("><");
String strseq = "001100111100101000101010111010100100111000000000";
// Ensure that you have a string of the correct size
if (strseq.length() % 8 != 0) {
throw new IllegalStateException(
"Input String is cannot be converted to bytes - wrong size: "
+ strseq.length());
}
int numBytes = strseq.length() / 8;
for (int i = 0; i < numBytes; i++) {
int start = i * 8;
int end = (i + 1) * 8;
byte output = (byte) Integer.parseInt(strseq.substring(start, end), 2);
dos.write(output);
}
dos.writeChars("> Enf of File");
dos.flush();
// Close the output stream
dos.close();
} catch (Exception e) {
e.printStackTrace();
}
}
The approach of writing bytes directly to a test file does have a few problems (I assume that it's a text file in that your test file name ends with .txt), the most obvious one being that some text editors don't handle/display null characters very well (your last test byte was: 00000000 or null). If you want to see the bytes as readable bytes then you could investigate encoding them using Base64 encoding.

Line:
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
looks very suspiciously...
can you provide move details about strseq and its value?

What about this code ?
this code :
byte[] data = new byte[strseq.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(strseq.substring(i * 8, (i + 1) * 8), 2);
dos.write(data[i]);
}
becomes
byte[] data = strseq.getBytes();

With the FileWriter class you have a nice abstraction of a file writing operation.
May this class can help you to write your file...
You can substitute the other OutputStreams by only this class. It have all the methods of you want for write a string and a byte array in a file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Error in reading GZip in java but not in python - java

Related

My DeflaterOutputStream/InputStream code corrupting data

FileInputStream and Huffman Tree

Java read into byte array

How do I read an NSInputStream while writing it to an NSOutputStream in iOS?

How can I write a sequence of strings and then a byte array to a file?

Categories

Resources