ZipFile : Wrong values when reading

ZipFile : Wrong values when reading - java

I am creating a zip file with one directory and an single compressed text file inside of it.
Code to create the zip file
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
zos.setLevel(Deflater.BEST_COMPRESSION);
zos.setMethod(Deflater.DEFLATED);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
//put entry for writing
zos.putNextEntry(file);
byte[] data="Hello World Hello World".getBytes();
//Compress Data
Deflater deflater=new Deflater(9);
deflater.setDictionary("Hello World ".getBytes());
deflater.setInput(data);
deflater.finish();
byte[] output=new byte[100];
int compressed=deflater.deflate(output);
//Write Data
CRC32 check=new CRC32();
check.update(data);
file.setSize(deflater.getBytesRead());
file.setCrc(check.getValue());
file.setCompressedSize(compressed);
zos.write(output,0,compressed);
//end data
System.out.println(deflater.getBytesRead()+"/"+compressed);
deflater.end();
}
//close the entry
zos.closeEntry();
}
}
}
Upon writing the file the size of the byte data uncompressed is 23 bytes and the size of the data compressed is 15. I am using every method inside ZipEntry just to test if i can retrive all the values correctly upon reading it.
Upon Reading it using ZipFile class & not ZipInputStream(bug getSize() always returns -1) using this code
//reading zip file using ZipFile
public static void main(String[] args)throws Exception
{
try(ZipFile zis=new ZipFile("E:/TestFile.zip"))
{
Enumeration<? extends ZipEntry> entries=zis.entries();
while(entries.hasMoreElements())
{
ZipEntry entry=entries.nextElement();
System.out.println("Name="+entry.getName());
System.out.println("Is Directory="+entry.isDirectory());
System.out.println("Comment="+entry.getComment());
System.out.println("Creation Time="+entry.getCreationTime());
System.out.println("Access Time="+entry.getLastAccessTime());
System.out.println("Modification Time="+entry.getLastModifiedTime());
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
System.out.println("Optional Data="+new String(entry.getExtra()));
System.out.println("Method="+entry.getMethod());
if(!entry.isDirectory())
{
Inflater inflater=new Inflater();
try(InputStream is=zis.getInputStream(entry))
{
byte[] originalData=new byte[(int)entry.getSize()];
inflater.setInput(is.readAllBytes());
int realLength=inflater.inflate(originalData);
if(inflater.needsDictionary())
{
inflater.setDictionary("Hello World ".getBytes());
realLength=inflater.inflate(originalData);
}
inflater.end();
System.out.println("Data="+new String(originalData,0,realLength));
}
}
System.out.println("=====================================================");
}
}
}
I get this output
Name=Directory/
Is Directory=true
Comment=Directory
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=0
Real Size=0
Compressed Size=2
Optional Data=UTaHello
Method=8
=====================================================
Name=Directory/Test.txt
Is Directory=false
Comment=A File
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=2483042136
Real Size=15
Compressed Size=17
Optional Data=UT��aWorld
Method=8
Data=Hello World Hel
==================================================
There is a lot of wrong output in this code
For the directory
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)Extra Data[Optional Data] has wrong encoding
For the file
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)getSize() & getCompressedSize() methods return the wrong values. I
have specified these values during writing manually with sizeSize() &
setCompressedSize() when creating the file the values were 23 and 15 but
it returns 15 and 17
3)Extra Data[Optional Data] has wrong encoding
4)Since getSize() returns incorrect size it dosen't display the whole
data[Hello World Hel]
With so many things going wrong i thought to post this as one question rather than multiple small ones as they all seem related. I am a complete beginner in writing zip files so any direction on where do i go from here would be greatly appreciated.
I can read the data of an zip entry using an while loop into an buffer if the size is not known or incorrect which is not an problem but why would they even create an set or get size method if they knew we would be doing this most of the time anyway. Whats the point?

After much research i was able to solve 70% of the problems. Others can't be solved given the nature of how an ZipOutputStream & ZipFile reads the data
Problem 1: Incorrect values returned by getSize() & getCompressedSize()
1) During Writing
I was blind to have not seen this earlier but ZipOutputStream already does compression for us and i was double compressing it by using my own inflater so i removed that code and i realized that you must specify these values only when you are using the method as STORED. else they are computed for you from the data. So refracting my zip writing code this is how it looks like
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile2.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
//Auto Compression
zos.setMethod(ZipOutputStream.DEFLATED);
zos.setLevel(9);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
byte[] data="Hello World Hello World".getBytes();
//Data
zos.putNextEntry(file);
zos.write(data);
zos.flush();
}
//close the entry
zos.closeEntry();
}
//finish writing the zip file without closing stream
zos.finish();
}
2)During Reading
To get the correct size & compressed size values there are 2 approaches
-> If you read the file using ZipFile class the values come out correctly
-> If you use ZipInputStream then these values are computed only after you have read all the bytes from the entry. more info here
if(!entry.isDirectory())
{
try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
{
int read;
byte[] data=new byte[10];
while((read=zipInputStream.read(data))>0){baos.write(data,0,read);}
System.out.println("Data="+new String(baos.toByteArray()));
}
}
//Now these values are correct
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
Problem 2: Incorrect Extra data
This post pretty much explains everything
Here is the code
ByteBuffer extraData = ByteBuffer.wrap(entry.getExtra()).order(ByteOrder.LITTLE_ENDIAN);
while(extraData.hasRemaining())
{
int id = extraData.getShort() & 0xffff;
int length = extraData.getShort() & 0xffff;
if(id == 0x756e)
{
int crc32 = extraData.getInt();
short permissions = extraData.getShort();
int
linkLengthOrDeviceNumbers = extraData.getInt(),
userID = extraData.getChar(),
groupID = extraData.getChar();
ByteBuffer linkDestBuffer = extraData.slice().limit(length - 14);
String linkDestination=StandardCharsets.UTF_8.decode(linkDestBuffer).toString();
}
else
{
extraData.position(extraData.position() + length);
byte[] ourData=new byte[extraData.remaining()];
extraData.get(ourData);
//do stuff
}
}
Unsolved Problems
There are still 3 values which return different results based on which method you use to read the file. I made a table of my observations per entry
ZipFile ZipInputStream
getCreationTime() null <correct value>
getLastAccessTime() null <correct value>
getComment() <correct value> null
Apparently from the bug report This is expected behavior since zip file is random access and zip input stream is sequential and so they access data differently.
From my observations Using ZipInputStream returns the best results so i will continue to use that

Related

Export multiple images in one byte array (BLOB IBM DB2) to disk

I have a column "Content" (BLOB data) in database (IBM DB2) and the data of an record same that (https://drive.google.com/file/d/12d1g5jtomJS-ingCn_n0GKMsM4RkdYzB/view?usp=sharing)
I have opened it by editor and I think that it has more than one image in this (https://i.stack.imgur.com/2biLN.png, https://i.stack.imgur.com/ZwBOs.png).
I can export an image from byte array (using C#) to my disk, but with multiple images, I don't know how to do it.
Please help me! Thanks!
Edit 1:
I have tried export it as only one image by this code:
private void readBLOB(DB2Connection conn, DB2Transaction trans)
{
try
{
string SavePath = #"D:\\MyBLOB";
long CurrentIndex = 0;
//the number of bytes to store in the array
int BufferSize = 413454;
//The Number of bytes returned from GetBytes() method
long BytesReturned;
//A byte array to hold the buffer
byte[] Blob = new byte[BufferSize];
DB2Command cmd = conn.CreateCommand();
cmd.CommandText = "SELECT ATTR0102500126 " +
" FROM JCR.ICMUT01278001 " +
" WHERE COMPKEY = 'N21E26B04900FC6B1F00000'";
cmd.Transaction = trans;
DB2DataReader reader;
reader = cmd.ExecuteReader(CommandBehavior.SequentialAccess);
if (reader.Read())
{
FileStream fs = new FileStream(SavePath + "\\" + "quang canh.jpg", FileMode.OpenOrCreate, FileAccess.Write);
BinaryWriter writer = new BinaryWriter(fs);
//reset the index to the beginning of the file
CurrentIndex = 0;
BytesReturned = reader.GetBytes(
0, //the BlobsTable column index
CurrentIndex, // the current index of the field from which to begin the read operation
Blob, // Array name to write the buffer to
0, // the start index of the array
BufferSize // the maximum length to copy into the buffer
);
while (BytesReturned == BufferSize)
{
writer.Write(Blob);
writer.Flush();
CurrentIndex += BufferSize;
BytesReturned = reader.GetBytes(0, CurrentIndex, Blob, 0, BufferSize);
}
writer.Write(Blob, 0, (int)BytesReturned);
writer.Flush(); writer.Close();
fs.Close();
}
reader.Close();
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
But can not view the image, it show format error => https://i.stack.imgur.com/PNS9Q.png

Your are currently asuming all BLOBS in that DB are JPEG Images. But that is clearly not the case.
Option 1: This is a faulty data
Programms that save to databases can fail.
Databases themself might fail, especially if transactions are turned off. Transactions are most likely turned off for BLOB's.
The physical disk the data was stored on might have degraded. And again, you will not get a lot of redundancy and error correction with BLOBS (plus getting use of the Error correction requires going through the proper DBMS in the first place).
Option 2: This is not a jpg
I know article about Unicode that says "[...]problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends."
This applies doubly, triply and quadruply to images:
this could be any number of formats that uses Interlacing.
this could could be a professional graphics programms image/project file like TIFF. Which can totally contain multiple images - up to one per layer you are working with.
this could even be a .SVG file (XML text that contains drawing orders) that was run through a .ZIP compression and a word document
this could even be a PDF, where the images are usually appended at the back (allowing you to read the text with a partial file, similar to interleaving)

How to split the ByteArray by reading from the file in C++?

I have written a Java program to write the ByteArray in to a file. And that resulting ByteArray is a resulting of these three ByteArrays-
First 2 bytes is my schemaId which I have represented it using short data type.
Then next 8 Bytes is my Last Modified Date which I have represented it using long data type.
And remaining bytes can be of variable size which is my actual value for my attributes..
So I have a file now in which first line contains resulting ByteArray which will have all the above bytes as I mentioned above.. Now I need to read that file from C++ program and read the first line which will contain the ByteArray and then split that resulting ByteArray accordingly as I mentioned above such that I am able to extract my schemaId, Last Modified Date and my actual attribute value from it.
I have done all my coding always in Java and I am new to C++... I am able to write a program in C++ to read the file but not sure how should I read that ByteArray in such a way such that I am able to split it as I mentioned above..
Below is my C++ program which is reading the file and printing it out on the console..
int main () {
string line;
//the variable of type ifstream:
ifstream myfile ("bytearrayfile");
//check to see if the file is opened:
if (myfile.is_open())
{
//while there are still lines in the
//file, keep reading:
while (! myfile.eof() )
{
//place the line from myfile into the
//line variable:
getline (myfile,line);
//display the line we gathered:
// and here split the byte array accordingly..
cout << line << endl;
}
//close the stream:
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
Can anyone help me with that? Thanks.
Update
Below is my java code which will write resulting ByteArray into a file and the same file now I need to read it back from c++..
public static void main(String[] args) throws Exception {
String os = "whatever os is";
byte[] avroBinaryValue = os.getBytes();
long lastModifiedDate = 1379811105109L;
short schemaId = 32767;
ByteArrayOutputStream byteOsTest = new ByteArrayOutputStream();
DataOutputStream outTest = new DataOutputStream(byteOsTest);
outTest.writeShort(schemaId);
outTest.writeLong(lastModifiedDate);
outTest.writeInt(avroBinaryValue.length);
outTest.write(avroBinaryValue);
byte[] allWrittenBytesTest = byteOsTest.toByteArray();
DataInputStream inTest = new DataInputStream(new ByteArrayInputStream(allWrittenBytesTest));
short schemaIdTest = inTest.readShort();
long lastModifiedDateTest = inTest.readLong();
int sizeAvroTest = inTest.readInt();
byte[] avroBinaryValue1 = new byte[sizeAvroTest];
inTest.read(avroBinaryValue1, 0, sizeAvroTest);
System.out.println(schemaIdTest);
System.out.println(lastModifiedDateTest);
System.out.println(new String(avroBinaryValue1));
writeFile(allWrittenBytesTest);
}
/**
* Write the file in Java
* #param byteArray
*/
public static void writeFile(byte[] byteArray) {
try{
File file = new File("bytearrayfile");
FileOutputStream output = new FileOutputStream(file);
IOUtils.write(byteArray, output);
} catch (Exception ex) {
ex.printStackTrace();
}
}

It doesn't look like you want to use std::getline to read this data. Your file isn't written as text data on a line-by-line basis - it basically has a binary format.
You can use the read method of std::ifstream to read arbitrary chunks of data from an input stream. You probably want to open the file in binary mode:
std::ifstream myfile("bytearrayfile", std::ios::binary);
Fundamentally the method you would use to read each record from the file is:
uint16_t schemaId;
uint64_t lastModifiedDate;
uint32_t binaryLength;
myfile.read(reinterpret_cast<char*>(&schemaId), sizeof(schemaId));
myfile.read(reinterpret_cast<char*>(&lastModifiedDate), sizeof(lastModifiedDate));
myfile.read(reinterpret_cast<char*>(&binaryLength), sizeof(binaryLength));
This will read the three static members of your data structure from the file. Because your data is variable size, you probably need to allocate a buffer to read it into, for example:
std::unique_ptr<char[]> binaryBuf(new char[binaryLength]);
myfile.read(binaryBuf.get(), binaryLength);
The above are examples only to illustrate how you would approach this in C++. You will need to be aware of the following things:
There's no error checking in the above examples. You'll need to check that the calls to ifstream::read are successful and return the correct amount of data.
Endianness may be an issue, depending on the the platform the data originates from and is being read on.
Interpreting the lastModifiedDate field may require you to write a function to convert it from whatever format Java uses (I have no idea about Java).

Binary search inside a buffer

So I have implemented a working program which searches a file using the binary search method:
public int BSearch(int x1, int x2) throws IOException {
int current_key;
middle=(x1+x2)/2;
if(x1>x2) {
middle=-1; //middle==-1 is condition of 'key not found'
return middle;
}
MyFile.seek(middle*4);
current_key=MyFile.readInt();
da++;
if(current_key==key) {
return middle;
}
else if(key<current_key) {
x2=middle-1;
return BSearch(x1,x2);
}
else {
x1=middle+1;
return BSearch(x1,x2);
}
}
Now I want to transform it so it reads the file piece-by-piece (say 1KB each time) into a buffer, and then binary search that buffer. If key is not found in that buffer I read further the file and so on. I want to clarify though that the buffer is a handmade buffer like this (correct me):
byte[] buf = new byte[1024];
MyFile.read(buf);
ByteArrayInputStream bis= new ByteArrayInputStream(buf1);
DataInputStream ois= new DataInputStream(bis);
current_key=ois.readInt();
A big problem (among others) is I don't know how I'll read from a certain position of the buffer

OK I think I managed to do it by copying the buffer to a new int[] array element-by-element. I want to believe it is still faster than accessing the disc every time I want to load a buffer.

reverse chunks in a file

My basic Java problem is this: I need to read in a file by chunks, then reverse the order of the chunks, then write that out to a new file. My first (naive) attempt followed this approach:
read a chunk from the file.
reverse the bytes of the chunk
push the bytes one at a time to the front of a results list
repeat for all chunks
write result list to new file.
So this is basically a very stupid and slow way to solve the problem, but generates the correct output that I am looking for. To try to improve the situation, I change to this algorithm:
read a chunk from the file
push that chunk onto the front of a list of arrays
repeat for all chunks
foreach chunk, write to new file
And to my mind, that produces the same output. except it doesn't and I am quite confused. The first chunk in the result file matches with both methods, but the rest of the file is completely different.
Here is the meat of the Java code I am using:
FileInputStream in;
FileOutputStream out, out2;
Byte[] t = new Byte[0];
LinkedList<Byte> reversed_data = new LinkedList<Byte>();
byte[] data = new byte[bufferSize];
LinkedList<byte[]> revd2 = new LinkedList<byte[]>();
try {
in = new FileInputStream(infile);
out = new FileOutputStream(outfile1);
out2 = new FileOutputStream(outfile2);
} catch (FileNotFoundException e) {
e.printStackTrace();
return;
}
while(in.read(data) != -1)
{
revd2.addFirst(data);
byte[] revd = reverse(data);
for (byte b : revd)
{
reversed_data.addFirst(b);
}
}
for (Byte b : reversed_data)
{
out.write(b);
}
for (byte[] b : revd2)
{
out2.write(b);
}
At http://pastie.org/3113665 you can see a complete example program (a long with my debugging attempts). For simplicity I am using a bufferSize that divides evenly the size of the file so all chunks will be the same size, but this won't hold in the real world. My question is, WHY don't these two methods generate the same output? It's driving me crazy because I can't grok it.

You're constantly overwriting the data you've read previously.
while(in.read(data) != -1)
{
revd2.addFirst(data);
// ignore byte-wise stuff
}
You're adding the same object repeatedly to the list revd2, so each list node will finally contain a reference to data filled with the result of the last read. I suggest replacing that with revd2.addFirst(data.clone()).

My guess is you want to change
revd2.addFirst(data);
byte[] revd = reverse(data);
for the following so the reversed copy is added to the start of the list.
byte[] revd = reverse(data);
revd2.addFirst(revd);

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}

Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html

A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.

If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.

Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ZipFile : Wrong values when reading - java

Related

Export multiple images in one byte array (BLOB IBM DB2) to disk

How to split the ByteArray by reading from the file in C++?

Binary search inside a buffer

reverse chunks in a file

Most efficient merging of 2 text files.

Categories

Resources