reverse chunks in a file - java

My basic Java problem is this: I need to read in a file by chunks, then reverse the order of the chunks, then write that out to a new file. My first (naive) attempt followed this approach:
read a chunk from the file.
reverse the bytes of the chunk
push the bytes one at a time to the front of a results list
repeat for all chunks
write result list to new file.
So this is basically a very stupid and slow way to solve the problem, but generates the correct output that I am looking for. To try to improve the situation, I change to this algorithm:
read a chunk from the file
push that chunk onto the front of a list of arrays
repeat for all chunks
foreach chunk, write to new file
And to my mind, that produces the same output. except it doesn't and I am quite confused. The first chunk in the result file matches with both methods, but the rest of the file is completely different.
Here is the meat of the Java code I am using:
FileInputStream in;
FileOutputStream out, out2;
Byte[] t = new Byte[0];
LinkedList<Byte> reversed_data = new LinkedList<Byte>();
byte[] data = new byte[bufferSize];
LinkedList<byte[]> revd2 = new LinkedList<byte[]>();
try {
in = new FileInputStream(infile);
out = new FileOutputStream(outfile1);
out2 = new FileOutputStream(outfile2);
} catch (FileNotFoundException e) {
e.printStackTrace();
return;
}
while(in.read(data) != -1)
{
revd2.addFirst(data);
byte[] revd = reverse(data);
for (byte b : revd)
{
reversed_data.addFirst(b);
}
}
for (Byte b : reversed_data)
{
out.write(b);
}
for (byte[] b : revd2)
{
out2.write(b);
}
At http://pastie.org/3113665 you can see a complete example program (a long with my debugging attempts). For simplicity I am using a bufferSize that divides evenly the size of the file so all chunks will be the same size, but this won't hold in the real world. My question is, WHY don't these two methods generate the same output? It's driving me crazy because I can't grok it.

You're constantly overwriting the data you've read previously.
while(in.read(data) != -1)
{
revd2.addFirst(data);
// ignore byte-wise stuff
}
You're adding the same object repeatedly to the list revd2, so each list node will finally contain a reference to data filled with the result of the last read. I suggest replacing that with revd2.addFirst(data.clone()).

My guess is you want to change
revd2.addFirst(data);
byte[] revd = reverse(data);
for the following so the reversed copy is added to the start of the list.
byte[] revd = reverse(data);
revd2.addFirst(revd);

Related

ZipFile : Wrong values when reading

I am creating a zip file with one directory and an single compressed text file inside of it.
Code to create the zip file
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
zos.setLevel(Deflater.BEST_COMPRESSION);
zos.setMethod(Deflater.DEFLATED);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
//put entry for writing
zos.putNextEntry(file);
byte[] data="Hello World Hello World".getBytes();
//Compress Data
Deflater deflater=new Deflater(9);
deflater.setDictionary("Hello World ".getBytes());
deflater.setInput(data);
deflater.finish();
byte[] output=new byte[100];
int compressed=deflater.deflate(output);
//Write Data
CRC32 check=new CRC32();
check.update(data);
file.setSize(deflater.getBytesRead());
file.setCrc(check.getValue());
file.setCompressedSize(compressed);
zos.write(output,0,compressed);
//end data
System.out.println(deflater.getBytesRead()+"/"+compressed);
deflater.end();
}
//close the entry
zos.closeEntry();
}
}
}
Upon writing the file the size of the byte data uncompressed is 23 bytes and the size of the data compressed is 15. I am using every method inside ZipEntry just to test if i can retrive all the values correctly upon reading it.
Upon Reading it using ZipFile class & not ZipInputStream(bug getSize() always returns -1) using this code
//reading zip file using ZipFile
public static void main(String[] args)throws Exception
{
try(ZipFile zis=new ZipFile("E:/TestFile.zip"))
{
Enumeration<? extends ZipEntry> entries=zis.entries();
while(entries.hasMoreElements())
{
ZipEntry entry=entries.nextElement();
System.out.println("Name="+entry.getName());
System.out.println("Is Directory="+entry.isDirectory());
System.out.println("Comment="+entry.getComment());
System.out.println("Creation Time="+entry.getCreationTime());
System.out.println("Access Time="+entry.getLastAccessTime());
System.out.println("Modification Time="+entry.getLastModifiedTime());
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
System.out.println("Optional Data="+new String(entry.getExtra()));
System.out.println("Method="+entry.getMethod());
if(!entry.isDirectory())
{
Inflater inflater=new Inflater();
try(InputStream is=zis.getInputStream(entry))
{
byte[] originalData=new byte[(int)entry.getSize()];
inflater.setInput(is.readAllBytes());
int realLength=inflater.inflate(originalData);
if(inflater.needsDictionary())
{
inflater.setDictionary("Hello World ".getBytes());
realLength=inflater.inflate(originalData);
}
inflater.end();
System.out.println("Data="+new String(originalData,0,realLength));
}
}
System.out.println("=====================================================");
}
}
}
I get this output
Name=Directory/
Is Directory=true
Comment=Directory
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=0
Real Size=0
Compressed Size=2
Optional Data=UTaHello
Method=8
=====================================================
Name=Directory/Test.txt
Is Directory=false
Comment=A File
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=2483042136
Real Size=15
Compressed Size=17
Optional Data=UT��aWorld
Method=8
Data=Hello World Hel
==================================================
There is a lot of wrong output in this code
For the directory
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)Extra Data[Optional Data] has wrong encoding
For the file
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)getSize() & getCompressedSize() methods return the wrong values. I
have specified these values during writing manually with sizeSize() &
setCompressedSize() when creating the file the values were 23 and 15 but
it returns 15 and 17
3)Extra Data[Optional Data] has wrong encoding
4)Since getSize() returns incorrect size it dosen't display the whole
data[Hello World Hel]
With so many things going wrong i thought to post this as one question rather than multiple small ones as they all seem related. I am a complete beginner in writing zip files so any direction on where do i go from here would be greatly appreciated.
I can read the data of an zip entry using an while loop into an buffer if the size is not known or incorrect which is not an problem but why would they even create an set or get size method if they knew we would be doing this most of the time anyway. Whats the point?
After much research i was able to solve 70% of the problems. Others can't be solved given the nature of how an ZipOutputStream & ZipFile reads the data
Problem 1: Incorrect values returned by getSize() & getCompressedSize()
1) During Writing
I was blind to have not seen this earlier but ZipOutputStream already does compression for us and i was double compressing it by using my own inflater so i removed that code and i realized that you must specify these values only when you are using the method as STORED. else they are computed for you from the data. So refracting my zip writing code this is how it looks like
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile2.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
//Auto Compression
zos.setMethod(ZipOutputStream.DEFLATED);
zos.setLevel(9);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
byte[] data="Hello World Hello World".getBytes();
//Data
zos.putNextEntry(file);
zos.write(data);
zos.flush();
}
//close the entry
zos.closeEntry();
}
//finish writing the zip file without closing stream
zos.finish();
}
2)During Reading
To get the correct size & compressed size values there are 2 approaches
-> If you read the file using ZipFile class the values come out correctly
-> If you use ZipInputStream then these values are computed only after you have read all the bytes from the entry. more info here
if(!entry.isDirectory())
{
try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
{
int read;
byte[] data=new byte[10];
while((read=zipInputStream.read(data))>0){baos.write(data,0,read);}
System.out.println("Data="+new String(baos.toByteArray()));
}
}
//Now these values are correct
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
Problem 2: Incorrect Extra data
This post pretty much explains everything
Here is the code
ByteBuffer extraData = ByteBuffer.wrap(entry.getExtra()).order(ByteOrder.LITTLE_ENDIAN);
while(extraData.hasRemaining())
{
int id = extraData.getShort() & 0xffff;
int length = extraData.getShort() & 0xffff;
if(id == 0x756e)
{
int crc32 = extraData.getInt();
short permissions = extraData.getShort();
int
linkLengthOrDeviceNumbers = extraData.getInt(),
userID = extraData.getChar(),
groupID = extraData.getChar();
ByteBuffer linkDestBuffer = extraData.slice().limit(length - 14);
String linkDestination=StandardCharsets.UTF_8.decode(linkDestBuffer).toString();
}
else
{
extraData.position(extraData.position() + length);
byte[] ourData=new byte[extraData.remaining()];
extraData.get(ourData);
//do stuff
}
}
Unsolved Problems
There are still 3 values which return different results based on which method you use to read the file. I made a table of my observations per entry
ZipFile ZipInputStream
getCreationTime() null <correct value>
getLastAccessTime() null <correct value>
getComment() <correct value> null
Apparently from the bug report This is expected behavior since zip file is random access and zip input stream is sequential and so they access data differently.
From my observations Using ZipInputStream returns the best results so i will continue to use that

Splitting binary file on tags?

I have a ModSecurity log file which contains parts which contain either text or binary data. I need to split this file according to the tags which are noted at the start of each part so i can do filtering of data for permanent storage.
So for example I have:
--tag1--
<text>
--tag2--
<binary data>
--tag3--
<text>
At first I thought it was all text so i made a parser to parse all the different pieces by reading the line and using a pattern to check if it was a new part. But now I need to read the file in binary. So what would be the best way to achieve this?
So far i've made a test to get a specific part by keeping the last several characters in a String buffer to check for the string and then start printing when the buffer contains that string. The same is done to stop. However since the buffer needs to fill up before it can check the end tag, the end tag will have been added to the byte array so once the part is complete, I remove the final bytes from the array to get the part needed.
public byte[] binaryDataReader(String startTag, String endTag) throws IOException{
File file = new File("20160926-161148-V#ksog7ZjVRfyQUPtAdOmgAAAAM");
try (FileInputStream fis = new FileInputStream(file);ByteArrayOutputStream buffer = new ByteArrayOutputStream()) {
System.out.println("Total file size to read (in bytes) : "+ fis.available());
int content;
String lastChars = "";
String status = "nok";
while ((content = fis.read()) != -1) {
if (lastChars.length() > 14) {
lastChars = lastChars.substring(lastChars.length() - 14, lastChars.length()) + (char) content;
} else {
lastChars += (char) content;
}
if(status.equals("ok")){
buffer.write(content);
}
if (lastChars.equals(startTag)) {
status = "ok";
}else if(lastChars.equals(endTag)){
status = "nok";
}
}
buffer.flush();
byte[] data = buffer.toByteArray();
data = Arrays.copyOf(data, data.length-15);
return data;
} catch (IOException e) {
//log
throw e;
}
}
Now i need to make this a general solution for many more tags by including patterns. But I was wondering: Is this a decent way of splitting a binary file or is there a better/easier way to achieve this?

How to split the ByteArray by reading from the file in C++?

I have written a Java program to write the ByteArray in to a file. And that resulting ByteArray is a resulting of these three ByteArrays-
First 2 bytes is my schemaId which I have represented it using short data type.
Then next 8 Bytes is my Last Modified Date which I have represented it using long data type.
And remaining bytes can be of variable size which is my actual value for my attributes..
So I have a file now in which first line contains resulting ByteArray which will have all the above bytes as I mentioned above.. Now I need to read that file from C++ program and read the first line which will contain the ByteArray and then split that resulting ByteArray accordingly as I mentioned above such that I am able to extract my schemaId, Last Modified Date and my actual attribute value from it.
I have done all my coding always in Java and I am new to C++... I am able to write a program in C++ to read the file but not sure how should I read that ByteArray in such a way such that I am able to split it as I mentioned above..
Below is my C++ program which is reading the file and printing it out on the console..
int main () {
string line;
//the variable of type ifstream:
ifstream myfile ("bytearrayfile");
//check to see if the file is opened:
if (myfile.is_open())
{
//while there are still lines in the
//file, keep reading:
while (! myfile.eof() )
{
//place the line from myfile into the
//line variable:
getline (myfile,line);
//display the line we gathered:
// and here split the byte array accordingly..
cout << line << endl;
}
//close the stream:
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
Can anyone help me with that? Thanks.
Update
Below is my java code which will write resulting ByteArray into a file and the same file now I need to read it back from c++..
public static void main(String[] args) throws Exception {
String os = "whatever os is";
byte[] avroBinaryValue = os.getBytes();
long lastModifiedDate = 1379811105109L;
short schemaId = 32767;
ByteArrayOutputStream byteOsTest = new ByteArrayOutputStream();
DataOutputStream outTest = new DataOutputStream(byteOsTest);
outTest.writeShort(schemaId);
outTest.writeLong(lastModifiedDate);
outTest.writeInt(avroBinaryValue.length);
outTest.write(avroBinaryValue);
byte[] allWrittenBytesTest = byteOsTest.toByteArray();
DataInputStream inTest = new DataInputStream(new ByteArrayInputStream(allWrittenBytesTest));
short schemaIdTest = inTest.readShort();
long lastModifiedDateTest = inTest.readLong();
int sizeAvroTest = inTest.readInt();
byte[] avroBinaryValue1 = new byte[sizeAvroTest];
inTest.read(avroBinaryValue1, 0, sizeAvroTest);
System.out.println(schemaIdTest);
System.out.println(lastModifiedDateTest);
System.out.println(new String(avroBinaryValue1));
writeFile(allWrittenBytesTest);
}
/**
* Write the file in Java
* #param byteArray
*/
public static void writeFile(byte[] byteArray) {
try{
File file = new File("bytearrayfile");
FileOutputStream output = new FileOutputStream(file);
IOUtils.write(byteArray, output);
} catch (Exception ex) {
ex.printStackTrace();
}
}
It doesn't look like you want to use std::getline to read this data. Your file isn't written as text data on a line-by-line basis - it basically has a binary format.
You can use the read method of std::ifstream to read arbitrary chunks of data from an input stream. You probably want to open the file in binary mode:
std::ifstream myfile("bytearrayfile", std::ios::binary);
Fundamentally the method you would use to read each record from the file is:
uint16_t schemaId;
uint64_t lastModifiedDate;
uint32_t binaryLength;
myfile.read(reinterpret_cast<char*>(&schemaId), sizeof(schemaId));
myfile.read(reinterpret_cast<char*>(&lastModifiedDate), sizeof(lastModifiedDate));
myfile.read(reinterpret_cast<char*>(&binaryLength), sizeof(binaryLength));
This will read the three static members of your data structure from the file. Because your data is variable size, you probably need to allocate a buffer to read it into, for example:
std::unique_ptr<char[]> binaryBuf(new char[binaryLength]);
myfile.read(binaryBuf.get(), binaryLength);
The above are examples only to illustrate how you would approach this in C++. You will need to be aware of the following things:
There's no error checking in the above examples. You'll need to check that the calls to ifstream::read are successful and return the correct amount of data.
Endianness may be an issue, depending on the the platform the data originates from and is being read on.
Interpreting the lastModifiedDate field may require you to write a function to convert it from whatever format Java uses (I have no idea about Java).

Binary search inside a buffer

So I have implemented a working program which searches a file using the binary search method:
public int BSearch(int x1, int x2) throws IOException {
int current_key;
middle=(x1+x2)/2;
if(x1>x2) {
middle=-1; //middle==-1 is condition of 'key not found'
return middle;
}
MyFile.seek(middle*4);
current_key=MyFile.readInt();
da++;
if(current_key==key) {
return middle;
}
else if(key<current_key) {
x2=middle-1;
return BSearch(x1,x2);
}
else {
x1=middle+1;
return BSearch(x1,x2);
}
}
Now I want to transform it so it reads the file piece-by-piece (say 1KB each time) into a buffer, and then binary search that buffer. If key is not found in that buffer I read further the file and so on. I want to clarify though that the buffer is a handmade buffer like this (correct me):
byte[] buf = new byte[1024];
MyFile.read(buf);
ByteArrayInputStream bis= new ByteArrayInputStream(buf1);
DataInputStream ois= new DataInputStream(bis);
current_key=ois.readInt();
A big problem (among others) is I don't know how I'll read from a certain position of the buffer
OK I think I managed to do it by copying the buffer to a new int[] array element-by-element. I want to believe it is still faster than accessing the disc every time I want to load a buffer.

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Categories

Resources