Binary search inside a buffer

Binary search inside a buffer - java

So I have implemented a working program which searches a file using the binary search method:
public int BSearch(int x1, int x2) throws IOException {
int current_key;
middle=(x1+x2)/2;
if(x1>x2) {
middle=-1; //middle==-1 is condition of 'key not found'
return middle;
}
MyFile.seek(middle*4);
current_key=MyFile.readInt();
da++;
if(current_key==key) {
return middle;
}
else if(key<current_key) {
x2=middle-1;
return BSearch(x1,x2);
}
else {
x1=middle+1;
return BSearch(x1,x2);
}
}
Now I want to transform it so it reads the file piece-by-piece (say 1KB each time) into a buffer, and then binary search that buffer. If key is not found in that buffer I read further the file and so on. I want to clarify though that the buffer is a handmade buffer like this (correct me):
byte[] buf = new byte[1024];
MyFile.read(buf);
ByteArrayInputStream bis= new ByteArrayInputStream(buf1);
DataInputStream ois= new DataInputStream(bis);
current_key=ois.readInt();
A big problem (among others) is I don't know how I'll read from a certain position of the buffer

OK I think I managed to do it by copying the buffer to a new int[] array element-by-element. I want to believe it is still faster than accessing the disc every time I want to load a buffer.

Related

ZipFile : Wrong values when reading

I am creating a zip file with one directory and an single compressed text file inside of it.
Code to create the zip file
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
zos.setLevel(Deflater.BEST_COMPRESSION);
zos.setMethod(Deflater.DEFLATED);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
//put entry for writing
zos.putNextEntry(file);
byte[] data="Hello World Hello World".getBytes();
//Compress Data
Deflater deflater=new Deflater(9);
deflater.setDictionary("Hello World ".getBytes());
deflater.setInput(data);
deflater.finish();
byte[] output=new byte[100];
int compressed=deflater.deflate(output);
//Write Data
CRC32 check=new CRC32();
check.update(data);
file.setSize(deflater.getBytesRead());
file.setCrc(check.getValue());
file.setCompressedSize(compressed);
zos.write(output,0,compressed);
//end data
System.out.println(deflater.getBytesRead()+"/"+compressed);
deflater.end();
}
//close the entry
zos.closeEntry();
}
}
}
Upon writing the file the size of the byte data uncompressed is 23 bytes and the size of the data compressed is 15. I am using every method inside ZipEntry just to test if i can retrive all the values correctly upon reading it.
Upon Reading it using ZipFile class & not ZipInputStream(bug getSize() always returns -1) using this code
//reading zip file using ZipFile
public static void main(String[] args)throws Exception
{
try(ZipFile zis=new ZipFile("E:/TestFile.zip"))
{
Enumeration<? extends ZipEntry> entries=zis.entries();
while(entries.hasMoreElements())
{
ZipEntry entry=entries.nextElement();
System.out.println("Name="+entry.getName());
System.out.println("Is Directory="+entry.isDirectory());
System.out.println("Comment="+entry.getComment());
System.out.println("Creation Time="+entry.getCreationTime());
System.out.println("Access Time="+entry.getLastAccessTime());
System.out.println("Modification Time="+entry.getLastModifiedTime());
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
System.out.println("Optional Data="+new String(entry.getExtra()));
System.out.println("Method="+entry.getMethod());
if(!entry.isDirectory())
{
Inflater inflater=new Inflater();
try(InputStream is=zis.getInputStream(entry))
{
byte[] originalData=new byte[(int)entry.getSize()];
inflater.setInput(is.readAllBytes());
int realLength=inflater.inflate(originalData);
if(inflater.needsDictionary())
{
inflater.setDictionary("Hello World ".getBytes());
realLength=inflater.inflate(originalData);
}
inflater.end();
System.out.println("Data="+new String(originalData,0,realLength));
}
}
System.out.println("=====================================================");
}
}
}
I get this output
Name=Directory/
Is Directory=true
Comment=Directory
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=0
Real Size=0
Compressed Size=2
Optional Data=UTaHello
Method=8
=====================================================
Name=Directory/Test.txt
Is Directory=false
Comment=A File
Creation Time=null
Access Time=null
Modification Time=2022-01-24T17:00:25Z
CRC=2483042136
Real Size=15
Compressed Size=17
Optional Data=UT��aWorld
Method=8
Data=Hello World Hel
==================================================
There is a lot of wrong output in this code
For the directory
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)Extra Data[Optional Data] has wrong encoding
For the file
1)Creation Time & Access Time are null[even though i have specified it
in the write method]
2)getSize() & getCompressedSize() methods return the wrong values. I
have specified these values during writing manually with sizeSize() &
setCompressedSize() when creating the file the values were 23 and 15 but
it returns 15 and 17
3)Extra Data[Optional Data] has wrong encoding
4)Since getSize() returns incorrect size it dosen't display the whole
data[Hello World Hel]
With so many things going wrong i thought to post this as one question rather than multiple small ones as they all seem related. I am a complete beginner in writing zip files so any direction on where do i go from here would be greatly appreciated.
I can read the data of an zip entry using an while loop into an buffer if the size is not known or incorrect which is not an problem but why would they even create an set or get size method if they knew we would be doing this most of the time anyway. Whats the point?

After much research i was able to solve 70% of the problems. Others can't be solved given the nature of how an ZipOutputStream & ZipFile reads the data
Problem 1: Incorrect values returned by getSize() & getCompressedSize()
1) During Writing
I was blind to have not seen this earlier but ZipOutputStream already does compression for us and i was double compressing it by using my own inflater so i removed that code and i realized that you must specify these values only when you are using the method as STORED. else they are computed for you from the data. So refracting my zip writing code this is how it looks like
try(ZipOutputStream zos=new ZipOutputStream(new FileOutputStream("E:/TestFile2.zip")))
{
//comment,level,method for all entries
zos.setComment("Test Zip File");
//Auto Compression
zos.setMethod(ZipOutputStream.DEFLATED);
zos.setLevel(9);
//Creating Directories[ends with a forward slash]
{
ZipEntry dir1=new ZipEntry("Directory/");
//Give it a comment
dir1.setComment("Directory");
//Some extra data
dir1.setExtra("Hello".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
dir1.setCreationTime(time);
dir1.setLastAccessTime(time);
dir1.setLastModifiedTime(time);
//put the entry & close it
zos.putNextEntry(dir1);
zos.closeEntry();
}
//Creating an fully compressed file inside the directory with all informtion
{
ZipEntry file=new ZipEntry("Directory/Test.txt");
//Meta Data
{
//Give it a comment
file.setComment("A File");
//Some extra data
file.setExtra("World".getBytes());
//Set Creation,Access,Modification Time
FileTime time=FileTime.fromMillis(System.currentTimeMillis());
file.setCreationTime(time);
file.setLastAccessTime(time);
file.setLastModifiedTime(time);
}
//Byte Data
{
byte[] data="Hello World Hello World".getBytes();
//Data
zos.putNextEntry(file);
zos.write(data);
zos.flush();
}
//close the entry
zos.closeEntry();
}
//finish writing the zip file without closing stream
zos.finish();
}
2)During Reading
To get the correct size & compressed size values there are 2 approaches
-> If you read the file using ZipFile class the values come out correctly
-> If you use ZipInputStream then these values are computed only after you have read all the bytes from the entry. more info here
if(!entry.isDirectory())
{
try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
{
int read;
byte[] data=new byte[10];
while((read=zipInputStream.read(data))>0){baos.write(data,0,read);}
System.out.println("Data="+new String(baos.toByteArray()));
}
}
//Now these values are correct
System.out.println("CRC="+entry.getCrc());
System.out.println("Real Size="+entry.getSize());
System.out.println("Compressed Size="+entry.getCompressedSize());
Problem 2: Incorrect Extra data
This post pretty much explains everything
Here is the code
ByteBuffer extraData = ByteBuffer.wrap(entry.getExtra()).order(ByteOrder.LITTLE_ENDIAN);
while(extraData.hasRemaining())
{
int id = extraData.getShort() & 0xffff;
int length = extraData.getShort() & 0xffff;
if(id == 0x756e)
{
int crc32 = extraData.getInt();
short permissions = extraData.getShort();
int
linkLengthOrDeviceNumbers = extraData.getInt(),
userID = extraData.getChar(),
groupID = extraData.getChar();
ByteBuffer linkDestBuffer = extraData.slice().limit(length - 14);
String linkDestination=StandardCharsets.UTF_8.decode(linkDestBuffer).toString();
}
else
{
extraData.position(extraData.position() + length);
byte[] ourData=new byte[extraData.remaining()];
extraData.get(ourData);
//do stuff
}
}
Unsolved Problems
There are still 3 values which return different results based on which method you use to read the file. I made a table of my observations per entry
ZipFile ZipInputStream
getCreationTime() null <correct value>
getLastAccessTime() null <correct value>
getComment() <correct value> null
Apparently from the bug report This is expected behavior since zip file is random access and zip input stream is sequential and so they access data differently.
From my observations Using ZipInputStream returns the best results so i will continue to use that

Serializing an object that includes BufferedImages

As the title suggests, I'm trying to save to file an object that contains (among other variables, Strings, etc) a few BufferedImages.
I found this:
How to serialize an object that includes BufferedImages
And it works like a charm, but with a small setback: it works well if your object contains only ONE image.
I've been struggling to get his solution to work with more than one image (which in theory should work) but each time I read the file in, I get my object back, I get the correct number of images, but only the first image actually gets read in; the others are just null images that have no data in them.
This is how my object looks like:
class Obj implements Serializable
{
transient List<BufferedImage> imageSelection= new ArrayList<BufferedImage>();
// ... other vars and functions
private void writeObject(ObjectOutputStream out) throws IOException {
out.defaultWriteObject();
out.writeInt(imageSelection.size()); // how many images are serialized?
for (BufferedImage eachImage : imageSelection) {
ImageIO.write(eachImage, "jpg", out); // png is lossless
}
}
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
final int imageCount = in.readInt();
imageSelection = new ArrayList<BufferedImage>(imageCount);
for (int i=0; i<imageCount; i++) {
imageSelection.add(ImageIO.read(in));
}
}
}
This is how I'm writing and reading the object to and from a file:
// writing
try (
FileOutputStream file = new FileOutputStream(objName+".ser");
ObjectOutputStream output = new ObjectOutputStream(file);
){
output.writeObject(myObjs);
}
catch(IOException ex){
ex.printStackTrace();
}
// reading
try(
FileInputStream inputStr = new FileInputStream(file.getAbsolutePath());
ObjectInputStream input = new ObjectInputStream (inputStr);
)
{myObjs = (List<Obj>)input.readObject();}
catch(Exception ex)
{ex.printStackTrace();}
Even though I have a list of objects, they get read in correctly and each element of the list is populated accordingly, except for the BufferedImages.
Does anyone have any means of fixing this?

The problem is likely that ImageIO.read(...) incorrectly positions the stream after the first image read.
I see two options to fix this:
Rewrite the serialization of the BufferedImages to write the backing array(s) of the image, height, width, color model/color space identifer, and other data required to recreate the BufferedImage. This requires a bit of code to correctly handle all kinds of images, so I'll skip the details for now. Might be faster and more accurate (but might send more data).
Continue to serialize using ImageIO, but buffer each write using a ByteArrayOutputStream, and prepend each image with its byte count. When reading back, start by reading the byte count, and make sure you fully read each image. This is trivial to implement, but some images might get converted or lose details (ie. JPEG compression), due to file format constraints. Something like:
private void writeObject(ObjectOutputStream out) throws IOException {
out.defaultWriteObject();
out.writeInt(imageSelection.size()); // how many images are serialized?
for (BufferedImage eachImage : imageSelection) {
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
ImageIO.write(eachImage, "jpg", buffer);
out.writeInt(buffer.size()); // Prepend image with byte count
buffer.writeTo(out); // Write image
}
}
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
int imageCount = in.readInt();
imageSelection = new ArrayList<BufferedImage>(imageCount);
for (int i = 0; i < imageCount; i++) {
int size = in.readInt(); // Read byte count
byte[] buffer = new byte[size];
in.readFully(buffer); // Make sure you read all bytes of the image
imageSelection.add(ImageIO.read(new ByteArrayInputStream(buffer)));
}
}

reverse chunks in a file

My basic Java problem is this: I need to read in a file by chunks, then reverse the order of the chunks, then write that out to a new file. My first (naive) attempt followed this approach:
read a chunk from the file.
reverse the bytes of the chunk
push the bytes one at a time to the front of a results list
repeat for all chunks
write result list to new file.
So this is basically a very stupid and slow way to solve the problem, but generates the correct output that I am looking for. To try to improve the situation, I change to this algorithm:
read a chunk from the file
push that chunk onto the front of a list of arrays
repeat for all chunks
foreach chunk, write to new file
And to my mind, that produces the same output. except it doesn't and I am quite confused. The first chunk in the result file matches with both methods, but the rest of the file is completely different.
Here is the meat of the Java code I am using:
FileInputStream in;
FileOutputStream out, out2;
Byte[] t = new Byte[0];
LinkedList<Byte> reversed_data = new LinkedList<Byte>();
byte[] data = new byte[bufferSize];
LinkedList<byte[]> revd2 = new LinkedList<byte[]>();
try {
in = new FileInputStream(infile);
out = new FileOutputStream(outfile1);
out2 = new FileOutputStream(outfile2);
} catch (FileNotFoundException e) {
e.printStackTrace();
return;
}
while(in.read(data) != -1)
{
revd2.addFirst(data);
byte[] revd = reverse(data);
for (byte b : revd)
{
reversed_data.addFirst(b);
}
}
for (Byte b : reversed_data)
{
out.write(b);
}
for (byte[] b : revd2)
{
out2.write(b);
}
At http://pastie.org/3113665 you can see a complete example program (a long with my debugging attempts). For simplicity I am using a bufferSize that divides evenly the size of the file so all chunks will be the same size, but this won't hold in the real world. My question is, WHY don't these two methods generate the same output? It's driving me crazy because I can't grok it.

You're constantly overwriting the data you've read previously.
while(in.read(data) != -1)
{
revd2.addFirst(data);
// ignore byte-wise stuff
}
You're adding the same object repeatedly to the list revd2, so each list node will finally contain a reference to data filled with the result of the last read. I suggest replacing that with revd2.addFirst(data.clone()).

My guess is you want to change
revd2.addFirst(data);
byte[] revd = reverse(data);
for the following so the reversed copy is added to the start of the list.
byte[] revd = reverse(data);
revd2.addFirst(revd);

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}

Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html

A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.

If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.

Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Search in a InputStream

Is there a way to do a efficient search for 2 fix bytes on an InputStream?
Background
I have to deal with Multipart Http traffic on Android. (Motion JPEG from a IP Webcam).
I already found some Classes on anddev.org to deal with it. Now I do some performance improvements. To find the start of a JPEG, I need to find the magic number for JPEGs (SOI=FFD8) in the InputStream.

Since you've no idea where in the stream those 2 bytes are, you'll have to look at the entire input. That means, your performance will be at least linear. Two find two bytes linearly is straightforward:
static long search(InputStream inputStream) throws IOException {
BufferedInputStream is = new BufferedInputStream(inputStream);
int previous = is.read(read);
long pos = 0;
int current;
while((current = is.read()) != -1) {
pos++;
if(previous == 0xff && current == 0xd8) {
return pos;
}
last = current;
}
throw new RuntimeException("There ain't no pic in here.");
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Binary search inside a buffer - java

OK I think I managed to do it by copying the buffer to a new int[] array element-by-element. I want to believe it is still faster than accessing the disc every time I want to load a buffer.

Related

ZipFile : Wrong values when reading

Serializing an object that includes BufferedImages

reverse chunks in a file

Most efficient merging of 2 text files.

Search in a InputStream

Categories

Resources