Read & Write Uncompressed Zip Data from an InputStream Java [closed]

Read & Write Uncompressed Zip Data from an InputStream Java [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Currently, I am downloading a jar file from a website, the classes are further processed, but the resources aren't. This means I do not need to decompress when reading from the URL and recompress the resources when writing to a file.
However, given a ZipInputStream there is no method I am aware of to read the zip entry compressed's data and write it directly to a file with NIO. Normally with NIO with files, I can use the Files#copy function to do this, but I am downloading these files from the network, therefor I do not have this luxury.
Essentially, I have a ZipInputStream and an NIO FileSystem for a Zip file, how do I copy some (not all) data from this input stream to the file without decompressing and recompressing each entry?

It's not clear what you are asking here.
zip to zip
Do you mean: You want to stream a zip file across a network, saving it to the local machine on disk, but only some of the files. You want to do this without actually doing any (de)compression. For example, if the stream contains a zip with 18 files in it, you want to save the 8 files whose name doesn't end in .class, but in a fashion that streams the compressed bytes straight from the network into a zipped file without any de- or recompression.
In that sense it is equivalent to saving the zip file from network to disk and then attempting to efficiently wipe out some of the entries. Except in one go.
This is a bad idea. There are no easy answers here. It is technically possible, with so many caveats that I'm pretty sure you wouldn't want this.
If you need more context as to why that is, scroll down to the end of this answer.
zip to files
If you just mean: "I want to stream a zip from the network, skipping some of them without decompressing the skipped items or saving them to disk at all (compressed or not), and writing the ones I want to keep straight from network to disk, decompressing them on the fly" - that's simple.
Use .getNextEntry() to skip. Treat the ZipInputStream as the single entry stream. It EOFs until you move to the next entry, which makes that 'work'.
Here is an example that reads all entries from a zip file, skips all entries that end in .class, and writes all the other ones to disk, uncompressing on the fly:
public void unpackResources(Path zip, Path tgt) throws IOException {
try (InputStream raw = Files.newInputStream(zip)) {
ZipInputStream zip = new ZipInputStream(raw);
for (ZipEntry entry = zip.getNextEntry(); entry != null; entry = zip.getNextEntry()) {
if (entry.getName().endsWith(".class")) continue;
Path to = tgt.resolve(entry.getName());
try (OutputStream out = Files.newOutputStream(to)) {
zip.transferTo(to);
}
}
}
}
in.transferTo(out) is the Input/OutputStream equivalent to Files.copy. If reads bytes from in and tosses them straight into out until in says that there are no more bytes to give.
Context: Why is zip-to-stripped-zip not feasible?
Compression is extremely inefficient at times if you treat each file in a batch entirely on its own: After all, then you cannot take advantage of duplicated patterns between files. Imagine compressing a database of common baby names, where the input data consists of 1 file per name, and they just contain the text Name: Joanna, over and over again. You really need to take advantage of those repeated Name: entries to get good compression rates.
If a compression format does it right, then what you want doesn't really work: You'd have a single table (I'm oversimplifying how compression works here) that maps shorter sequences onto longer ones, but it is designed for efficient storage of the entire deal. If you strip out half the files, that table is probably not at all efficient anymore. If you don't copy over the table, the compressed bytes don't mean anything.
Some compression formats do it wrong and do treat each file entirely on its own, scoring rather badly at the 'name files' test. ZIP is, unfortunately, such a format. Which does indeed mean that technically, streaming the compressed data straight into a file / stripping out some files can be done without de/recompressing, assuming a zip file that uses all the usual algorithms (ZIP is not so much an algorithm, it's a container format. However, 99% of the zips out there use a specific algorithm, and many zip readers fail on other zips). Encryption is probably also going to cause issues here.
Given that it's a bit odd, generally libraries for compression just don't offer this feature; it can't be done except, specifically, to common zip files.
You'd have to write it yourself. I'm not sure this is worth doing. De- and recompressing is quite fast (zip was doable 30 years ago. Sprinkle some moore's law over that number and you may get some sense of how trivial it is these days. Your disk will be the bottleneck, not the CPU. Even with fast SSDs).

Related

Is read data from the package slower than reading from a folder on your hard drive?

I have object which i want load to memory on start program.
My question is:
It is better to insert objects into the (JAR) package or put the folder with the program?
What is faster way for reads object?
EDIT:
public MapStandard loadFromFileMS(String nameOfFile) {
MapStandard hm = null;
/*
InputStream inputStreaminputStream
= getClass().getClassLoader().
getResourceAsStream("data/" + nameOfFile + ".data");
*/
try {
FileInputStream inputStreaminputStream = new FileInputStream("C:\\"+nameOfFile+".data");
try (ObjectInputStream is = new ObjectInputStream(inputStreaminputStream)) {
hm = (MapStandard) is.readObject();
}
} catch (IOException | ClassNotFoundException e) {
System.out.println("Error: " + e);
}
return hm;
}

In theory it is faster to read a file from directory as from JAR file. JAR file is basically zip file with some metadata (MANIFEST.MF) so reading from JAR will include unzipping the content.

I don't think that there is a clear answer. Of course, reading a compressed archive requires time to un-compress. But: CPU cycles are VERY cheap. The time it takes to read a smaller archive and extract its content might still be quicker than reading "much more" content directly from the file system. You can do A LOT of computations while waiting for your IO to come in.
On the other hand: do you really think that the loading of this file is a performance bottleneck?
There is an old saying that the root of all evil is premature optimization.
If you or your users complain about bad performance - only then you start analyzing your application; for example using a profiler. And then you can start to fix those performance problems that REALLY cause problems; not those that you "assume" to be problematic.
And finally: if were are talking abut such huge dimensions - then you SHOULD not ask for stackoverflow opinions, but start to measure exact times yourself! We can only assume - but you have all the data in front of you - you just have to collect it!

A qualified guess would be that when the program starts the jar file entry will load a lot faster than the external file, but repeated usages will be much more alike.
The reason is that the limiting factor here on modern computers is "How fast can the bytes be retrieved from disk" and for jar-files the zip file is already being read by the JVM so many of the bytes needed are already loaded and does not have to be read again. An external file needs a completely separate "open-read" dance with the operating system. Later both will be in the disk read cache maintained by the operating system, so the difference is neglectible.
Considerations about cpu-usage is not really necessary. A modern CPU can do a lot of uncompressing in the time needed to read extra data from disk.
Note that reading through the jar file makes it automatically write protected. If you need to update the contents you need an external file.

How to delete a record and keep on reading a file?

I have to read a sequential file which has over a million of records. I have to read each line/record and have to delete that record/line from the file and keep on reading.
Not finding any example on how to do that without using temporary file or creating/recreating a new file of the same name.
These are text files. Each file is about .5 GB big and we have over a million lines/records in each file.
Currently we are copying all the records to memory as we do not want to re-process any record if any thing happens in the middle of the processing of a file.

Assuming that the file in question is a simple sequential file - you can't. In the Java file model, deleting part of a file implies deleting all of it after the deletion point.
Some alternative approaches are:
In your process copy the file, omitting the parts you want deleted. This is the normal way of doing this.
Overwrite the parts of the file you want deleted with some value that you know never occurs in the file, and then at a later date copy the file, removing the marked parts.
Store the entire file in memory, edit it as required, and write it again. Just because you have a million records doesn't make that impossible. If your files are 0.5GB, as you say, then this approach is almost certainly viable.
Each time you delete some record, copy all of the contents of the file after the deletion to its new position. This will be incredibly inefficient and error-prone.
Unless you can store the file in memory, using a temporary file is the most efficient. That's why everyone does it.
If this is some kind of database, then that's an entirely different question.
EDIT: Since I answered this. comments have indicated that what the user wants to do is use deletion to keep track of which records have already been processed. If that is the case, there are much simpler ways of doing this. One good way is to write a file which just contains a count of how many bytes (or records) of the file have been processed. If the processor crashes, update the file by deleting the records that have been processed and start again.

Files are unstructured streams of bytes; there is no record structure. You can not "delete" a "line" from an unstructured stream of bytes.
The basic algorithm you need to use is this:
create temporary file.
open input file
if at the end of the file, goto line 7
read a line from the input file
if the line is not to be deleted, write it to the temporary file
goto line 3
close the input file.
close the temporary file.
delete (or just rename) the input file.
rename (or move) the temporary file to have the original name of the input file.

There is a similar question asked, "Java - Find a line in a file and remove".
Basically they all use a temp file, there is no harm doing so. So why not just do it? It will not affect your performance much and can avoid some errors.

Why not a simple sed -si '/line I want to delete/d' big_file?

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Java 6 File Input Output Stream (same file)

I searched and looked at multiple questions like this, but my question is really different than anything I found. I've looked at Java Docs.
How do I get the equivalent of this c file open:
stream1 = fopen (out_file, "r+b");
Once I've done a partial read from the file, the first write makes the next read return EOF no matter how many bytes were in the file.
Essentially I want a file I/O stream that doesn't do that. The whole purpose of what I'm trying to do is to replace the bytes in an existing file in the current file. I don't want to do it in a copy or make a copy before I do the Read->Write.

You can use a RandomAccessFile.

As Perception mentions, you can use a RandomAccessFile. Also, in some situations, a FileChannel may work better. I've used these to handle binary file data with great success.
EDIT: you can get a FileChannel from the RandomAccessFile object using getChannel.

How do I write/read to the beginning of a text file?

EDIT
This is my file reader, can I make this read it from bottom to up seeing how difficult it is to make it write from bottom to up.
BufferedReader mainChat = new BufferedReader(new FileReader("./messages/messages.txt"));
String str;
while ((str = mainChat.readLine()) != null)
{
System.out.println(str);
}
mainChat.close();
OR (old question)
How can I make it put the next String at the beginning of the file and then insert an new line(to shift the other lines down)?
FileWriter chatBuffer = new FileWriter("./messages/messages.txt",true);
BufferedWriter mainChat = new BufferedWriter(chatBuffer);
mainChat.write(message);
mainChat.newLine();
mainChat.flush();
mainChat.close();

Someone could correct me, but I'm pretty sure in most operating systems, there is no option but to read the whole file in, then write it back again.
I suppose the main reason is that, in most modern OSs, all files on the disc start at the beginning of a boundary. The problem is, you cannot tell the file allocation table that your file starts earlier than that point.
Therefore, all the later bytes in the file have to be rewritten. I don't know of any OS routines that do this in one step.
So, I would use a BufferedReader to store whole file into a Vector or StringBuffer, then write it all back with the prepended string first.
--
Edit
A way that would save memory for larger files, reading #Saury's randomaccessfile suggestion, would be:
file has N bytes to start with
we want to add on "hello world"
open the file for append
append 11 spaces
i=N
loop {
go back to byte i
read a byte
move to byte i+11
write that byte back
i--
} until i==0
then move to byte 0
write "hello world"
voila

Use FileUtils from Apache Common IO to simplify this if you can. However, it still needs to read the whole file in so it will be slow for large files.
List<String> newList = Arrays.asList("3");
File file = new File("./messages/messages.txt");
newList.addAll(FileUtils.readLines(file));
FileUtils.writeLines(file, newList);
FileUtils also have read/write methods that take care of encoding.

Use RandomAccessFile to read/write the file in reverse order. See following links for more details.
http://www.java2s.com/Code/Java/File-Input-Output/UseRandomAccessFiletoreverseafile.htm
http://download.oracle.com/javase/1.5.0/docs/api/java/io/RandomAccessFile.html

As was suggested here pre-pending to a file is rather difficult and is indeed linked to how files are stored on the hard drive. The operation is not naturally available from the OS so you will have to make it yourself and most obvious answers to this involve reading the whole file and writing it again. this may be fine for you but will incur important costs and could be a bottleneck for your application performance.
Appending would be the natural choice but this would, as far as I understand, make reading the file unnatural.
There are many ways you could tackle this depending on the specificities of your situation.
If writing this file is not time critical in your application and the file does not grow too big you could bite the bullet and read the whole file, prepend the information and write it again. apache's common-io's FileUtils will be of help here simpifying the operation where you can read the file as a list of strings, prepend the new lines to the list and write the list again.
If writing is time critical but have control over the reading or the file. That is, if the file is to be read by another of your programs. you could load the file in a list of lines and reverse the list. Again FileUtils from the common-io library and helper functions in the Collections class in the standard JDK should do the trick nicely.
If writing is time critical but the file is intended to be read through a normal text editor you could create a small class or program that would read the file and write it in another file with the preferred order.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.