XMLEventReader from ZipInputStream closes stream - java

I want to stream a ZIP file containing several very large (~1GByte) XML files. I could read the data from each zip file into a buffer and create a XMLStream from that - but to save on memory I would prefer to process the data on the fly.
#Test
public void zipStreamTest() throws IOException, XMLStreamException {
FileInputStream fis = new FileInputStream("archive.zip");
ZipInputStream zis = new ZipInputStream(fis);
ZipEntry ei;
while ((ei = zis.getNextEntry()) != null){
XMLEventReader xr = XMLInputFactory.newInstance().createXMLEventReader(zis);
while (reader.hasNext()) {
XMLEvent xe = xr.nextEvent();
// do some xml event processing..
}
zis.closeEntry();
}
zis.close();
}
The problem: I'm getting a java.io.IOException: Stream closed when executing zis.closeEntry();. When I remove that line, the same error is thrown at zis.getNextEntry() which closes previous entries if they're still open automatically.
It seems that my XML stream reader is breaking the stream at the end of the XML file so that the rest of the zip can't be processed.
Do I have an implementation error or is my conception of how streams work incorrect?
Note: To make this a minimal reproduceable example all you need is a zip file "archive.zip" which contains any valid XML file (no subdirectories inside the zip!). You can then run the snippet using JUnit.

You could try to open separate InputStream for each entry using java.util.zip.ZipFile:
#Test
public void zipStreamTest() throws Exception {
ZipFile zipFile = new ZipFile("archive.zip");
Iterator<? extends ZipEntry> iterator = zipFile.entries().asIterator();
while (iterator.hasNext()) {
ZipEntry ze = iterator.next();
try (InputStream zis = zipFile.getInputStream(ze)) {
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(zis);
while (reader.hasNext()) {
XMLEvent xe = reader.nextEvent();
// do some xml event processing
}
reader.close();
}
}
}

I would recommend using ZipFile instead of ZipInputStream, as suggested in answer by Alexandra Dudkina.
However, if you're processing a data stream while e.g. downloading, and therefore want to keep using ZipInputStream, you should wrap it in a CloseShieldInputStream from Apache Commons IO1 inside the getNextEntry() loop:
while ((ei = zis.getNextEntry()) != null) {
XMLEventReader xr = XMLInputFactory.newInstance().createXMLEventReader(new CloseShieldInputStream(zis));
// Process XML here
zis.closeEntry();
}
1) Or other similar helper class from third-party library of your choice.

Related

Getting specific file from ZipInputStream

I can go through ZipInputStream, but before starting the iteration I want to get a specific file that I need during the iteration. How can I do that?
ZipInputStream zin = new ZipInputStream(myInputStream)
while ((entry = zin.getNextEntry()) != null)
{
println entry.getName()
}
If the myInputStream you're working with comes from a real file on disk then you can simply use java.util.zip.ZipFile instead, which is backed by a RandomAccessFile and provides direct access to the zip entries by name. But if all you have is an InputStream (e.g. if you're processing the stream directly on receipt from a network socket or similar) then you'll have to do your own buffering.
You could copy the stream to a temporary file, then open that file using ZipFile, or if you know the maximum size of the data in advance (e.g. for an HTTP request that declares its Content-Length up front) you could use a BufferedInputStream to buffer it in memory until you've found the required entry.
BufferedInputStream bufIn = new BufferedInputStream(myInputStream);
bufIn.mark(contentLength);
ZipInputStream zipIn = new ZipInputStream(bufIn);
boolean foundSpecial = false;
while ((entry = zin.getNextEntry()) != null) {
if("special.txt".equals(entry.getName())) {
// do whatever you need with the special entry
foundSpecial = true;
break;
}
}
if(foundSpecial) {
// rewind
bufIn.reset();
zipIn = new ZipInputStream(bufIn);
// ....
}
(I haven't tested this code myself, you may find it's necessary to use something like the commons-io CloseShieldInputStream in between the bufIn and the first zipIn, to allow the first zip stream to close without closing the underlying bufIn before you've rewound it).
use the getName() method on ZipEntry to get the file you want.
ZipInputStream zin = new ZipInputStream(myInputStream)
String myFile = "foo.txt";
while ((entry = zin.getNextEntry()) != null)
{
if (entry.getName().equals(myFileName)) {
// process your file
// stop looking for your file - you've already found it
break;
}
}
From Java 7 onwards, you are better off using ZipFile instead of ZipStream if you only want one file and you have a file to read from:
ZipFile zfile = new ZipFile(aFile);
String myFile = "foo.txt";
ZipEntry entry = zfile.getEntry(myFile);
if (entry) {
// process your file
}
Look at Finding a file in zip entry
ZipFile file = new ZipFile("file.zip");
ZipInputStream zis = searchImage("foo.png", file);
public searchImage(String name, ZipFile file)
{
for (ZipEntry e : file.entries){
if (e.getName().endsWith(name)){
return file.getInputStream(e);
}
}
return null;
}
I'm late to the party, but all above "answers" does not answer the question and accepted "answer" suggest create temp file which is inefficient.
Lets create sample zip file:
seq 10000 | sed "s/^.*$/a/"> /tmp/a
seq 10000 20000 | sed "s/^.*$/b/"> /tmp/b
seq 20000 30000 | sed "s/^.*$/c/"> /tmp/c
zip /tmp/out.zip /tmp/a /tmp/b /tmp/c
so now we have /tmp/out.zip file, which contains 3 files, each of them full of chars a, b or c.
Now lets read it:
public static void main(String[] args) throws IOException {
ZipInputStream zipStream = new ZipInputStream(new FileInputStream("/tmp/out.zip"));
ZipEntry zipEntry;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName();
System.out.println("Entry: "+name);
if (name.equals("tmp/c")) {
byte[] bytes = zipStream.readAllBytes();
String s = new String(bytes);
System.out.println(s);
}
}
}
method readAllBytes seems weird, while we're in processing of stream, but it seems to work, I tested it also on some images, where there is higher chance of failure. So it's probably just unintuitive api, but it seems to work.

How do parse some files in a tar.bz2 archive with Java

So I have written the parser for parsing an individual file but can i read each file within the archive without having to actually extract the archive to disk
Following the examples in http://commons.apache.org/proper/commons-compress/examples.html you have to wrap one InputStream with another
// 1st InputStream from your compressed file
FileInputStream in = new FileInputStream(tarbz2File);
// wrap in a 2nd InputStream that deals with compression
BZip2CompressorInputStream bzIn = new BZip2CompressorInputStream(in);
// wrap in a 3rd InputStream that deals with tar
TarArchiveInputStream tarIn = new TarArchiveInputStream(bzIn);
ArchiveEntry entry = null;
while (null != (entry = tarIn.getNextEntry())){
if (entry.getSize() < 1){
continue;
}
// use your parser here, the tar inputStream deals with the size of the current entry
parser.parse(tarIn);
}
tarIn.close();

How to clone xml file (totaly identical copy)

Am using StaX XMLEventReader and XMLEventWriter.
I need to make modified temporal copy of original xml file saved in byte array. If I do so (for debug, am writing to file):
public boolean isCrcCorrect(Path path) throws IOException, XPathExpressionException {
ByteArrayOutputStream output = new ByteArrayOutputStream();
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
XMLEventReader reader = null;
XMLEventWriter writer = null;
StreamResult result;
String tagContent;
if (!fileData.currentFilePath.equals(path.toString())) {
parseFile(path);
}
try {
System.out.println(path.toString());
reader = XMLInputFactory.newInstance().createXMLEventReader(new FileReader(path.toString()));
//writer = XMLOutputFactory.newInstance().createXMLEventWriter(output);
writer = XMLOutputFactory.newInstance().createXMLEventWriter(new FileWriter("f:\\Projects\\iqpdct\\iqpdct-domain\\src\\main\\java\\de\\iq2dev\\domain\\util\\debug.xml"));
writer.add(reader);
writer.close();
} catch(XMLStreamException strEx) {
System.out.println(strEx.getMessage());
}
crc.reset();
crc.update(output.toByteArray());
System.out.println(crc.getValue());
//return fileData.file_crc == crc.getValue();
return false;
}
clone differs from origin
Source:
<VendorText textId="T_VendorText" />
Clone:
<VendorText textId="T_VendorText"></VendorText>
Why he is putting the end tag? There is no either in Source.
If you want a precise copy of a byte stream that happens to be an XML document, you must copy it as a byte stream. You can't copy it by providing a back-end to an XML parser because the purpose of the parser front-end to to isolate your code from features that can vary but which are semantically equivalent. Such as, in your case, the two means for indicating an empty element.

What is the idiomatic way to copy a ZipEntry into a new ZipFile?

I'm writing a tool to do some minor text replacement in a DOCX file, which is a zipped format. My method is to copy ZipEntry contents from entries in the original file into the modified file using a ZipOutputStream. For most DOCX files this works well, but occasionally I will encounter ZipExceptions regarding discrepancies between the contents I've written and the meta-information contained in the ZipEntry (usually a difference in compressed size).
Here's the code I'm using to copy over contents. I've stripped out error handling and document processing for brevity; I haven't had issues with the document entry so far.
ZipFile original = new ZipFile(INPUT_FILENAME);
ZipOutputStream outputStream = new ZipOutputStream(new FileOutputStream(OUTPUT_FILE));
Enumeration entries = original.entries();
byte[] buffer = new byte[512];
while (entries.hasMoreElements()) {
ZipEntry entry = (ZipEntry)entries.nextElement();
if ("word/document.xml".equalsIgnoreCase(entry.getName())) {
//perform special processing
}
else{
outputStream.putNextEntry(entry);
InputStream in = original.getInputStream(entry);
while (0 < in.available()){
int read = in.read(buffer);
outputStream.write(buffer,0,read);
}
in.close();
}
outputStream.closeEntry();
}
outputStream.close();
What is the proper or idiomatic way to directly copy ZipEntry objects from one ZipFile to another?
I have found a workaround that avoids the error. By creating a new ZipEntry with only the name field set I'm able to copy over contents without issue.
ZipFile original = new ZipFile(INPUT_FILENAME);
ZipOutputStream outputStream = new ZipOutputStream(new FileOutputStream(OUTPUT_FILE));
Enumeration entries = original.entries();
byte[] buffer = new byte[512];
while (entries.hasMoreElements()) {
ZipEntry entry = (ZipEntry)entries.nextElement();
if ("word/document.xml".equalsIgnoreCase(entry.getName())) {
//perform special processing
}
else{
// create a new empty ZipEntry
ZipEntry newEntry = new ZipEntry(entry.getName());
// outputStream.putNextEntry(entry);
outputStream.putNextEntry(newEntry);
InputStream in = original.getInputStream(entry);
while (0 < in.available()){
int read = in.read(buffer);
if (read > 0) {
outputStream.write(buffer,0,read);
}
}
in.close();
}
outputStream.closeEntry();
}
outputStream.close();
However, I this method loses any meta-information stored in the fields of original ZipEntry (e.g.: comment, extra). The API docs aren't clear on whether this is important.
To keep your meta-data for the zip entry, create it using ZipEntry's "copy constructor":
ZipEntry newEntry = new ZipEntry(entry);
You can then modify just the name or the comments etc. and everything else will be copied from the given entry.
You could also look at Docmosis which can populate DocX files from Java.

Unzipping a file from InputStream and returning another InputStream

I am trying to write a function which will accept an InputStream with zipped file data and would return another InputStream with unzipped data.
The zipped file will only contain a single file and thus there is no requirement of creating directories, etc...
I tried looking at ZipInputStream and others but I am confused by so many different types of streams in Java.
Concepts
GZIPInputStream is for streams (or files) zipped as gzip (".gz" extension). It doesn't have any header information.
This class implements a stream filter for reading compressed data in the GZIP file format
If you have a real zip file, you have to use ZipFile to open the file, ask for the list of files (one in your example) and ask for the decompressed input stream.
Your method, if you have the file, would be something like:
// ITS PSEUDOCODE!!
private InputStream extractOnlyFile(String path) {
ZipFile zf = new ZipFile(path);
Enumeration e = zf.entries();
ZipEntry entry = (ZipEntry) e.nextElement(); // your only file
return zf.getInputStream(entry);
}
Reading an InputStream with the content of a .zip file
Ok, if you have an InputStream you can use (as #cletus says) ZipInputStream. It reads a stream including header data.
ZipInputStream is for a stream with [header information + zippeddata]
Important: if you have the file in your PC you can use ZipFile class to access it randomly
This is a sample of reading a zip-file through an InputStream:
import java.io.FileInputStream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public class Main {
public static void main(String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("c:/inas400.zip");
// this is where you start, with an InputStream containing the bytes from the zip file
ZipInputStream zis = new ZipInputStream(fis);
ZipEntry entry;
// while there are entries I process them
while ((entry = zis.getNextEntry()) != null)
{
System.out.println("entry: " + entry.getName() + ", " + entry.getSize());
// consume all the data from this entry
while (zis.available() > 0)
zis.read();
// I could close the entry, but getNextEntry does it automatically
// zis.closeEntry()
}
}
}
If you can change the input data I would suggested you to use GZIPInputStream.
GZipInputStream is different from ZipInputStream since you only have one data inside it. So the whole input stream represents the whole file. In ZipInputStream the whole stream contains also the structure of the file(s) inside it, which can be many.
It is on scala syntax:
def unzipByteArray(input: Array[Byte]): String = {
val zipInputStream = new ZipInputStream(new ByteArrayInputStream(input))
val entry = zipInputStream.getNextEntry
IOUtils.toString(zipInputStream, StandardCharsets.UTF_8)
}
Unless I'm missing something, you should absolutely try and get ZipInputStream to work and there's no reason it shouldn't (I've certainly used it on several occasions).
What you should do is try and get ZipInputStream to work and if you can't, post the code and we'll help you with whatever problems you're having.
Whatever you do though, don't try and reinvent its functionality.

Categories

Resources