Apache Tika and File access instead of Java Input Stream - java

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently.
I think that I've run into this problem/enhancement request for Tika:
Allow passing of files or memory buffers to parsers
I have a console c++ executable that accepts the path to a file on input and then outputs the metadata that it finds, each line consisting of name/value pairs.
The c++ code relies on libraries that expect a file path when accessing the data.
It's not going to be possible to rewrite this executable in Java.
I thought that it would be fairly easy to plug this into Tika. But the Tika parser needs to be in Java and the Tika parser method that needs to be overridden takes an open input stream:
void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
So I guess that my only solution will be to take the input stream and write it to a temporary file and then to process the file that gets written and to then finally clean up the file. I hate messing with a temporary file and then potentially having to worry about cleanup of temp files should something go wrong and it doesn't get deleted.
Does anyone have a clever idea about how to cleanly deal with something like this?

There's TikaInputStream which should help. It handles wrapping a File or an InputStream, and converting between them internally as parsers require. It does all the temp file bits as needed for you.
Several Java parsers already make use of it because they need a File rather than an Input Stream. What's more, users who have a file can pass it to the Parser wrapped as an InputStream, and the parser can read it as either a File or an InputStream as their needs suit.
So, I'd suggest you just turn the InputStream into a TikaInputStream (which is just a cast if it's already one), then get the file and pass that to your c++.

If I understand correctly and assuming you're launching the C++ program using Runtime.exec, you could parse the Processs standard output stream as the InputStream that Tika wants. Would that work?

Related

Java - possible to modify and parse gzipped xml files without unzipping?

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.
What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.
The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Java: Telling whether a byte array is a zip file

I have a server written in Java, that in a single request, gets a whole file from the client. The file is passed to the server as a list of bytes, and is finally represented in the java server as a byte array.
Is there some standard way / standard library that could tell whether a file represented by a byte array is a valid zip file?
Files are typically identified using magic numbers in the beginning of the file.
To make an educated guess about a given file Java has built-in method of detecting some file types: Files.probeContentType. Plus, there are various third party libraries: simplemagic or Apache Tika (which supports more than only magic numbers).
But content detection alone won't tell you whether the file is valid. For that, you'd need something that actually knows how to read Zip files, such as Java's ZipFile.
If you want to standard way to implement for this process, you can use serialization API.For that use following articles that found myself while searching about this topic.
Article 1 - javaworld
Article 2 - developer.com
Check Zip4j library. It is really easy to use and the ZipFile class has a isValidZipFile() method
The easiest way is to check the "PK" magic at the beginning of the byte array.
Something like this:
"PK".equals(new String(array, 0,2))

Create a copy of xml file in memory in java

I need to create a copy of an xml file in memory using java and i need to edit this file in memory without affecting the original one. After making changes to this xml in memory i need to send it as an input to a function. What is the appropriate option .Please help me.
You can use java native api for xml parsing:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
File file = new File("xml_file_name");
Document doc = builder.parse(file);
and then edit the Document in memory before sending it to your designated function.
Do what you wrote:
Read the file.
Write it to another file.
Edit so called another file.
Pass it to the function. Here you have to decide if it's better to pass a file or a path.
What you are looking for is ByteArrayOutputStream. http://docs.oracle.com/javase/7/docs/api/java/io/ByteArrayOutputStream.html
This will allow you to write a byte array in to memory most xml lib will accept implementations of OutputStream.
Given the file is XML you should consider using loading it into the Document Object Model (DOM): https://docs.oracle.com/javase/tutorial/jaxp/dom/readingXML.html
That will make it easier for you to modify it and write it back as valid XML document.
I would only suggest loading it as bytes/characters if you're operating on it at a byte level. An example of when that might be appropriate is if you're making some character encoding translation (say UTF-16 -> UTF-8) or removing 'illegal' characters.
Code that tries to parse and modify XML in place usually becomes dreadfully bloated if it covers all valid XML files.
Unless you're a domain expert for XML, pick the parser of the shelf. It's pretty full of good libraries.
If the files may be large and your logic ameanable I would prefer to use an XML stream model such as SAX: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
However I get the impression you're not experienced and non-experts tend to struggle with the event driven parsing model of SAX.
Try DOM first time out.

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.
At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.
I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.
In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?
The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like
InputStream input = new FileInputStream("foo.xls");
To something like
InputStream input = TikaInputStream.get(new File("foo.xls"));
If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like
InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();
Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help
.
Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in
.
Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.
IIRC, the low memory parsers include:
.xls
.xlsx
All ODF-based formats
XML
Some of the common document parsers which load + parse most/all of the file before being able to output anything include:
.doc / .docx / .ppt / .pptx
.pdf
Images
Videos

Java 6 File Input Output Stream (same file)

I searched and looked at multiple questions like this, but my question is really different than anything I found. I've looked at Java Docs.
How do I get the equivalent of this c file open:
stream1 = fopen (out_file, "r+b");
Once I've done a partial read from the file, the first write makes the next read return EOF no matter how many bytes were in the file.
Essentially I want a file I/O stream that doesn't do that. The whole purpose of what I'm trying to do is to replace the bytes in an existing file in the current file. I don't want to do it in a copy or make a copy before I do the Read->Write.
You can use a RandomAccessFile.
As Perception mentions, you can use a RandomAccessFile. Also, in some situations, a FileChannel may work better. I've used these to handle binary file data with great success.
EDIT: you can get a FileChannel from the RandomAccessFile object using getChannel.

Categories

Resources