How to parse large text file with Apache Tika 1.5? - java

Problem:
For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.
My solution:
I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Printer {
public void readMyFile(String fname) throws IOException, SAXException,
TikaException {
System.out.println("Working...");
File f = new File(fname);
// InputStream stream = TikaInputStream.get(new File(fname));
InputStream stream = new BufferedInputStream(new FileInputStream(fname));
Metadata meta = new Metadata();
ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
AutoDetectParser parser = new AutoDetectParser();
String mime = new Tika().detect(f);
meta.set(Metadata.CONTENT_TYPE, mime);
System.out.println("trying to parse...");
try {
parser.parse(stream, content, meta, new ParseContext());
} finally {
stream.close();
}
}
public static void main(String[] args) {
Printer p = new Printer();
try {
p.readMyFile("test/pagecounts-20140701-060000.txt");
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
}
}
Problem:
Upon invoking the parse method of the parser I am getting:
Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
at com.tastyminerals.cli.Printer.main(Printer.java:46)
I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.
Questions:
What is wrong with my code?
How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?

You can set like this to avoid the limit in size :-
BodyContentHandler bodyHandler = new BodyContentHandler(-1);

Pass BodyContentHandler a Writer or OutputStream instead of int
As Gagravarr mentioned, the BodyContentHandler you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError exception for large files.
If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler with a Writer (or OutputStream directly) instead of passing an int:
Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);
And then call Tika parse:
Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);
AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, content, meta, context);
By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.
Note that Tika will not automatically close your input or output streams for you.

You can use incremental parsing
Tika tika = new Tika();
Reader fulltext = null;
String contentStr = null;
try {
fulltext = tika.parse(response.getEntityInputStream());
contentStr = IOUtils.toString(fulltext);
} finally {
fulltext.close();
}

Solution with ByteArrayInputStream
I had a similar problem with CSV files. If they were read in Java with the wrong charset, only a part of the records could be imported. The method from my library assigns the correct encoding to the file and prevents reading errors.
public static String lib_getCharset( String fullFile ) {
// Initialize variables.
String returnValue = "";
BodyContentHandler handler = new BodyContentHandler( -1 );
Metadata meta = new Metadata();
// Convert the BufferedInputStream to a ByteArrayInputStream.
try( final InputStream is = new BufferedInputStream( new FileInputStream( fullFile ) ) ) {
InputStream bais = new ByteArrayInputStream( is.readAllBytes() );
ParseContext context = new ParseContext();
TXTParser parser = new TXTParser();
// Run the Tika TXTParser and read the metadata.
try {
parser.parse( bais, handler, meta, context );
// Fill the metadata's names in an array ...
String[] metaNames = meta.names();
// ... and iterate over it.
for( String metaName : metaNames ) {
// Check if a charset is described.
if( metaName.equals( "Content-Encoding" ) ) {
returnValue = meta.get( metaName );
}
}
} catch( SAXException | TikaException se_te ) {
se_te.printStackTrace();
}
} catch( IOException e ) {
e.printStackTrace();
}
return returnValue;
}
Using scanner, the file can then be imported as follows.
Scanner scanner = null;
String charsetChar = TrnsLib.lib_getCharset( fullFileName );
try {
// Scan the file, e.g. with UTF-8 or
// ISO8859-1 or windows-1252 for ANSI.
scanner = new Scanner( new File( fullFileName ), charsetChar );
} catch( FileNotFoundException e ) {
e.printStackTrace();
}
Don't forget the assignment of the two dependencies in the POM.XML:
https://repo1.maven.org/maven2/org/apache/tika/tika-core/2.4.1/
https://repo1.maven.org/maven2/org/apache/tika/tika-parser-text-module/2.4.1/
and the definition of the requires in module-info.java:
module org.wnt.wnt94lib {
requires transitive org.apache.tika.core;
requires transitive org.apache.tika.parser.txt;
}
My solution works fine with small files (up to about 100 lines of 300 characters). Larger files need more attention. The Babylonian confusion around CR and LF led to inconsistencies under Apache Tika. If the parameter is set to -1, the whole text file is read for BodyContentHandler, but only the above-mentioned 100 lines are used to find the correct charset. And especially in CSV files, exotic characters like ä, ö or ü are rare. But, out of luck, Apache finds the combined characters CR and LF and concludes that it must be an ANSI instead of a UTF-8 file.
So, what can you do? - Quick and dirty, you can add the letters - ÄÖÜ to the file's first line. – However, the following solution is better: Load the file with Notepad++. Show all characters under View, Show Symbol. Under Search, Replace... delete all CR. To do this, activate the selection Extended under Search Mode and enter the characters \r\n under Find what and \n under Replace with. Set the cursor on the file's first line and press the button Replace All. It frees the file from the burden of remembering the good old typewriter and converts it into a proper Unix file with UTF-8.
Afterwards, however, do not edit the CSV file with Excel. The programme, which I otherwise really appreciate, converts your file back into one with CR-ballast. For correct saving, without CR, you have to use VBA. Ekkehard Horner describes how at: VBA : save a file with UTF-8 without BOM

Related

Spark Reading .7z files

I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.
For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream.
https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html
Zip file code
spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
I am trying similar code for the 7z files something like
spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new SevenZFile(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
But SevenZFile doesnt accept these formats.Looking for ideas.
If the file is in local filessytem following solution works, but my file is in hdfs
Local fileSystem Code
public static void decompress(String in, File destination) throws IOException {
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
}
After all these years of spark evolution there should be easy way to do it.
Instead of using the java.io.File-based approach, you could try the SeekableByteChannel method as shown in this alternative constructor.
You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.
With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.
Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.

How to access every entry of CompressedSource in google cloud dataflow? And get Byte[] of each subfile

I have a compressed file which is a gzip file composed of multiple text file on google storage. I need to access each subfile and do some operation like regular expression.
I can do the same thing on my local computer like this.
pubic static void untarFile( String filepath ) throw IOException {
try {
FileInputStream fin = new FileInputStream(filepath);
BufferedInputStream in = new BufferedInputStream(fin);
GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
TarArchiveInputStream tarInput = new TarArchiveInputStream(gzIn);
TarArchiveEntry entry = null;
while ((entry = (TarArchiveEntry) tarInput.getNextTarEntry() ) != null) {
byte[] fileContent = new byte (int)entry.getSize() ];
tarInput.read(fileContent, 0, fileContent.length);
}
}
}
Therefore, I can do some other operation on fileContent which is a byte[ ]. So I used CompressedSource on google cloud dataflow and refer to its test code.It seems that I can only get every byte from file instead of whole byet[] of subfile, so I am wondering if there is any solution for me to do this on google cloud dataflow.
TextIO does not support this directly, but you can create a new subclass of FileBasedSource to do this. You'll want to override isSplittable() to always return false, and then have readNextRecord() just read the entire file.

Java POI - Error: Unable to read entire header

I'm trying to read a .doc file with java through the POI library. Here is my code:
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
And I have this exception:
java.io.IOException: Unable to read entire header; 162 bytes read; expected 512 bytes
at org.apache.poi.poifs.storage.HeaderBlock.alertShortRead(HeaderBlock.java:226)
at org.apache.poi.poifs.storage.HeaderBlock.readFirst512(HeaderBlock.java:207)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at MicrosoftWordParser.getDocString(MicrosoftWordParser.java:277)
at MicrosoftWordParser.main(MicrosoftWordParser.java:86)
My file is not corrupted, i can launch it with microsoft Word.
I'm using poi 3.9 (the latest stable version).
Do you have an idea t solve the problem ?
Thank you.
readFirst512() will read the first 512 bytes of your Inputstream and throw an exception if there is not enough bytes to read. I think your file is not big enough to be read by POI.
It is probably not a correct Word file. Is it really only 162 bytes long? Check in your filesystem.
I'd recommend creating a new Word file using Word or LibreOffice, and then try to read it using your program.
you should try this programm.
package file_opration;
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("filepath location");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
Ahh, you've got a file, then you're spending loads of memory buffering the whole thing into memory by hiding your file behind an InputStream... Don't! If you have a File, give that to POI. Only give POI an InputStream if that's all your have
Your code should be something like:
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("myfile.doc"));
HWPFDocument document = new HWPFDocument(fs.getRoot());
That'll be quicker and use less memory that reading it into an InputStream, and if there are problems with the file you should normally get slightly more helpful error messages out too
A 162 byte MS Word .doc is probably an "owner file". A temporary file that Word uses to signify the file is locked/owned.
They have a .doc file extension but they are not MS Word Docs.

File Delete and Rename in Java

I have the following Java code which will search in an xml for a specific tag and then will add some text to it and save that file. I couldnt find a way to rename the emporary file to the original file. Please suggest.
import java.io.*;
class ModifyXML {
public void readMyFile(String inputLine) throws Exception
{
String record = "";
File outFile = new File("tempFile.tmp");
FileInputStream fis = new FileInputStream("InfectiousDisease.xml");
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
FileOutputStream fos = new FileOutputStream(outFile);
PrintWriter out = new PrintWriter(fos);
while ( (record=br.readLine()) != null )
{
if(record.endsWith("<add-info>"))
{
out.println(" "+"<add-info>");
out.println(" "+inputLine);
}
else
{
out.println(record);
}
}
out.flush();
out.close();
br.close();
//Also we need to delete the original file
//outFile.renameTo(InfectiousDisease.xml);//Not working
}
public static void main (String[] args) {
try
{
ModifyXML f = new ModifyXML();
f.readMyFile("This is infectious disease data");
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Thanks
First delete the original file and then rename the new file:
File inputFile = new File("InfectiousDisease.xml");
File outFile = new File("tempFile.tmp");
if(inputFile.delete()){
outFile.renameTo(inputFile);
}
A good method to rename files is.
File file = new File("path-here");
file.renameTo(new File("new path here"));
In your code there are several issues.
First your description mentions renameing the original file and adding some text to it. Your code doesn't do that, it opens two files, one for reading and one for writing (with the additional text). That is the right way to do things, as adding text in-place is not really feasible using the techniques you are using.
The second issue is that you are opening a temporary file. Temporary files remove themselves upon closing, so all the work you did adding your text disappears as soon as you close the file.
The third issue is that you are modifying XML files as plain text. This sometimes works as XML files are a subset of plain text files, but there is no indication that you attempted to ensure that the output file was an XML file. Perhaps you know more about your input files than is mentioned, but if you want this to work correctly for 100% of the input cases, you probably want to create a SAX writer that writes out all a SAX reader reads, with the additional information in the correct tag location.
You can use
outFile.renameTo(new File(newFileName));
You have to ensure these files are not open at the time.

How do i get a filename of a file inside a gzip in java?

int BUFFER_SIZE = 4096;
byte[] buffer = new byte[BUFFER_SIZE];
InputStream input = new GZIPInputStream(new FileInputStream("a_gunzipped_file.gz"));
OutputStream output = new FileOutputStream("current_output_name");
int n = input.read(buffer, 0, BUFFER_SIZE);
while (n >= 0) {
output.write(buffer, 0, n);
n = input.read(buffer, 0, BUFFER_SIZE);
}
}catch(IOException e){
System.out.println("error: \n\t" + e.getMessage());
}
Using the above code I can succesfully extract a gzip's contents although the extracted file's filenames are, as expected, will always be current_output_name (I know its because I declared it to be that way in the code). My problem is I dont know how to get the file's filename when it is still inside the archive.
Though, java.util.zip provides a ZipEntry, I couldn't use it on gzip files.
Any alternatives?
as i kinda agree with "Michael Borgwardt" on his reply, but it is not entirely true, gzip file specifications contains an optional file name stored in the header of the gz file, sadly there are no way (as far as i know ) of getting that name in current java (1.6). as seen in the implementation of the GZIPInputStream in the method getHeader in the openjdk
they skip reading the file name
// Skip optional file name
if ((flg & FNAME) == FNAME) {
while (readUByte(in) != 0) ;
}
i have modified the class GZIPInputStream to get the optional filename out of the gzip archive(im not sure if i am allowed to do that) (download the original version from here), you only need to add a member String filename; to the class, and modify the above code to be :
// Skip optional file name
if ((flg & FNAME) == FNAME) {
filename= "";
int _byte = 0;
while ((_byte= readUByte(in)) != 0){
filename += (char)_byte;
}
}
and it worked for me.
Apache Commons Compress offers two options for obtaining the filename:
With metadata (Java 7+ sample code)
try ( //
GzipCompressorInputStream gcis = //
new GzipCompressorInputStream( //
new FileInputStream("a_gunzipped_file.gz") //
) //
) {
String filename = gcis.getMetaData().getFilename();
}
With "the convention"
String filename = GzipUtils.getUnCompressedFilename("a_gunzipped_file.gz");
References
Apache Commons Compress
GzipCompressorInputStream
See also: GzipUtils#getUnCompressedFilename
Actually, the GZIP file format, using the multiple members, allows the original filename to be specified. Including a member with the FLAG of FLAG.FNAME the name can be specified. I do not see a way to do this in the java libraries though.
http://www.gzip.org/zlib/rfc-gzip.html#specification
following the answers above, here is an example that creates a file "myTest.csv.gz" that contains a file "myTest.csv", notice that you can't change the internal file name, and you can't add more files into the gz file.
#Test
public void gzipFileName() throws Exception {
File workingFile = new File( "target", "myTest.csv.gz" );
GZIPOutputStream gzipOutputStream = new GZIPOutputStream( new FileOutputStream( workingFile ) );
PrintWriter writer = new PrintWriter( gzipOutputStream );
writer.println("hello,line,1");
writer.println("hello,line,2");
writer.close();
}
Gzip is purely compression. There is no archive, it's just the file's data, compressed.
The convention is for gzip to append .gz to the filename, and for gunzip to remove that extension. So, logfile.txt becomes logfile.txt.gz when compressed, and again logfile.txt when it's decompressed. If you rename the file, the name information is lost.

Categories

Resources