Java POI - Error: Unable to read entire header - java

I'm trying to read a .doc file with java through the POI library. Here is my code:
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
And I have this exception:
java.io.IOException: Unable to read entire header; 162 bytes read; expected 512 bytes
at org.apache.poi.poifs.storage.HeaderBlock.alertShortRead(HeaderBlock.java:226)
at org.apache.poi.poifs.storage.HeaderBlock.readFirst512(HeaderBlock.java:207)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at MicrosoftWordParser.getDocString(MicrosoftWordParser.java:277)
at MicrosoftWordParser.main(MicrosoftWordParser.java:86)
My file is not corrupted, i can launch it with microsoft Word.
I'm using poi 3.9 (the latest stable version).
Do you have an idea t solve the problem ?
Thank you.

readFirst512() will read the first 512 bytes of your Inputstream and throw an exception if there is not enough bytes to read. I think your file is not big enough to be read by POI.

It is probably not a correct Word file. Is it really only 162 bytes long? Check in your filesystem.
I'd recommend creating a new Word file using Word or LibreOffice, and then try to read it using your program.

you should try this programm.
package file_opration;
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("filepath location");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}

Ahh, you've got a file, then you're spending loads of memory buffering the whole thing into memory by hiding your file behind an InputStream... Don't! If you have a File, give that to POI. Only give POI an InputStream if that's all your have
Your code should be something like:
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("myfile.doc"));
HWPFDocument document = new HWPFDocument(fs.getRoot());
That'll be quicker and use less memory that reading it into an InputStream, and if there are problems with the file you should normally get slightly more helpful error messages out too

A 162 byte MS Word .doc is probably an "owner file". A temporary file that Word uses to signify the file is locked/owned.
They have a .doc file extension but they are not MS Word Docs.

Related

Combining compressed Gzipped Text Files using Java

my question might not be entirely related to Java but I'm currently seeking a method to combine several compressed (gzipped) textfiles without the requirement to recompress them manually. Lets say I have 4 files, all text that is compressed using gzip and want to compress these into one single *.gz file without de + recompressing them. My current method is to open an InputStream and parse the file linewise, storing in a GZIPoutputstream, which works but isn't very fast.... I could of course also call
zcat file1 file2 file3 | gzip -c > output_all_four.gz
This would work, too but isn't really fast either.
My idea would be to copy the inputstream and write it to outputstream directly without "parsing" the stream, as I don't need to manipulate anything actually. Is something like this possible?
Find below a simple solution in Java (it does the same as my cat ... example). Any kind of buffering the input/output has been omitted to keep the code slim.
public class ConcatFiles {
public static void main(String[] args) throws IOException {
// concatenate the single gzip files to one gzip file
try (InputStream isOne = new FileInputStream("file1.gz");
InputStream isTwo = new FileInputStream("file2.gz");
InputStream isThree = new FileInputStream("file3.gz");
SequenceInputStream sis = new SequenceInputStream(new SequenceInputStream(isOne, isTwo), isThree);
OutputStream bos = new FileOutputStream("output_all_three.gz")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = sis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
// ungezip the single gzip file, the output contains the
// concatenated input of the single uncompressed files
try (GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("output_all_three.gz"));
OutputStream bos = new FileOutputStream("output_all_three")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = gzipis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
}
}
The above method works if you just require to gzip many zipped files. In my case I had made a web servlet and my response was in 20-30 KBs. So I was sending the zipped response.
I tried to zip all my individual JS files on server start only and then add dynamic code runtime using the above method. I could print the entire response in my log file but chrome was able to unzip the first file only. Rest output was coming in bytes.
After research I found out that this is not possible with chrome and they have closed the bug also without solving it.
https://bugs.chromium.org/p/chromium/issues/detail?id=20884

How to parse large text file with Apache Tika 1.5?

Problem:
For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.
My solution:
I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Printer {
public void readMyFile(String fname) throws IOException, SAXException,
TikaException {
System.out.println("Working...");
File f = new File(fname);
// InputStream stream = TikaInputStream.get(new File(fname));
InputStream stream = new BufferedInputStream(new FileInputStream(fname));
Metadata meta = new Metadata();
ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
AutoDetectParser parser = new AutoDetectParser();
String mime = new Tika().detect(f);
meta.set(Metadata.CONTENT_TYPE, mime);
System.out.println("trying to parse...");
try {
parser.parse(stream, content, meta, new ParseContext());
} finally {
stream.close();
}
}
public static void main(String[] args) {
Printer p = new Printer();
try {
p.readMyFile("test/pagecounts-20140701-060000.txt");
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
}
}
Problem:
Upon invoking the parse method of the parser I am getting:
Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
at com.tastyminerals.cli.Printer.main(Printer.java:46)
I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.
Questions:
What is wrong with my code?
How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?
You can set like this to avoid the limit in size :-
BodyContentHandler bodyHandler = new BodyContentHandler(-1);
Pass BodyContentHandler a Writer or OutputStream instead of int
As Gagravarr mentioned, the BodyContentHandler you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError exception for large files.
If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler with a Writer (or OutputStream directly) instead of passing an int:
Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);
And then call Tika parse:
Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);
AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, content, meta, context);
By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.
Note that Tika will not automatically close your input or output streams for you.
You can use incremental parsing
Tika tika = new Tika();
Reader fulltext = null;
String contentStr = null;
try {
fulltext = tika.parse(response.getEntityInputStream());
contentStr = IOUtils.toString(fulltext);
} finally {
fulltext.close();
}
Solution with ByteArrayInputStream
I had a similar problem with CSV files. If they were read in Java with the wrong charset, only a part of the records could be imported. The method from my library assigns the correct encoding to the file and prevents reading errors.
public static String lib_getCharset( String fullFile ) {
// Initialize variables.
String returnValue = "";
BodyContentHandler handler = new BodyContentHandler( -1 );
Metadata meta = new Metadata();
// Convert the BufferedInputStream to a ByteArrayInputStream.
try( final InputStream is = new BufferedInputStream( new FileInputStream( fullFile ) ) ) {
InputStream bais = new ByteArrayInputStream( is.readAllBytes() );
ParseContext context = new ParseContext();
TXTParser parser = new TXTParser();
// Run the Tika TXTParser and read the metadata.
try {
parser.parse( bais, handler, meta, context );
// Fill the metadata's names in an array ...
String[] metaNames = meta.names();
// ... and iterate over it.
for( String metaName : metaNames ) {
// Check if a charset is described.
if( metaName.equals( "Content-Encoding" ) ) {
returnValue = meta.get( metaName );
}
}
} catch( SAXException | TikaException se_te ) {
se_te.printStackTrace();
}
} catch( IOException e ) {
e.printStackTrace();
}
return returnValue;
}
Using scanner, the file can then be imported as follows.
Scanner scanner = null;
String charsetChar = TrnsLib.lib_getCharset( fullFileName );
try {
// Scan the file, e.g. with UTF-8 or
// ISO8859-1 or windows-1252 for ANSI.
scanner = new Scanner( new File( fullFileName ), charsetChar );
} catch( FileNotFoundException e ) {
e.printStackTrace();
}
Don't forget the assignment of the two dependencies in the POM.XML:
https://repo1.maven.org/maven2/org/apache/tika/tika-core/2.4.1/
https://repo1.maven.org/maven2/org/apache/tika/tika-parser-text-module/2.4.1/
and the definition of the requires in module-info.java:
module org.wnt.wnt94lib {
requires transitive org.apache.tika.core;
requires transitive org.apache.tika.parser.txt;
}
My solution works fine with small files (up to about 100 lines of 300 characters). Larger files need more attention. The Babylonian confusion around CR and LF led to inconsistencies under Apache Tika. If the parameter is set to -1, the whole text file is read for BodyContentHandler, but only the above-mentioned 100 lines are used to find the correct charset. And especially in CSV files, exotic characters like ä, ö or ü are rare. But, out of luck, Apache finds the combined characters CR and LF and concludes that it must be an ANSI instead of a UTF-8 file.
So, what can you do? - Quick and dirty, you can add the letters - ÄÖÜ to the file's first line. – However, the following solution is better: Load the file with Notepad++. Show all characters under View, Show Symbol. Under Search, Replace... delete all CR. To do this, activate the selection Extended under Search Mode and enter the characters \r\n under Find what and \n under Replace with. Set the cursor on the file's first line and press the button Replace All. It frees the file from the burden of remembering the good old typewriter and converts it into a proper Unix file with UTF-8.
Afterwards, however, do not edit the CSV file with Excel. The programme, which I otherwise really appreciate, converts your file back into one with CR-ballast. For correct saving, without CR, you have to use VBA. Ekkehard Horner describes how at: VBA : save a file with UTF-8 without BOM

What's up with AssetFileDescriptor.getFileDescriptor()?

I have an uncompressed binary file in res/raw that I was reading this way:
public byte[] file2Bytes (int rid) {
byte[] buffer = null;
try {
AssetFileDescriptor afd = res.openRawResourceFd(rid);
FileInputStream in = new FileInputStream(afd.getFileDescriptor());
int len = (int)afd.getLength();
buffer = new byte[len];
in.read(buffer, 0, len);
in.close();
} catch (Exception ex) {
Log.w(ACTNAME, "file2Bytes() fail\n"+ex.toString());
return null;
}
return buffer;
}
However, buffer did not contain what it was supposed to. The source file is 1024 essentially random bytes (a binary key). But buffer, when written out and examined, was not the same. Amongst unprintable bytes at beginning appeared "res/layout/main.xml" (the literal path) and then further down, part of the text content of another file from res/raw. O_O?
Exasperated after a while, I tried:
AssetFileDescriptor afd = res.openRawResourceFd(rid);
//FileInputStream in = new FileInputStream(afd.getFileDescriptor());
FileInputStream in = afd.createInputStream();
Presto, I got the content correctly -- this is easily reproducible.
So the relevant API docs read:
public FileDescriptor getFileDescriptor ()
Returns the FileDescriptor that can be used to read the data in the
file.
public FileInputStream createInputStream ()
Create and return a new auto-close input stream for this asset. This
will either return a full asset
AssetFileDescriptor.AutoCloseInputStream, or an underlying
ParcelFileDescriptor.AutoCloseInputStream depending on whether the the
object represents a complete file or sub-section of a file. You should
only call this once for a particular asset.
Why would a FileInputStream() constructed from getFileDescriptor() end up with garbage whereas createInputStream() gives proper access?
As per pskink's comment, the FileDescriptor returned by AssetFileDescriptor() is apparently not an fd that refers just to the file -- it perhaps refers to whatever bundle/parcel/conglomeration aapt has made of the resources.
AssetFileDescriptor afd = res.openRawResourceFd(rid);
FileInputStream in = new FileInputStream(afd.getFileDescriptor());
in.skip(afd.getStartOffset());
Turns out to be the equivalent of the FileInputStream in = afd.createInputStream() version.
I suppose there is a hint in the difference between "create" (something new) and "get" (something existing). :/
AssetFileDescriptor can be thought of as the entry point to the entire package's assets data.
I have run into the same issue and solved it finally.
If you want to manually create a stream from an AssetFileDescriptor, you have to skip n bytes to the requested resource. It is like you are paging thru all the available files in one big file.
Thanks to pskink! I had a look at the hex content of the jpg image I want to acquire, it starts with -1. The thing is, there are two jpg images. I did not know, so I arbitrarily skip 76L bytes. Got the first image!

Apache POI fails to save (HWPFDocument.write) large word doc files

I want to remove word metadata from .doc files. My .docx files works fine with XWPFDocument, but the following code for removing metadata fails for large (> 1MB) files. For example using a 6MB .doc file with images, it outputs a 4.5MB file in which some images are removed.
public static InputStream removeMetaData(InputStream inputStream) throws IOException {
POIFSFileSystem fss = new POIFSFileSystem(inputStream);
HWPFDocument doc = new HWPFDocument(fss);
// **it even fails on large files if you remove from here to 'until' below**
SummaryInformation si = doc.getSummaryInformation();
si.removeAuthor();
si.removeComments();
si.removeLastAuthor();
si.removeKeywords();
si.removeSubject();
si.removeTitle();
doc.getDocumentSummaryInformation().removeCategory();
doc.getDocumentSummaryInformation().removeCompany();
doc.getDocumentSummaryInformation().removeManager();
try {
doc.getDocumentSummaryInformation().removeCustomProperties();
} catch (Exception e) {
// can not remove above
}
// until
ByteArrayOutputStream os = new ByteArrayOutputStream();
doc.write(os);
os.flush();
os.close();
return new ByteArrayInputStream(os.toByteArray());
}
Related posts:
How to save the Word Document using POI API?
https://stackoverflow.com/questions/9758955/saving-poi-document-correctly
Which version of Apache POI are you using ?
This seems to be the Bug 46220 - Regression: Some embedded images being lost .
Please upgrade to the latest release of POI (3.8) and try again.
Hope that helps.

File Delete and Rename in Java

I have the following Java code which will search in an xml for a specific tag and then will add some text to it and save that file. I couldnt find a way to rename the emporary file to the original file. Please suggest.
import java.io.*;
class ModifyXML {
public void readMyFile(String inputLine) throws Exception
{
String record = "";
File outFile = new File("tempFile.tmp");
FileInputStream fis = new FileInputStream("InfectiousDisease.xml");
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
FileOutputStream fos = new FileOutputStream(outFile);
PrintWriter out = new PrintWriter(fos);
while ( (record=br.readLine()) != null )
{
if(record.endsWith("<add-info>"))
{
out.println(" "+"<add-info>");
out.println(" "+inputLine);
}
else
{
out.println(record);
}
}
out.flush();
out.close();
br.close();
//Also we need to delete the original file
//outFile.renameTo(InfectiousDisease.xml);//Not working
}
public static void main (String[] args) {
try
{
ModifyXML f = new ModifyXML();
f.readMyFile("This is infectious disease data");
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Thanks
First delete the original file and then rename the new file:
File inputFile = new File("InfectiousDisease.xml");
File outFile = new File("tempFile.tmp");
if(inputFile.delete()){
outFile.renameTo(inputFile);
}
A good method to rename files is.
File file = new File("path-here");
file.renameTo(new File("new path here"));
In your code there are several issues.
First your description mentions renameing the original file and adding some text to it. Your code doesn't do that, it opens two files, one for reading and one for writing (with the additional text). That is the right way to do things, as adding text in-place is not really feasible using the techniques you are using.
The second issue is that you are opening a temporary file. Temporary files remove themselves upon closing, so all the work you did adding your text disappears as soon as you close the file.
The third issue is that you are modifying XML files as plain text. This sometimes works as XML files are a subset of plain text files, but there is no indication that you attempted to ensure that the output file was an XML file. Perhaps you know more about your input files than is mentioned, but if you want this to work correctly for 100% of the input cases, you probably want to create a SAX writer that writes out all a SAX reader reads, with the additional information in the correct tag location.
You can use
outFile.renameTo(new File(newFileName));
You have to ensure these files are not open at the time.

Categories

Resources