How to read doc file using Poi?

How to read doc file using Poi? - java

I am trying to view word file in my editor pane
I tried these lines
import java.awt.Dimension;
import java.awt.GridLayout;
import java.io.File;
import java.io.FileInputStream;
import javax.swing.JEditorPane;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class editorpane extends JEditorPane
{
public editorpane(File file)
{
try
{
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument hwpfd = new HWPFDocument(fis);
WordExtractor we = new WordExtractor(hwpfd);
String[] array = we.getParagraphText();
for (int i = 0; i < array.length; i++)
{
this.setPage(array[i]);
}
} catch (Exception e)
{
e.printStackTrace();
}
but gives me
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:106)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at frame1.editorpane.<init>(editorpane.java:24)
in this line
HWPFDocument hwpfd = new HWPFDocument(fis);
how can I solve that ??
beside I am not sure about these lines
for (int i = 0; i < array.length; i++)
{
this.setPage(array[i]);
}
can I get them confirmed ??

You are trying to open a .docx file (XWPF) with code for .doc (HWPF) files. You can use XWPFWordExtractor for .docx files.
There is an ExtractorFactory which you can use to let POI decide which of these applies and uses the correct class to open the file, however you can then not iterate by page as only a generic getText() method is available then.
Use it like this
POITextExtractor extractor = ExtractorFactory.createExtractor(file);
extractor.getText();

Related

how to judge if the file is doc or docx in POI

The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.
What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?

Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}

try {
new ZipFile(new File("/Users/giang/Documents/a.doc"));
System.out.println("this file is .docx");
} catch (ZipException e) {
System.out.println("this file is not .docx");
e.printStackTrace();
}

PDFBox IO Exception: COSStream has been closed and cannot be read

I am having an issue with some code I'm writing in Java using PDFBox. I am attempting to populate a PDF with particular forms based on values read from an excel spreadsheet. Below is my class file.
import java.io.FileInputStream;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.PDPageContentStream.AppendMode;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.hssf.usermodel.*;
/**
* This is a test file for reading and populating a PDF with specific forms
*/
public class JU_TestFile {
PDPage Stick_Form;
PDPage IKE_Form;
PDPage BO_Form;
/**
* Constructor.
*/
public JU_TestFile() throws IOException
{
this.BO_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\BO Pole Form.pdf")).getPage(0);
this.IKE_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\IKE Form.pdf")).getPage(0);
this.Stick_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\Sticking Form.pdf")).getPage(0);
}
public void buildFile(String fileName, String excelSheet) throws IOException {
// Create a Blank PDF Document and load in JU Excel Spreadsheet
PDDocument workingDocument = new PDDocument();
FileInputStream fis = new FileInputStream(new File(excelSheet));
// Load in the workbook
HSSFWorkbook JU_XML = new HSSFWorkbook(fis);
int sheetNumber = 0;
int rowNumber = 0;
String cellValue = "Starting Value";
HSSFSheet currentSheet = JU_XML.getSheetAt(sheetNumber);
// While we have not reached the 25th row in our current sheet
while (rowNumber <= 24) {
// Get the value in the current row, on the 8th column in the xls file
cellValue = currentSheet.getRow(rowNumber + 6).getCell(7).getStringCellValue();
// If it has stuff in it,
if (cellValue != "") {
// Check if it has the letters "IKE" and append the IKE form to our PDF
if (cellValue != "IKE") {
workingDocument.importPage(IKE_Form);
// If it is anything else (other than empty), append the Stick Form to our PDF
} else {
workingDocument.importPage(Stick_Form);
}
// Let's move on to the next row
rowNumber++;
// If the next row number is the "26th" row, we know we need to move on to the
// next sheet, and also reset the rows to the first row of that next sheet
if (rowNumber == 25) {
rowNumber = 0;
currentSheet = JU_XML.getSheetAt(++sheetNumber);
}
// if the 9th row is empty, we should break out of the loop and save/close our PDF, we are done
} else {
break;
}
}
workingDocument.save(fileName);
workingDocument.close();
}
}
I am getting the following error:
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
I've done research and it seems like a PDDocument is closing before I run the workingDocument.save(fileName) command. I'm not quite sure how to fix this, and I'm also a bit lost on how to find a workaround. I'm a bit rusty on my programming, so any help would be super appreciated! Also any feedback on how to make future posts more informative would be great.
Thanks in advance

Please try it
PDFMergerUtility merger = new PDFMergerUtility();
PDDocument combine = PDDocument.load(file);
merger.appendDocument(getDocument(), combine);
merger.mergeDocuments();
combine.close();
Update:
Since merger.mergeDocuments(); is deprecated in recent APIs, try to make use of the same method using following overloaded methods...
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
or
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
Depends on your memory usage, you can further fine tune this method by passing MemoryUsageSetting object.

How to read raw data, say only text, from a file(word document, excel) without format? [duplicate]

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc

There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();

Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

Can't read Excel 2010 file with Apache POI. First Row number is -1

I am trying the this testfile with the Apache POI API (current version 3-10-FINAL). The following test code
import java.io.FileInputStream;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class ExcelTest {
public static void main(String[] args) throws Exception {
String filename = "testfile.xlsx";
XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(filename));
XSSFSheet sheet = wb.getSheetAt(0);
System.out.println(sheet.getFirstRowNum());
}
}
results in the first row number to be -1 (and existing rows come back as null). The test file was created by Excel 2010 (I have no control over that part) and can be read with Excel without warnings or problems. If I open and save the file with my version of Excel (2013) it can be read perfectly as expected.
Any hints into why I can't read the original file or how I can is highly appreciated.

The testfile.xlsx is created with "SpreadsheetGear 7.1.1.120". Open the XLSX file with a software which can deal with ZIP archives and look into /xl/workbook.xml to see that. In the worksheets/sheet?.xml files is to notice that all row elements are without row numbers. If I put a row number in the first row-tag like <row r="1"> then apache POI can read this row.
If it comes to the question, who is to blame for this, then the answer is definitely both Apache Poi and SpreadsheetGear ;-). Apache POI because the attribute r in the row element is optional. But SpreadsheetGear also because there is no reason not to use the r attribute if Excel itself does it ever.
If you cannot get the testfile.xlsx in a format which can Apache POI read directly, then you must work with the underlying objects. The following works with your testfile.xlsx:
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.ss.util.*;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTSheetData;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRow;
import java.util.List;
class Testfile {
public static void main(String[] args) {
try {
InputStream inp = new FileInputStream("testfile.xlsx");
Workbook wb = WorkbookFactory.create(inp);
Sheet sheet = wb.getSheetAt(0);
System.out.println(sheet.getFirstRowNum());
CTWorksheet ctWorksheet = ((XSSFSheet)sheet).getCTWorksheet();
CTSheetData ctSheetData = ctWorksheet.getSheetData();
List<CTRow> ctRowList = ctSheetData.getRowList();
Row row = null;
Cell[] cell = new Cell[2];
for (CTRow ctRow : ctRowList) {
row = new MyRow(ctRow, (XSSFSheet)sheet);
cell[0] = row.getCell(0);
cell[1] = row.getCell(1);
if (cell[0] != null && cell[1] != null && cell[0].toString() != "" && cell[1].toString() != "")
System.out.println(cell[0].toString()+"\t"+cell[1].toString());
}
} catch (InvalidFormatException ifex) {
} catch (FileNotFoundException fnfex) {
} catch (IOException ioex) {
}
}
}
class MyRow extends XSSFRow {
MyRow(org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRow row, XSSFSheet sheet) {
super(row, sheet);
}
}
I have used:
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTSheetData
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRow
Which are part of the Apache POI Binary Distribution poi-bin-3.10.1-20140818 and there are within poi-ooxml-schemas-3.10.1-20140818.jar
For a documentation see http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.poi/ooxml-schemas/1.1/
And I have extend XSSFRow, because we can't use the XSSFRow constructor directly since it has protected access.

How to edit docx using Java

I need replace cerain words or phrases in docx-file and save it with another name. I know that my problem is not unik and I tried find solution in the web. But I still can't get a result that I need.
I found two ways to solwe my task but came to the deadlock in each case.
1. Unfold docx like a zip-file, change xml with main content and pack into archive again. But after that manipulations I can't open new changed docx in MS Word. It is odd because I can do the similar steps by hands (without Java, using WinRar) and get a correct result file.
So can you explain me how to archive docx content to get a correct file using Java?
Using external API. I get an advice to use docx4j Java library. But all tat I can with it is just replace a label (like ${label}) in template with any words (I used VariableReplace sample). But I want change words that I want without using a template with labels.
I hope for a help.

I had this code. I hope that it helps you to resolve your problem. With it, you can read from a .docx find the word that you would change. Change this word and save the new paragraphs in new document.
//WriteDocx.java
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.*;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class WriteDocx
{
public static void main(String[] args) throws Exception {
int count = 0;
XWPFDocument document = new XWPFDocument();
XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
String text = we.getText() ;
if(text.contains("SMS")){
text = text.replace("SMS", "sms");
System.out.println(text);
}
char[] c = text.toCharArray();
for(int i= 0; i < c.length;i++){
if(c[i] == '\n'){
count ++;
}
}
System.out.println(c[0]);
StringTokenizer st = new StringTokenizer(text,"\n");
XWPFParagraph para = document.createParagraph();
para.setAlignment(ParagraphAlignment.CENTER);
XWPFRun run = para.createRun();
run.setBold(true);
run.setFontSize(36);
run.setText("Apache POI works well!");
List<XWPFParagraph>paragraphs = new ArrayList<XWPFParagraph>();
List<XWPFRun>runs = new ArrayList<XWPFRun>();
int k = 0;
for(k=0;k<count+1;k++){
paragraphs.add(document.createParagraph());
}
k=0;
while(st.hasMoreElements()){
paragraphs.get(k).setAlignment(ParagraphAlignment.LEFT);
paragraphs.get(k).setSpacingAfter(0);
paragraphs.get(k).setSpacingBefore(0);
run = paragraphs.get(k).createRun();
run.setText(st.nextElement().toString());
k++;
}
document.write(new FileOutputStream("test2.docx"));
}
}
PS: XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"))
You must change "Bonjour1.docx" with the name of file from where you would replace certain words or phrases.
I use APACHE POI library
And I take some code from this site HANDLING MS WORD DOCUMENTS USING APACHE POI
UPDATE

If you want to change arbitrary words, you can do that easily enough with docx4j.
But first you need to find them.
You can find your words using an XPath query, or by traversing the document tree in Java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read doc file using Poi? - java

Related

how to judge if the file is doc or docx in POI

PDFBox IO Exception: COSStream has been closed and cannot be read

How to read raw data, say only text, from a file(word document, excel) without format? [duplicate]

Can't read Excel 2010 file with Apache POI. First Row number is -1

How to edit docx using Java

Categories

Resources