Read .doc file content and write into pdf file in java - java

I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
WordExtractor extractor = null ;
try {
fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
HWPFDocument hdocument=new HWPFDocument(fs);
extractor = new WordExtractor(hdocument);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
Range range=hdocument.getRange();
String readText=null;
PdfPTable createTable;
CharacterRun run;
PicturesTable picture;
for(int i=0;i<range.numParagraphs();i++) {
Paragraph par = range.getParagraph(i);
readText=par.text();
if(!par.isInTable()) {
if(readText.endsWith("\n")) {
readText=readText+"\n";
document.add(new com.itextpdf.text.Paragraph(readText));
} if(readText.endsWith("\r")) {
readText += "\n";
document.add(new com.itextpdf.text.Paragraph(readText));
}
run =range.getCharacterRun(i);
picture=hdocument.getPicturesTable();
if(picture.hasPicture(run)) {
//if(run.isSpecialCharacter()) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
} else if (par.isInTable()) {
Table table = range.getTable(par);
TableRow tRow1= table.getRow(0);
int numColumns=tRow1.numCells();
createTable=new PdfPTable(numColumns);
for (int rowId=0;rowId<table.numRows();rowId++) {
TableRow tRow = table.getRow(rowId);
for (int cellId=0;cellId<tRow.numCells();cellId++) {
TableCell tCell = tRow.getCell(cellId);
PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
createTable.addCell(c1);
}
}
document.add(createTable);
}
}
}catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
}
catch(Exception exep) {
exep.printStackTrace();
}finally {
document.close();
}
}
The problems are:
1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.
I'm getting following exception while reading table.
java.lang.IllegalArgumentException: This paragraph is not the first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876)
at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)
Can anybody help me to solve the problem.
Thank you.

Regarding your exception:
Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.
However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):
if (par.isInTable()) {
Table table = range.getTable(par);
for (int rn=0; rn<table.numRows(); rn++) {
TableRow row = table.getRow(rn);
for (int cn=0; cn<row.numCells(); cn++) {
TableCell cell = row.getCell(cn);
for (int pn=0; pn<cell.numParagraphs(); pn++) {
Paragraph cellParagraph = cell.getParagraph(pn);
// your PDF conversion code goes here
}
}
}
i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}
Regarding the pictures issue:
Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):
PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
CharacterRun characterRun = par.getCharacterRun(cr);
Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"
Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
}
}
For a list of possible values of Field.getType() see here.

Related

How to delete first character after table using POI

I am attempting to format a Word document that has multiple tables. I need to delete line breaks that occur after table. How to i achieve this programatically in Java ?
I am currently trying it with the following code and it does not work
org.apache.xmlbeans.XmlCursor cursor = xwpfTable.getCTTbl().newCursor();
cursor.toEndToken();
cursor.toNextToken();
cursor.removeChars(2);
Further Clarification : We are receiving non-formatted word files from external source. We need to eliminate paragraph (extra lines in-between tables) when the table has only 1 row. Currently I are using a macro and achieving this by code :
For Each t In doc.Tables
Set myrange = doc.Characters(t.Range.End + 1)
If myrange.Text = Chr(13) Then
myrange.Delete
End If
Thanks in advance
What I am trying to remove:
According to your screenshot you wants to remove empty paragraphs which are placed immediately after tables.
This is possible, although i am wondering why those paragraphs are there. After removing those paragraphs, in Word the tables are not more editable as single tables but only as rows within one table. Is this what you want?
Anyway, as said removing the empty paragraphs after the tables is possible. To do so, you could traversing the body elements of the document. If there is a XWPFTable immediately followed by a XWPFParagraph and this XWPFParagraph does not have any text runs in it, then remove that XWPFParagraph from the document.
Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordRemoveEmptyParagraphs {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordTables.docx"));
int thisBodyElementPos = 0;
int nextBodyElementPos = 1;
IBodyElement thisBodyElement = null;
IBodyElement nextBodyElement = null;
if (document.getBodyElements().size() > 1) { // document must have at least two body elements
do {
thisBodyElement = document.getBodyElements().get(thisBodyElementPos);
nextBodyElement = document.getBodyElements().get(nextBodyElementPos);
if (thisBodyElement instanceof XWPFTable && nextBodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)nextBodyElement;
if (paragraph.getRuns().size() == 0) { // if paragraph does not have any text runs in it
document.removeBodyElement(nextBodyElementPos);
}
}
thisBodyElementPos++;
nextBodyElementPos = thisBodyElementPos + 1;
} while (nextBodyElementPos < document.getBodyElements().size());
}
FileOutputStream out = new FileOutputStream("./WordTablesChanged.docx");
document.write(out);
out.close();
document.close();
}
}

iText Fill Form / Copy Page to new Document

I'm useing iText to fill a template PDF which contains a AcroForm.
Now I want to use this template to create a new PDF with dynamically pages.
My idea is it to fill the template PDF, copy the page with the written fields and add it to a new file. They main Problem is that our customer want to designe the template by them self. So I'm not sure if I try the right way to solve this Problem.
So I've created this code which don't work right now I get the error com.itextpdf.io.IOException: PDF header not found.
My Code
x = 1;
try (PdfDocument finalDoc = new PdfDocument(new PdfWriter("C:\\Users\\...Final.pdf"))) {
for (HashMap<String, String> map : testValues) {
String path1 = "C:\\Users\\.....Temp.pdf"
InputStream template = templateValues.get("Template");
PdfWriter writer = new PdfWriter(path1);
try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(template), writer)) {
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
for (HashMap.Entry<String, String> map2 : map.entrySet()) {
if (form.getField(map2.getKey()) != null) {
Map<String, PdfFormField> fields = form.getFormFields();
fields.get(map2.getKey()).setValue(map2.getValue());
}
}
} catch (IOException | PdfException ex) {
System.err.println("Ex2: " + ex.getMessage());
}
if (x != 0 && (x % 5) == 0) {
try (PdfDocument tempDoc = new PdfDocument(new PdfReader(path1))) {
PdfPage page = tempDoc.getFirstPage();
finalDoc.addPage(page.copyTo(finalDoc));
} catch (IOException | PdfException ex) {
System.err.println("Ex3: " + ex.getMessage());
}
}
x++;
}
} catch (IOException | PdfException ex) {
System.err.println("Ex: " + ex.getMessage());
}
Part 1 - PDF Header is Missing
this appears to be caused by you attempting to re-read an InputStream w/in a loop that has already been read (and, depending on the configuration of the PdfReader, closed). Solving for this depends on the specific type of InputStream being used - if you want to leave it as a simple InputStream (vs. a more specific yet more capable InputStream type) then you'll need to first slurp up the bytes from the stream into memory (e.g. a ByteArrayOutputStream) then create your PDFReaders based on those bytes.
i.e.
ByteArrayOutputStream templateBuffer = new ByteArrayOutputStream();
while ((int c = template.read()) > 0) templateBuffer.write(c);
for (/* your loop */) {
...
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(templateBuffer.toByteArray())), new PdfWriter(tmp))
...
Part 2 - other problems
Couple of things
make sure to grab the recently released 7.0.1 version of iText since it included a couple of fixes wrt/ AcroForm handling
you can probably get away with using ByteArrayOutputStreams for your temporary PDFs (vs. writing them out to files) - i'll use this approach in the example below
PdfDocument/PdfPage is in the "kernel" module, yet AcroForms are in the "form" module (meaning PdfPage is intentionally unaware of AcroForms) - IPdfPageExtraCopier is sortof the bridge between the modules. In order to properly copy AcroForms, you need to use the two-arg copyTo() version, passing an instance of PdfPageFormCopier
field names must be unique in the document (the "absolute" field name that is - i'll skip field hierarcies for now). Since we're looping through and adding the fields from the template multiple times, we need to come up with a strategy to rename the fields to ensure uniqueness (the current API is actually a little bit clunky in this area)
File acroFormTemplate = new File("someTemplate.pdf");
Map<String, String> someMapOfFieldToValues = new HashMap<>();
try (
PdfDocument finalOutput = new PdfDocument(new PdfWriter(new FileOutputStream(new File("finalOutput.pdf")));
) {
for (/* some looping condition */int x = 0; x < 5; x++) {
// for each iteration of the loop, create a temporary in-memory
// PDF to handle form field edits.
ByteArrayOutputStream tmp = new ByteArrayOutputStream();
try (
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new FileInputStream(acroFormTemplate)), new PdfWriter(tmp));
) {
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(filledInAcroFormTemplate, true);
for (PdfFormField field : acroForm.getFormFields().values()) {
if (someMapOfFieldToValues.containsKey(field.getFieldName())) {
field.setValue(someMapOfFieldToValues.get(field.getFieldName()));
}
}
// NOTE that because we're adding the template multiple times
// we need to adopt a field renaming strategy to ensure field
// uniqueness in the final document. For demonstration's sake
// we'll just rename them prefixed w/ our loop counter
List<String> fieldNames = new ArrayList<>();
fieldNames.addAll(acroForm.getFormFields().keySet()); // avoid ConfurrentModification
for (String fieldName : fieldNames) {
acroForm.renameField(fieldName, x+"_"+fieldName);
}
}
// the temp PDF needs to be "closed" for all the PDF finalization
// magic to happen...so open up new read-only version to act as
// the source for the merging from our in-memory bucket-o-bytes
try (
PdfDocument readOnlyFilledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(tmp.toByteArray())));
) {
// although PdfPage.copyTo will probably work for simple pages, PdfDocument.copyPagesTo
// is a more comprehensive copy (wider support for copying Outlines and Tagged content)
// so it's more suitable for general page-copy use. Also, since we're copying AcroForm
// content, we need to use the PdfPageFormCopier
readOnlyFilledInAcroFormTemplate.copyPagesTo(1, 1, finalOutput, new PdfPageFormCopier());
}
}
}
Close your PdfDocuments when you are done with adding content to them.

Apache POI Table of contents not updating

I am using Apache POI XWPF components and java, to extract data from a .xml file into a word document. So far so good, but I am struggling to create a table of contents.
I have to create a table of contents at the start of the method and then I update it at the end to get all the new headers. Currently I use doc.createTOC(), where doc is a variable created from XWPFDocument, to create the table at the start and then I use doc.enforceUpdateFields() to update everything at the end of the document. But when I open the document after I ran the program, the table of contents is empty, but the navigation panel does include some of the headers I specified.
A comment recommended that I include some code. So i started off by create the document from a template:
XWPFDocument doc = new XWPFDocument(new FileInputStream("D://Template.docx"));
I then create a table of contents:
doc.createTOC();
Then throughout the method I add headers to the document:
XWPFParagraph documentControlHeading = doc.createParagraph();
documentControlHeading.setPageBreak(true);
documentControlHeading.setAlignment(ParagraphAlignment.LEFT);
documentControlHeading.setStyle("Tier1Header");
After all the headers are added, I want to update the document so that all the new headers will appear in the table of contents. I do this buy using the following command:
doc.enforceUpdateFields();
Hmmm... I am looking at the createTOC() method code, and it appears that it looks for styles that look like Heading #. So Tier1Header would not be found. Try creating your text first, and use styles like Heading 1 for your headings. Then add the TOC using createTOC(). It should find all the headings when the TOC is created. I do not know if enforceUpdateFields() affects the TOC.
//Your docx template should contain the following or something similar text //which will be searched for and replaced with a WORD TOC.
//${TOC}
public static void main(String[] args) throws IOException, OpenXML4JException {
XWPFDocument docTemplate = null;
try {
File file = new File(PATH_TO_FILE); //"C:\\Reports\\Template.docx";
FileInputStream fis = new FileInputStream(file);
docTemplate = new XWPFDocument(fis);
generateTOC(docTemplate);
saveDocument(docTemplate);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (docTemplate != null) {
docTemplate.close();
}
}
}
private static void saveDocument(XWPFDocument docTemplate) throws FileNotFoundException, IOException {
FileOutputStream outputFile = null;
try {
outputFile = new FileOutputStream(OUTFILENAME);
docTemplate.write(outputFile);
} finally {
if (outputFile != null) {
outputFile.close();
}
}
}
public static void generateTOC(XWPFDocument document) throws InvalidFormatException, FileNotFoundException, IOException {
String findText = "${TOC}";
String replaceText = "";
for (XWPFParagraph p : document.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
int pos = r.getTextPosition();
String text = r.getText(pos);
if (text != null && text.contains(findText)) {
text = text.replace(findText, replaceText);
r.setText(text, 0);
addField(p, "TOC \\o \"1-3\" \\h \\z \\u");
break;
}
}
}
}
private static void addField(XWPFParagraph paragraph, String fieldName) {
CTSimpleField ctSimpleField = paragraph.getCTP().addNewFldSimple();
// ctSimpleField.setInstr(fieldName + " \\* MERGEFORMAT ");
ctSimpleField.setInstr(fieldName);
ctSimpleField.addNewR().addNewT().setStringValue("<<fieldName>>");
}
This is the code of createTOC(), obtained by inspecting XWPFDocument.class:
public void createTOC() {
CTSdtBlock block = getDocument().getBody().addNewSdt();
TOC toc = new TOC(block);
for (XWPFParagraph par : this.paragraphs) {
String parStyle = par.getStyle();
if ((parStyle != null) && (parStyle.startsWith("Heading"))) try {
int level = Integer.valueOf(parStyle.substring("Heading".length())).intValue();
toc.addRow(level, par.getText(), 1, "112723803");
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
As you can see, it adds to the TOC all paragraphs having styles named "HeadingX", with X being a number. But, unfortunately, that's not sufficent. The method, in fact, is bugged/uncomplete in its implementation.
The page number passed to addRow() is always 1, it's not even calculated.
So, at the end, you will have a TOC with all your paragraphs and the trailing dots giving the proper indentation, but the pages will be always equal to "1".
EDIT
...but, there's a solution here.

Remove rectangles from PDF file

I'd like to have a program that removes all rectangles from a PDF file. One use case for this is to unblacken a given PDF file to see if there is any hidden information behind the rectangles. The rest of the PDF file should be kept as-is.
Which PDF library is suitable to this task? In Java, I would like the code to look like this:
PdfDocument doc = PdfDocument.load(new File("original.pdf"));
PdfDocument unblackened = doc.transform(new CopyingPdfVisitor() {
public void visitRectangle(PdfRect rect) {
if (rect.getFillColor().getBrightness() >= 0.1) {
super.visitRectangle(rect);
}
}
});
unblackened.save(new File("unblackened.pdf"));
The CopyingPdfVisitor would copy a PDF document exactly as-is, and my custom code would leave out all the dark rectangles.
Itext pdf library have ways to modify pdf content.
The *ITEXT CONTENTPARSER Example * may give you any idea. "qname" parameter (qualified name) may be used to detected rectangle element.
http://itextpdf.com/book/chapter.php?id=15
Other option, if you want obtain the text on the document use the PdfReaderContentParser to extract text content
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
example at http://itextpdf.com/examples/iia.php?id=277

Apache POI HWPF - problem in convert doc file to pdf

I am currently working Java project with use of apache poi.
Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour.
My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.
If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.
If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
int j = 0;
while (true) {
CharacterRun run = poiPara.getCharacterRun(j++);
System.out.println("Color "+run.getColor());
System.out.println("Font size "+run.getFontSize());
System.out.println("Font Name "+run.getFontName());
System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
System.out.println("Text is "+run.text());
if (run.getEndOffset() == poiPara.getEndOffset()) {
break;
}
}
}

Categories

Resources