Apache POI HWPF - problem in convert doc file to pdf

Apache POI HWPF - problem in convert doc file to pdf - java

I am currently working Java project with use of apache poi.
Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour.
My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
int j = 0;
while (true) {
CharacterRun run = poiPara.getCharacterRun(j++);
System.out.println("Color "+run.getColor());
System.out.println("Font size "+run.getFontSize());
System.out.println("Font Name "+run.getFontName());
System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
System.out.println("Text is "+run.text());
if (run.getEndOffset() == poiPara.getEndOffset()) {
break;
}
}
}

Related

making more than one pdf document in java

This is my code:
try {
dozen = magazijn.getFfd().vraagDozenOp();
for (int i = 0; i < dozen.size(); i++) {
PdfWriter.getInstance(doc, new FileOutputStream("Order" + x + ".pdf"));
System.out.println("Writer instance created");
doc.open();
System.out.println("doc open");
Paragraph ordernummer = new Paragraph(order.getOrdernummer());
doc.add(ordernummer);
doc.add( Chunk.NEWLINE );
for (String t : text) {
Paragraph klant = new Paragraph(t);
doc.add(klant);
}
doc.add( Chunk.NEWLINE );
Paragraph datum = new Paragraph (order.getDatum());
doc.add(datum);
doc.add( Chunk.NEWLINE );
artikelen = magazijn.getFfd().vraagArtikelenOp(i);
for (Artikel a : artikelen){
artikelnr.add(a.getArtikelNaam());
}
for (String nr: artikelnr){
Paragraph Artikelnr = new Paragraph(nr);
doc.add(Artikelnr);
}
doc.close();
artikelnr.clear();
x++;
System.out.println("doc closed");
}
} catch (Exception e) {
System.out.println(e);
}
I get this exception: com.itextpdf.text.DocumentException: The document has been closed. You can't add any Elements.
can someone help me fix this so that the other pdf can be created and paragrphs added?

Alright, your intent is not very clear from your code and question so I'm going to operate under the following assumptions:
You are creating a report for each box you're processing
Each report needs to be a separate PDF file
You're getting a DocumentException on the second iteration of the loop, you're trying to add content to a Document that has been closed in the previous iteration via doc.close();. 'doc.close' will finalize the Document and write everything still pending to any linked PdfWriter.
If you wish to create separate pdfs for each box, you need to create a seperate Document in your loop statement as well, since creating a new PdfWriter via PdfWriter.getInstance(doc, new FileOutputStream("Order" + x + ".pdf")); will not create a new Document on its own.
If I'm wrong with assumption 2 and you wish to add everything to a single PDF, move doc.close(); outside of the loop and create only a single PdfWriter

You can try something like this using Apache PDFBox
File outputFile = new File(path);
outputFile.createNewFile();
PDDocument newDoc = new PDDocument();
then create a PDPage and write what you wanna write in that page. After your page is ready, add it to the newDoc and in the end save it and close it
newDoc.save(outputFile);
newDoc.close()
repeat this dozen.size() times and keep changing the file's name in path for every new document.

iText PdfAConformanceException: PDF array is out of bounds

i'm using iText 5.5.5 with Java5.
I'm trying to merge some PDF/A. when I got a "PdfAConformanceException: PDF array is out of bounds".
Trying to find error I find the "bad PDF" that cause the error and when I try to copy just it exception throw again. This error don't appear always, it appear just when this PDF/A is in the "job chain"; I tried with some other files and it's all fine. I cant share with you source PDF 'couse it's restricted.
That's my piece of code:
_log.info("Start Document Merge");
// Output pdf
ByteArrayOutputStream bos = new ByteArrayOutputStream();
com.itextpdf.text.Document document = new com.itextpdf.text.Document();
PdfCopy copy = new PdfACopy(document, bos, PdfAConformanceLevel.PDF_A_1B);
PageStamp stamp = null;
PdfReader reader = null;
PdfContentByte content = null;
int outPdfPageCount = 0;
BaseFont baseFont = BaseFont.createFont("Arial", BaseFont.WINANSI, BaseFont.EMBEDDED);
copyOutputIntents(reader, copy);
// Loop over the pages in that document
try {
int numberOfPages = reader.getNumberOfPages();
for (int i = 1; i <= numberOfPages; i++) {
PdfImportedPage pagecontent = copy.getImportedPage(reader, i);
_log.debug("Handling page numbering [" + i + "]");
stamp = copy.createPageStamp(pagecontent);
content = stamp.getUnderContent();
content.beginText();
content.setFontAndSize(baseFont, Configuration.NumPagSize);
content.showTextAligned(PdfContentByte.ALIGN_CENTER, String.format("%s %s ", Configuration.NumPagPrefix, i), Configuration.NumPagX, Configuration.NumPagY, 0);
content.endText();
stamp.alterContents();
copy.addPage(pagecontent);
outPdfPageCount++;
if (outPdfPageCount > Configuration.MaxPages) {
_log.error("Pdf Page Count > MaxPages");
throw new PackageException(Constants.ERROR_104_TEXT, Constants.ERROR_104);
}
}
copy.freeReader(reader);
reader.close();
copy.createXmpMetadata();
document.close();
} catch (Exception e) {
_log.error("Error during mergin Document, skip");
_log.debug(MiscUtil.stackToString(e));
}
return bos.toByteArray();
That's the full stacktrace:
com.itextpdf.text.pdf.PdfAConformanceException: PDF array is out of bounds.
at com.itextpdf.text.pdf.internal.PdfA1Checker.checkPdfObject(PdfA1Checker.java:269)
at com.itextpdf.text.pdf.internal.PdfAChecker.checkPdfAConformance(PdfAChecker.java:208)
at com.itextpdf.text.pdf.internal.PdfAConformanceImp.checkPdfIsoConformance(PdfAConformanceImp.java:71)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3480)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3476)
at com.itextpdf.text.pdf.PdfArray.toPdf(PdfArray.java:165)
at com.itextpdf.text.pdf.PdfDictionary.toPdf(PdfDictionary.java:149)
at com.itextpdf.text.pdf.PdfArray.toPdf(PdfArray.java:175)
at com.itextpdf.text.pdf.PdfDictionary.toPdf(PdfDictionary.java:149)
at com.itextpdf.text.pdf.PdfIndirectObject.writeTo(PdfIndirectObject.java:158)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.write(PdfWriter.java:420)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:398)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:373)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:369)
at com.itextpdf.text.pdf.PdfWriter.addToBody(PdfWriter.java:843)
at com.itextpdf.text.pdf.PdfCopy.addToBody(PdfCopy.java:839)
at com.itextpdf.text.pdf.PdfCopy.addToBody(PdfCopy.java:821)
at com.itextpdf.text.pdf.PdfCopy.copyIndirect(PdfCopy.java:426)
at com.itextpdf.text.pdf.PdfCopy.copyIndirect(PdfCopy.java:446)
at com.itextpdf.text.pdf.PdfCopy.copyObject(PdfCopy.java:577)
at com.itextpdf.text.pdf.PdfCopy.copyDictionary(PdfCopy.java:503)
at com.itextpdf.text.pdf.PdfCopy.copyObject(PdfCopy.java:573)
at com.itextpdf.text.pdf.PdfCopy.copyDictionary(PdfCopy.java:503)
at com.itextpdf.text.pdf.PdfCopy.copyObject(PdfCopy.java:573)
at com.itextpdf.text.pdf.PdfCopy.copyDictionary(PdfCopy.java:493)
at com.itextpdf.text.pdf.PdfCopy.copyDictionary(PdfCopy.java:519)
at com.itextpdf.text.pdf.PdfCopy.addPage(PdfCopy.java:663)
at com.itextpdf.text.pdf.PdfACopy.addPage(PdfACopy.java:115)
at it.m2sc.engageone.documentpackage.generator.PackageGenerator.mergePDF(PackageGenerator.java:256)

In that specific case, the problem depends by a specific Font ( Gulim ) that is too big to be embedded in PDF/A-1 file. When that font was removed, everything war run fine.

Java writing PDF - Font not supported

Below is the code to write PDF using Java.
Code
public class PDFTest {
public static void main(String args[]) {
Document document = new Document(PageSize.A4, 50, 50, 50, 50);
try {
File file = new File("C://test//itext-test.pdf");
FileOutputStream fileout = new FileOutputStream(file);
PdfWriter.getInstance(document, fileout);
document.addAuthor("Me");
document.addTitle("My iText Test");
document.open();
Chunk chunk = new Chunk("iText Test");
Paragraph paragraph = new Paragraph();
String test = "și";
String test1 = "şi";
if (test.equalsIgnoreCase(test1)) {
// System.out.println("equal ignore case true");
paragraph.add(test + " New Font equal with Old Font");
} else {
// System.out.println("equal ignore case X true");
paragraph.add(test1 + " New Font Not equal with Old Font");
}
paragraph.setAlignment(Element.ALIGN_CENTER);
document.add(paragraph);
document.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
When I test with Romanian language, I found that "ș" is missing in created PDF.
The Document appears like below:
Any advice or references links regarding this issue is highly appreciated.
**EDITED**
I've use unicode example like below and the output is still same. "ș" is still missing.
Code
static String RESULT = "C://test/itext-unicode4.pdf";
static String FONT = "C://Users//PenangIT//Desktop//Arial Unicode.ttf";
public static void main(String args[])
{
try
{
Document doc = new Document();
PdfWriter.getInstance(doc, new FileOutputStream(RESULT));
doc.open();
BaseFont bf;
bf = BaseFont.createFont(FONT,BaseFont.IDENTITY_H,BaseFont.EMBEDDED);
doc.add(new Paragraph("Font : "+bf.getPostscriptFontName()+" with encoding: "+bf.getEncoding()));
doc.add(new Paragraph(" TESTING "));
doc.add(new Paragraph(" TESTING 1 și "));
doc.add(new Paragraph(" TESTING 2 şi "));
doc.add(Chunk.NEWLINE);
doc.close();
}
catch(Exception ex)
{
}
The Output looks like this
It same for encode as well. The "ș" is still missing.

Please take a look at this PDF: encoding_example.pdf (*)
It contains all kinds of characters that aren't present in the default font Helvetica (which is the default font you're using as you're not defining any other font).
In the EncodingExample source, we use arialbd.ttf with a specific encoding, resulting in the use of a simple font in the PDF. In the UnicodeExample source, we use IDENTITY_H as encoding, resulting in the use of a composite font in the PDF.
I've adapted your code, because I see that you didn't understand my answer:
BaseFont bf = BaseFont.createFont(FONT,BaseFont.IDENTITY_H,BaseFont.EMBEDDED);
doc.add(new Paragraph(" TESTING 1 și ", new Font(bf, 12)));
doc.add(new Paragraph(" TESTING 2 \u015Fi ", new Font(bf, 12)));
Do you see the difference? In your code, you create bf, but you aren't using that object anywhere.
(* )Note: pdf.js can't interpret some glyphs because pdf.js doesn't support simple fonts with a special encoding; these glypgh show up correctly in Adobe Reader and Chrome PDF viewer. If you want to be safe, use composite fonts, because pdf.js can render those glyphs correctly: unicode_example.pdf

Read .doc file content and write into pdf file in java

I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
WordExtractor extractor = null ;
try {
fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
HWPFDocument hdocument=new HWPFDocument(fs);
extractor = new WordExtractor(hdocument);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
Range range=hdocument.getRange();
String readText=null;
PdfPTable createTable;
CharacterRun run;
PicturesTable picture;
for(int i=0;i<range.numParagraphs();i++) {
Paragraph par = range.getParagraph(i);
readText=par.text();
if(!par.isInTable()) {
if(readText.endsWith("\n")) {
readText=readText+"\n";
document.add(new com.itextpdf.text.Paragraph(readText));
} if(readText.endsWith("\r")) {
readText += "\n";
document.add(new com.itextpdf.text.Paragraph(readText));
}
run =range.getCharacterRun(i);
picture=hdocument.getPicturesTable();
if(picture.hasPicture(run)) {
//if(run.isSpecialCharacter()) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
} else if (par.isInTable()) {
Table table = range.getTable(par);
TableRow tRow1= table.getRow(0);
int numColumns=tRow1.numCells();
createTable=new PdfPTable(numColumns);
for (int rowId=0;rowId<table.numRows();rowId++) {
TableRow tRow = table.getRow(rowId);
for (int cellId=0;cellId<tRow.numCells();cellId++) {
TableCell tCell = tRow.getCell(cellId);
PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
createTable.addCell(c1);
}
}
document.add(createTable);
}
}
}catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
}
catch(Exception exep) {
exep.printStackTrace();
}finally {
document.close();
}
}
The problems are:
1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.
I'm getting following exception while reading table.
java.lang.IllegalArgumentException: This paragraph is not the first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876)
at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)
Can anybody help me to solve the problem.
Thank you.

Regarding your exception:
Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.
However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):
if (par.isInTable()) {
Table table = range.getTable(par);
for (int rn=0; rn<table.numRows(); rn++) {
TableRow row = table.getRow(rn);
for (int cn=0; cn<row.numCells(); cn++) {
TableCell cell = row.getCell(cn);
for (int pn=0; pn<cell.numParagraphs(); pn++) {
Paragraph cellParagraph = cell.getParagraph(pn);
// your PDF conversion code goes here
}
}
}
i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}
Regarding the pictures issue:
Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):
PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
CharacterRun characterRun = par.getCharacterRun(cr);
Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"
Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
}
}
For a list of possible values of Field.getType() see here.

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI HWPF - problem in convert doc file to pdf - java

Related

making more than one pdf document in java

iText PdfAConformanceException: PDF array is out of bounds

Java writing PDF - Font not supported

Read .doc file content and write into pdf file in java

Identify hidden text Word 2003/2007 using Apache POI

Categories

Resources