PDFBox PDFImageWrite.writeImage is not handling all characters properly

PDFBox PDFImageWrite.writeImage is not handling all characters properly - java

I am using PDFBox 1.8.10 to load PDFs and to overlay images on each page.
PDDocument doc = PDDocument.load(url);
PDFImageWriter imageWriter = new PDFImageWriter();
imageWriter.writeImage(doc, imageFormat, password, 1,
doc.getNumberOfPages(), filePrefix, imageType, resolution);
I have tried saving the doc as a PDF and this looks fine. When the images are saved they can contain incorrect text. This is especially true for eastern European documents - eg Hungary, Poland, Czech etc
The PDF shows
H-4432 NYÍREGYHÁZA-NYÍRSZŐLŐS
The image shows
Is there a solution for this? Do I need to define a codepage? Could it be a problem with the available fonts?

The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes.
Thanks to #mkl for the info.

Related

How to clone a generated PDDocument in pdfbox 2.0.x?

I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus

PDFBox text extraction - empty output

I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.
I'm using PDFBox 1.8.8, with Java 7.
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));
It just prints
File: /foo/bar/mypdf.pdf readable: true size: 1267743
Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.
I get no exception, no nothing. Any ideas?
EDIT:
Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5
Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer
EDIT2:
Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Anyway to circumvent that? After all, pdftotext could get the text out.

The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf

PDFbox runs into error (how to calculate the position for non-simple fonts)

I am using pdfbox to fill up a form in my pdf file, application is able to show number of available fields on the form but it returns the following error
Messages:
Error: Don't know how to calculate the position for non-simple fonts
File: org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
Line number: 616
Code
.....
while (fieldsIter.hasNext()) {
PDField field = (PDField) fieldsIter.next();
setField(pdf, field.getPartialName(), "My input");
//setField(pdf, field.getFullyQualifiedName(), "My input");
}
.....
public void setField(PDDocument pdfDocument, String name, String value) throws
IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field != null) {
field.setValue(value);
} else {
System.err.println("No field found with name:" + name);
}
}
Please let me know if you need any other part of the code.

There seems to be a bug in pdfbox that occurs in various circumstances where it can't find the relevant font. The only workaround I've been able to find is just skipping the update appearance code that's run in PDTextbox.setValue. You can "force" the value to be updated by just doing:
COSString fieldValue = new COSString("Awesome field value");
textbox.getDictionary().setItem(COSName.V, fieldValue);
Presumably, in all but corner cases, the PDF viewer can handle rendering the font, and just setting the V item for the field should suffice. Subjectively the documents I've generated open fine in Acrobat and OS X preview.
Related issue:
https://issues.apache.org/jira/browse/PDFBOX-1550
Edit to add: by default acrobat seems to create fields with a visible text area size of 0. For documents with this problem, you can get around this by adding textbox.getDictionary().setItem(COSName.AP, null); and hoping that the reader can handle rendering the appearance correctly.

Use another PDF if that works means PDFBox is not compatible with the first pdf.

I had the same issue with a PDF and I ended up solving it by editing the PDF form field (with Abobe Acrobat Pro) and setting an specific font.
The problem was that the problematic field didn't specify any font at all.
Hope it helps!

I also had this issue, the problem was that the PDF did not embed the font used.
Moreover the Preflight tools from Acrobat pro seemed unable to fix the PDF. I ended up recreating the PDF and now it is working fine.

This happend to me by using Adobe Pro and merging to PDFs into one. After this the resulting PDF was not able to be used with PDFBox also the original PDFs worked. After a little research i found out, that the merge process destryos the font information. Just reset the fonts and it should work!
Best regards!

Unable to read a PDF file using PDFBOX

I am trying to fill in a PDF form using JAVA, but when I tried to get the fields using the below code the list is empty.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Then I tried to read the file using PDFStripper
PDFTextStripper stripper = new PDFTextStripper();
System.out.println(stripper.getText(pdDoc));
and the ouput was as follows
"Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries."
But I'm able to open the file manually and fill the fields as well. I've tried other tools like iText also. But again I wasn't able to get the fields.
How can I resolve this issue?

May be it is too late to answer but anyway why not. You can get empty list if your pdf file has XFA structure.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Use these code lines to start working with pdf:
PDXFA xfa = pdform.getXFA();
Document xfaDocument = xfa.getDocument();
NodeList elements = xfaDocument.getElementsByTagName( "SomeElement" );

While struggling with Alfresco's content search abilities, I've had some trouble with pdfbox (used by Alfresco to extract text and metadata) reading PDF files written by old applications (like QuarkXPress) that use old Acrobat 4.0 format. This old format pdfbox seems to be unable to extract metadata or text from it, although the files were perfectly viewable with any PDF reader application.
The solution was having all old PFD files re-printed (saved as...) using a more modern PDF format (like 10.0 for instance). This can be done in a row using some bash scripting.
I directly didn't try intermediate Acrobat versions among 4.0 and 10.0.

Add image using WordML

I am trying to add an image into a document using WordML. I have used the xml as a basis from the jpg example from here http://www.codeproject.com/KB/office/WordML.aspx. I have managed to write Java which creates this exact xml(wordML) in the document, however when I try and open the generated file in MS Word 2007 it says the file in invalid or corrupt.
The xml for the document that won't open is here:
http://pastebin.com/RNEkbvYG (Raw xml)
Sorry for the long paste, this is the shortest example I could create, there's load of gumph at the top and bottom, but you can clearly see the data image in the middle.
http://pastebin.com/download.php?i=RNEkbvYG (download, rename from txt to xml and open with word)
I would greatly appreciate if anybody could look at the xml at the link above and see if they can see why it won't open in word.

<w:pict>
<w:binData w:name="wordml://02000001.jpg">/9j/4AA..Xof/9k=</w:binData>
<v:shape id="_x0000_i1025" style="width:100%;height:auto" type="#_x0000_t75">
<v:imagedata o:title="network" src="wordml://02000001.jpg"/>
</v:shape>
</w:pict>
is 2003 WordML. There is no w:binData element in the 2007 docx format / ECMA standard.
You might try docx4j instead :-)
See http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/samples/AddImage.java

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox PDFImageWrite.writeImage is not handling all characters properly - java

The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes. Thanks to #mkl for the info.

Related

How to clone a generated PDDocument in pdfbox 2.0.x?

PDFBox text extraction - empty output

PDFbox runs into error (how to calculate the position for non-simple fonts)

Unable to read a PDF file using PDFBOX

Add image using WordML

Categories

Resources