I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus
Related
I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);
I am using PDFBox 1.8.10 to load PDFs and to overlay images on each page.
PDDocument doc = PDDocument.load(url);
PDFImageWriter imageWriter = new PDFImageWriter();
imageWriter.writeImage(doc, imageFormat, password, 1,
doc.getNumberOfPages(), filePrefix, imageType, resolution);
I have tried saving the doc as a PDF and this looks fine. When the images are saved they can contain incorrect text. This is especially true for eastern European documents - eg Hungary, Poland, Czech etc
The PDF shows
H-4432 NYÍREGYHÁZA-NYÍRSZŐLŐS
The image shows
Is there a solution for this? Do I need to define a codepage? Could it be a problem with the available fonts?
The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes.
Thanks to #mkl for the info.
I am using pdfbox to fill up a form in my pdf file, application is able to show number of available fields on the form but it returns the following error
Messages:
Error: Don't know how to calculate the position for non-simple fonts
File: org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
Line number: 616
Code
.....
while (fieldsIter.hasNext()) {
PDField field = (PDField) fieldsIter.next();
setField(pdf, field.getPartialName(), "My input");
//setField(pdf, field.getFullyQualifiedName(), "My input");
}
.....
public void setField(PDDocument pdfDocument, String name, String value) throws
IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field != null) {
field.setValue(value);
} else {
System.err.println("No field found with name:" + name);
}
}
Please let me know if you need any other part of the code.
There seems to be a bug in pdfbox that occurs in various circumstances where it can't find the relevant font. The only workaround I've been able to find is just skipping the update appearance code that's run in PDTextbox.setValue. You can "force" the value to be updated by just doing:
COSString fieldValue = new COSString("Awesome field value");
textbox.getDictionary().setItem(COSName.V, fieldValue);
Presumably, in all but corner cases, the PDF viewer can handle rendering the font, and just setting the V item for the field should suffice. Subjectively the documents I've generated open fine in Acrobat and OS X preview.
Related issue:
https://issues.apache.org/jira/browse/PDFBOX-1550
Edit to add: by default acrobat seems to create fields with a visible text area size of 0. For documents with this problem, you can get around this by adding textbox.getDictionary().setItem(COSName.AP, null); and hoping that the reader can handle rendering the appearance correctly.
Use another PDF if that works means PDFBox is not compatible with the first pdf.
I had the same issue with a PDF and I ended up solving it by editing the PDF form field (with Abobe Acrobat Pro) and setting an specific font.
The problem was that the problematic field didn't specify any font at all.
Hope it helps!
I also had this issue, the problem was that the PDF did not embed the font used.
Moreover the Preflight tools from Acrobat pro seemed unable to fix the PDF. I ended up recreating the PDF and now it is working fine.
This happend to me by using Adobe Pro and merging to PDFs into one. After this the resulting PDF was not able to be used with PDFBox also the original PDFs worked. After a little research i found out, that the merge process destryos the font information. Just reset the fonts and it should work!
Best regards!
I am trying to fill in a PDF form using JAVA, but when I tried to get the fields using the below code the list is empty.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Then I tried to read the file using PDFStripper
PDFTextStripper stripper = new PDFTextStripper();
System.out.println(stripper.getText(pdDoc));
and the ouput was as follows
"Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries."
But I'm able to open the file manually and fill the fields as well. I've tried other tools like iText also. But again I wasn't able to get the fields.
How can I resolve this issue?
May be it is too late to answer but anyway why not. You can get empty list if your pdf file has XFA structure.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Use these code lines to start working with pdf:
PDXFA xfa = pdform.getXFA();
Document xfaDocument = xfa.getDocument();
NodeList elements = xfaDocument.getElementsByTagName( "SomeElement" );
While struggling with Alfresco's content search abilities, I've had some trouble with pdfbox (used by Alfresco to extract text and metadata) reading PDF files written by old applications (like QuarkXPress) that use old Acrobat 4.0 format. This old format pdfbox seems to be unable to extract metadata or text from it, although the files were perfectly viewable with any PDF reader application.
The solution was having all old PFD files re-printed (saved as...) using a more modern PDF format (like 10.0 for instance). This can be done in a row using some bash scripting.
I directly didn't try intermediate Acrobat versions among 4.0 and 10.0.
I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
Many thanks.
All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).
Here is a small sample I tried:
After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:
DeviceRGB
146.115
208.08
80.07
Here's the code:
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.
As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.
In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.