Reading pdf with Apache PDF Box

Reading pdf with Apache PDF Box - java

I have these lines of codes that I have being trying to use to read pdf file with Apache pdfBox.
private void readPdf(){
try {
File PDF_Path = new File("/home/olyjosh/Downloads/my project.pdf");
PDDocument inputPDF = PDDocument.load(PDF_Path);
List<PDPage> allPages = inputPDF.getDocumentCatalog().getAllPages();
PDPage testPage = (PDPage) allPages.get(5);
System.out.println("Number of pages "+allPages.size());
PDFPagePanel pdfPanel = new PDFPagePanel();
jPanel1.add(pdfPanel);
pdfPanel.setPage(testPage);
// this.revalidate();
inputPDF.close();
} catch (IOException ex) {
Logger.getLogger(NewJFrame.class.getName()).log(Level.SEVERE, null, ex);
}
}
I want this pdf to be displayed on swing component like jPanel but this will only display the panel with the expected content of the pdf file. However, I was able to display the pdf as image using
convertToImage = testPage.convertToImage();
Please, how do I work around this or what am I doing wrong.

Apache PDF-Box has a mailing list where I was able to ask the same question and this was the response I got
This was removed in 2.0 because it made trouble. Obviously, it doesn't work for 1.8 either, at least for you, so why bother?
There are two ways to display, either get a BufferedImage (renderImage / renderImageWithDPI) and display that somehow (see in PDFDebugger how to do it), or renderPageToGraphics which renders to a graphics device object.
If you really want to get the source code of the deleted PDFReader application (which includes PDFPagePanel), use svn to get revision 1702125 or earlier, that should have it. But if it didn't work for you in 1.8, it won't work for you now.
The point is that swing display of PDF pages isn't part of the API, it's part of some tool (now: in PDFDebugger, previously: in PDFReader)
You need to have some understanding of awt / swing. If you don't, learn it, or hire somebody. (That's what we did, and the best is: google paid it, as part of the google summer of code)
Tilman

Related

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

I am trying to fetch and read bar codes from my PDF using getXObjectNames() of PdResources.
My code is very similar to this link: https://issues.apache.org/jira/browse/PDFBOX-2124
If you see the above JIRA item, you will see a PDF file attached to it.
When I run the code on that PDF file I get the desired output (i.e. the bar code type is printed.)
However when I run it on my PDF, it does not recognize the bar code in it. (I have checked that the bar code is in fact an image and not text.)
Also it may sound weird, but it did work on my PDF once and I haven't made any changes since then, but it definitely does not work now. (I cannot share the PDF for some reason.)
Has anyone faced a similar issue?
Also this is my first question on Stack Overflow. Please tell me if I am wrong anywhere.
Here is a link to that pdf:
https://drive.google.com/file/d/1PzVApIePg4U9XL399BpAd2oeY6Q2tLEB/view?usp=drivesdk

In General
As you don't show your code but only describe it as very similar to that in PDFBOX-2124, and as you say you cannot share the PDF for some reason, I only have that code to analyze. Thus, I cannot tell what really is the issue but merely enumerate some possible problems
First of all, that code only inspects the immediate resources of the given page for bitmap images:
PDResources pdResources = pdPage.getResources();
Map<String, PDXObject> xobjects = (Map<String, PDXObject>) pdResources.getXObjects();
if (xobjects != null)
{
for (String key : xobjects.keySet())
{
PDXObject xobject = xobjects.get(key);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
extractBarcodeArrayByAreas(image, this.maximumBlankPixelDelimiterCount);
}
}
}
}
(PDFBOX-2124 PdPageBarcodeScanner method scsan)
Bitmap images can also be stored elsewhere, e.g.
in the separate resources of form xobjects, patterns, or Type 3 fonts used on the page; to find them one has to inspect other page resources, too, even recursively as the image might be a resource of a pattern used in a form xobject used on the page;
in the separate resources of annotations of the page; thus, one has to recurse into annotation resources, too;
inlined in some content stream; thus, one also has to search the content streams of the page itself, of page resources (recursively), and page annotations and their resources (recursively).
Furthermore, the bitmap might be given in some format (in particular with some colorspace) which PDFBox does not know how to export as BufferedImage.
Also the bar code may be constructed using some mask applied to a purely black bitmap in which case your code probably only tries to scan that purely black image.
Furthermore, you say
I have checked that the bar code is in fact an image and not text.
If you only checked that the bar code is not text, it may not only be a bitmap image but it can also be drawn by vector graphics instructions. Thus, you also have to check all content streams for vector graphics instructions drawing a bar code.
Also there may be combinations, e.g. a soft mask of vector graphics may be active when drawing a purely black inlined bitmap image etc.
And I'm sure I've missed a number of options here.
As next step you may want to analyze the PDF you cannot share to find out how exactly that barcode is drawn.
Alternatively, you render the page as bitmap image and search that large bitmap for bar codes using zxing.
Sample PDF.pdf
You provided a link to a sample PDF. So I tried to extract the bar code using code very similar to that from PDFBOX-2124. Apparently the code there was for some PDFBox 2.0.0-SNAPSHOT, so it had to be corrected a bit. In particular the method getXObjectNames() you mention in the question title finally is used:
PDResources pdResources = pdPage.getResources();
int index = 0;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(name);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
File file = new File(RESULT_FOLDER, String.format("Sample PDF-1-%s.%s", index, imageObject.getSuffix()));
ImageIO.write(image, imageObject.getSuffix(), file);
index++;
System.out.println(file);
}
}
}
(ExtractImages test testExtractSamplePDFJayshreeAtak)
The output: One bitmap image is exported as "Sample PDF-1-0.tiff" which looks like this:
Thus, I cannot reproduce your issue
PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet
Obviously getXObjectNames() does return the name of the bitmap image xobject resource and PDFBox exports it just fine.
Please check with your code whether as claimed the image is not extracted or whether some later step simply cannot deal with it.
If in your case indeed the image is not extracted,
update your PDFBox version (I used the current development head but the newest released version should return the same),
update your Java,
check whether you have extra JAI jars that might cause trouble.
If in your case the image is extracted but not analyzed as expected by later code,
debug more thoroughly to find out where the analysis fails,
create a new question here focusing on the QR code image analysis,
and provide enough code and the tiff file to allow people to actually reproduce the issue.

PDFBox PDFImageWrite.writeImage is not handling all characters properly

I am using PDFBox 1.8.10 to load PDFs and to overlay images on each page.
PDDocument doc = PDDocument.load(url);
PDFImageWriter imageWriter = new PDFImageWriter();
imageWriter.writeImage(doc, imageFormat, password, 1,
doc.getNumberOfPages(), filePrefix, imageType, resolution);
I have tried saving the doc as a PDF and this looks fine. When the images are saved they can contain incorrect text. This is especially true for eastern European documents - eg Hungary, Poland, Czech etc
The PDF shows
H-4432 NYÍREGYHÁZA-NYÍRSZŐLŐS
The image shows
Is there a solution for this? Do I need to define a codepage? Could it be a problem with the available fonts?

The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes.
Thanks to #mkl for the info.

PDFBox: Convert PDF to Image is slower than expected

I am currently using this code to convert a pdf to an image:
#SuppressWarnings("unchecked")
public static Image convertPDFtoImage(ByteArrayInputStream bais) {
Image convertedImage = null;
try {
PDDocument document = PDDocument.load(bais);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
PDPage page = list.get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 64);
convertedImage = SwingFXUtils.toFXImage(image, null);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
return convertedImage;
}
Then, I show the convertedImage in an JavaFX ImageView.
Further, I need to import these two packages, while I am not using them:
import org.apache.commons.logging.LogFactory;
import org.apache.fontbox.afm.AFMParser;
Two questions:
Does it normally take two to three seconds to convert a simple one page PDF to Image where the DPI is set on 64 (which is not that high in my opinion)? It seems to be a bit slow.
Why do I need those two imports while I am not using them? If I don't import them, I get a lot of errors and the conversion does not work.
I would like to show a PDF quickly in JavaFX, and two to three seconds is just too long. Any other ways of showing a PDF in JavaFX (other than convert it to an image) are very welcome.
Any help is greatly appreciated!

I also had similar problem while converting pdf into images, I solved this by upgrading PDFBox from 1.8 to 2.0.
This improved my performance by 50%. Previously my app is taking around 10 seconds to convert pdf into images and now it is taking 5 seconds.
Please use following link as reference while upgrading PDFBox -
https://pdfbox.apache.org/2.0/migration.html
Additional imports are not required for PDFBox.
Regards,
Yogesh

PDFbox runs into error (how to calculate the position for non-simple fonts)

I am using pdfbox to fill up a form in my pdf file, application is able to show number of available fields on the form but it returns the following error
Messages:
Error: Don't know how to calculate the position for non-simple fonts
File: org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
Line number: 616
Code
.....
while (fieldsIter.hasNext()) {
PDField field = (PDField) fieldsIter.next();
setField(pdf, field.getPartialName(), "My input");
//setField(pdf, field.getFullyQualifiedName(), "My input");
}
.....
public void setField(PDDocument pdfDocument, String name, String value) throws
IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field != null) {
field.setValue(value);
} else {
System.err.println("No field found with name:" + name);
}
}
Please let me know if you need any other part of the code.

There seems to be a bug in pdfbox that occurs in various circumstances where it can't find the relevant font. The only workaround I've been able to find is just skipping the update appearance code that's run in PDTextbox.setValue. You can "force" the value to be updated by just doing:
COSString fieldValue = new COSString("Awesome field value");
textbox.getDictionary().setItem(COSName.V, fieldValue);
Presumably, in all but corner cases, the PDF viewer can handle rendering the font, and just setting the V item for the field should suffice. Subjectively the documents I've generated open fine in Acrobat and OS X preview.
Related issue:
https://issues.apache.org/jira/browse/PDFBOX-1550
Edit to add: by default acrobat seems to create fields with a visible text area size of 0. For documents with this problem, you can get around this by adding textbox.getDictionary().setItem(COSName.AP, null); and hoping that the reader can handle rendering the appearance correctly.

Use another PDF if that works means PDFBox is not compatible with the first pdf.

I had the same issue with a PDF and I ended up solving it by editing the PDF form field (with Abobe Acrobat Pro) and setting an specific font.
The problem was that the problematic field didn't specify any font at all.
Hope it helps!

I also had this issue, the problem was that the PDF did not embed the font used.
Moreover the Preflight tools from Acrobat pro seemed unable to fix the PDF. I ended up recreating the PDF and now it is working fine.

This happend to me by using Adobe Pro and merging to PDFs into one. After this the resulting PDF was not able to be used with PDFBox also the original PDFs worked. After a little research i found out, that the merge process destryos the font information. Just reset the fonts and it should work!
Best regards!

Unable to print PNG files using Java Print Services (Everything else works fine)

I am using the Java print services to print a PNG file, however it is sending erroneous output to the printer. What actually gets printed (when I use a PNG) is some text saying:
ERROR: /syntaxerror in --%ztokenexec_continue--
Operand stack:
--nostringval-
There seems to be some more text, but that is kind of lost out of the page margins. I am setting the DocFlavor to DocFlavor.INPUT_STREAM.PNG and the specified file is actually an InputStream (Just changing the DoccFlavor to DocFlavor.INPUT_STREAM.PDF and using a pdf file works).
I have also tried it with different PNG files, but the problem persists. For what its worth, even PostScript seems to be working.
The errors that are being printed look quite similar to the gd (or ImageMagick?) errors. So, my best assumption right now is that the conversion from PNG -> PS is failing.
The code is as follows:
PrintService printService = this.getPrintService("My printer name");
final Doc doc = new SimpleDoc(document, DocFlavor.INPUT_STREAM.PNG, null);
final DocPrintJob printJob = printService.createPrintJob();
Here, getPrintService fetches a print service and is fetching a valid one. As for the document, here is how I get it:
File pngFile = new File("/home/rprabhu/temp/myprintfile.png");
FileInputStream document = new FileInputStream(pngFile);
I have no clue why it is going wrong, and I don't see any errors being output to the console as well.
Any help is greatly appreciated. Thanks.

Printing is always a messy business – inevitably so, because you have to worry about tedious details such as the size of a page, the margin sizes, and how many pages you're going to need for your output. As you might expect, the process for printing an image is different from printing text and you may also have the added complication of several printers with different capabilities being available, so with certain types of documents you need to select an appropriate printer.
Please see below links :
http://vineetreynolds.wordpress.com/2005/12/12/silent-print-a-pdf-print-pdf-programmatically/
http://hillert.blogspot.com/2011/12/java-print-service-frustrations.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading pdf with Apache PDF Box - java

Related

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

PDFBox PDFImageWrite.writeImage is not handling all characters properly

PDFBox: Convert PDF to Image is slower than expected

PDFbox runs into error (how to calculate the position for non-simple fonts)

Unable to print PNG files using Java Print Services (Everything else works fine)

Categories

Resources