PDFBox: Convert PDF to Image is slower than expected

PDFBox: Convert PDF to Image is slower than expected - java

I am currently using this code to convert a pdf to an image:
#SuppressWarnings("unchecked")
public static Image convertPDFtoImage(ByteArrayInputStream bais) {
Image convertedImage = null;
try {
PDDocument document = PDDocument.load(bais);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
PDPage page = list.get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 64);
convertedImage = SwingFXUtils.toFXImage(image, null);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
return convertedImage;
}
Then, I show the convertedImage in an JavaFX ImageView.
Further, I need to import these two packages, while I am not using them:
import org.apache.commons.logging.LogFactory;
import org.apache.fontbox.afm.AFMParser;
Two questions:
Does it normally take two to three seconds to convert a simple one page PDF to Image where the DPI is set on 64 (which is not that high in my opinion)? It seems to be a bit slow.
Why do I need those two imports while I am not using them? If I don't import them, I get a lot of errors and the conversion does not work.
I would like to show a PDF quickly in JavaFX, and two to three seconds is just too long. Any other ways of showing a PDF in JavaFX (other than convert it to an image) are very welcome.
Any help is greatly appreciated!

I also had similar problem while converting pdf into images, I solved this by upgrading PDFBox from 1.8 to 2.0.
This improved my performance by 50%. Previously my app is taking around 10 seconds to convert pdf into images and now it is taking 5 seconds.
Please use following link as reference while upgrading PDFBox -
https://pdfbox.apache.org/2.0/migration.html
Additional imports are not required for PDFBox.
Regards,
Yogesh

Related

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

I am trying to fetch and read bar codes from my PDF using getXObjectNames() of PdResources.
My code is very similar to this link: https://issues.apache.org/jira/browse/PDFBOX-2124
If you see the above JIRA item, you will see a PDF file attached to it.
When I run the code on that PDF file I get the desired output (i.e. the bar code type is printed.)
However when I run it on my PDF, it does not recognize the bar code in it. (I have checked that the bar code is in fact an image and not text.)
Also it may sound weird, but it did work on my PDF once and I haven't made any changes since then, but it definitely does not work now. (I cannot share the PDF for some reason.)
Has anyone faced a similar issue?
Also this is my first question on Stack Overflow. Please tell me if I am wrong anywhere.
Here is a link to that pdf:
https://drive.google.com/file/d/1PzVApIePg4U9XL399BpAd2oeY6Q2tLEB/view?usp=drivesdk

In General
As you don't show your code but only describe it as very similar to that in PDFBOX-2124, and as you say you cannot share the PDF for some reason, I only have that code to analyze. Thus, I cannot tell what really is the issue but merely enumerate some possible problems
First of all, that code only inspects the immediate resources of the given page for bitmap images:
PDResources pdResources = pdPage.getResources();
Map<String, PDXObject> xobjects = (Map<String, PDXObject>) pdResources.getXObjects();
if (xobjects != null)
{
for (String key : xobjects.keySet())
{
PDXObject xobject = xobjects.get(key);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
extractBarcodeArrayByAreas(image, this.maximumBlankPixelDelimiterCount);
}
}
}
}
(PDFBOX-2124 PdPageBarcodeScanner method scsan)
Bitmap images can also be stored elsewhere, e.g.
in the separate resources of form xobjects, patterns, or Type 3 fonts used on the page; to find them one has to inspect other page resources, too, even recursively as the image might be a resource of a pattern used in a form xobject used on the page;
in the separate resources of annotations of the page; thus, one has to recurse into annotation resources, too;
inlined in some content stream; thus, one also has to search the content streams of the page itself, of page resources (recursively), and page annotations and their resources (recursively).
Furthermore, the bitmap might be given in some format (in particular with some colorspace) which PDFBox does not know how to export as BufferedImage.
Also the bar code may be constructed using some mask applied to a purely black bitmap in which case your code probably only tries to scan that purely black image.
Furthermore, you say
I have checked that the bar code is in fact an image and not text.
If you only checked that the bar code is not text, it may not only be a bitmap image but it can also be drawn by vector graphics instructions. Thus, you also have to check all content streams for vector graphics instructions drawing a bar code.
Also there may be combinations, e.g. a soft mask of vector graphics may be active when drawing a purely black inlined bitmap image etc.
And I'm sure I've missed a number of options here.
As next step you may want to analyze the PDF you cannot share to find out how exactly that barcode is drawn.
Alternatively, you render the page as bitmap image and search that large bitmap for bar codes using zxing.
Sample PDF.pdf
You provided a link to a sample PDF. So I tried to extract the bar code using code very similar to that from PDFBOX-2124. Apparently the code there was for some PDFBox 2.0.0-SNAPSHOT, so it had to be corrected a bit. In particular the method getXObjectNames() you mention in the question title finally is used:
PDResources pdResources = pdPage.getResources();
int index = 0;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(name);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
File file = new File(RESULT_FOLDER, String.format("Sample PDF-1-%s.%s", index, imageObject.getSuffix()));
ImageIO.write(image, imageObject.getSuffix(), file);
index++;
System.out.println(file);
}
}
}
(ExtractImages test testExtractSamplePDFJayshreeAtak)
The output: One bitmap image is exported as "Sample PDF-1-0.tiff" which looks like this:
Thus, I cannot reproduce your issue
PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet
Obviously getXObjectNames() does return the name of the bitmap image xobject resource and PDFBox exports it just fine.
Please check with your code whether as claimed the image is not extracted or whether some later step simply cannot deal with it.
If in your case indeed the image is not extracted,
update your PDFBox version (I used the current development head but the newest released version should return the same),
update your Java,
check whether you have extra JAI jars that might cause trouble.
If in your case the image is extracted but not analyzed as expected by later code,
debug more thoroughly to find out where the analysis fails,
create a new question here focusing on the QR code image analysis,
and provide enough code and the tiff file to allow people to actually reproduce the issue.

Reading pdf with Apache PDF Box

I have these lines of codes that I have being trying to use to read pdf file with Apache pdfBox.
private void readPdf(){
try {
File PDF_Path = new File("/home/olyjosh/Downloads/my project.pdf");
PDDocument inputPDF = PDDocument.load(PDF_Path);
List<PDPage> allPages = inputPDF.getDocumentCatalog().getAllPages();
PDPage testPage = (PDPage) allPages.get(5);
System.out.println("Number of pages "+allPages.size());
PDFPagePanel pdfPanel = new PDFPagePanel();
jPanel1.add(pdfPanel);
pdfPanel.setPage(testPage);
// this.revalidate();
inputPDF.close();
} catch (IOException ex) {
Logger.getLogger(NewJFrame.class.getName()).log(Level.SEVERE, null, ex);
}
}
I want this pdf to be displayed on swing component like jPanel but this will only display the panel with the expected content of the pdf file. However, I was able to display the pdf as image using
convertToImage = testPage.convertToImage();
Please, how do I work around this or what am I doing wrong.

Apache PDF-Box has a mailing list where I was able to ask the same question and this was the response I got
This was removed in 2.0 because it made trouble. Obviously, it doesn't work for 1.8 either, at least for you, so why bother?
There are two ways to display, either get a BufferedImage (renderImage / renderImageWithDPI) and display that somehow (see in PDFDebugger how to do it), or renderPageToGraphics which renders to a graphics device object.
If you really want to get the source code of the deleted PDFReader application (which includes PDFPagePanel), use svn to get revision 1702125 or earlier, that should have it. But if it didn't work for you in 1.8, it won't work for you now.
The point is that swing display of PDF pages isn't part of the API, it's part of some tool (now: in PDFDebugger, previously: in PDFReader)
You need to have some understanding of awt / swing. If you don't, learn it, or hire somebody. (That's what we did, and the best is: google paid it, as part of the google summer of code)
Tilman

PDFBox PDFImageWrite.writeImage is not handling all characters properly

I am using PDFBox 1.8.10 to load PDFs and to overlay images on each page.
PDDocument doc = PDDocument.load(url);
PDFImageWriter imageWriter = new PDFImageWriter();
imageWriter.writeImage(doc, imageFormat, password, 1,
doc.getNumberOfPages(), filePrefix, imageType, resolution);
I have tried saving the doc as a PDF and this looks fine. When the images are saved they can contain incorrect text. This is especially true for eastern European documents - eg Hungary, Poland, Czech etc
The PDF shows
H-4432 NYÍREGYHÁZA-NYÍRSZŐLŐS
The image shows
Is there a solution for this? Do I need to define a codepage? Could it be a problem with the available fonts?

The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes.
Thanks to #mkl for the info.

In Java converting an image to sRGB makes the image too bright

I have multiple images with a custom profile embedded in them and want to convert the image to sRGB in order to serve it up to a browser. I have seen code like the following:
BufferedImage image = ImageIO.read(fileIn);
ColorSpace ics = ColorSpace.getInstance(ColorSpace.CS_sRGB);
ColorConvertOp cco = new ColorConvertOp(ics, null);
BufferedImage result = cco.filter(image, null);
ImageIO.write(result, "PNG", fileOut);
where fileIn and fileOut are File objects representing the input file and output file respectively. This works to an extent. The problem is that the resulting image is lighter than the original. If I was to convert the color space in photoshop the colors would appear the same. In fact if I pull up both images with photoshop and take a screen shot and sample the colors, they are the same. What is photoshop doing that the code above isn't and what can I do to correct the problem?
There are various types of images being converted, including JPEG, PNG, and TIFF. I have tried using TwelveMonkeys to read in JPEG and TIFF images and I still get the same effect, where the image is too light. The conversion process seems worst when applied to an image that didn't have an embedded profile in the first place.
Edit:
I've added some sample images to help explain the problem.
This image is the one with the color profile embedded in it. Viewed on some browsers there won't be a noticeable difference between this one and the next but viewed in Chrome on Mac OSX and Windows it currently appears darker than it should. This is where my problem originates in the first place. I need to convert the image to something that will show up correctly in Chrome.
This is an image converted with ImageMagick to the Adobe RGB 1998 color profile, which Chrome appears to be able to display correctly.
This is the image that I converted using the code above and it appears lighter than it should.
(Note that the images above are on imgur so to make them larger, simply remove the "t" from the end of the filename, before the file extension.)

This was my initial solution which worked but I didn't like having to use ImageMagick. I have created another answer based off of the solution I ended up sticking with.
I gave in and ended up using im4java, which is a wrapper around the command line tool of image magick. When I use the following code to get a BufferedImage, it works really well.
IMOperation op = new IMOperation();
op.addImage(fileIn.getAbsolutePath());
op.profile(colorFileIn.getAbsolutePath());
op.addImage("png:-");
ConvertCmd cmd = new ConvertCmd();
Stream2BufferedImage s2b = new Stream2BufferedImage();
cmd.setOutputConsumer(s2b);
cmd.run(op);
BufferedImage image = s2b.getImage();
I can also use the library to apply a CMYK profile for print when needed. It would be nice if ColorConvertOp did the conversion correctly but for now, at least, this is my solution. In order to reach parity with my question the im4java code to achieve the same effect in the question is:
ConvertCmd cmd = new ConvertCmd();
IMOperation op = new IMOperation();
op.addImage(fileIn.getAbsolutePath());
op.profile(colorFileIn.getAbsolutePath());
op.addImage(fileOut.getAbsolutePath());
cmd.run(op);
where colorFileIn.getAboslutePath() is the location of the sRGB color profile on the machine. Since im4java is using the command line it's not as straight forward how to perform operations but the library is explained in detail here. I originally had issues with image magick not working on my Mac as explained in the question. I installed it using brew but it turns out on a Mac you have to install it like brew install imagemagick --with-little-cms. After that image magick worked fine for me.

I found a solution that doesn't require ImageMagick. Basically Java doesn't respect the profile when loading the image so if there is one it needs to get loaded. Here is a code snippet of what I did to accomplish this:
private BufferedImage loadBufferedImage(InputStream inputStream) throws IOException, BadElementException {
byte[] imageBytes = IOUtils.toByteArray(inputStream);
BufferedImage incorrectImage = ImageIO.read(new ByteArrayInputStream(imageBytes));
if (incorrectImage.getColorModel() instanceof ComponentColorModel) {
// Java does not respect the color profile embedded in a component based image, so if there is a color
// profile, detected using iText, then create a buffered image with the correct profile.
Image iTextImage = Image.getInstance(imageBytes);
com.itextpdf.text.pdf.ICC_Profile iTextProfile = iTextImage.getICCProfile();
if (iTextProfile == null) {
// If no profile is present than the image should be processed as is.
return incorrectImage;
} else {
// If there is a profile present then create a buffered image with the profile embedded.
byte[] profileData = iTextProfile.getData();
ICC_Profile profile = ICC_Profile.getInstance(profileData);
ICC_ColorSpace ics = new ICC_ColorSpace(profile);
boolean hasAlpha = incorrectImage.getColorModel().hasAlpha();
boolean isAlphaPremultiplied = incorrectImage.isAlphaPremultiplied();
int transparency = incorrectImage.getTransparency();
int transferType = DataBuffer.TYPE_BYTE;
ComponentColorModel ccm = new ComponentColorModel(ics, hasAlpha, isAlphaPremultiplied, transparency, transferType);
return new BufferedImage(ccm, incorrectImage.copyData(null), isAlphaPremultiplied, null);
}
}
else if (incorrectImage.getColorModel() instanceof IndexColorModel) {
return incorrectImage;
}
else {
throw new UnsupportedEncodingException("Unsupported color model type.");
}
}
This answer does use iText, which is generally used for PDF creation and manipulation, but it happens to process the ICC profiles correctly and I'm already depending on it for my project so it happens to be a much better choice than ImageMagick.
The code in the question then ends up as follows:
BufferedImage image = loadBufferedImage(new FileInputStream(fileIn));
ColorSpace ics = ColorSpace.getInstance(ColorSpace.CS_sRGB);
ColorConvertOp cco = new ColorConvertOp(ics, null);
BufferedImage result = cco.filter(image, null);
ImageIO.write(result, "PNG", fileOut);
which works great.

ZXing library can't decode Datamatrix barcode

I'm trying to use ZXing library to decode Datamatrix barcode. Here are my code sample:
BufferedImage bi = img.getBufferedImage();
Hashtable<DecodeHintType, Object> hints = new Hashtable<DecodeHintType, Object>();
hints.put(DecodeHintType.TRY_HARDER, Boolean.TRUE);
LuminanceSource source = new BufferedImageLuminanceSource(bi);
BinaryBitmap bitmap = new BinaryBitmap(new HybridBinarizer(source));
DataMatrixReader dataMatrixReader = new DataMatrixReader();
try {
Result res = dataMatrixReader.decode(bitmap,hints);
System.out.println("resultText = "+res.getText());
} catch (Exception e) {
System.out.println("failed to get resultText");
e.printStackTrace();
}
I've seen almost the same samples many times accross https://stackoverflow.com/ and other sites, but this approach does not working for me in this form.
As a source I'm using images grabbed from IR-camera. Here are example image:
As you see, the barcode is almost exactly at the center of an image, as Sean Owen recommended here and here. If I programmatically convert this image to black&white and crop image to bound barcode with some white space around it only, then ZXing works perfectly with images like this. But the problem is that barcode in real could have little deformations, so my simple algorythm can't help me to crop image properly. More over barcode could be placed not exactly in the center of an image and cold have a little bit different brightness. I saw threads mentioning OpenCV capabilities to find out placement of speciects objects on the image, like this one, but they are quite old. Is something changed since then? And what should i yet certainly consider to write 100% reliable datamatrix decoder (and detector) in my specific situation?
I decided to supply LuminanceSource and BinaryBitmap images made of .toString() text output of correcponding objects for reference:
http://s28.postimg.org/l53sykhx9/Binary_Bitmap.png
and /65z0vlbpl/Luminance_Source.png (at the same domain). They are looking good and ready for decoding, but what is wrong with decoding then.
After all this image and similar ones recognized and decoded very well with smartphone software and i'm just wanted achieve same results.

you need to enable it from settings programmatically or manually.
in class DecodeThread.java you can see the line that enables data matrix encoding
decodeFormats.addAll(DecodeFormatManager.DATA_MATRIX_FORMATS);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.