PDFRenderer - Export to image, exported inaccurately

PDFRenderer - Export to image, exported inaccurately - java

I have a program written to export PDF file to a series of images, it is shown as follow:
//Load pdf from path(file)
File file = new File("C:\\TEMP\\office\\a.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] b = new byte[(int) raf.length()];
raf.readFully(b);
ByteBuffer buf = ByteBuffer.wrap(b);
PDFFile pdffile = new PDFFile(buf);
//Get number of pages
int numOfPages = pdffile.getNumPages();
System.out.println(numOfPages);
//iterate through the number of pages
for (int i = 1; i <= numOfPages; i++) {
PDFPage page = pdffile.getPage(i);
//Create new image
Rectangle rect = new Rectangle(0, 0, (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
Image img = page.getImage(rect.width, rect.height, rect, null, true, true);
BufferedImage bufferedImage = new BufferedImage(rect.width, rect.height, BufferedImage.TYPE_INT_RGB);
Graphics g = bufferedImage.createGraphics();
g.drawImage(img, 0, 0, null);
g.dispose();
File asd = new File("C:\\TEMP\\office\\img\\Testingx" + i + ".jpg");
if (asd.exists()) {
asd.delete();
}
//Export the image to jpg format using the path C:\TEMP\office\img\filename
ImageIO.write(bufferedImage, "jpg", asd);
}
//Close the buf and other stuffs, which does not affect the image exported
This program works fine in lots of PDF files, however, while I was testing my program using various pdf found on the internet, there is a pdf that cannot be exported to image accurately like the others, the resources I used are listed below.
Original PDF Link:
2007_OReilly_EssentialActionScript3.0.pdf
I will use the page 7 of the PDF given above.
The expected image to be export : Click here for expected result image
After the program finished the operation, the resulting image is quite different.
Click here for Resulting image
As you can see, the resulting image shifts upward and some of the content disappeared, the result image lost the formatting in the pdf, it is not centered, it indents itself to the right.
PDFrenderer itself does not have problem, if we run the .jar file of PDFrenderer , the top side and the formatting is consistent with the original PDF file.
PDF opened with PDFRenderer in page 7
Known possibly issue: ImageIO does not support CMYK format, thus, page 1 and other pages involves the use of CMYK format will be unable to be exported correctly. Not sure if I am right.
Another issue: PDFRenderer seems to be failed at reading page 1 which is possibly due to something used in the PDF formatting, I don't know much about it
Used library : PDFRenderer
You may download the PDF from the link aforementioned and use the program I provided to reproduce the problem.
My question: How can I fix this problem? Is there somethings wrong with my program?

I found the problem myself and I am be able to have it fixed.
The explanation as follow
My JAVA program does not follow the "X" coordinates and "Y" coordinates in pdf file, to be simple, my program hardcoded the X,Y coordinates. In most case, most pdf will be work like the following image
Most PDF http://img266.imageshack.us/img266/7618/4cl5.png
HOWEVER, the pdf I provided is not that case, the X coordinate of upper left corner is not 0 , so as the Y. that's why the image has been cut off.
To be short, my program will capture the PDF screen with a shape of rectangle, however since the PDF i provided above does not find the coordinate of upper left corner, so it will capture the screen like the image below. The Y Coordinates is not written in the picture, my mistake.
Exception PDF http://img12.imageshack.us/img12/9672/plhb.png
With the following modification to the program, it will work like the most case and it is even more better.
Rectangle rect = new Rectangle((int)page.getPageBox().getX(), (int)page.getPageBox().getY(), (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
This allows the program "capture" the entire pdf provided by PDFRenderer starting from upper left corner which is just like the first image I have provided, It will works the same even in different page size from A4 to A7, I didn't test further, but it works

Related

Converting PDF to GRAYSCALE using PDFBox without images?

im using Apache PDFBox,
I want to convert a RGB PDF file to another GRAYSCALE file WITHOUT using images method because its making huge file size -_- !!
so this my steps:
Export a (A4) First.pdf from Adobe InDesign, contain images, texts, vector-objects.
I read the First.pdf file. Done!
using LayerUtility, copy pages from First.pdf rotate them and put them to NEW PDF file (A4) Second.pdf. Done!
this method preferred because i need vector-objects to reduce the size.
then, i want to save this as GRAY-SCALE PDF file (Second-grayscale.pdf)
and this my code (not all):
PDDocument documentFirst = PDDocument.load("First.pdf"));
// Second.pdf its empty always
PDDocument documentSecond = PDDocument.load("Second.pdf"));
for (int page = 0; page < documentSecond.getNumberOfPages(); page++) {
// get current page from documentSecond
PDPage tempPage = documentSecond.getPage(page);
// create content contentStream
PDPageContentStream contentStream = new PDPageContentStream(documentSecond, tempPage);
// create layerUtility
LayerUtility layerUtility = new LayerUtility(documentSecond);
// importPageAsForm from documentFirst
PDFormXObject form = layerUtility.importPageAsForm(documentFirst, page);
// saveGraphicsState
contentStream.saveGraphicsState();
// rotate the page
Matrix matrix;
matrix.rotate(Math.toRadians(90));
contentStream.transform(matrix);
// draw the rotated page from documentFirst to documentSecond
contentStream.drawForm(form);
contentStream.close();
}
// save the new document
documentSecond.save("Second.pdf");
documentSecond.close();
documentFirst.close();
// now convert it to GRAYSCALE or do it in the Loop above!
well, i just start using Apache Box this week, i have followed some
example, but most are old and not working, until now i did what i
need, just need the Grayscale :)!!
if there are other solutions in java using open-source library
or a free tools. (i found with Ghost Script and Python)
i read this example but i didn't understand it and there are a functions deprecated!:
https://github.com/lencinhaus/pervads/blob/master/libs/pdfbox/src/java/org/apache/pdfbox/ConvertColorspace.java
its about PDF Specs, and changing Color Space...

You mentioned you would be interested in a Ghostscript based solution as far as I understood.
If you are able to call GS from your command line you can do color to grayscale conversion with this command line
gs -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o out.pdf -f input.pdf
my answer is taken from How to convert a PDF to grayscale from command line avoiding to be rasterized?

Small rendered images from PDF slides

I'm using Ghost4J library (http://ghost4j.sourceforge.net) to split a PDF file of slides into several images. The problem I have is that I get images where the slides are in the corner and very small. I want my images to get the format of the page from the PDF, but I don't know how to do it. Here is the code I'm using.
PDFDocument examplePDF = new PDFDocument();
String filePath="input.pdf";
File file=new File(filePath);
examplePDF.load(file);
List<org.ghost4j.document.Document> docs=examplePDF.explode();
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);
int counter=0;
for ( org.ghost4j.document.Document d : docs){
List<Image> img=renderer.render(d);
ImageIO.write((RenderedImage) img.get(0), "png", new File(
(counter+ 1) + ".png"));
counter++;
}
I think the problem is in the explode method that doesn't take into account that my original pdf didn't have standard pdf page size.
PD. I first tried the code from the second answer of this question but that gave me a heap space error when the document have a lot of pages.

Would you consider using ImageMagick instead?
convert -density 300 input.pdf output.png
would give you output-1.png, output-2.png, etc.

How to extend the page size of a PDF to add a watermark?

My web application signs PDF documents. I would like to let users download the original PDF document (not signed) but adding an image and the signers in the left margin of the pdf document.
I've seen this idea in another web application, and I would like to do the same. Of course I would like to do it using itext library.
I have attached two images, the original PDF document (not signed) and the modified PDF document.

First this: it is important to change the document before you digitally sign it. Once digitally signed, these changes will break the signature.
I will break up the question in two parts and I'll skip the part about the actual watermarking as this is already explained here: How to watermark PDFs using text or images?
This question is not a duplicate of that question, because of the extra requirement to add an extra margin to the right.
Take a look at the primes.pdf document. This is the source file we are going to use in the AddExtraMargin example with the following result: primes_extra_margin.pdf. As you can see, a half an inch margin was added to the left of each page.
This is how it's done:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
int n = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// properties
PdfContentByte over;
PdfDictionary pageDict;
PdfArray mediabox;
float llx, lly, ury;
// loop over every page
for (int i = 1; i <= n; i++) {
pageDict = reader.getPageN(i);
mediabox = pageDict.getAsArray(PdfName.MEDIABOX);
llx = mediabox.getAsNumber(0).floatValue();
lly = mediabox.getAsNumber(1).floatValue();
ury = mediabox.getAsNumber(3).floatValue();
mediabox.set(0, new PdfNumber(llx - 36));
over = stamper.getOverContent(i);
over.saveState();
over.setColorFill(new GrayColor(0.5f));
over.rectangle(llx - 36, lly, 36, ury - llx);
over.fill();
over.restoreState();
}
stamper.close();
reader.close();
}
The PdfDictionary we get with the getPageN() method is called the page dictionary. It has plenty of information about a specific page in the PDF. We are only looking at one entry: the /MediaBox. This is only a proof of concept. If you want to write a more robust application, you should also look at the /CropBox and the /Rotate entry. Incidentally, I know that these entries don't exist in primes.pdf, so I am omitting them here.
The media box of a page is an array with four values that represent a rectangle defined by the coordinates of its lower-left and upper-right corner (usually, I refer to them as llx, lly, urx and ury).
In my code sample, I change the value of llx by subtracting 36 user units. If you compare the page size of both PDFs, you'll see that we've added half an inch.
We also use these coordinates to draw a rectangle that covers the extra half inch. Now switch to the other watermark examples to find out how to add text or other content to each page.
Update:
if you need to scale down the existing pages, please read Fix the orientation of a PDF in order to scale it

In Java converting an image to sRGB makes the image too bright

I have multiple images with a custom profile embedded in them and want to convert the image to sRGB in order to serve it up to a browser. I have seen code like the following:
BufferedImage image = ImageIO.read(fileIn);
ColorSpace ics = ColorSpace.getInstance(ColorSpace.CS_sRGB);
ColorConvertOp cco = new ColorConvertOp(ics, null);
BufferedImage result = cco.filter(image, null);
ImageIO.write(result, "PNG", fileOut);
where fileIn and fileOut are File objects representing the input file and output file respectively. This works to an extent. The problem is that the resulting image is lighter than the original. If I was to convert the color space in photoshop the colors would appear the same. In fact if I pull up both images with photoshop and take a screen shot and sample the colors, they are the same. What is photoshop doing that the code above isn't and what can I do to correct the problem?
There are various types of images being converted, including JPEG, PNG, and TIFF. I have tried using TwelveMonkeys to read in JPEG and TIFF images and I still get the same effect, where the image is too light. The conversion process seems worst when applied to an image that didn't have an embedded profile in the first place.
Edit:
I've added some sample images to help explain the problem.
This image is the one with the color profile embedded in it. Viewed on some browsers there won't be a noticeable difference between this one and the next but viewed in Chrome on Mac OSX and Windows it currently appears darker than it should. This is where my problem originates in the first place. I need to convert the image to something that will show up correctly in Chrome.
This is an image converted with ImageMagick to the Adobe RGB 1998 color profile, which Chrome appears to be able to display correctly.
This is the image that I converted using the code above and it appears lighter than it should.
(Note that the images above are on imgur so to make them larger, simply remove the "t" from the end of the filename, before the file extension.)

This was my initial solution which worked but I didn't like having to use ImageMagick. I have created another answer based off of the solution I ended up sticking with.
I gave in and ended up using im4java, which is a wrapper around the command line tool of image magick. When I use the following code to get a BufferedImage, it works really well.
IMOperation op = new IMOperation();
op.addImage(fileIn.getAbsolutePath());
op.profile(colorFileIn.getAbsolutePath());
op.addImage("png:-");
ConvertCmd cmd = new ConvertCmd();
Stream2BufferedImage s2b = new Stream2BufferedImage();
cmd.setOutputConsumer(s2b);
cmd.run(op);
BufferedImage image = s2b.getImage();
I can also use the library to apply a CMYK profile for print when needed. It would be nice if ColorConvertOp did the conversion correctly but for now, at least, this is my solution. In order to reach parity with my question the im4java code to achieve the same effect in the question is:
ConvertCmd cmd = new ConvertCmd();
IMOperation op = new IMOperation();
op.addImage(fileIn.getAbsolutePath());
op.profile(colorFileIn.getAbsolutePath());
op.addImage(fileOut.getAbsolutePath());
cmd.run(op);
where colorFileIn.getAboslutePath() is the location of the sRGB color profile on the machine. Since im4java is using the command line it's not as straight forward how to perform operations but the library is explained in detail here. I originally had issues with image magick not working on my Mac as explained in the question. I installed it using brew but it turns out on a Mac you have to install it like brew install imagemagick --with-little-cms. After that image magick worked fine for me.

I found a solution that doesn't require ImageMagick. Basically Java doesn't respect the profile when loading the image so if there is one it needs to get loaded. Here is a code snippet of what I did to accomplish this:
private BufferedImage loadBufferedImage(InputStream inputStream) throws IOException, BadElementException {
byte[] imageBytes = IOUtils.toByteArray(inputStream);
BufferedImage incorrectImage = ImageIO.read(new ByteArrayInputStream(imageBytes));
if (incorrectImage.getColorModel() instanceof ComponentColorModel) {
// Java does not respect the color profile embedded in a component based image, so if there is a color
// profile, detected using iText, then create a buffered image with the correct profile.
Image iTextImage = Image.getInstance(imageBytes);
com.itextpdf.text.pdf.ICC_Profile iTextProfile = iTextImage.getICCProfile();
if (iTextProfile == null) {
// If no profile is present than the image should be processed as is.
return incorrectImage;
} else {
// If there is a profile present then create a buffered image with the profile embedded.
byte[] profileData = iTextProfile.getData();
ICC_Profile profile = ICC_Profile.getInstance(profileData);
ICC_ColorSpace ics = new ICC_ColorSpace(profile);
boolean hasAlpha = incorrectImage.getColorModel().hasAlpha();
boolean isAlphaPremultiplied = incorrectImage.isAlphaPremultiplied();
int transparency = incorrectImage.getTransparency();
int transferType = DataBuffer.TYPE_BYTE;
ComponentColorModel ccm = new ComponentColorModel(ics, hasAlpha, isAlphaPremultiplied, transparency, transferType);
return new BufferedImage(ccm, incorrectImage.copyData(null), isAlphaPremultiplied, null);
}
}
else if (incorrectImage.getColorModel() instanceof IndexColorModel) {
return incorrectImage;
}
else {
throw new UnsupportedEncodingException("Unsupported color model type.");
}
}
This answer does use iText, which is generally used for PDF creation and manipulation, but it happens to process the ICC profiles correctly and I'm already depending on it for my project so it happens to be a much better choice than ImageMagick.
The code in the question then ends up as follows:
BufferedImage image = loadBufferedImage(new FileInputStream(fileIn));
ColorSpace ics = ColorSpace.getInstance(ColorSpace.CS_sRGB);
ColorConvertOp cco = new ColorConvertOp(ics, null);
BufferedImage result = cco.filter(image, null);
ImageIO.write(result, "PNG", fileOut);
which works great.

PDFBox LayerUtility - Importing layers into existing PDF

I am using pdfbox to manipulate PDF content. I have a big PDF file (say 500 pages). I also have a few other single page PDF files containing only a single image which are around 8-15kb per file at the max. What I need to do is to import these single page pdf's like an overlay onto certain pages of the big PDF file.
I have tried the LayerUtility of pdfbox where I've succeeded but it creates a very large sized file as the output. The source pdf is about 1MB before processing and when added with the smaller pdf files, the size goes upto 64MB. And sometimes I need to include two smaller PDF's onto the bigger one.
Is there a better way to do this or am I just doing this wrong? Posting code below trying to add two layers onto a single page:
...
...
..
overlayDoc[pCounter] = PDDocument.load("data\\" + overlay + ".pdf");
outputPage[pCounter] = (PDPage) overlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu = new LayerUtility( overlayDoc[pCounter] );
form[pCounter] = lu.importPageAsForm( bigPDFDoc, Integer.parseInt(pageNo)-1);
lu.appendFormAsLayer( outputPage[pCounter], form[pCounter], aTrans, "OVERLAY_"+pCounter );
outputDoc.addPage(outputPage[pCounter]);
mOverlayDoc[pCounter] = PDDocument.load("data\\" + overlay2 + ".pdf");
mOutputPage[pCounter] = (PDPage) mOverlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu2 = new LayerUtility( mOverlayDoc[pCounter] );
mForm[pCounter] = lu2.importPageAsForm(outputDoc, outputDoc.getNumberOfPages()-1);
lu.appendFormAsLayer( mOutputPage[pCounter], mForm[pCounter], aTrans, "OVERLAY_2"+pCounter );
outputDoc.removePage(outputPage[pCounter]);
outputDoc.addPage(mOutputPage[pCounter]);
...
...

With code like the following I don't see any unepected growth of size:
PDDocument bigDocument = PDDocument.load(BIG_SOURCE_FILE);
LayerUtility layerUtility = new LayerUtility(bigDocument);
List bigPages = bigDocument.getDocumentCatalog().getAllPages();
// import each page to superimpose only once
PDDocument firstSuperDocument = PDDocument.load(FIRST_SUPER_FILE);
PDXObjectForm firstForm = layerUtility.importPageAsForm(firstSuperDocument, 0);
PDDocument secondSuperDocument = PDDocument.load(SECOND_SUPER_FILE);
PDXObjectForm secondForm = layerUtility.importPageAsForm(secondSuperDocument, 0);
// These things can easily be done in a loop, too
AffineTransform affineTransform = new AffineTransform(); // Identity... your requirements may differ
layerUtility.appendFormAsLayer((PDPage) bigPages.get(0), firstForm, affineTransform, "Superimposed0");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(1), secondForm, affineTransform, "Superimposed1");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(2), firstForm, affineTransform, "Superimposed2");
bigDocument.save(BIG_TARGET_FILE);
As you see I superimposed the first page of FIRST_SUPER_FILE on two pages of the target file but I only imported the page once. Thus, also the resources of that imported page are imported only once.
This approach is open for loops, too, but don't import the same page multiple times! Instead import all required template pages once up front as forms and in the later loop reference those forms again and again.
(I hope this solves your issue. If not, supply more code and the sample PDFs to reproduce your issue.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.