Converting PDF to GRAYSCALE using PDFBox without images? - java

im using Apache PDFBox,
I want to convert a RGB PDF file to another GRAYSCALE file WITHOUT using images method because its making huge file size -_- !!
so this my steps:
Export a (A4) First.pdf from Adobe InDesign, contain images, texts, vector-objects.
I read the First.pdf file. Done!
using LayerUtility, copy pages from First.pdf rotate them and put them to NEW PDF file (A4) Second.pdf. Done!
this method preferred because i need vector-objects to reduce the size.
then, i want to save this as GRAY-SCALE PDF file (Second-grayscale.pdf)
and this my code (not all):
PDDocument documentFirst = PDDocument.load("First.pdf"));
// Second.pdf its empty always
PDDocument documentSecond = PDDocument.load("Second.pdf"));
for (int page = 0; page < documentSecond.getNumberOfPages(); page++) {
// get current page from documentSecond
PDPage tempPage = documentSecond.getPage(page);
// create content contentStream
PDPageContentStream contentStream = new PDPageContentStream(documentSecond, tempPage);
// create layerUtility
LayerUtility layerUtility = new LayerUtility(documentSecond);
// importPageAsForm from documentFirst
PDFormXObject form = layerUtility.importPageAsForm(documentFirst, page);
// saveGraphicsState
contentStream.saveGraphicsState();
// rotate the page
Matrix matrix;
matrix.rotate(Math.toRadians(90));
contentStream.transform(matrix);
// draw the rotated page from documentFirst to documentSecond
contentStream.drawForm(form);
contentStream.close();
}
// save the new document
documentSecond.save("Second.pdf");
documentSecond.close();
documentFirst.close();
// now convert it to GRAYSCALE or do it in the Loop above!
well, i just start using Apache Box this week, i have followed some
example, but most are old and not working, until now i did what i
need, just need the Grayscale :)!!
if there are other solutions in java using open-source library
or a free tools. (i found with Ghost Script and Python)
i read this example but i didn't understand it and there are a functions deprecated!:
https://github.com/lencinhaus/pervads/blob/master/libs/pdfbox/src/java/org/apache/pdfbox/ConvertColorspace.java
its about PDF Specs, and changing Color Space...

You mentioned you would be interested in a Ghostscript based solution as far as I understood.
If you are able to call GS from your command line you can do color to grayscale conversion with this command line
gs -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o out.pdf -f input.pdf
my answer is taken from How to convert a PDF to grayscale from command line avoiding to be rasterized?

Related

How to get image cordinates from pdf using pdfbox in java

I want to get image field from existing pdf and fill it with other image to create new pdf file using pdfbox library in java
If your trying to get image coordinates from a PDF you could try PDF Mantis which has a function to extract this information - See example below:
// Load your PDF
PdfMantis pdf = new PdfMantis("/home/example.pdf");
// Get the image index
List <ImageIndex> imageIndex = pdf.getImageIndex().buildIndex();
// Iterate over each entry in the index
for (ImageIndex image : imageIndex) {
// And then we can get coordinates like so...
image.getX();
image.getY();
image.getHeight();
image.getWidth();
}
Disclaimer: I am the developer of PDF Mantis.

Small rendered images from PDF slides

I'm using Ghost4J library (http://ghost4j.sourceforge.net) to split a PDF file of slides into several images. The problem I have is that I get images where the slides are in the corner and very small. I want my images to get the format of the page from the PDF, but I don't know how to do it. Here is the code I'm using.
PDFDocument examplePDF = new PDFDocument();
String filePath="input.pdf";
File file=new File(filePath);
examplePDF.load(file);
List<org.ghost4j.document.Document> docs=examplePDF.explode();
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);
int counter=0;
for ( org.ghost4j.document.Document d : docs){
List<Image> img=renderer.render(d);
ImageIO.write((RenderedImage) img.get(0), "png", new File(
(counter+ 1) + ".png"));
counter++;
}
I think the problem is in the explode method that doesn't take into account that my original pdf didn't have standard pdf page size.
PD. I first tried the code from the second answer of this question but that gave me a heap space error when the document have a lot of pages.
Would you consider using ImageMagick instead?
convert -density 300 input.pdf output.png
would give you output-1.png, output-2.png, etc.

Create a pdf with dimensions 1700pixels*2200pixels in java using pdfBox

I have an application which opens a pdf file with dimensions 1700pixels*2200pixels. I will get dimensions of a rectangle drawn over a pdf from it.
When I am trying to create the same rectangle on a pdf,
I am using PdfBox which creates a pdf page with dimensions.
System.out.println(page.getMediaBox().getHeight());
System.out.println(page.getMediaBox().getWidth());
results in :
612
792
How to convert the pdf coordinates from 1700*2200 to 612*792?
Your output
612 792
of
System.out.println(page.getMediaBox().getHeight()); System.out.println(page.getMediaBox().getWidth());
seems to indicate that you create that PDPage using the default constructor, i.e. using new PDPage() as that constructor sets the page size to the US Letter page format.
If you want pages in a different format, you should use the constructor PDPage(PDRectangle), e.g.:
PDRectangle rec = new PDRectangle(1700, 2200);
PDDocument document = new PDDocument();
PDPage page = new PDPage(rec);
document.addPage(page);
This creates a PDF with a page whose size is 1700x2200 user space units, i.e. about 23.6"x30.6".
BTW, you talk about a pdf file in the dimensions 1700pixels*2200pixels - PDFs don't know the unit 'pixel'. They know the default user space unit which defaults to 1/72" and, therefore, more or less corresponds to the unit point. This especially does not imply a resolution.

PDFRenderer - Export to image, exported inaccurately

I have a program written to export PDF file to a series of images, it is shown as follow:
//Load pdf from path(file)
File file = new File("C:\\TEMP\\office\\a.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] b = new byte[(int) raf.length()];
raf.readFully(b);
ByteBuffer buf = ByteBuffer.wrap(b);
PDFFile pdffile = new PDFFile(buf);
//Get number of pages
int numOfPages = pdffile.getNumPages();
System.out.println(numOfPages);
//iterate through the number of pages
for (int i = 1; i <= numOfPages; i++) {
PDFPage page = pdffile.getPage(i);
//Create new image
Rectangle rect = new Rectangle(0, 0, (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
Image img = page.getImage(rect.width, rect.height, rect, null, true, true);
BufferedImage bufferedImage = new BufferedImage(rect.width, rect.height, BufferedImage.TYPE_INT_RGB);
Graphics g = bufferedImage.createGraphics();
g.drawImage(img, 0, 0, null);
g.dispose();
File asd = new File("C:\\TEMP\\office\\img\\Testingx" + i + ".jpg");
if (asd.exists()) {
asd.delete();
}
//Export the image to jpg format using the path C:\TEMP\office\img\filename
ImageIO.write(bufferedImage, "jpg", asd);
}
//Close the buf and other stuffs, which does not affect the image exported
This program works fine in lots of PDF files, however, while I was testing my program using various pdf found on the internet, there is a pdf that cannot be exported to image accurately like the others, the resources I used are listed below.
Original PDF Link:
2007_OReilly_EssentialActionScript3.0.pdf
I will use the page 7 of the PDF given above.
The expected image to be export : Click here for expected result image
After the program finished the operation, the resulting image is quite different.
Click here for Resulting image
As you can see, the resulting image shifts upward and some of the content disappeared, the result image lost the formatting in the pdf, it is not centered, it indents itself to the right.
PDFrenderer itself does not have problem, if we run the .jar file of PDFrenderer , the top side and the formatting is consistent with the original PDF file.
PDF opened with PDFRenderer in page 7
Known possibly issue: ImageIO does not support CMYK format, thus, page 1 and other pages involves the use of CMYK format will be unable to be exported correctly. Not sure if I am right.
Another issue: PDFRenderer seems to be failed at reading page 1 which is possibly due to something used in the PDF formatting, I don't know much about it
Used library : PDFRenderer
You may download the PDF from the link aforementioned and use the program I provided to reproduce the problem.
My question: How can I fix this problem? Is there somethings wrong with my program?
I found the problem myself and I am be able to have it fixed.
The explanation as follow
My JAVA program does not follow the "X" coordinates and "Y" coordinates in pdf file, to be simple, my program hardcoded the X,Y coordinates. In most case, most pdf will be work like the following image
Most PDF http://img266.imageshack.us/img266/7618/4cl5.png
HOWEVER, the pdf I provided is not that case, the X coordinate of upper left corner is not 0 , so as the Y. that's why the image has been cut off.
To be short, my program will capture the PDF screen with a shape of rectangle, however since the PDF i provided above does not find the coordinate of upper left corner, so it will capture the screen like the image below. The Y Coordinates is not written in the picture, my mistake.
Exception PDF http://img12.imageshack.us/img12/9672/plhb.png
With the following modification to the program, it will work like the most case and it is even more better.
Rectangle rect = new Rectangle((int)page.getPageBox().getX(), (int)page.getPageBox().getY(), (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
This allows the program "capture" the entire pdf provided by PDFRenderer starting from upper left corner which is just like the first image I have provided, It will works the same even in different page size from A4 to A7, I didn't test further, but it works

PDFBox LayerUtility - Importing layers into existing PDF

I am using pdfbox to manipulate PDF content. I have a big PDF file (say 500 pages). I also have a few other single page PDF files containing only a single image which are around 8-15kb per file at the max. What I need to do is to import these single page pdf's like an overlay onto certain pages of the big PDF file.
I have tried the LayerUtility of pdfbox where I've succeeded but it creates a very large sized file as the output. The source pdf is about 1MB before processing and when added with the smaller pdf files, the size goes upto 64MB. And sometimes I need to include two smaller PDF's onto the bigger one.
Is there a better way to do this or am I just doing this wrong? Posting code below trying to add two layers onto a single page:
...
...
..
overlayDoc[pCounter] = PDDocument.load("data\\" + overlay + ".pdf");
outputPage[pCounter] = (PDPage) overlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu = new LayerUtility( overlayDoc[pCounter] );
form[pCounter] = lu.importPageAsForm( bigPDFDoc, Integer.parseInt(pageNo)-1);
lu.appendFormAsLayer( outputPage[pCounter], form[pCounter], aTrans, "OVERLAY_"+pCounter );
outputDoc.addPage(outputPage[pCounter]);
mOverlayDoc[pCounter] = PDDocument.load("data\\" + overlay2 + ".pdf");
mOutputPage[pCounter] = (PDPage) mOverlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu2 = new LayerUtility( mOverlayDoc[pCounter] );
mForm[pCounter] = lu2.importPageAsForm(outputDoc, outputDoc.getNumberOfPages()-1);
lu.appendFormAsLayer( mOutputPage[pCounter], mForm[pCounter], aTrans, "OVERLAY_2"+pCounter );
outputDoc.removePage(outputPage[pCounter]);
outputDoc.addPage(mOutputPage[pCounter]);
...
...
With code like the following I don't see any unepected growth of size:
PDDocument bigDocument = PDDocument.load(BIG_SOURCE_FILE);
LayerUtility layerUtility = new LayerUtility(bigDocument);
List bigPages = bigDocument.getDocumentCatalog().getAllPages();
// import each page to superimpose only once
PDDocument firstSuperDocument = PDDocument.load(FIRST_SUPER_FILE);
PDXObjectForm firstForm = layerUtility.importPageAsForm(firstSuperDocument, 0);
PDDocument secondSuperDocument = PDDocument.load(SECOND_SUPER_FILE);
PDXObjectForm secondForm = layerUtility.importPageAsForm(secondSuperDocument, 0);
// These things can easily be done in a loop, too
AffineTransform affineTransform = new AffineTransform(); // Identity... your requirements may differ
layerUtility.appendFormAsLayer((PDPage) bigPages.get(0), firstForm, affineTransform, "Superimposed0");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(1), secondForm, affineTransform, "Superimposed1");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(2), firstForm, affineTransform, "Superimposed2");
bigDocument.save(BIG_TARGET_FILE);
As you see I superimposed the first page of FIRST_SUPER_FILE on two pages of the target file but I only imported the page once. Thus, also the resources of that imported page are imported only once.
This approach is open for loops, too, but don't import the same page multiple times! Instead import all required template pages once up front as forms and in the later loop reference those forms again and again.
(I hope this solves your issue. If not, supply more code and the sample PDFs to reproduce your issue.)

Categories

Resources