Small rendered images from PDF slides

Small rendered images from PDF slides - java

I'm using Ghost4J library (http://ghost4j.sourceforge.net) to split a PDF file of slides into several images. The problem I have is that I get images where the slides are in the corner and very small. I want my images to get the format of the page from the PDF, but I don't know how to do it. Here is the code I'm using.
PDFDocument examplePDF = new PDFDocument();
String filePath="input.pdf";
File file=new File(filePath);
examplePDF.load(file);
List<org.ghost4j.document.Document> docs=examplePDF.explode();
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);
int counter=0;
for ( org.ghost4j.document.Document d : docs){
List<Image> img=renderer.render(d);
ImageIO.write((RenderedImage) img.get(0), "png", new File(
(counter+ 1) + ".png"));
counter++;
}
I think the problem is in the explode method that doesn't take into account that my original pdf didn't have standard pdf page size.
PD. I first tried the code from the second answer of this question but that gave me a heap space error when the document have a lot of pages.

Would you consider using ImageMagick instead?
convert -density 300 input.pdf output.png
would give you output-1.png, output-2.png, etc.

Related

Converting PDF to GRAYSCALE using PDFBox without images?

im using Apache PDFBox,
I want to convert a RGB PDF file to another GRAYSCALE file WITHOUT using images method because its making huge file size -_- !!
so this my steps:
Export a (A4) First.pdf from Adobe InDesign, contain images, texts, vector-objects.
I read the First.pdf file. Done!
using LayerUtility, copy pages from First.pdf rotate them and put them to NEW PDF file (A4) Second.pdf. Done!
this method preferred because i need vector-objects to reduce the size.
then, i want to save this as GRAY-SCALE PDF file (Second-grayscale.pdf)
and this my code (not all):
PDDocument documentFirst = PDDocument.load("First.pdf"));
// Second.pdf its empty always
PDDocument documentSecond = PDDocument.load("Second.pdf"));
for (int page = 0; page < documentSecond.getNumberOfPages(); page++) {
// get current page from documentSecond
PDPage tempPage = documentSecond.getPage(page);
// create content contentStream
PDPageContentStream contentStream = new PDPageContentStream(documentSecond, tempPage);
// create layerUtility
LayerUtility layerUtility = new LayerUtility(documentSecond);
// importPageAsForm from documentFirst
PDFormXObject form = layerUtility.importPageAsForm(documentFirst, page);
// saveGraphicsState
contentStream.saveGraphicsState();
// rotate the page
Matrix matrix;
matrix.rotate(Math.toRadians(90));
contentStream.transform(matrix);
// draw the rotated page from documentFirst to documentSecond
contentStream.drawForm(form);
contentStream.close();
}
// save the new document
documentSecond.save("Second.pdf");
documentSecond.close();
documentFirst.close();
// now convert it to GRAYSCALE or do it in the Loop above!
well, i just start using Apache Box this week, i have followed some
example, but most are old and not working, until now i did what i
need, just need the Grayscale :)!!
if there are other solutions in java using open-source library
or a free tools. (i found with Ghost Script and Python)
i read this example but i didn't understand it and there are a functions deprecated!:
https://github.com/lencinhaus/pervads/blob/master/libs/pdfbox/src/java/org/apache/pdfbox/ConvertColorspace.java
its about PDF Specs, and changing Color Space...

You mentioned you would be interested in a Ghostscript based solution as far as I understood.
If you are able to call GS from your command line you can do color to grayscale conversion with this command line
gs -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o out.pdf -f input.pdf
my answer is taken from How to convert a PDF to grayscale from command line avoiding to be rasterized?

How to extract data from a .docx file including image, table, formula etc?

I am doing a task in which i have to extract data from word document mainly images, tables and special texts(formula etc) .
I am able to save image from a word file it is downloaded from web but when i am applying same code to my .docx file than it is giving error.
Code for same is
//create file inputstream to read from a binary file
FileInputStream fs=new FileInputStream(filename);
//create office word 2007+ document object to wrap the word file
XWPFDocument docx=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=docx.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
System.out.println(piclist.size());
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
int i=0;
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
//captureimage(imag,i,flag,j);
if(imag != null)
{
ImageIO.write(imag, "jpg", new File("D:/imagefromword"+i+".jpg"));
}else{
System.out.println("imag is empty");
}
It is giving incorrect format error. But I cannot change the doc file.
Secondly for above code if i am having more then one image and when i am saving this than every time it saving save image. Suppose we have 3 images then it will save 3 images but all three will be latest one.
Any help will be appreciated.

Without actual error one can only guess.
But there are two POI implementations HWPF and XWPF depending which version of word document your read the old doc one or xml-new-one docx. Typically the format error comes when you try to open the doc using the wrong one.
Also you need the full poi-ooxml-schemas jar to read more complicated documents.

PDFRenderer - Export to image, exported inaccurately

I have a program written to export PDF file to a series of images, it is shown as follow:
//Load pdf from path(file)
File file = new File("C:\\TEMP\\office\\a.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] b = new byte[(int) raf.length()];
raf.readFully(b);
ByteBuffer buf = ByteBuffer.wrap(b);
PDFFile pdffile = new PDFFile(buf);
//Get number of pages
int numOfPages = pdffile.getNumPages();
System.out.println(numOfPages);
//iterate through the number of pages
for (int i = 1; i <= numOfPages; i++) {
PDFPage page = pdffile.getPage(i);
//Create new image
Rectangle rect = new Rectangle(0, 0, (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
Image img = page.getImage(rect.width, rect.height, rect, null, true, true);
BufferedImage bufferedImage = new BufferedImage(rect.width, rect.height, BufferedImage.TYPE_INT_RGB);
Graphics g = bufferedImage.createGraphics();
g.drawImage(img, 0, 0, null);
g.dispose();
File asd = new File("C:\\TEMP\\office\\img\\Testingx" + i + ".jpg");
if (asd.exists()) {
asd.delete();
}
//Export the image to jpg format using the path C:\TEMP\office\img\filename
ImageIO.write(bufferedImage, "jpg", asd);
}
//Close the buf and other stuffs, which does not affect the image exported
This program works fine in lots of PDF files, however, while I was testing my program using various pdf found on the internet, there is a pdf that cannot be exported to image accurately like the others, the resources I used are listed below.
Original PDF Link:
2007_OReilly_EssentialActionScript3.0.pdf
I will use the page 7 of the PDF given above.
The expected image to be export : Click here for expected result image
After the program finished the operation, the resulting image is quite different.
Click here for Resulting image
As you can see, the resulting image shifts upward and some of the content disappeared, the result image lost the formatting in the pdf, it is not centered, it indents itself to the right.
PDFrenderer itself does not have problem, if we run the .jar file of PDFrenderer , the top side and the formatting is consistent with the original PDF file.
PDF opened with PDFRenderer in page 7
Known possibly issue: ImageIO does not support CMYK format, thus, page 1 and other pages involves the use of CMYK format will be unable to be exported correctly. Not sure if I am right.
Another issue: PDFRenderer seems to be failed at reading page 1 which is possibly due to something used in the PDF formatting, I don't know much about it
Used library : PDFRenderer
You may download the PDF from the link aforementioned and use the program I provided to reproduce the problem.
My question: How can I fix this problem? Is there somethings wrong with my program?

I found the problem myself and I am be able to have it fixed.
The explanation as follow
My JAVA program does not follow the "X" coordinates and "Y" coordinates in pdf file, to be simple, my program hardcoded the X,Y coordinates. In most case, most pdf will be work like the following image
Most PDF http://img266.imageshack.us/img266/7618/4cl5.png
HOWEVER, the pdf I provided is not that case, the X coordinate of upper left corner is not 0 , so as the Y. that's why the image has been cut off.
To be short, my program will capture the PDF screen with a shape of rectangle, however since the PDF i provided above does not find the coordinate of upper left corner, so it will capture the screen like the image below. The Y Coordinates is not written in the picture, my mistake.
Exception PDF http://img12.imageshack.us/img12/9672/plhb.png
With the following modification to the program, it will work like the most case and it is even more better.
Rectangle rect = new Rectangle((int)page.getPageBox().getX(), (int)page.getPageBox().getY(), (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight());
This allows the program "capture" the entire pdf provided by PDFRenderer starting from upper left corner which is just like the first image I have provided, It will works the same even in different page size from A4 to A7, I didn't test further, but it works

PDFBox LayerUtility - Importing layers into existing PDF

I am using pdfbox to manipulate PDF content. I have a big PDF file (say 500 pages). I also have a few other single page PDF files containing only a single image which are around 8-15kb per file at the max. What I need to do is to import these single page pdf's like an overlay onto certain pages of the big PDF file.
I have tried the LayerUtility of pdfbox where I've succeeded but it creates a very large sized file as the output. The source pdf is about 1MB before processing and when added with the smaller pdf files, the size goes upto 64MB. And sometimes I need to include two smaller PDF's onto the bigger one.
Is there a better way to do this or am I just doing this wrong? Posting code below trying to add two layers onto a single page:
...
...
..
overlayDoc[pCounter] = PDDocument.load("data\\" + overlay + ".pdf");
outputPage[pCounter] = (PDPage) overlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu = new LayerUtility( overlayDoc[pCounter] );
form[pCounter] = lu.importPageAsForm( bigPDFDoc, Integer.parseInt(pageNo)-1);
lu.appendFormAsLayer( outputPage[pCounter], form[pCounter], aTrans, "OVERLAY_"+pCounter );
outputDoc.addPage(outputPage[pCounter]);
mOverlayDoc[pCounter] = PDDocument.load("data\\" + overlay2 + ".pdf");
mOutputPage[pCounter] = (PDPage) mOverlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu2 = new LayerUtility( mOverlayDoc[pCounter] );
mForm[pCounter] = lu2.importPageAsForm(outputDoc, outputDoc.getNumberOfPages()-1);
lu.appendFormAsLayer( mOutputPage[pCounter], mForm[pCounter], aTrans, "OVERLAY_2"+pCounter );
outputDoc.removePage(outputPage[pCounter]);
outputDoc.addPage(mOutputPage[pCounter]);
...
...

With code like the following I don't see any unepected growth of size:
PDDocument bigDocument = PDDocument.load(BIG_SOURCE_FILE);
LayerUtility layerUtility = new LayerUtility(bigDocument);
List bigPages = bigDocument.getDocumentCatalog().getAllPages();
// import each page to superimpose only once
PDDocument firstSuperDocument = PDDocument.load(FIRST_SUPER_FILE);
PDXObjectForm firstForm = layerUtility.importPageAsForm(firstSuperDocument, 0);
PDDocument secondSuperDocument = PDDocument.load(SECOND_SUPER_FILE);
PDXObjectForm secondForm = layerUtility.importPageAsForm(secondSuperDocument, 0);
// These things can easily be done in a loop, too
AffineTransform affineTransform = new AffineTransform(); // Identity... your requirements may differ
layerUtility.appendFormAsLayer((PDPage) bigPages.get(0), firstForm, affineTransform, "Superimposed0");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(1), secondForm, affineTransform, "Superimposed1");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(2), firstForm, affineTransform, "Superimposed2");
bigDocument.save(BIG_TARGET_FILE);
As you see I superimposed the first page of FIRST_SUPER_FILE on two pages of the target file but I only imported the page once. Thus, also the resources of that imported page are imported only once.
This approach is open for loops, too, but don't import the same page multiple times! Instead import all required template pages once up front as forms and in the later loop reference those forms again and again.
(I hope this solves your issue. If not, supply more code and the sample PDFs to reproduce your issue.)

jasperreports: can see background image in pdf export but not in docx export

Report generation:
The following code resides in a servlet and generates both a "letter.docx" word document to download and a "pika.pdf" file in C:
I am able to see the background image i defined in pika, but not in "letter".
InputStream is = request.getServletContext().getResourceAsStream("/resources/reports/" +name);
JasperReport jr = JasperCompileManager.compileReport(is);
JasperPrint jp = JasperFillManager.fillReport(jr, params, ds);
JRExporter exp = new JRDocxExporter();
exp.setParameter(JRExporterParameter.JASPER_PRINT, jp);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
exp.setParameter(JRExporterParameter.OUTPUT_STREAM, bos);
exp.exportReport();
JasperExportManager.exportReportToPdfFile(jp, "C:\\pika.pdf");
byte[] bytes = bos.toByteArray();
response.reset();
response.setContentType("application/octet-stream");
response.setHeader("Content-disposition", "attachment; filename=\"letter.docx\"");
response.getOutputStream().write(bytes);
response.getOutputStream().flush();
response.getOutputStream().close();

Looking for an answer in the jasper community, i can see you are not the first one that asked by this.
Here is another question like yours all says that you can't set an image as background in doc reports.
The last things i found in my travel are three alternatives:
JOD Reports The most radical option, if you can change you report engine, check this out.
Other tutorial that shows how to embed images, but i'm not sure that works in Word docs specific case.
The last tutorial Here in SO, a little taste to put text as background.
Hope this helps, cheers.

I don't have enough information on your case but once I had a very nasty problem with Excel export, a cell wasn't being shown in XLS but in PDF it was shown fine. What I found out was just a single pixel misalignment between the header band and value band for the same column. This brought an extra cell into each of the values rows and JR couldn't populate it correctly.
So checking for misalignments in the JRXML is my advice, based upon previous experiences. Since MS Office formats are not well standardized as PDF or HTML, their exporting tends to be more "glitched".

JRDocxExporter is a grid exporter, it generates a table and then populates each cell of this table with the elements in the jasper template.
If an element in the template overlaps another element, the further element does not display, because in a table a cell cannot overlap another cell.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Small rendered images from PDF slides - java

Would you consider using ImageMagick instead? convert -density 300 input.pdf output.png would give you output-1.png, output-2.png, etc.

Related

Converting PDF to GRAYSCALE using PDFBox without images?

How to extract data from a .docx file including image, table, formula etc?

PDFRenderer - Export to image, exported inaccurately

PDFBox LayerUtility - Importing layers into existing PDF

jasperreports: can see background image in pdf export but not in docx export

Categories

Resources