Convert a XFA PDF to an image with PDFBox

Convert a XFA PDF to an image with PDFBox - java

Is there a way to convert a XFA PDF into a set of images (png or jpeg) with Apache's PDFBox?
I use version (1.8.6) which is supposed to have XFA support.
The PDF file to convert is a dynamic PDF form (XFA). Converting a static PDF form does not make any problem.
I was successful with PDFBox V 1.8.6 by calling the Page.convertToImage() method.
My attempt with an XFA PDF resulted in this image:
Here is the code I used to test the PDF conversion:
public void convertToImages(File sourceFile, File destinationDir){
if (!destinationDir.exists()) {
destinationDir.mkdir();
System.out.println("Folder Created -> "+ destinationDir.getAbsolutePath());
}
if (sourceFile.exists()) {
System.out.println("Images copied to Folder: "+ destinationDir.getName());
PDDocument document = null;
try {
//The classical way to lod the PDF document doesn't work here
//document = PDDocument.load(sourceFile);
File scratch = new File(destinationDir, "scratch");
if(scratch.exists())scratch.delete();
document = PDDocument.loadNonSeq(sourceFile, new RandomAccessFile(scratch, "rw"));
//Doesn't seem to have an effect in my case but I keep it ;-)
document.setAllSecurityToBeRemoved(true);
#SuppressWarnings("unchecked")
List<PDPage> list = document.getDocumentCatalog().getAllPages();
System.out.println("Total files to be converted -> "+ list.size());
String fileName = sourceFile.getName();
int pos = fileName.lastIndexOf('.');
fileName = fileName.substring(0, pos);
int pageNumber = 1;
for (PDPage page : list) {
File outputfile = new File(destinationDir, fileName +"_"+ pageNumber +".png");
try {
BufferedImage image = page.convertToImage();
ImageIO.write(image, "png", outputfile);
pageNumber++;
if(outputfile.exists()){
System.out.println("Image Created -> "+ outputfile.getName());
} else {
System.out.println("Image NOT Created -> "+ outputfile.getName());
}
} catch (Exception e) {
System.out.println("Error while creating image file "+ outputfile.getName());
e.printStackTrace();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if(document != null){
try {
document.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
System.out.println("Converted Images are saved at -> "+ destinationDir.getAbsolutePath());
} else {
System.err.println(sourceFile.getName() +" File does not exists");
}
}
Is there anything special to do in that case?
I tried the GIMP despite the fact that it should be used on a server but it doesn't work with dynamic PDF either.
I also have tried ImageMagick but it didn't work at all. Since I would be very surprised that it could solve anything I gave up further investigations.

Related

Java Apache POI: insert an image "infront the text"

I have a placeholder image in my docx file and I want to replace it with new image. The problem is - the placeholder image has an attribute "in front of text", but the new image has not. As a result the alignment breaks. Here is my code snippet and the docx with placeholder and the resulting docx.
.......
replaceImage(doc, "Рисунок 1", qr, 50, 50);
ByteArrayOutputStream out = new ByteArrayOutputStream();
doc.write(out);
out.close();
return out.toByteArray();
}
}
public XWPFDocument replaceImage(XWPFDocument document, String imageOldName, byte[] newImage, int newImageWidth, int newImageHeight) throws Exception {
try {
int imageParagraphPos = -1;
XWPFParagraph imageParagraph = null;
List<IBodyElement> documentElements = document.getBodyElements();
for (IBodyElement documentElement : documentElements) {
imageParagraphPos++;
if (documentElement instanceof XWPFParagraph) {
imageParagraph = (XWPFParagraph) documentElement;
if (imageParagraph.getCTP() != null && imageParagraph.getCTP().toString().trim().contains(imageOldName)) {
break;
}
}
}
if (imageParagraph == null) {
throw new Exception("Unable to replace image data due to the exception:\n"
+ "'" + imageOldName + "' not found in in document.");
}
ParagraphAlignment oldImageAlignment = imageParagraph.getAlignment();
// remove old image
boolean isDeleted = document.removeBodyElement(imageParagraphPos);
// now add new image
XWPFParagraph newImageParagraph = document.createParagraph();
XWPFRun newImageRun = newImageParagraph.createRun();
newImageParagraph.setAlignment(oldImageAlignment);
try (InputStream is = new ByteArrayInputStream(newImage)) {
newImageRun.addPicture(is, XWPFDocument.PICTURE_TYPE_JPEG, "qr",
Units.toEMU(newImageWidth), Units.toEMU(newImageHeight));
}
// set new image at the old image position
document.setParagraph(newImageParagraph, imageParagraphPos);
// NOW REMOVE REDUNDANT IMAGE FORM THE END OF DOCUMENT
document.removeBodyElement(document.getBodyElements().size() - 1);
return document;
} catch (Exception e) {
throw new Exception("Unable to replace image '" + imageOldName + "' due to the exception:\n" + e);
}
}
The image with placeholder:
enter image description here
The resulting image:
enter image description here

To replace picture templates in Microsoft Word there is no need to delete them.
The storage is as so:
The embedded media is stored as binary file. This is the picture data (XWPFPictureData). In the document a picture element (XWPFPicture) links to that picture data.
The XWPFPicture has settings for position, size and text flow. These dont need to be changed.
The changing is needed in XWPFPictureData. There one can replace the old binary content with the new.
So the need is to find the XWPFPicture in the document. There is a non visual picture name stored while inserting the picture in the document. So if one knows that name, then this could be a criteriea to find the picture.
If found one can get the XWPFPictureData from found XWPFPicture. There is method XWPFPicture.getPictureDatato do so. Then one can replace the old binary content of XWPFPictureData with the new. XWPFPictureData is a package part. So it has PackagePart.getOutputStream to get an output stream to write to.
Following complete example shows that all.
The source.docx needs to have an embedded picture named "QRTemplate.jpg". This is the name of the source file used while inserting the picture into Word document using Word GUI. And there needs to be a file QR.jpg which contains the new content.
The result.docx then has all pictures named "QRTemplate.jpg" replaced with the content of the given file QR.jpg.
import java.io.FileInputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordReplacePictureData {
static XWPFPicture getPictureByName(XWPFRun run, String pictureName) {
if (pictureName == null) return null;
for (XWPFPicture picture : run.getEmbeddedPictures()) {
String nonVisualPictureName = picture.getCTPicture().getNvPicPr().getCNvPr().getName();
if (pictureName.equals(nonVisualPictureName)) {
return picture;
}
}
return null;
}
static void replacePictureData(XWPFPictureData source, String pictureResultPath) {
try ( FileInputStream in = new FileInputStream(pictureResultPath);
OutputStream out = source.getPackagePart().getOutputStream();
) {
byte[] buffer = new byte[2048];
int length;
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
static void replacePicture(XWPFRun run, String pictureName, String pictureResultPath) {
XWPFPicture picture = getPictureByName(run, pictureName);
if (picture != null) {
XWPFPictureData source = picture.getPictureData();
replacePictureData(source, pictureResultPath);
}
}
public static void main(String[] args) throws Exception {
String templatePath = "./source.docx";
String resultPath = "./result.docx";
String pictureTemplateName = "QRTemplate.jpg";
String pictureResultPath = "./QR.jpg";
try ( XWPFDocument document = new XWPFDocument(new FileInputStream(templatePath));
FileOutputStream out = new FileOutputStream(resultPath);
) {
for (IBodyElement bodyElement : document.getBodyElements()) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)bodyElement;
for (XWPFRun run : paragraph.getRuns()) {
replacePicture(run, pictureTemplateName, pictureResultPath);
}
}
}
document.write(out);
}
}
}

I have a dirty workaround. Since the text block on the right side of the image is static, I replaced the text with screen-shot on the original docx. And now, when the placeholder image been substituted by the new image, everything is rendered as expected.

IText html to pdf wrapping line

Hello I'm creating javafx app with iText. I have html editor to write text and I want to create pdf from it. Everything works but when I have a really long line that is wrapped in html editor, in pdf it isn't wrapped, its out of page, how can I set wrapping page? here is my code:
PdfWriter writer = null;
try {
writer = new PdfWriter("doc.pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document document = new Document(pdf, PageSize.A4);
List<IElement> list = null;
try {
list = HtmlConverter.convertToElements(editor.getHtmlText());
} catch (IOException e) {
e.printStackTrace();
}
// add elements to document
for (IElement p : list) {
document.add((IBlockElement) p);
}
// close document
document.close();
I also want to set line spacing for this text
Thank you for help

I don't get any errors for the following code:
public class stack_overflow_0008 extends AbstractSupportTicket{
private static String LONG_PIECE_OF_TEXT =
"Once upon a midnight dreary, while I pondered, weak and weary," +
"Over many a quaint and curious volume of forgotten lore—" +
"While I nodded, nearly napping, suddenly there came a tapping," +
"As of some one gently rapping, rapping at my chamber door." +
"Tis some visitor,” I muttered, “tapping at my chamber door—" +
"Only this and nothing more.";
public static void main(String[] args)
{
PdfWriter writer = null;
try {
writer = new PdfWriter(getOutputFile());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document document = new Document(pdf, PageSize.A4);
List<IElement> list = null;
try {
list = HtmlConverter.convertToElements("<p>" + LONG_PIECE_OF_TEXT + "</p>");
} catch (IOException e) {
e.printStackTrace();
}
for (IElement p : list) {
document.add((IBlockElement) p);
}
document.close();
}
}
The document is a single (A4) page PDF with one string neatly wrapped.
I think perhaps the content of your string is to blame?
Could you post the HTML you get from this editor object?
Update:
Using the code from this answer on the HTML shared in a new comment to the question, I get the following result:
As you can see, the content is distributed over two lines. No content "falls off the page."

Internal links are not working on merged PDF files

I am currently modifying the logic for generating a PDF report, using itext package (itext-2.1.7.jar). The report is generated first, then - table of contents (TOC) and then TOC is inserted into the report file, using PdfStamper and it's method - replacePage:
PdfStamper stamper = new PdfStamper(new PdfReader(finalFile.getAbsolutePath()),
new FileOutputStream(reportFile));
stamper.replacePage(new PdfReader(tocFile.getAbsolutePath()), 1, 2);
However, I wanted TOC to also have internal links to point to the right chapter. Using replacePage() method, unfortunately, removes all links and annotations, so I tried another way. I've used a method I've found to merge the report file and the TOC file. It did the trick, meaning that the links were now preserved, but they don't seem to work, the user can click them, but nothing happens. I've placed internal links on other places in the report file, for testing purposes, and they seem to work, but they don't work when placed on the actual TOC. TOC is generated separately, here's how I've created the links (addLeadingDots is a local method, not pertinent to the problem):
Chunk anchor = new Chunk(idx + ". " + chapterName + getLeadingDots(chapterName, pageNumber) + " " + pageNumber, MainReport.FONT12);
String anchorLocation = idx+chapterName.toUpperCase().replaceAll("\\s","");
anchor.setLocalGoto(anchorLocation);
line = new Paragraph();
line.add(anchor);
line.setLeading(6);
table.addCell(line);
And this is how I've created anchor destinations on the chapters in the report file:
Chunk target = new Chunk(title.toUpperCase(), font);
String anchorDestination = title.toUpperCase().replace(".", "").replaceAll("\\s","");
target.setLocalDestination(anchorDestination);
Paragraph p = new Paragraph(target);
p.setAlignment(Paragraph.ALIGN_CENTER);
doc.add(p);
Finally, here's the method that supposed to merge report file and the TOC:
public static void mergePDFs(String originalFilePath, String fileToInsertPath, String outputFile, int location) {
PdfReader originalFileReader = null;
try {
originalFileReader = new PdfReader(originalFilePath);
} catch (IOException ex) {
System.out.println("ITextHelper.addPDFToPDF(): can't read original file: " + ex);
}
PdfReader fileToAddReader = null;
try {
fileToAddReader = new PdfReader(fileToInsertPath);
} catch (IOException ex) {
System.out.println("ITextHelper.addPDFToPDF(): can't read fileToInsert: " + ex);
}
if (originalFileReader != null && fileToAddReader != null) {
int numberOfOriginalPages = originalFileReader.getNumberOfPages();
Document document = new Document();
PdfCopy copy = null;
try {
copy = new PdfCopy(document, new FileOutputStream(outputFile));
document.open();
for (int i = 1; i <= numberOfOriginalPages; i++) {
if (i == location) {
for (int j = 1; j <= fileToAddReader.getNumberOfPages(); j++) {
copy.addPage(copy.getImportedPage(fileToAddReader, j));
}
}
copy.addPage(copy.getImportedPage(originalFileReader, i));
}
document.close();
} catch (DocumentException | FileNotFoundException ex) {
m_logger.error("ITextHelper.addPDFToPDF(): can't read output location: " + ex);
} catch (IOException ex) {
m_logger.error(Level.SEVERE, ex);
}
}
}
So, the main question is, what happens to the links, after both pdf documents are merged and how to make them work? I'm not too experienced with itext, so any help is appreciated.
Thanks for your help

PDFBox locks JPEG input file until application exits

I'm using PDFBox RC2 in a Windows 7 environment, Java 1.8_66. I'm using it to create a PDF from a collection of 200dpi page-sized image files, both JPEG and PNG.
It turns out that when adding JPEG files to a PDF, the PDImageXObject.createFromFile() routine fails to close an internal file handle, thus locking the image file for the lifetime of the application. When adding PNG files to a PDF, there is no problem.
Here's some sample code that reproduces the issue. Using process explorer (from sysinternals), view the open file handles for the java.exe process and run this code. My test uses about 20 full sized JPEG files. Note that after the method exits, several locked files still remain behind.
public Boolean CreateFromImages_Broken(String pdfFilename, String[] imageFilenames) {
PDDocument doc = new PDDocument();
for (String imageFilename : imageFilenames) {
try {
PDPage page = new PDPage();
doc.addPage(page);
PDImageXObject pdImage = PDImageXObject.createFromFile(imageFilename, doc);
// at this point, if the imageFilename is a jpeg, pdImage holds onto a handle for
// the given imageFilename and that file remains locked until the application is closed
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
float scale = (float)72.0 / 200;
page.setMediaBox(new PDRectangle((int)(pdImage.getWidth() * scale), (int)(pdImage.getHeight() * scale)));
contentStream.drawImage(pdImage, 0, 0, pdImage.getWidth()*scale, pdImage.getHeight()*scale);
}
} catch (IOException ioe) {
return false;
}
}
try {
doc.save(pdfFilename);
doc.close();
} catch (IOException ex) {
return false;
}
return true;
}

As a workaround, I reviewed the source code for PNG and JPEG handling, and I've had success by implementing this, which seems to work for both file types:
public Boolean CreateFromImages_FIXED(String pdfFilename, String[] imageFilenames) {
PDDocument doc = new PDDocument();
for (String imageFilename : imageFilenames) {
FileInputStream fis = null;
try {
PDPage page = new PDPage();
doc.addPage(page);
PDImageXObject pdImage = null;
// work around JPEG issue by opening up our own stream, with which
// we can close ourselves instead of PDFBOX leaking it. For PNG
// images, the createFromFile seems to be OK
if (imageFilename.toLowerCase().endsWith(".jpg")) {
fis = new FileInputStream(new File(imageFilename));
pdImage = JPEGFactory.createFromStream(doc, fis);
} else {
pdImage = PDImageXObject.createFromFile(imageFilename, doc);
}
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
float scale = (float)72.0 / 200;
page.setMediaBox(new PDRectangle((int)(pdImage.getWidth() * scale), (int)(pdImage.getHeight() * scale)));
contentStream.drawImage(pdImage, 0, 0, pdImage.getWidth()*scale, pdImage.getHeight()*scale);
if (fis != null) {
fis.close();
fis = null;
}
}
} catch (IOException ioe) {
return false;
}
}
try {
doc.save(pdfFilename);
doc.close();
} catch (IOException ex) {
return false;
}
return true;
}

Converting PDF Pages to JPG on Java-GAE

I am searching for a open-source java-library that enables me to render single pages of PDFs as JPG or PNG on server-side.
Unfortunately it mustn't use any other java.awt.* classes then
java.awt.datatransfer.DataFlavor
java.awt.datatransfer.MimeType
java.awt.datatransfer.Transferable
If there is any way, a little code-snippet would be fantastic.

i believe icepdf might have what you are looking for.
I've used this open source project a while back to turn uploaded pdfs into images for use in an online catalog.
import org.icepdf.core.exceptions.PDFException;
import org.icepdf.core.exceptions.PDFSecurityException;
import org.icepdf.core.pobjects.Document;
import org.icepdf.core.pobjects.Page;
import org.icepdf.core.util.GraphicsRenderingHints;
public byte[][] convert(byte[] pdf, String format) {
Document document = new Document();
try {
document.setByteArray(pdf, 0, pdf.length, null);
} catch (PDFException ex) {
System.out.println("Error parsing PDF document " + ex);
} catch (PDFSecurityException ex) {
System.out.println("Error encryption not supported " + ex);
} catch (FileNotFoundException ex) {
System.out.println("Error file not found " + ex);
} catch (IOException ex) {
System.out.println("Error handling PDF document " + ex);
}
byte[][] imageArray = new byte[document.getNumberOfPages()][];
// save page captures to bytearray.
float scale = 1.75f;
float rotation = 0f;
// Paint each pages content to an image and write the image to file
for (int i = 0; i < document.getNumberOfPages(); i++) {
BufferedImage image = (BufferedImage)
document.getPageImage(i,
GraphicsRenderingHints.SCREEN,
Page.BOUNDARY_CROPBOX, rotation, scale);
try {
//get the picture util object
PictureUtilLocal pum = (PictureUtilLocal) Component
.getInstance("pictureUtil");
//load image into util
pum.loadBuffered(image);
//write image in desired format
imageArray[i] = pum.imageToByteArray(format, 1f);
System.out.println("\t capturing page " + i);
} catch (IOException e) {
e.printStackTrace();
}
image.flush();
}
// clean up resources
document.dispose();
return imageArray;
}
Word of caution though, I have had trouble with this library throwing a SegFault on open-jdk. worked fine on Sun's. Not sure what it would do on GAE. I can't remember what version it was that had the problem so just be aware.

You can apache PDF box APi for this purpose and use following to code to convert two pdfs into JPG page by page .
public void convertPDFToJPG(String src,String FolderPath){
try{
File folder1 = new File(FolderPath+"\\");
comparePDF cmp=new comparePDF();
cmp.rmdir(folder1);
//load pdf file in the document object
PDDocument doc=PDDocument.load(new FileInputStream(src));
//Get all pages from document and store them in a list
List<PDPage> pages=doc.getDocumentCatalog().getAllPages();
//create iterator object so it is easy to access each page from the list
Iterator<PDPage> i= pages.iterator();
int count=1; //count variable used to separate each image file
//Convert every page of the pdf document to a unique image file
System.out.println("Please wait...");
while(i.hasNext()){
PDPage page=i.next();
BufferedImage bi=page.convertToImage();
ImageIO.write(bi, "jpg", new File(FolderPath+"\\Page"+count+".jpg"));
count++;
}
System.out.println("Conversion complete");
}catch(IOException ie){ie.printStackTrace();}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert a XFA PDF to an image with PDFBox - java

Related

Java Apache POI: insert an image "infront the text"

IText html to pdf wrapping line

Internal links are not working on merged PDF files

PDFBox locks JPEG input file until application exits

Converting PDF Pages to JPG on Java-GAE

Categories

Resources