What I want is:
I have PDF document in english and an other languages and always pairs with the same content. I want to merge exactly the pairs, but I got tons of them (but the pairs have quiet similiar names), so I don't want to do it manually.
What I do have:
some experience with Java and VBA and yet installed Microsoft Office Suite XP and Adobe Acrobat Professional 8. Any other software has to be for free... if needed.
Can you help me to find a solution. Anything would help I only found solutions for newer versions of Excel and Acrobat Professional on the web.
This is easily accomplished using IText, which is an open source package that processes Pdfs. There are Java and C# versions.
Here is some example java code that I wrote to merge several pdfs
/**
* merge multiple pdfs into a single pdf
* #param fileName output pdf full path name
* #parrm childPdfs full path names of pdfs to merge
*/
public static void mergePdfs(String fileName, String [] childPdfs) {
try {
Document doc = new Document();
PdfCopy copyDoc = new PdfCopy(doc, new FileOutputStream(fileName));
doc.open();
for (int i = 0; i < childPdfs.length; i++) {
PdfReader reader = new PdfReader(childPdfs [i]);
int pageCnt = reader.getNumberOfPages();
for (int j = 1; j <= pageCnt; j++) {
copyDoc.addPage(copyDoc.getImportedPage(reader, j));
}
reader.close();
}
doc.close();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Related
I am evaluating to replace our pdf processing from itext to pdfbox. I did some tests with 200 pdfs with a single page (94KB, 469KB, 937KB) and merged them to one pdf in our application. PDFBox version: 2.0.23.
itext version: 2.1.7. Here are the test results:
Here is the itext implementation:
byte[] l_PDFPage = null;
PdfReader l_PDFReader = null;
PdfCopy l_Copier = null;
Document l_PDFDocument = null;
OutputStream l_Stream = new FileOutputStream(m_File);
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
l_PDFPage = l_Page.getAsPdf();
l_PDFReader = new PdfReader(l_PDFPage);
l_PDFReader.getPageN(1).put(PdfName.ROTATE, new PdfNumber(l_PDFReader.getPageRotation(1) + l_Page.getRotation() % 360));
l_PDFReader.consolidateNamedDestinations();
if( i == 0 ) {
l_PDFDocument = new Document(l_PDFReader.getPageSizeWithRotation(1));
l_Copier = new PdfCopy(l_PDFDocument, l_Stream);
l_PDFDocument.open();
}
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
if( l_PDFReader.getAcroForm() != null )
l_Copier.copyAcroForm(l_PDFReader);
l_Copier.flush();
l_Copier.freeReader(l_PDFReader);
}
l_PDFDocument.close();
l_Stream.close();
Here is the pdfbox implementation:
byte[] l_PDFPage = null;
List<PDDocument> pageDocuments = new ArrayList<>();
PDDocument saveDocument = new PDDocument();
try {
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
// our wrapper object for a page
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
PDDocument document = PDDocument.load(l_PDFPage);
// save page document to close it later
pageDocuments.add(document);
PDPage page = document.getPage(0);
saveDocument.addPage(saveDocument.importPage(page));
}
saveDocument.save(l_Stream);
}
finally {
// close every page document
for(PDDocument doc : pageDocuments) {
doc.close();
}
saveDocument.close();
}
I have also tried using pdfmerger of pdfbox. The performance was nearly the same as the other pdfbox implementation. But with the 937KB files I run in an outofmemory exception with this implementation:
byte[] l_PDFPage = null;
OutputStream l_Stream = new FileOutputStream(m_File);
PDFMergerUtility merger = new PDFMergerUtility();
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
merger.addSource(new ByteArrayInputStream(l_PDFPage));
}
merger.setDestinationStream(l_Stream);
merger.mergeDocuments(null);
So my questions:
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
Am I missing something in our pdfbox implementation?
Why I can't close the "page document" after I added the page in "saveDocument"? If i close it there I'd get an error while saving so I have to store the "page documents" and close them at the end.
PDFBox and iText are architecturally different and, therefore, perform differently well for different tasks.
In particular iText attempts to write out new contents early, in your case much of the page is written to the output already during
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
and
l_PDFDocument.close();
eventually only finalizes the PDF and writes last remaining objects and the file trailer.
PDFBox on the other hand saves everything in the end at once:
saveDocument.save(l_Stream);
The approach of iText has the advantage of a smaller memory footprint (as you observed) and the disadvantage that you cannot change data of a page once it is written.
(As an aside: the iText architecture has changed from iText 5 to iText 7, in iText 7 you have the choice and can keep everything in memory, but the price here also is a big memory footprint.)
Thus,
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
The difference in memory use can partially be explained by the above. Also in iText after
l_Copier.freeReader(l_PDFReader);
the PdfReader can be closed (which you leave to the garbage collection to do for you) to free its resources while in your PDFBox code you keep all the source documents open, holding the resources up to the end. (Actually I would have assumed that when you're using importPage, you needn't keep them.)
Concerning the time I'm not sure now. You should do some finer clocking and determine where exactly the extra time is used in PDFBox; thus, I second #Tilman's request for profiling data. I assume it's during the final save but that's only a hunch. Also such time differences might depend on structural details of the PDF in question and may be less extreme for other documents.
I am maintaining the code which is using older version iText 4.2. Now, I am trying to merging multiple tagged PDF files into one using following codes:
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(RESULT));
document.open();
PdfReader reader;
int n;
for (int i = 0; i < files.length; i++) {
reader = new PdfReader(files[i]);
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
}
document.close();
However, the tags are not copied over at all. since this is old version, I cannot find the function getImportedPage(reader, ++page, true) with third boolean parameter.
My question is whether the version of itext is not possible to achieve this? I also want to let you know I cannot upgrade my itext version to newer ones.
Thanks for any help!
I'm manipulating a PDF available from https://www.census.gov/content/dam/Census/library/publications/2015/econ/g13-aspef.pdf. Part of the manipulation is to copy the pages from the original PDF to a new PDF, and also to copy the named destinations. The iText Java API method addNamedDestinations isn't inserting the destinations into the new PDF.
Below is my code segment which is based on the example in the book iText in Action, 2nd edition.
try {
PdfReader reader1 = new PdfReader("C:\\Temp\\g13-aspef.pdf");
Document doc = new Document();
PdfCopy copy2 = new PdfCopy(doc, fileout);
doc.open();
reader1.consolidateNamedDestinations();
int n = reader1.getNumberOfPages();
for (int i = 0; i < n;) {
copy2.addPage(copy2.getImportedPage(reader1, ++i));
}
/* myDests indeed includes all 23 destinations appearing in the original PDF. */
HashMap<String,String> myDests = SimpleNamedDestination.getNamedDestination(reader1, false);
/* Use addNamedDestinations to insert the original destinations into the new PDF. */
copy2.addNamedDestinations(myDests, 0);
doc.close();
} catch (IOException e) {
System.out.println("Could not copy");
}
However, when I open the created PDF, the pages are there, but not the destinations. Why don't I see the destinations in the new PDF?
Thank you in advance!
I'd like to have a program that removes all rectangles from a PDF file. One use case for this is to unblacken a given PDF file to see if there is any hidden information behind the rectangles. The rest of the PDF file should be kept as-is.
Which PDF library is suitable to this task? In Java, I would like the code to look like this:
PdfDocument doc = PdfDocument.load(new File("original.pdf"));
PdfDocument unblackened = doc.transform(new CopyingPdfVisitor() {
public void visitRectangle(PdfRect rect) {
if (rect.getFillColor().getBrightness() >= 0.1) {
super.visitRectangle(rect);
}
}
});
unblackened.save(new File("unblackened.pdf"));
The CopyingPdfVisitor would copy a PDF document exactly as-is, and my custom code would leave out all the dark rectangles.
Itext pdf library have ways to modify pdf content.
The *ITEXT CONTENTPARSER Example * may give you any idea. "qname" parameter (qualified name) may be used to detected rectangle element.
http://itextpdf.com/book/chapter.php?id=15
Other option, if you want obtain the text on the document use the PdfReaderContentParser to extract text content
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
example at http://itextpdf.com/examples/iia.php?id=277
I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}
In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.
You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!
Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.