I am trying to copy annotations from one pdf to another. But copying even 1 annotation DOUBLES the size of outputing pdf file.
Please find below simple code sample:
PDDocument pdf = PDDocument.load(new File("test1.pdf"));
PDDocument pdf2 = PDDocument.load(new File("test/test1.pdf"));
List<PDAnnotation> pdfAnnotations1 = pdf.getPage(0).getAnnotations();
List<PDAnnotation> pdfAnnotations2 = pdf2.getPage(0).getAnnotations();
pdfAnnotations1.add(pdfAnnotations2.get(0));
pdf.save("test1.pdf");
If I try to open this output file with Adobe Reader and save it again - size comes back to normal. Any thoughts?
Thank you very much in advance for any help.
Each annotation points back to the page where it is. So you need to correct that as well by calling pdfAnnotations1.get(0).setPage(pdf.getPage(0)).
The size increase is because without the call I described, the annotation will point back to the old page, which points back to its parent, etc.
Related
I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);
I am using PDFBox in Java to attempt to extract text from the pdf file. This is how I load the file:
PDDocument document = PDDocument.load(new File(path1));
As you can see, it opens the file and loads the stuff inside it. This may cause issue when say I tried to load a file which has 10 million words or text which is huge and it throws an OutOfMemoryException:Java heap space.
I actually tested this and it does throw an error. And the culprit was the line above.
Is there a way to open the file but not loading it's content in PDFBox?
I appreciate any suggestion.
Use :
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary files with no restricted size.
I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus
I am developing a module where i am supposed to print documents from the server. Following are the requirements :
the module should be able to print a pdf from a url, with & without saving
the module should be able to accept page numbers as parameters and only print/save those page numbers.
the module should be able to accept the printer name as a parameter and use only that printer
Is there any library available for this? How should i go about implementing this?
The answer was Apache PDFBox . I was able to load the PDF into a PDDocument object like this :
PDDocument pdf = PDDocument.load(new URL(download_pdf_from).openStream());
Splitting the document was as easy as :
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(pdf);
Now, to get a reference to any particular page:
splittedDocuments.get(pageNo);
Saving the entire document or even a given page number :
pdf.save(path); //saving the entire document to device
splittedDocuments.get(pageNo).save(path); //saving a particular page number to device
For the printing part, this answer helped me.
I have some questions about parsing pdf anfd how to:
what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read PDF page per page without vaste in memory?
I use page per page reading because I need to write text in XML using DOM memory (using stripping technique, I decide to produce an XML for every page)
what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?
PDFBox implements two ways to read a PDF file.
loadNonSeq is the way documents should be loaded
load is the way documents should not be loaded but one might try to repair flles with broken cross references this way
In the 2.0.0 development branch, the algorithm formerly used for loadNonSeq is now used for load and the algorithm formerly used for load is not used anymore.
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
Using loadNonSeq instead of load may improve memory usage for multi-revision PDFs because it only reads objects still referenced from the reference table while load can keep more in memory.
I don't know, though, whether using a scratch file makes a big difference.
is it a good practice to read PDF page per page without vaste in memory?
Internally PDFBox parses the given range page after page, too. Thus, if you process the stripper output page-by-page, it certainly is ok to parse it page by page.