PDDocument.load(file) isnt a method (PDFBox) - java

I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.

As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);

Related

When and where do you close a PDDocument?

I get bytearrays of several pdfs from a backend source.
I load all these bytearrays into PDDocuments and add them to a list, like this:
List<PDDocument> pdfs = new ArrayList<>();
for (...the amount of bytearrays...) {
PDDocument pdf = PDDocument.load(bytearray);
pdfs.add(pdf);
}
I then merge these pdfs into one PDDocument:
PDDocument mergedPdf = new PDDocument();
PDFMergerUtility PDFmerger = new PDFMergerUtility();
for(...all pdfs in list...) {
PDFmerger.appendDocument(mergedPdf, pdf);
}
And then I save the mergedPdf to a file:
mergedPdf.save("c:\temp\mergeddoc.pdf");
My question is now: where do I call the close() method on these pddocuments?
Is this after loading them? But then that means I can't work any further with them, because I have closed the pdfs.
Or is this only needed at the end after I do the save?
You're on the safest side if you call close() on the source documents after saving the destination document. There have been bugs in older PDFBox 2.0.* versions where the destination PDF still kept references on the source PDFs - usually these were tagged PDFs. The soon to be released (likely in March) version 2.0.14 has all of these bugs fixed, hopefully, and you can close the source PDF after calling appendDocument(). Obviously you can't call close() directly after loading because the document is needed for appendDocument().

Java- Does pdfBox have an option to open file instead of loading it?

I am using PDFBox in Java to attempt to extract text from the pdf file. This is how I load the file:
PDDocument document = PDDocument.load(new File(path1));
As you can see, it opens the file and loads the stuff inside it. This may cause issue when say I tried to load a file which has 10 million words or text which is huge and it throws an OutOfMemoryException:Java heap space.
I actually tested this and it does throw an error. And the culprit was the line above.
Is there a way to open the file but not loading it's content in PDFBox?
I appreciate any suggestion.
Use :
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary files with no restricted size.

How to clone a generated PDDocument in pdfbox 2.0.x?

I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus

how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

I have some questions about parsing pdf anfd how to:
what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read PDF page per page without vaste in memory?
I use page per page reading because I need to write text in XML using DOM memory (using stripping technique, I decide to produce an XML for every page)
what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?
PDFBox implements two ways to read a PDF file.
loadNonSeq is the way documents should be loaded
load is the way documents should not be loaded but one might try to repair flles with broken cross references this way
In the 2.0.0 development branch, the algorithm formerly used for loadNonSeq is now used for load and the algorithm formerly used for load is not used anymore.
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
Using loadNonSeq instead of load may improve memory usage for multi-revision PDFs because it only reads objects still referenced from the reference table while load can keep more in memory.
I don't know, though, whether using a scratch file makes a big difference.
is it a good practice to read PDF page per page without vaste in memory?
Internally PDFBox parses the given range page after page, too. Thus, if you process the stripper output page-by-page, it certainly is ok to parse it page by page.

Unable to read a PDF file using PDFBOX

I am trying to fill in a PDF form using JAVA, but when I tried to get the fields using the below code the list is empty.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Then I tried to read the file using PDFStripper
PDFTextStripper stripper = new PDFTextStripper();
System.out.println(stripper.getText(pdDoc));
and the ouput was as follows
"Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries."
But I'm able to open the file manually and fill the fields as well. I've tried other tools like iText also. But again I wasn't able to get the fields.
How can I resolve this issue?
May be it is too late to answer but anyway why not. You can get empty list if your pdf file has XFA structure.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Use these code lines to start working with pdf:
PDXFA xfa = pdform.getXFA();
Document xfaDocument = xfa.getDocument();
NodeList elements = xfaDocument.getElementsByTagName( "SomeElement" );
While struggling with Alfresco's content search abilities, I've had some trouble with pdfbox (used by Alfresco to extract text and metadata) reading PDF files written by old applications (like QuarkXPress) that use old Acrobat 4.0 format. This old format pdfbox seems to be unable to extract metadata or text from it, although the files were perfectly viewable with any PDF reader application.
The solution was having all old PFD files re-printed (saved as...) using a more modern PDF format (like 10.0 for instance). This can be done in a row using some bash scripting.
I directly didn't try intermediate Acrobat versions among 4.0 and 10.0.

Categories

Resources