Unable to read a PDF file using PDFBOX

Unable to read a PDF file using PDFBOX - java

I am trying to fill in a PDF form using JAVA, but when I tried to get the fields using the below code the list is empty.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Then I tried to read the file using PDFStripper
PDFTextStripper stripper = new PDFTextStripper();
System.out.println(stripper.getText(pdDoc));
and the ouput was as follows
"Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries."
But I'm able to open the file manually and fill the fields as well. I've tried other tools like iText also. But again I wasn't able to get the fields.
How can I resolve this issue?

May be it is too late to answer but anyway why not. You can get empty list if your pdf file has XFA structure.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Use these code lines to start working with pdf:
PDXFA xfa = pdform.getXFA();
Document xfaDocument = xfa.getDocument();
NodeList elements = xfaDocument.getElementsByTagName( "SomeElement" );

While struggling with Alfresco's content search abilities, I've had some trouble with pdfbox (used by Alfresco to extract text and metadata) reading PDF files written by old applications (like QuarkXPress) that use old Acrobat 4.0 format. This old format pdfbox seems to be unable to extract metadata or text from it, although the files were perfectly viewable with any PDF reader application.
The solution was having all old PFD files re-printed (saved as...) using a more modern PDF format (like 10.0 for instance). This can be done in a row using some bash scripting.
I directly didn't try intermediate Acrobat versions among 4.0 and 10.0.

Related

Convert PDF file version in java (not only header version!)

I'm trying to convert PDF version 1.3 to PDF version 1.5 or above. The challenge here is: i don't want to change only header version like almost all forums writting on. I want to change ALL file version.
I read this topic: convert PDF to an older version from a servlet?, but firstly new version of iTextPdf 7, does not support PdfStamper so i just skip it.
I think I need to create a TMP file, write PDF to TMP, replace original with TMP and delete TMP. But how can i do that in JAVA?
This code only convert HEADER version! I use Itext version 7.
WriterProperties wp = new WriterProperties();
wp.setPdfVersion(PdfVersion.PDF_1_7);
PdfDocument pdfDoc = new PdfDocument(new PdfReader("source"), new PdfWriter("destination", wp));
pdfDoc.close();
Any suggestions?
picture from Firefox (PDF v 1.3)
picture: no text available
here you can download pdf sample: https://wetransfer.com/downloads/ce2d2f41ac29c36baa2ac895ebc0473c20210922065257/5889b2

There is no need to convert PDF version 1.3 to PDF version 1.5 or above as PDF is designed to be backwards compatible. Thus, every PDF 1.3 document already also is a PDF 1.4 document. And a PDF 1.5 document. And a PDF 1.6 document. ...
In a comment you explained why you want the version change nonetheless:
But if you want to open PDF 1.3 in Firefox, you cannot open it! So we have some clients which use firefox for opening PDF formats. Firefox only support 1.5 and above
In the light of the compatibility discussed above that does not make sense. But sometimes programs behave in a nonsensical way. Thus, I tested this.
The result: The Firefox 87.0 I have installed here accepts PDF 1.3 and PDF 1.4 files I found among my documents without any issue!
Unfortunately I don't have any PDF 1.2 (or earlier) files around here, so I cannot check support of such files.
Thus, I'm afraid you'll have to go back to analyzing the issue your customers have, it is not as simple as "Firefox only support 1.5 and above".
(Some ideas: Maybe your PDF 1.3 files actually are broken and Firefox fails to open them because of that; they may be broken already on your side or they may get broken during transfer to your clients. Or maybe your clients have some older Firefox version with some bugs in its PDF viewer.)
A Fix For The Actual Problem
In comments here the OP provided example files. Analyzing them it turned out that the Actual Problem is that Firefox cannot properly determine the built-in encoding of the embedded fonts.
To help Firefox in this regard, we can provide an explicit base encoding, so Firefox does not need the built-in encoding.
As you used iText 7 in your question, here a proof-of-concept working with your example PDF:
try ( PdfReader pdfReader = new PdfReader("1100-SD-9000455596.pdf");
PdfWriter pdfWriter = new PdfWriter("1100-SD-9000455596-Fixed.pdf");
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) ) {
for (int page = 1; page <= pdfDocument.getNumberOfPages(); page++) {
PdfPage pdfPage = pdfDocument.getPage(page);
PdfResources pdfResources = pdfPage.getResources();
for (Entry<PdfName, PdfObject> fontEntry : pdfResources.getResource(PdfName.Font).entrySet()) {
PdfObject fontObject = fontEntry.getValue();
if (fontObject != null && fontObject.getType() == PdfObject.INDIRECT_REFERENCE) {
fontObject = ((PdfIndirectReference)fontObject).getRefersTo(true);
}
if (fontObject instanceof PdfDictionary) {
PdfDictionary fontDictionary = (PdfDictionary) fontObject;
PdfDictionary encodingDictionary = fontDictionary.getAsDictionary(PdfName.Encoding);
if (encodingDictionary != null) {
if (encodingDictionary.getAsName(PdfName.BaseEncoding) == null &&
encodingDictionary.getAsArray(PdfName.Differences) != null) {
encodingDictionary.put(PdfName.BaseEncoding, PdfName.WinAnsiEncoding);
}
}
}
}
}
}
(FixForFirefox test testFix1100_SD_9000455596)

PDDocument.load(file) isnt a method (PDFBox)

I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.

As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);

How to clone a generated PDDocument in pdfbox 2.0.x?

I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus

PDFBox text extraction - empty output

I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.
I'm using PDFBox 1.8.8, with Java 7.
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));
It just prints
File: /foo/bar/mypdf.pdf readable: true size: 1267743
Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.
I get no exception, no nothing. Any ideas?
EDIT:
Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5
Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer
EDIT2:
Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Anyway to circumvent that? After all, pdftotext could get the text out.

The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf

Creating PDF from Word (DOC) using Apache POI and iText in JAVA

I am trying to generate a PDF document from a *.doc document.
Till now and thanks to stackoverflow I have success generating it but with some problems.
My sample code below generates the pdf without formatations and images, just the text.
The document includes blank spaces and images which are not included in the PDF.
Here is the code:
in = new FileInputStream(sourceFile.getAbsolutePath());
out = new FileOutputStream(outputFile);
WordExtractor wd = new WordExtractor(in);
String text = wd.getText();
Document pdf= new Document(PageSize.A4);
PdfWriter.getInstance(pdf, out);
pdf.open();
pdf.add(new Paragraph(text));

docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.
There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.
If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.
Use it like so:
wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
// Set up font mapper
Mapper fontMapper = new IdentityPlusMapper();
wordMLPackage.setFontMapper(fontMapper);
// Example of mapping missing font Algerian to installed font Comic Sans MS
PhysicalFont font
= PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
fontMapper.getFontMappings().put("Algerian", font);
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);
OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");
c.output(os);
Update July 2016
As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com
If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.
Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.
The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.
What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.
One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.
Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.

I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).
It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.
Good luck with your project!
Wim

Use OpenOffice/LbreOffice and JODConnector
This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.
private static void transformDocXToPDFUsingJOD(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
converter.convert(in, out, pdf);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
You need to be running LibreOffice as a serverto make this work.
From the command line you can do this using;
"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore

Another option I came across recently is using the OpenOffice (or LibreOffice) API (see here). I have not been able to get into this but it should be able to open documents in various formats and output them in a pdf format. If you look into this, let me know how it worked!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to read a PDF file using PDFBOX - java

Related

Convert PDF file version in java (not only header version!)

PDDocument.load(file) isnt a method (PDFBox)

How to clone a generated PDDocument in pdfbox 2.0.x?

PDFBox text extraction - empty output

Creating PDF from Word (DOC) using Apache POI and iText in JAVA

Categories

Resources