PdfBox - Unable to extract some text from pdf

PdfBox - Unable to extract some text from pdf - java

I am trying to extract text from a pdf using pdfbox. However I am unable to extract all the text from a table. See the image below (snipped from the pdf)
(some confidential text has been highlighted)
I am able to get the text out of the 1st table (in orange) and the 3rd table (General Information one). But I am unable to extract anything out of the 2nd table.
In the output I just see a couple of blank lines between the output of 1st and 3rd table.
Here is my code.
PDDocument doc = PDDocument.load(new File("...."));
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(doc);
System.out.println(text);
doc.close();
Any inputs or suggestions?

I found the issue. The content was being displayed but it was being re-arranged.
The PDF had a couple of tables placed one after the other. The content of this table was being displayed after content of a few tables placed just after it. So for example if I had 6 tables and this was the 2nd table from the top. It's content was being displayed on 5th position instead of 2nd position.
As suggested by Tilman in comments the use of pdfStripper.setSortByPosition(true) results into expected content in expected places.

Related

iTextPDF 7 - Table of content with clickable page numbers

I'm trying to create a table of content with iTextPDF 7 in Java from existing PDFs.
I've tried to use Link link = new Link() and PdfAction.createGoTo(PdfExplicitDestination.createFit(pageNum)) but to no avail.
I can't link the page number/row in TOC to the respective page in the end PDF (after merging).
The documentation on their website only gets me so far and I'm out of ideas.
What is the correct way to create a TOC for existing PDFs and how to implement it correctly?

PDF Box flatten PDF causes weird spacing

I'm having an issue with PDF box flattening a PDF generated by Adobe Acrobat DC.
The Adobe Acrobat text field I created is absolutely the default text field.
In my example below, I have a PatientName field with the text value "Douglas McDouggelman".
When I flatten the PDF, here's what it looks like:
Anyone know what's up with this bizarre spacing?
It appears that the space + next character are combined. This is what it looks like when you try to select that character.
Code:
try (PDDocument document = PDDocument.load(pdfFormInputStream)) {
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
acroForm.getField("PatientName").setValue("Douglas McDouggelman");
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
if (flattenPdfs) {
acroForm.flatten();
}
document.save(byteArrayOutputStream);
}

I realized this PDF was from some other group who made it and who knows what they did. So I found the source word document, repeated the creation of the form from Adobe DC, added the fields back to the document, then it was totally fine.
PDF box was not the problem... it was some unknown incorrect step that the person who originally prepared the pdf did.

How to update table of content (TOC) for docx file by apache poi

How to update table of contents (TOC) and then convert docx file to pdf?
I want to create and update table of content in docx file and then convert it to pdf, so I want this TOC to be updated in pdf file.
This code update TOC after opening docx file by user and I don't need it.
docx.enforceUpdateFields();
I want TOC to be updated automatically by my program. Thanks.

Based on comment in this link, I found out that there is no way to update TOC automatically. Ben said on that comment "It then relies on a user to open the document in Word to generate the actual table of contents. Unfortunately, Word can't be made to update the TOC automatically and without user interaction".

add an invisible paragraph in a word document using Apache POI

I have a word document (.docx) that contains some information, I want to edit this document and add to it some text , I want that the text still invisible when I open the document but I want also to access to it easily from my code. Do you have please any idea how can I proceed ?

I was in fact looking for a solution with ApachePOI to make my text invisible in the generated word document.After some researches, I found this solution:
for (XWPFParagraph paragraph : doc.getParagraphs()){
for(XWPFRun run : paragraph.getRuns()){
CTOnOff onoffnull = CTOnOff.Factory.newInstance();
run.getCTR().getRPr().setVanish(onoffnull);
}
}
this code make all paragraphs of a word document invisible by the user.

Generating Table of Contents using XMLWorker

I am generating PDF using iText and XMLWorker. There problem is we need to generate the TOC for the PDF with page no. I am having mt section headings in a list. With this list i can generate the TOC without page no. But our requirement is we need page no also. Below is my list containing section details.
List<String> sectionList=new ArrayList<String>();
sectionList.add("Section1");
sectionList.add("Section2");
sectionList.add("Section3");
sectionList.add("Section4");
sectionList.add("Section5");`
My CLOB object is
String pdfString="<h1>Section1</h1><p>Some content for section1</p>" +
"<h1>Section2</h1><p>Some content for section2</p>" +
"<h1>Section3</h1><p>Some content for section3</p>" +
"<h1>Section4</h1><p>Some content for section4</p>" +
"<h1>Section5</h1><p>Some content for section5</p>";
Section contents will be more than 1 page so we need the page no in TOC. is there any wat to achieve this.
NOTE This is a sample we have many sections and subsections.

As of the XML Worker 5.5.4 source, it doesn't seem to create "Chapters" anywhere which is required for creating the table of contents. You can create your own tag and build into XML Worker how to process it. Some browsers may ignore an unknown tag and not display it, so be careful.
How to generate a Table of Contents “TOC” with iText?
JavaDoc method for telling XML Worker how to process a new Tag

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PdfBox - Unable to extract some text from pdf - java

Related

iTextPDF 7 - Table of content with clickable page numbers

PDF Box flatten PDF causes weird spacing

How to update table of content (TOC) for docx file by apache poi

add an invisible paragraph in a word document using Apache POI

Generating Table of Contents using XMLWorker

Categories

Resources