PDF Parsing tables in java with Pdfbox

PDF Parsing tables in java with Pdfbox - java

i've been looking for quite long time for answer, but i haven't found anything.
My problem is in parsing pdf, i have page made with some kind of tables.
I've already written some code via which i can extract iformation from specified rectangle, but i am declaring those values in code and it is not dynamic as it should. I want to find information about cells and with this information i will be able to get those string which i will need. In PDFbox api i haven't found anything what could be useful.
I would be graceful for any tips.

Related

How to read values from a PDF file in java using PDFTable or PDFTableExtractor class?

I have tried with PDFTextStripperByArea and PDPageContentStream classes to extract the number values from my pdf file. They work fine!
But my requirement is to use PDFTable or PDFTableExtractor class to read the pdf contents. Can you tell me what is the maven dependency and jar file I need to use to access the above said classes?
Also mention the required methods to get the values from a particular position.
I have another doubt. Can we extract the table formatted data from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page in GitHub. Click here! From that image, I only need the values of Gross premium, GST and Total Payable. Please let me know whether it's possible

First, don't use classes from packages com.lowagie
That code is old, obsolete and no longer supported. Furthermore, this code belonged to the very early version of iText.
Afterwards a thorough investigation was done into the intellectual property rights of all the code (since iText has had a lot of contributors). When you use the old code, you may (unknowingly) be using code for which you do not have the copyright.
Second, if you just want to solve the problem of extracting numbers and tables from a PDF document, have a look at pdf2Data. It's an iText add-on that makes things a lot easier.
It gives you a nice UI, where you can build templates for data extraction. Then you can call a single method to match an existing (XML) template against an input PDF document, and you'd get a datastructure that contains all the information about the match.
http://pdf2data.online/

PDFTable
I have found two PDFTable classes:
com.lowagie.text.pdf.PdfPTable
com.itextpdf.text.pdf.PdfPTable
Documentation of both of this class (this may help you to learn the methods you need):
https://www.coderanch.com/how-to/javadoc/itext-2.1.7/com/lowagie/text/pdf/PdfPTable.html
http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfPTable.html
If you want to use this classes, you can copy the dependency to your pom.file from:
https://mvnrepository.com/artifact/com.itextpdf/itextpdf
https://mvnrepository.com/artifact/com.lowagie/itext - As mentioned in this link, This artifact was moved to com.itextpdf
Examples of how to use this classes you may found here:
https://developers.itextpdf.com/examples/itext-action-second-edition/chapter-4
https://www.programcreek.com/java-api-examples/index.php?api=com.lowagie.text.pdf.PdfPTable

How do I extract data from pptx file using Apache POI?

I am using XSLFPowerPointExtractor to extract text from a pptx file. However all the text in the pptx file is returned to me in a single string. Is there anyway i can get the text on each slide separately? I am completely new to this concept, so please give detailed answers..

I looked up the API documentation and it seems that it's either all or nothing. The API documentation has a method called getText() which returns the entire text for all the slides which is exactly the behavior you are observing.
A bit more googling showed me that the way to do it is to use another API namely XMLSlideShow. That gives you a slide-by-slide access to the presentation.
From there, you can access the different shapes including the text areas from which you can read the text. As a matter of fact, this is explained in this other SO question which I believe will help you resolve your issue: How to get pptx slide notes text using apache poi?

XDocReport generate report : loop thru collection in table (java)

I have been struggling with trying to follow a code sample by XDocReport(open source project).
I followed this tutorial from the website:
https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable
I used the Freemarker template style.
I would not iterate and create the table, I just get back: $variable as text in the output doc.
Then I dug further, and discovered that this tutorial on the website was probably not updated for the newer version. I found some more examples in this url, which contains a zip file.
https://code.google.com/p/xdocreport/downloads/detail?name=docxandfreemarker-1.0.4-sample.zip
I still could not get it to work.
I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
I hope one of the developers of XDocReport, angelo.zerr, would give some input on this.
Sincerely,
P

I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
What is the problem with https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable?
I suggets you that you create an issue on XDocReport forum with a very simple case (simple Java main + docx)

It seems that the issue was the template. If one sets up a mail merge field in a Word template and don't use it in the Java program, the program then complains it can't find the variable, or something to that effect. And if you just delete the mail merge text in the document, it may still exist as a mail merge field variable in the word document.
So one needs to be very careful it seems with how to set things in the template.
I think the API should be able to ignore if there is a field setup in the template, and we are not referencing it in the code though. But that solved the problem.

Create a pdf with text at given coordinates (PDFBox?)

My Situation:
I'm programming in java
Using a library from a person from my university I'm able to read pdfs and create a XML document out of it
This XML document contains additional informations e.g. the coordinates of the text in the original document
My Problem
I would like to create the read PDF again with the content set at its original coordinates (Again: I have the coordinates)
My Question:
-> Do you know a way to create a pdf and set the text of the pdf at given coordinates? <-
I'm doing a lot of research these days about, but maybe I tried the wrong google search terms since I cant find much helpful results. So i thought I might be able to ask here, in the forum where I found the most help so far in my young "programmers life" :)
Most of the results I get, even here, are about people trying to get the coordinates, but I already have them.
I heard during a discussion that PDFBox might be able to do this, but I'm also happy to work with any other framework or library that is capable for my problem.
Thanks for every help and thought you're sharing with me.

Thanks a lot for your comments. In the end I've decided for iText, which allowed me to do all my tasks (placing text at absolute coordinates, give it a background color by certain criterias) in a quite easy and efficient way.
If someone here is searching for inspiration and has a similar task, check my related post here on stackoverflow for some code snippets How can I add a background color to my (pdf-) text using iText to create it with Java

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks

You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.