PDFBox extract labels of form pdf - java

I have form PDF file as shown in image.FORM_PDF
Using PDFBox in Java I have retrieved text of the form fields.
My Code:
File file = new File("example.pdf");
PDDocument doc = PDDocument.load(file);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm form = catalog.getAcroForm();
PDFieldTree fields = form.getFieldTree();
for (PDField field : fields) {
Object value = field.getValueAsString();
String name = field.getPartialName();
System.out.print(name);
System.out.print(" = ");
System.out.print(value);
System.out.println();
}
Output :
Given Name Text Box = Jignesh
Family Name Text Box = Jignesh
House nr Text Box = xyz
Address 2 Text Box = pqr
I want below also to be retrieved
Given Name:
Family Name:
Address 1:
as
Given Name Text = Given Name:
Family Name Text = Family Name:
House nr Text = Address 1:
Address 2 Text = Address 2:
Since above were form fields all fields were retrieved easily. I want to extract even the labels of the form, since I want to map both of them.
Please help with the same.
Thanks a lot.

Related

PDFBox fill template doesn't fill in auto calculated fields

I'm having an interactive PDF with a couple of fields. When some of the fields are filled in the other ones are calculated. In Adobe Acrobat Reader this works fine.
Now when I fill in the document as follows:
public static void setField(PDDocument pdfDocument, String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if( field != null ) {
field.setValue(value);
} else {
System.err.println( "No field found with name:" + name );
}
}
The fields are filled in but I have two problems:
For every field I get:
May 04, 2021 11:57:04 AM org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper getFormattedValue
INFO: Field contains a formatting action but no ScriptingHandler has been supplied - formatted value might be incorrect
The fields that are normally auto calculated are not filled in. Do I need to trigger some actions or is it because the field is not formatted like a string or a number?

Set items in table java web scraping

Okey so my problem is next: I use web scraping to take some data from web page IMDB in this case, that data is titles of movies, and I already tried to print it in console and that works fine. My problem is that I can not save that titles in my table columns, I put all needed codes for this problem, and I cant find where I made mistake and why titles wont show in table columns. At the and, user need to pick one title and that title need to be stored in text field. Have someone any idea, please?
I have table:
TableColumn izborAuta = new TableColumn("Izbor auta");
TableColumn lokacijaPreuzimanja = new TableColumn("Lokacija
preuzimanja");
TableColumn lokacijaVracanja = new TableColumn("Lokacija Vracanja");
TableColumn cena = new TableColumn("Cena");
I have this code to setup columns:
izborAuta.setCellValueFactory(new PropertyValueFactory<Vozila, String>
("izborAuta"));
lokacijaPreuzimanja.setCellValueFactory(new
PropertyValueFactory<Vozila, String>("lokacijaPreuzimanja"));
lokacijaVracanja.setCellValueFactory(new
PropertyValueFactory<Vozila, String>("lokacijaVracanja"));
cena.setCellValueFactory(new PropertyValueFactory<Vozila, String>
("cena"));
tableView.setItems(Baza.baza.prikazBaze());
tableView.getColumns().addAll(izborAuta, lokacijaPreuzimanja,
lokacijaVracanja, cena);
I have this code, so when I pick one item from table that item need to be stored in textField:
tableView.setOnMouseClicked((e) -> {
Vozila v = (Vozila)
tableView.getSelectionModel().getSelectedItem();
txIzborAuta.setText(v.getIzborAuta());
txLokacijaPreuzimanja.setText(v.getLokacijaPreuzimanja());
txLokacijaVracanja.setText(v.getLokacijaVracanja());
txCena.setText(v.getCena());
});
And at the end I use web scraping to save items in table:
Document doc = Jsoup.connect("https://www.imdb.com/chart/top?
ref_=nv_mv_250").get();
Elements elems = doc.select("table.chart.full-width");
for (Element e : elems) {
String izborAuta = e.select(".titleColumn").text();
String lokacijaPreuzimanja = e.select(".titleColumn").text();
String lokacijaVracanja = e.select(".titleColumn").text();
String cena = e.select(".titleColumn").text();
Vozila v = new Vozila();
v.setIzborAuta(izborAuta);
v.setLokacijaPreuzimanja(lokacijaPreuzimanja);
v.setLokacijaVracanja(lokacijaVracanja);
v.setCena(cena + " " + "RSD");
Baza.insertVozila(v);
}

pdfbox textbox value not setting

I am using PDFBox to fill in PDF forms that we've been given by a third party.
I'm having a problem with only 1 of the forms, this code works for 21 others.
I know the valueToSet has value and is correct, and within the setField method, the getField method does return a value, so I know the field name is correct too. Plus, this code works fine with many other forms. None of the fields are populating (this particular template only has text boxes anyway).
What am I missing? Is there something on this specific form I should be looking for?
setField(formFieldName, valueToSet);
public static void setField(String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if (field instanceof PDCheckBox){
String onValue = ((PDCheckBox) field).getOnValue();
String offValue = "Off";
if(value.equals("Yes")){
field.setValue(onValue);
}
else{
field.setValue(offValue);
}
}
else{
field.setValue(value);
}
}

how to know if a field is on a particular page?

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.
ie. I'm processing fields per page, but not sure which fields are on which pages.
Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?
Thank you!
Mark
code snippet:
PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
processFields(acroForm, fieldList, contentStream, page);
contentStream.close();
}
The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages
The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.
PDFBox 1.8.x
Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.
The following code should make clear how to do that:
#SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
List<PDPage> pages = docCatalog.getAllPages();
Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
for (PDAnnotation annotation : page.getAnnotations())
pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
}
PDAcroForm acroForm = docCatalog.getAcroForm();
for (PDField field : (List<PDField>)acroForm.getFields()) {
COSDictionary fieldDict = field.getDictionary();
List<Integer> annotationPages = new ArrayList<Integer>();
List<COSObjectable> kids = field.getKids();
if (kids != null) {
for (COSObjectable kid : kids) {
COSBase kidObject = kid.getCOSObject();
if (kidObject instanceof COSDictionary)
annotationPages.add(pageNrByAnnotDict.get(kidObject));
}
}
Integer mergedPage = pageNrByAnnotDict.get(fieldDict);
if (mergedPage == null)
if (annotationPages.isEmpty())
System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
else
System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
else
if (annotationPages.isEmpty())
System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
else
System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
}
}
Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:
The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.
Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.
PS: #mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.
PDFBox 2.0.x
In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.
The safe methods:
int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
COSDictionary widgetObject = widget.getCOSObject();
PDPageTree pages = document.getPages();
for (int i = 0; i < pages.getCount(); i++)
{
for (PDAnnotation annotation : pages.get(i).getAnnotations())
{
COSDictionary annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject))
return i;
}
}
return -1;
}
The fast method
int determineFast(PDDocument document, PDAnnotationWidget widget)
{
PDPage page = widget.getPage();
return page != null ? document.getPages().indexOf(page) : -1;
}
Usage:
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
for (PDField field : acroForm.getFieldTree())
{
System.out.println(field.getFullyQualifiedName());
for (PDAnnotationWidget widget : field.getWidgets())
{
System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
System.out.printf(" - fast: %s", determineFast(document, widget));
System.out.printf(" - safe: %s\n", determineSafe(document, widget));
}
}
}
(DetermineWidgetPage.java)
(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)
Example documents
A document for which the fast method fails: aFieldTwice.pdf
A document for which the fast method works: test_duplicate_field2.pdf
Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:
PDDocumentCatalog catalog = doc.getDocumentCatalog();
int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
This example uses Lucee (cfml) https://lucee.org/
A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.
Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.
<cfscript>
try{
/*
java is used by using CreateObject()
*/
variables.File = CreateObject("java", "java.io.File");
//references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")
function determineSafe(doc, widget){
var i = '';
var widgetObject = widget.getCOSObject();
var pages = doc.getPages();
var annotation = '';
var annotationObject = '';
for (i = 0; i < pages.getCount(); i=i+1){
for (annotation in pages.get(i).getAnnotations()){
if(annotation.getSubtype() eq 'widget'){
annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject)){
return i;
}
}
}
}
return -1;
}
function pageForSignature(doc, fieldName){
try{
var acroForm = doc.getDocumentCatalog().getAcroForm();
var field = '';
var widget = '';
var annotation = '';
var pageNo = '';
for(field in acroForm.getFields()){
if(field.getPartialName() == fieldName){
for(widget in field.getWidgets()){
for(annotation in widget.getPage().getAnnotations()){
if(annotation.getSubtype() == 'widget'){
pageNo = determineSafe(doc, widget);
doc.close();
return pageNo;
}
}
}
}
}
return -1;
}catch(e){
doc.close();
writeDump(label="catch error",var='#e#');
}
}
doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));
//returns no, page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');
writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript
General solution for single or multiple widget of (duplicate qualified name of single page)..
List<PDAnnotationWidget> widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());
/* field co ordinate also can get here for single or multiple both it will work..*/
//PDRectangle r= widget.get(i).getRectangle();
}

how to Create table in word doc using docx4j in specific bookmark without overwritting the word doc

I need to create a table at the location of particular bookmark. ie i need to find the bookmark and insert the table . how can i do this using docx4j
Thanks in Advance
Sorry Jason, I am new to Stackoverflow so i couldnt write my problem clearly, here is my situation and problem.
I made changes in that code as you suggested and to my needs, and the code is here
//loop through the bookmarks
for (CTBookmark bm : rt.getStarts()) {
// do we have data for this one?
String bmname =bm.getName();
// find the right bookmark (in this case i have only one bookmark so check if it is not null)
if (bmname!=null) {
String value = "some text for testing run";
//if (value==null) continue;
List<Object> theList = null;
//create bm list
theList = ((ContentAccessor)(bm.getParent())).getContent();
// I set the range as 1 (I assume this start range is to say where the start the table creating)
int rangeStart = 1;
WordprocessingMLPackage wordPackage = WordprocessingMLPackage.createPackage();
// create the table
Tbl table = factory.createTbl();
//add boards to the table
addBorders(table);
for(int rows = 0; rows<1;rows++)
{// create a row
Tr row = factory.createTr();
for(int colm = 0; colm<1;colm++)
{
// create a cell
Tc cell = factory.createTc();
// add the content to cell
cell.getContent().add(wordPackage.getMainDocumentPart()
.createParagraphOfText("cell"+colm));
// add the cell to row
row.getContent().add(cell);
}
// add the row to table
table.getContent().add(row);
// now add a run (to test whether run is working or not)
org.docx4j.wml.R run = factory.createR();
org.docx4j.wml.Text t = factory.createText();
run.getContent().add(t);
t.setValue(value);
//add table to list
theList.add(rangeStart, table);
//add run to list
//theList.add(rangeStart, run);
}
I dont need to delete text in bookmark so i removed it.
I dont know whats the problem, program is compiling but I cannot open the word doc , it says "unknown error". I test to write some string "value" it writes perfectly in that bookmark and document is opening but not in the case of table. Please help me
Thanks in advance
You can adapt the sample code BookmarksReplaceWithText.java
In your case:
line 89: the parent won't be p, it'll be body or tc. You could remove the test.
line 128: instead of adding a run, you want to insert a table
You can use TblFactory to create your table, or the docx4j webapp to generate code from a sample docx.
For some reason bookmark replacement with table didn't workout for me, so I relied on text replacement with table. I created my tables from HTML using XHTML importer for my use case
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String xhtml= <your table HTML>;
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
int ct = 0;
List<Integer> tableIndexes = new ArrayList<>();
List<Object> documentContents = documentPart.getContent();
for (Object o: documentContents) {
if (o.toString().contains("PlaceholderForTable1")) {
tableIndexes.add(ct);
}
ct++;
}
for (Integer i: tableIndexes) {
documentPart.getContent().remove(i.intValue());
documentPart.getContent().addAll(i.intValue(), XHTMLImporter.convert( xhtml, null));
}
In my input word doc, I defined text 'PlaceholderForTable1' where I want to insert my table.

Categories

Resources