I'm having an interactive PDF with a couple of fields. When some of the fields are filled in the other ones are calculated. In Adobe Acrobat Reader this works fine.
Now when I fill in the document as follows:
public static void setField(PDDocument pdfDocument, String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if( field != null ) {
field.setValue(value);
} else {
System.err.println( "No field found with name:" + name );
}
}
The fields are filled in but I have two problems:
For every field I get:
May 04, 2021 11:57:04 AM org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper getFormattedValue
INFO: Field contains a formatting action but no ScriptingHandler has been supplied - formatted value might be incorrect
The fields that are normally auto calculated are not filled in. Do I need to trigger some actions or is it because the field is not formatted like a string or a number?
Related
I am using PDFBox to fill in PDF forms that we've been given by a third party.
I'm having a problem with only 1 of the forms, this code works for 21 others.
I know the valueToSet has value and is correct, and within the setField method, the getField method does return a value, so I know the field name is correct too. Plus, this code works fine with many other forms. None of the fields are populating (this particular template only has text boxes anyway).
What am I missing? Is there something on this specific form I should be looking for?
setField(formFieldName, valueToSet);
public static void setField(String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if (field instanceof PDCheckBox){
String onValue = ((PDCheckBox) field).getOnValue();
String offValue = "Off";
if(value.equals("Yes")){
field.setValue(onValue);
}
else{
field.setValue(offValue);
}
}
else{
field.setValue(value);
}
}
I have form PDF file as shown in image.FORM_PDF
Using PDFBox in Java I have retrieved text of the form fields.
My Code:
File file = new File("example.pdf");
PDDocument doc = PDDocument.load(file);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm form = catalog.getAcroForm();
PDFieldTree fields = form.getFieldTree();
for (PDField field : fields) {
Object value = field.getValueAsString();
String name = field.getPartialName();
System.out.print(name);
System.out.print(" = ");
System.out.print(value);
System.out.println();
}
Output :
Given Name Text Box = Jignesh
Family Name Text Box = Jignesh
House nr Text Box = xyz
Address 2 Text Box = pqr
I want below also to be retrieved
Given Name:
Family Name:
Address 1:
as
Given Name Text = Given Name:
Family Name Text = Family Name:
House nr Text = Address 1:
Address 2 Text = Address 2:
Since above were form fields all fields were retrieved easily. I want to extract even the labels of the form, since I want to map both of them.
Please help with the same.
Thanks a lot.
As the lucene migration guide mentioned, to set document level boost we should multiply all fields boost by boosting value. here is my code :
StringField nameField = new StringField("name", name, Field.Store.YES) ;
StringField linkField = new StringField("link", link, Field.Store.YES);
Field descField;
TextField reviewsField = new TextField("reviews", reviews_str, Field.Store.YES);
TextField authorsField = new TextField("authors", authors_str, Field.Store.YES);
FloatField scoreField = new FloatField("score", origScore,Field.Store.YES);
if (desc != null) {
descField = new TextField("desc", desc, Field.Store.YES);
} else {
descField = new TextField("desc", "", Field.Store.YES);
}
doc.add(nameField);
doc.add(linkField);
doc.add(descField);
doc.add(reviewsField);
doc.add(authorsField);
doc.add(scoreField);
nameField.setBoost(score);
linkField.setBoost(score);
descField.setBoost(score);
reviewsField.setBoost(score);
authorsField.setBoost(score);
scoreField.setBoost(score);
but I've got this exception when running code :
Exception in thread "main" java.lang.IllegalArgumentException: You cannot set an index-time boost on an unindexed field, or one that omits norms
I've searched google. but I've got no answers. would you please help me?
Index-time boosts are stored in the field's norm, and both StringField and FloatField omit norms by default. So, you'll need to turn them on before you set the boosts.
To turn norms on, you'll need to define your own field types:
//Start with a copy of the standard field type
FieldType myStringType = new FieldType(StringField.TYPE_STORED);
myStringType.setOmitNorms(false);
//StringField doesn't do anything special except have a customized fieldtype, so just use Field.
Field nameField = new Field("name", name, myStringType);
FieldType myFloatType = new FieldType(FloatField.TYPE_STORED);
myFloatType.setOmitNorms(false);
//For FloatField, use the appropriate FloatField ctor, instead of Field (similar for other numerics)
Field scoreField = new FloatField("score", origScore, myFloatType);
The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.
ie. I'm processing fields per page, but not sure which fields are on which pages.
Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?
Thank you!
Mark
code snippet:
PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
processFields(acroForm, fieldList, contentStream, page);
contentStream.close();
}
The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages
The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.
PDFBox 1.8.x
Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.
The following code should make clear how to do that:
#SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
List<PDPage> pages = docCatalog.getAllPages();
Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
for (PDAnnotation annotation : page.getAnnotations())
pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
}
PDAcroForm acroForm = docCatalog.getAcroForm();
for (PDField field : (List<PDField>)acroForm.getFields()) {
COSDictionary fieldDict = field.getDictionary();
List<Integer> annotationPages = new ArrayList<Integer>();
List<COSObjectable> kids = field.getKids();
if (kids != null) {
for (COSObjectable kid : kids) {
COSBase kidObject = kid.getCOSObject();
if (kidObject instanceof COSDictionary)
annotationPages.add(pageNrByAnnotDict.get(kidObject));
}
}
Integer mergedPage = pageNrByAnnotDict.get(fieldDict);
if (mergedPage == null)
if (annotationPages.isEmpty())
System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
else
System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
else
if (annotationPages.isEmpty())
System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
else
System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
}
}
Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:
The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.
Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.
PS: #mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.
PDFBox 2.0.x
In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.
The safe methods:
int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
COSDictionary widgetObject = widget.getCOSObject();
PDPageTree pages = document.getPages();
for (int i = 0; i < pages.getCount(); i++)
{
for (PDAnnotation annotation : pages.get(i).getAnnotations())
{
COSDictionary annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject))
return i;
}
}
return -1;
}
The fast method
int determineFast(PDDocument document, PDAnnotationWidget widget)
{
PDPage page = widget.getPage();
return page != null ? document.getPages().indexOf(page) : -1;
}
Usage:
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
for (PDField field : acroForm.getFieldTree())
{
System.out.println(field.getFullyQualifiedName());
for (PDAnnotationWidget widget : field.getWidgets())
{
System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
System.out.printf(" - fast: %s", determineFast(document, widget));
System.out.printf(" - safe: %s\n", determineSafe(document, widget));
}
}
}
(DetermineWidgetPage.java)
(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)
Example documents
A document for which the fast method fails: aFieldTwice.pdf
A document for which the fast method works: test_duplicate_field2.pdf
Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:
PDDocumentCatalog catalog = doc.getDocumentCatalog();
int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
This example uses Lucee (cfml) https://lucee.org/
A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.
Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.
<cfscript>
try{
/*
java is used by using CreateObject()
*/
variables.File = CreateObject("java", "java.io.File");
//references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")
function determineSafe(doc, widget){
var i = '';
var widgetObject = widget.getCOSObject();
var pages = doc.getPages();
var annotation = '';
var annotationObject = '';
for (i = 0; i < pages.getCount(); i=i+1){
for (annotation in pages.get(i).getAnnotations()){
if(annotation.getSubtype() eq 'widget'){
annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject)){
return i;
}
}
}
}
return -1;
}
function pageForSignature(doc, fieldName){
try{
var acroForm = doc.getDocumentCatalog().getAcroForm();
var field = '';
var widget = '';
var annotation = '';
var pageNo = '';
for(field in acroForm.getFields()){
if(field.getPartialName() == fieldName){
for(widget in field.getWidgets()){
for(annotation in widget.getPage().getAnnotations()){
if(annotation.getSubtype() == 'widget'){
pageNo = determineSafe(doc, widget);
doc.close();
return pageNo;
}
}
}
}
}
return -1;
}catch(e){
doc.close();
writeDump(label="catch error",var='#e#');
}
}
doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));
//returns no, page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');
writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript
General solution for single or multiple widget of (duplicate qualified name of single page)..
List<PDAnnotationWidget> widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());
/* field co ordinate also can get here for single or multiple both it will work..*/
//PDRectangle r= widget.get(i).getRectangle();
}
I am using POI's Event API to process large volume of records without any memory foot print issues. Here is the refernce for it.
When i processing XLSX sheet, i am getting different format of Date value than specified format in excel sheet. Date format for a column in excel sheet is 'dd-mm-yyyy' where as I am getting the value in 'mm/dd/yy' format.
Can some one tell me how to get the actual format given in excel sheet. Reference of code snippet is given below.
ContentHandler handler = new XSSFSheetXMLHandler(styles, strings,
new SheetContentsHandler() {
public void startRow(int rowNum) {
}
public void endRow() {
}
public void cell(String cellReference, String formattedValue) {
System.out.println(formattedValue);
} catch (IOException e) {
System.out.println(
"Exception during file writing");
}
}
Getting formmatedValue in cell method for date column is like 'mm/dd/yy' and hence i cant able to do the validations properly in my pl/sql program.
Two points to keep in mind:
The original Excel cell may have a format that doesn't work for you
or may be formatted as general text.
You may want to control exactly how dates, times or numeric values
are formatted.
Another way to control the formatting of date, and other numeric values is to provide your own custom DataFormatter extending org.apache.poi.ss.usermodel.DataFormatter.
You simply override the formatRawCellContents() method (or other methods depending on your needs):
Sample code constructing the parser / handler:
public void processSheet(Styles styles, SharedStrings strings,
SheetContentsHandler sheetHandler, InputStream sheetInputStream)
throws IOException, SAXException {
DataFormatter formatter = new CustomDataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(styles, null, strings, sheetHandler,
formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch (ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
private class CustomDataFormatter extends DataFormatter {
#Override
public String formatRawCellContents(double value, int formatIndex, String formatString,
boolean use1904Windowing) {
// Is it a date?
if (DateUtil.isADateFormat(formatIndex, formatString)) {
if (DateUtil.isValidExcelDate(value)) {
Date d = DateUtil.getJavaDate(value, use1904Windowing);
try {
return new SimpleDateFormat("yyyyMMdd").format(d);
} catch (Exception e) {
logger.log(Level.SEVERE, "Bad date value in Excel: " + d, e);
}
}
}
return new DecimalFormat("##0.#####").format(value);
}
}
I had the very same problem. After a few days googling and research, I came up with a solution. Unfortunately, it isn't nice, but it works:
Make a copy of org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler class in your project.
Find the interface SheetContentsHandler in the class.
Add a new method definition: String overriddenFormat(String cellRef, int formatIndex, String formatString);
Find this method in the class: public void endElement(String uri, String localName, String name) throws SAXException.
It has a long switch over the cell types.
In the case NUMBER there is an if statement like this: if (this.formatString != null) {...
Before that, paste this code:
String overriddenFormat = output.overriddenFormat(cellRef, formatIndex, formatString);
if (overriddenFormat != null) {
this.formatIndex = -1;
this.formatString = overriddenFormat;
}
Follow this article/answer: https://stackoverflow.com/a/11345859 but use your new class and interface.
Now you can use unique date formats if it is needed.
My use case was:
In a given sheet I have date values in G, H, and I columns, so my implementation of SheetContentsHandler.overriddenFormat is:
#Override
public String overriddenFormat(String cellRef, int formatIndex, String formatString) {
if (cellRef.matches("(G|H|I)\\d+")) { //matches all cells in G, H, and I columns
return "yyyy-mm-dd;#"; //this is the hungarian date format in excel
}
return null;
}
As you can see, in the endElement method I have overridden the formatIndex and formatString. The possible values of the formatIndex are described in org.apache.poi.ss.usermodel.DateUtil.isInternalDateFormat(int format). If the given value doesn't fit on these (and -1 does not fit), the formatString will be used through formatting the timestamp values. (The timestamp values are counted from about 1900.01.01 and have day-resolution.)
Excel stores some dates with regional settings. For example in the number format dialog in Excel you will see a warning like this:
Displays date and time serial numbers as date values, according to the type and locale (location) that you specify. Date formats that begin with an asterisk (*) respond to changes in regional date and time settings that are specified in Control Panel. Formats without an asterisk are not affected by Control Panel settings.
The Excel file that you are reading may be using one of those *dates. In which case POI probably uses a US default value.
You will probably need to add some workaround code to map the date format strings to the format that you want.
See also the following for a discussion of regional date settings in Excel.