The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.
ie. I'm processing fields per page, but not sure which fields are on which pages.
Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?
Thank you!
Mark
code snippet:
PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
processFields(acroForm, fieldList, contentStream, page);
contentStream.close();
}
The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages
The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.
PDFBox 1.8.x
Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.
The following code should make clear how to do that:
#SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
List<PDPage> pages = docCatalog.getAllPages();
Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
for (PDAnnotation annotation : page.getAnnotations())
pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
}
PDAcroForm acroForm = docCatalog.getAcroForm();
for (PDField field : (List<PDField>)acroForm.getFields()) {
COSDictionary fieldDict = field.getDictionary();
List<Integer> annotationPages = new ArrayList<Integer>();
List<COSObjectable> kids = field.getKids();
if (kids != null) {
for (COSObjectable kid : kids) {
COSBase kidObject = kid.getCOSObject();
if (kidObject instanceof COSDictionary)
annotationPages.add(pageNrByAnnotDict.get(kidObject));
}
}
Integer mergedPage = pageNrByAnnotDict.get(fieldDict);
if (mergedPage == null)
if (annotationPages.isEmpty())
System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
else
System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
else
if (annotationPages.isEmpty())
System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
else
System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
}
}
Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:
The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.
Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.
PS: #mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.
PDFBox 2.0.x
In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.
The safe methods:
int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
COSDictionary widgetObject = widget.getCOSObject();
PDPageTree pages = document.getPages();
for (int i = 0; i < pages.getCount(); i++)
{
for (PDAnnotation annotation : pages.get(i).getAnnotations())
{
COSDictionary annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject))
return i;
}
}
return -1;
}
The fast method
int determineFast(PDDocument document, PDAnnotationWidget widget)
{
PDPage page = widget.getPage();
return page != null ? document.getPages().indexOf(page) : -1;
}
Usage:
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
for (PDField field : acroForm.getFieldTree())
{
System.out.println(field.getFullyQualifiedName());
for (PDAnnotationWidget widget : field.getWidgets())
{
System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
System.out.printf(" - fast: %s", determineFast(document, widget));
System.out.printf(" - safe: %s\n", determineSafe(document, widget));
}
}
}
(DetermineWidgetPage.java)
(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)
Example documents
A document for which the fast method fails: aFieldTwice.pdf
A document for which the fast method works: test_duplicate_field2.pdf
Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:
PDDocumentCatalog catalog = doc.getDocumentCatalog();
int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
This example uses Lucee (cfml) https://lucee.org/
A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.
Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.
<cfscript>
try{
/*
java is used by using CreateObject()
*/
variables.File = CreateObject("java", "java.io.File");
//references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")
function determineSafe(doc, widget){
var i = '';
var widgetObject = widget.getCOSObject();
var pages = doc.getPages();
var annotation = '';
var annotationObject = '';
for (i = 0; i < pages.getCount(); i=i+1){
for (annotation in pages.get(i).getAnnotations()){
if(annotation.getSubtype() eq 'widget'){
annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject)){
return i;
}
}
}
}
return -1;
}
function pageForSignature(doc, fieldName){
try{
var acroForm = doc.getDocumentCatalog().getAcroForm();
var field = '';
var widget = '';
var annotation = '';
var pageNo = '';
for(field in acroForm.getFields()){
if(field.getPartialName() == fieldName){
for(widget in field.getWidgets()){
for(annotation in widget.getPage().getAnnotations()){
if(annotation.getSubtype() == 'widget'){
pageNo = determineSafe(doc, widget);
doc.close();
return pageNo;
}
}
}
}
}
return -1;
}catch(e){
doc.close();
writeDump(label="catch error",var='#e#');
}
}
doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));
//returns no, page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');
writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript
General solution for single or multiple widget of (duplicate qualified name of single page)..
List<PDAnnotationWidget> widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());
/* field co ordinate also can get here for single or multiple both it will work..*/
//PDRectangle r= widget.get(i).getRectangle();
}
Related
I'm having an interactive PDF with a couple of fields. When some of the fields are filled in the other ones are calculated. In Adobe Acrobat Reader this works fine.
Now when I fill in the document as follows:
public static void setField(PDDocument pdfDocument, String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if( field != null ) {
field.setValue(value);
} else {
System.err.println( "No field found with name:" + name );
}
}
The fields are filled in but I have two problems:
For every field I get:
May 04, 2021 11:57:04 AM org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper getFormattedValue
INFO: Field contains a formatting action but no ScriptingHandler has been supplied - formatted value might be incorrect
The fields that are normally auto calculated are not filled in. Do I need to trigger some actions or is it because the field is not formatted like a string or a number?
I need to remove property in Text (setRise) , if t.setRise(+-) gets out of fields paper.
PdfDocument pdfDoc = new PdfDocument(pdfWriter);
Document doc = new Document(pdfDoc, PageSize.A5);
doc.setMargins(0,0,0,36);
for (int i = 0; i <50 ; i++) {
Text t = new Text("hello " + i);
if(i ==0){
t.setTextRise(7);
}
if(i==31){
t.setTextRise(-35);
}
Paragraph p = new Paragraph(t);
p.setNextRenderer(new ParagraphRen(p,doc));
p.setFixedLeading(fixedLeading);
doc.add(p);
}
doc.close();
}
class ParagraphRen extends ParagraphRenderer{
private float heightDoc;
private float marginTop;
private float marginBot;
public ParagraphRen(Paragraph modelElement, Document doc) {
super(modelElement);
this.heightDoc =doc.getPdfDocument().getDefaultPageSize().getHeight();
this.marginTop = doc.getTopMargin();
this.marginBot = doc.getBottomMargin();
}
#Override
public void drawChildren(DrawContext drawContext) {
super.drawChildren(drawContext);
Rectangle rect = this.getOccupiedAreaBBox();
List<IRenderer> childRenderers = this.getChildRenderers();
//check first line
if(rect.getTop()<=heightDoc- marginTop) {
for (IRenderer iRenderer : childRenderers) {
if (iRenderer.getModelElement().hasProperty(72)) {
Object property = iRenderer.getModelElement().getProperty(72);
float v = (Float) property + rect.getTop();
//check text more AreaPage
if(v >heightDoc){
iRenderer.getModelElement().deleteOwnProperty(72);
}
}
}
}
//check last line
if(rect.getBottom()-marginBot-rect.getHeight()*2<0){
for (IRenderer iRenderer : childRenderers) {
if (iRenderer.getModelElement().hasProperty(72)) {
Object property = iRenderer.getModelElement().getProperty(72);
//if setRise(-..) more margin bottom setRise remove
if(rect.getBottom()-marginBot-rect.getHeight()+(Float) property<0)
iRenderer.getModelElement().deleteOwnProperty(72);
}
}
}
}
}
Here i check if first lines with setRise more the paper area I remove setRise property.
And if last lines with serRise(-35) more then margin bottom I remove it.
But it doesn't work. Properties don't remove.
Your problem is as follows: drawChildren method gets called after rendering has been done. At this stage iText usually doesn't consider properties of any elements: it just places the element in its occupied area, which has been calculated before, at layout() stage.
You can overcome it with layout emulation.
Let's add all your paragraphs to a div rather than directly to the document. Then emulate adding this div to the document:
LayoutResult result = div.createRendererSubTree().setParent(doc.getRenderer()).layout(new LayoutContext(new LayoutArea(0, PageSize.A5)));
In the snippet above I've tried to layout our div on a A5-sized document.
Now you can consider the result of layout and change some elements, which will be then processed for real with Document#add. For example, to get the 30th layouted paragraph one can use:
((DivRenderer)result.getSplitRenderer()).getChildRenderers().get(30);
Some more tips:
split renderer represent the part of the content which iText can place on the area, overflow - the content which overflows.
I have a path to the page (/content/my-site/en/cars, for example) and I need a list of all 'parsys' and 'iparsys' components presented on this page in java code. Are there any ways to do it? Thanks for any help.
I assume you are trying a Sling model or WCMUsePOJO to read inner nodes of a page. Here the techniques:
If you dont know how many parsys nodes are present: This is not an ideal case since page rendering script dictates all included parsys and iparsys. But just incase, you ll run a query for sling:resourceType like this:
Iterator<Resource> parsysResources = resourceResolver.findResources("/jcr:root/content/my-site/en/cars//*[sling:resourceType='foundation/components/parsys']", Query.XPATH);
Iterator<Resource> iparsysResources = resourceResolver.findResources("/jcr:root/content/my-site/en/cars//*[sling:resourceType='foundation/components/iparsys']", Query.XPATH);
Similar query but with Query Builder (Recommended): It is recommended to use query builder API for readability and extensible in future.
List<Resource> parsysIpaysysResources = new ArrayList<>();
Map<String, String> predicateMap = new HashMap<>();
predicateMap.put("path", "/content/my-site/en/cars");
predicateMap.put("1_property", "sling:resourceType");
predicateMap.put("1_property.value", "foundation/components/parsys");
predicateMap.put("2_property", "sling:resourceType");
predicateMap.put("2_property.value", "foundation/components/iparsys");
predicateMap.put("p.limit", "-1");
QueryBuilder queryBuilder = resourceResolver.adaptTo(QueryBuilder.class);
Session session = resourceResolver.adaptTo(Session.class);
com.day.cq.search.Query query = queryBuilder.createQuery(PredicateGroup.create(predicateMap), session);
SearchResult result = query.getResult();
Iterator<Resource> resources = result.getResources();
while (resources.hasNext()) {
parsysIpaysysResources.add(resources.next());
}
If the parsys nodes are known to be immediate children of page content, listChildren will be cheaper compared to query.
Page pageContent = pageManager.getContainingPage("/content/my-site/en/cars");
Iterator<Resource> children = pageContent.getContentResource().listChildren();
while(children != null && children.hasNext()) {
Resource child = children.next();
if(child.isResourceType("foundation/components/parsys") || child.isResourceType("foundation/components/iparsys")) {
// do something
}
}
If the node name of inner parsys is known, JCR API can be leveraged
Page pageContent = pageManager.getContainingPage("/content/my-site/en/cars");
Node pageContentNode = pageContent.adaptTo(Node.class);
try {
NodeIterator nodeIter = pageContentNode.getNodes("parsys*");
// iterate nodes
} catch (RepositoryException e) {
e.printStackTrace();
}
I am using PDFBox to fill in PDF forms that we've been given by a third party.
I'm having a problem with only 1 of the forms, this code works for 21 others.
I know the valueToSet has value and is correct, and within the setField method, the getField method does return a value, so I know the field name is correct too. Plus, this code works fine with many other forms. None of the fields are populating (this particular template only has text boxes anyway).
What am I missing? Is there something on this specific form I should be looking for?
setField(formFieldName, valueToSet);
public static void setField(String name, String value ) throws IOException {
PDDocumentCatalog docCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField( name );
if (field instanceof PDCheckBox){
String onValue = ((PDCheckBox) field).getOnValue();
String offValue = "Off";
if(value.equals("Yes")){
field.setValue(onValue);
}
else{
field.setValue(offValue);
}
}
else{
field.setValue(value);
}
}
So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000