For a certain PDF file if I use page.getMediaBox().getWidth() and page.getMediaBox().getHeight() to get width and height of PDF file page using PDFBox, if shows values which are different than the values I am getting using the PDFBoxDebugger. What might be the reason? I am attaching the screenshot for the PDFDebugger. I am using PDFBox-2.0.9 version. The values I am getting from page.getMediaBox().getWidth() and page.getMediaBox().getHeight() are 531.36597 and 647.99603 respectively which do not match with the PDFBoxDebugger values. (And it only occurs for the first page of PDF, for further pages it works fine)
As Tilman already stated in a comment, the values to expect are
a width of 1282.2 - 750.834 = 531.366 and
a height of 849.593 - 201.597 = 647.996 (corrected value).
The observed values
531.36597 and 647.99603
correspond to the expected values well enough considering the accuracy of the float type.
I assume the op misunderstands the values of the MediaBox array. They do not contain the width or height as explicit values but the coordinates of two opposite corners of the box.
The MediaBox value is specified to have the type rectangle, cf. ISO 32000-1 table 30 Entries in a page object. And a rectangle is specified as
a specific array object used to describe locations oon a page and bounding boxes for a variety of objects and written as an array of four numbers giving the coordinates of a pair of diagonally opposite corners,
cf. ISO 32000-1 section 4.40 rectangle.
As also already mentioned by Tilman you probably should be looking at the CropBox instead.
Related
I have a requirement to add adobe form fields to an existing pdf.
The problem I encounter is when adding fields to a rotated page, the resulting form field text orientation is incorrect.
e.g. A page that is rotated 90 degrees clockwise, results in form field where the text is "vertical".
Is there a workaround to get form fields created with the correct orientation?
The appearance characteristics dictionary (/MK entry) of the widget has an /R entry where the rotation can be set. See e.g. this file.
PDAppearanceCharacteristicsDictionary fieldAppearance
= new PDAppearanceCharacteristicsDictionary(new COSDictionary());
fieldAppearance.setRotation(90);
widget.setAppearanceCharacteristics(fieldAppearance);
You may have to adjust your coordinates. To find the best coordinates, use PDFDebugger and hover at the place you want your field to be.
Update:
For checkmarks (and radio buttons) where the appearance stream is created by the user and not by PDFBox (as seen here or in the PDFBox example) you need to set the matrix yourself like this (for 90°):
yesAP.setMatrix(AffineTransform.getQuadrantRotateInstance(1, rect.getWidth(), 0));
The "1" here is for 90°. The translation needs to be adjusted for the other rotations.
I am trying to get the Absolute position of a PDF field and My code is as follows.
float[] _advisor = reader.getAcroFields().getFieldPositions("_advisor");
float[] _test = reader.getAcroFields().getFieldPositions("_test");
float[] _owner = reader.getAcroFields().getFieldPositions("_owner");
All the fields are vertically aligned same left position.
The problem is the first two fields are on the same page of the PDF and the value of xLeft is same but the Last field _owner is on the second page and the Value of xLeft is off by a big amount. Do i need to subtract an offset or something for pages in different page?
Some things to consider:
The default coordinate used by iText has its origin at the lower left corner of the page.
iText will return coordinates in points, rather than in pixels.
You can display a ruler and grid overlay using Adobe Reader, this enables you to easily gauge where each component is at. Check whether these readings are the same as the values iText provides you with.
If you still think iText is giving you the wrong values, please provide us access to your pdf, and provide us with the values you expect to receive (and why).
One possible issue could be that your mediabox has a different positioning than 0,0. I needed that once so I "normalized" the values like this:
PdfDictionary pageDict = reader.getPageN(pageNumber);
PdfArray mediaBox = (PdfArray)PdfReader.getPdfObject(pageDict.get(PdfName.MEDIABOX));
//check whether the mediabox has a different positioning than 0,0
if(((PdfNumber)mediaBox.getPdfObject(0)).floatValue()!=0){
//normalize X coordinates
lowerLeftX = lowerLeftX-(PdfNumber)mediaBox.getPdfObject(0)).floatValue();
upperRightX = upperRightX-((PdfNumber)mediaBox.getPdfObject(0)).floatValue();
}
if(((PdfNumber)mediaBox.getPdfObject(1)).floatValue()!=0){
//normalize Y coordinates
lowerLeftY = lowerLeftY-((PdfNumber)mediaBox.getPdfObject(1)).floatValue();
upperRightY = upperRightY-((PdfNumber)mediaBox.getPdfObject(1)).floatValue();
}
I am setting a margin for a pdf and checking if the contents of the page are exceeding the margin.
I am easily able to do that if the contents of a page are just text.
Here s what I am doing:
I am using TextMarginFinder. I will set the left margin values of the pdf based on the book size. and check with the finder.getLlx(); since finder.getLlx(); will get me the left most position of a text in that page.
TextMarginFinder finder;
if(leftmar>=finder.getLlx())
{
errormargin=1; //left margin error
System.out.println("Page: "+i+"Margin Error:LeftMArginError ");
}
But this does not work in case if the page contains an image. Although the image goes outside of the margin, I am not getting the error with the above code since the finder.getLlx(); function seems to work only for texts.
Two Questions:
1) While looping through the pages in pdf, if there is an image in that page, how can I check if that particular page contains an image?
2) If it contains an image, how can I obtain its extreme positions?
Update after mkl suggestion
if(leftmar>=finder.getLlx())
{
errormargin=1; //left margin error
System.out.println("finder.getLlx() value ="+finder.getLlx()+", leftmar Value="+leftmar);
}
if(rightmar<= finder.getUrx()){
errormargin=1; //right margin error
System.out.println("finder.getUrx() value ="+finder.getUrx()+", rightmar Value="+rightmar);
}
if(margintop >= finder.getUry()){
errormargin=3; //top margin error
System.out.println("finder.getUry() value ="+finder.getUry()+", margintop Value="+margintop);
}
if(marginbottom >= finder.getLly()){
errormargin=3; //bottom margin error
System.out.println("finder.getLly() value ="+finder.getLly()+", marginbottom Value="+marginbottom);
}
This is more an answer to what the OP actually wanted, a way to retrieve the bounding box of all content on a page.
The OP already uses the iText TextMarginFinder render listener class to determine the bounding box of the text on page. In the context of this answer an analogous class MarginFinder has been developed which does not only consider text but also other kind of content, e.g. bitmap images and vector graphics.
Thus, replacing the use of TextMarginFinder by MarginFinder allows to find the bounding box of any content on the page.
Please be aware:
Any content is considered, the margin finder does not check whether the content makes a difference. E.g. think about white text, white bitmap areas, or white rectangles, all are considered content and, therefore, the bounding box encompasses such invisible content, too. Especially the latter example, white rectangles, might be a problem here or there as some software first paints a white rectangle over the whole page area.
Clipping paths are not considered. Thus, even content that never is drawn (because it is clipped away) makes the bounding box expand.
Page borders are not considered, either. Thus, off-page content like printer marks may make the bounding box expand even more.
The code calculating the bounding box for vector graphics is not correct: it simply returns the bounding box of all control points which in case of Bezier curves may be false. Its ignoring line widths and wedge types also results in somewhat-off coordinates.
Annotations are not considered. Thus, the resulting bounding box may be to small if annotations are expected to also be considered, e.g. for forms.
In spite of these shortcomings, the render listener usually returns correct results. If this is not enough, the class can be extended accordingly.
PS: Anyone who is interested in the original question may find answers in the MarginFinder render listener class and its use.
I'm using PDFBox's PDPage::convertToImage to display PDF pages in Java. I'm trying to create click-able areas on the PDF page's image based on COSObjects in the page (namely, AcroForm fields). The problem is the PDF seems to use a completely different coordinate system:
System.out.println(field.getDictionary().getItem(COSName.RECT));
yields
COSArray{[COSFloat{149.04}, COSFloat{678.24}, COSInt{252}, COSFloat{697.68}]}
If I were to estimate the actual dimensions of the field's rectangle on the image, it would be 40,40,50,10 (x,y,width,height). There's no obvious correlation between the two and I can't seem to find any information about this with Google.
How can I determine the pixel position of a PDPage's COSObjects?
The pdf coordinate system is not that different from the coordinate system used in images. The only differences are:
the y-axis points up, not down
the scale is most likely different.
You can convert from pdf coordinates to image coordinates using these formulae:
x_image = x_pdf * width_image / width_page
y_image = (height_pdf - y_pdf) * height_image / height_pdf
To get the page size, simply use the mediabox size of the page that contains the annotation:
PDRectangle pageBounds = page.getMediaBox();
You may have missed the correlation between the array from the pdf and your image coordinate estimates, since a rectangle in pdf is represented as array [x_left, y_bottom, x_right, y_top].
Fortunately PDFBox provides classes that operate on a higher level than the cos structure. Use this to your advantage and use e.g. PDRectangle you get from the PDAnnotation using getRectangle() instead of accessing the COSArray you extract from the field's dictionary.
I want to remove the bottom part of each page in the PDF, but not change page size, what is the recommended way to do this in java in PDFBOX? How to remove the footer from each page in PDF?
Is there possibly a way to use PDRectangle to just delete all text/images within it?
snippet of what I tried, using rectangle with setCropBox seems to lose page size, maybe cropBox is not intended for this?
PDRectangle rectangle = new PDRectangle();
rectangle.setUpperRightY(mypage.findCropBox().getUpperRightY());
rectangle.setLowerLeftY(50);
rectangle.setUpperRightX(mypage.findCropBox().getUpperRightX());
rectangle.setLowerLeftX(mypage.findCropBox().getLowerLeftX());
mypage.setCropBox(rectangle);
croppedDoc.addPage(mypage);
croppedDoc.save(filename);
croppedDoc.close();
Closest example in pdfbox cookbook examples I could find is on how to remove entire page, however this is not what I'm looking for, I'd like to just delete few elements from the page:
http://pdfbox.apache.org/userguide/cookbook.html
I'm also a newbie, but take a look at this page, in particular, the description of TrimBox. If there's no TrimBox on the page, it defaults to CropBox, which would cause what you're seeing.
In general, don't expect the PDFBox docs to tell you much of anything about PDF itself - to use PDFBox well I think you need to go elsewhere - AFAIK, mostly just to the PDF specification. I haven't even skimmed it yet, though!
The CropBox is the way to go if you want to remove a portion of a page while keeping a rectangular region visible. If you want the page size to remain the same, you need the MediaBox to remain the same.
From the PDF Spec:
CropBox - rectangle (Optional; inheritable) A rectangle, expressed in default user space units, defining the visible region of default
user space. When the page is displayed or printed, its contents are to
be clipped (cropped) to this rectangle and then imposed on the output
medium in some implementation-defined manner (see Section 10.10.1,
“Page Boundaries”). Default value: the value of MediaBox.
MediaBox - rectangle (Required; inheritable) A rectangle (see Section 3.8.4, “Rectangles”), expressed in default user space units,
defining the boundaries of the physical medium on which the page is
intended to be displayed or printed (see Section 10.10.1, “Page
Boundaries”).
A have seen (faulty) applications and libraries that force the CropBox and the MediaBox to be the same, double check that this is not what is happening on your case.
Also take into account that the coordinates origin (0,0) in PDF is the bottom-left corner, some libraries do the translation to top-left for you, some others not, you may also want to double check this on the library you are using.