PDFBox: Differentiating between transparent and non-transparent text - java

I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent) and the not-needed (non-transparent) text?
Here is a representative pdf file: https://easyupload.io/rbo333
Image OCR text should be extracted on page 2,3,12 but text is also extracted on page 4. On page 4 there is no OCR text behind images, but there is regular text under the image. I need to somehow filter that out as I only need OCR text.

So the images have in front of them or behind them transparent text. I thought that meant that they have no color, but #mkl said that they might have colors, but they are empty glyphs. The pdf specification also states that they can have color even if they are transparent. To be truly transparent the characters need to be rendered with neither stroking, nor non-stroking colors.
There is a RenderingMode enum in PDFBox, or Fontbox for exactly this purpose and its NEITHER value denotes whether something is transparent. I could extract it with the help of this answer.
The solution code looks like this.
#Override
protected void processTextPosition(TextPosition character) {
characterRenderingModes.put(character, getGraphicsState().getTextState().getRenderingMode());
super.processTextPosition(character);
}
This is an overriden method of the PDFTextStripper class and it goes through every character on the page/s and gets their RenderingModes. After that when needed I get the RenderingModes out of the map based on the characters I needed to examine.

Related

PDFBox skipping text

I am using Apache pdfbox to read a pdf that was scanned. The order of text sometimes appears jumbled in some of the pdfs. For instance, in the image below, you can notice how a section is completely skipped while selecting text from Adobe Reader for a pdf. The same happens when the pdf is read programmatically using pdfbox. I understand that this is related to pdf structure. But, I was hoping to find answers to the following questions in SO:
Why exactly does this happen inside a pdf?
How do detect this programmatically in java? What would be potential approaches?
What is the fix for this problem? (Apart from reading setting readSorted to true in PDFStripper)
Part of the pdf file is here for download.
Why exactly does this happen inside a pdf?
The contents of a PDF page you see as a final, static image are drawn following a sequence of instructions in its content stream. These instructions mostly either set some property (color, font, ...) or actually draw something ("draw a line from A to B", "draw text string A starting at B", ...). The PDF standard does not require to arrange these instructions in reading order, e.g. the string "Hello world" may be drawn by first drawing "world" and then drawing "Hello" before it.
The PDFBox text stripper by default extracts the text in the order it is drawn. E.g. assume on your page there are four text pieces A, B, C, and D visibly arranged in that order but drawn in the order A, C, D, and B. PDFBox by default will extract them in the latter order, A, C, D, and B. (If you ask it to sort, you'll get it in top-to-bottom, left-to-right order.)
Also Adobe Reader marks text in the order it is drawn. E.g. again assume on your page there are four text pieces A, B, C, and D visibly arranged in that order but drawn in the order A, C, D, and B. If you mark from A to C, B will not be marked, only A and C.
For example your PDF
Page 2 of your document is indeed drawn in a funny order:
The page content stream starts with an instruction to draw the form Xobject named X0. You can consider such objects as something like macros, independent content streams that can be included in the drawing of other content streams. Thus, the content stream of that form Xobject is drawn now:
The Xobject X0 content stream starts with an instruction to draw an image Xobject. Image Xobjects contain bitmap graphics in a number of formats. The bitmap in question contains the scanned page except all letters, i.e. essentially dirt specks and a few lines:
Thereafter there are a lot of text drawing instructions drawing all paragraphs except paragraph 7 and paragraph 8.01:
Hereafter the content stream of the form Xobject ends, so execution continues in the page content stream.
The page content stream continues with text drawing instruction drawing the two missing paragraphs:
This explains your observation:
The start and end of your marked text are drawn in the form Xobject. Thus, only text in that form Xobject is marked, not the two paragraphs drawn later in the page content stream.
By the way, if you wonder why the text does look like a bitmapped scanned image in spite of being drawn as text... The fonts used here have been constructed from the scanned page by cutting small pieces from it containing what the OCR mechanism considered a single glyph. This sometimes does not exactly correspond to individual characters, some glyphs in that font correspond to multiple characters:
As you see, for some characters there are multiple glyphs in the font (e.g. the lowercase 'o') and there are some glyphs containing multiple characters (e.g. the 'es' or 'mi').
What is the fix for this problem? (Apart from reading setting readSorted to true in PDFStripper)
Well, you have to decide what you want. Either you want the text in the order it is drawn or you want it in a sorted order.
If you want it in drawing order, there will every once in a while be documents with such erratic jumps in the order of text blocks.
If you want it in a different order, you'll have to sort. The sorting PDFBox offers is a simple top-to-bottom, left-to-right sorting. If you have a different sorting on your mind, you can retrieve the TextPosition objects from the text stripper in which it stores glyphs plus their positions, sizes and orientation on the page, and sort them yourself.
How do detect this programmatically in java? What would be potential approaches?
What exactly do you mean by "this"?
Do you mean that the text drawing instructions do not draw top-to-bottom, left-to-right?
For that you can simply override either of the PDFTextStripper methods processTextPosition(TextPosition) or writeString(String, List<TextPosition>) and analyze the positions in the TextPosition instances. If they suddenly jump upwards or (on the same line) left, you found such a situation.
Or do you mean that the text drawing instructions do not draw in reading order?
This is very difficult, there are multiple situations in which the reading order does jump up again, e.g. in case of multi-column text or inset text boxes. This definitively is beyond a stack overflow answer.

PDFBox 2.0: Get color information in TextStripper

I'm using PDFBox PDFTextStripper for text extraction. I also need to get color information for each character, ideally in writeString method.
What I found, is this solution for PDFBox 1.8 (actually can be easy converted to 2.0 version), and what else i'm looking for is background color for each character (as in that answer there is only character color).
I added all handlers for Fill operators - CloseFillNonZeroAndStrokePath, CloseFillEvenOddAndStrokePath FillNonZeroAndStrokePath, FillEvenOddAndStrokePath, LegacyFillNonZeroRule, FillNonZeroRule, FillEvenOddRule (like suggested in this topic), and inside those operators get nonStrokingColor:
public final class FillEvenOddRule extends OperatorProcessor {
#Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
PDGraphicsState gs = getGraphicsState();
PDColor nonStrokingColor = gs.getNonStrokingColor();
fillColor = nonStrokingColor.toRGB();
}
#Override
public String getName() {
return "f*";
}
}
Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color. There is file I'm trying to process, each second row has blue filling, and I would like to get that blue color for each character in such row, and white color for each character in white row. Is it possible with PDFBox?
The problem in context with the sample document
Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color.
As you found out, your assumption is wrong for the PDF at hand. The strategy in this document is to first draw all background material and then draw all text. Thus, your approach for this document should always return the color of the last bit of background material.
As mentioned in my comment to the second question here you referenced, you have to collect all rectangles (or more generically: paths) filled in parallel to the actual text extraction and check whether the font rendering color(s) (depending on the text rendering mode it may also be the StrokingColor!) of the currently inspected text coincide with that of the currently top filled path at the location of the text.
In a comment you wonder
does this mean this approach will work for all documents?
Does this approach work for all documents
For many it does but not for all.
The following issues immediately come to mind:
Not all color spaces support the toRGB method you use. (I just checked, I'm positively surprised for how many PDFBox does have an implementation.)
In particular in case of pattern colors you have to do a lot of digging into the pattern and its usage in your case to find the actual background color(s).
There are other ways to paint a background form, too, in particular:
The approach only considers filled paths, but if you use a larger value for the graphics state line width or a stretching transformation matrix, a stroked line can also paint rectangular forms. Thus, for this case you also have to consider stroked paths.
The background might be a bitmap image. In this case you'll have to analyze the image to get the background color(s)
Another alternative to consider is a shading fill. This usually will also result in a range of colors in the background.
Forms drawn over the glyph afterwards instead of covering it may change foreground and background considerably. There e.g. are blend modes that take the hue from the backdrop and the saturation from the foreground...
Soft masks active when drawing background or foreground may also be of interest.
...

How to fill out horizontal PDF forms using the PDFBox

How does one fill out a horizontal pdf form with the PDFBox library?
I access my fields and fill them using the supplied example code and it works fine. But, if the pdf page is tilted horizontally the filled out text are still left in the vertical position.
I have tried rotating the page first and then filling the form but the fields seem to be independent. I have also tried formatting the field through the various set methods defined for PDField and PDTextbox but this has no effect either.
Finally, I know that some of the rotation properties are controlled through the PDAnnotation and PDAnnotationWidget but trying to set their PDAppearanceCharacteristics has no effect on the initial text rotation. Rather, a user is required to interact with the field in order for this to take effect.
Thanking in advance,
J3lly

Add text and an alligned image to a ColumnText Itext

I am writing text from right to left.
how can i add an image at the end of the text (alligned nicely)?
The question isn't entirely clear.
The order of objects added to a Document is always respected, except in the case of Image objects. If an Image object doesn't fit the page, it can be forwarded to the next page, and other content can be added first. If you want to avoid this, use writer.setStrictImageSequence(true);
However: you're writing from Right to Left (probably in Hebrew), so the above doesn't apply, not the previous answer by Anshu. You can only use RTL in ColumnText and PdfPTable.
It's not clear what you want to do.
Do you want to add an Image at the bottom of the text? That's easy: just add the text first, then add the Image. Do you want to add an Image inline? In that case, you can wrap the Image in a Chunk as is done in this example: http://itextpdf.com/examples/iia.php?id=54
My interpretation is: you want to add the image at the bottom left, and you want the text to be added next to the image. That's more difficult to achieve. You'd need to add the Image and the text separately. Add the Image at an absolute position and add the text using 'irregular columns'. That is: ColumnText in text mode (as opposed to composite mode). For an example showing how to use irregular columns, see http://itextpdf.com/examples/iia.php?id=67

iText PDF colors are inconsistent in Acrobat

I'm generating a multipage PDF from Java using iText. Problem: the lines on my charts shift color between certain pages.
Here's a screenshot of the transition between pages:
This was taken from Adobe Reader. The lines are the correct color in OS X Preview.app.
In Reader the top is #73C352, the bottom is #35FF69. In Preview.app the line is #00FE7E.
Any thoughts on what could be causing this discrepancy? I saved the PDF from Preview.app and opened it in Adobe Reader, still has the colors off.
Here is the PDF that is having trouble. Open it in Adobe Reader and look at the transition between pages 11 & 12.
On checking this out further, it appears that the java.awt.print.PrinterJob is calling print() for each pageIndex twice. This might be a clue.
The problem with the pages with darker colors is that they include a pattern object with a transparent image. When transparency is involved, Adobe Acrobat switches automatically to a custom CMYK profile and this causes the darker colors. Only Acrobat does this, other viewers behave just fine. The solution is either to remove the pattern object with the transparent image (it seems to be a drawing artifact of the PDF generator engine, it is not used anywhere on the page) or you can make the page part of a transparency group and specify the transparency group to use RGB colorspace.
Several different possibilities, yes.
Different color matching. If you're using a "calibrated" color space on one page and a "device" color space on another, the same RGB/CMYK values can produce visually different values.
If the graph is inside a Form XObject, the same graph can appear differently depending on the current graphic state when the form is drawn.
If you could post a link to your PDF, I could probably give you a specific answer.
Ouch. That PDF is painful to shclep through. I'd like to have some words with whoever wrote their PDF converter. Harsh ones. Lots of unnecessary clipping ("text" is being clipped hither and yon, page 7 for example), poor use of patters for images, but not using patters when it would actually help, drawing text as paths, and on and on...
EDIT: Which is precisely the sort of stuff you see when rendering Java UI via a PdfGraphics2D object. You CAN keep the text as text though. It's just a matter of how you create the PdfGraphics2D instance.
Okay, so the color of the line itself is identical. 0 1 0.4 RG. HOWEVER, there is some "transparency stuff" going on.
On pages that have images with soft masks or extended graphic states that change the transparency, the green line appears darker. On pages without, it appears brighter.
I suspect that all those other PDF viewers that draw the lines consistently don't support transparency at all, or only poorly.

Categories

Resources