PDFBox skipping text - java

I am using Apache pdfbox to read a pdf that was scanned. The order of text sometimes appears jumbled in some of the pdfs. For instance, in the image below, you can notice how a section is completely skipped while selecting text from Adobe Reader for a pdf. The same happens when the pdf is read programmatically using pdfbox. I understand that this is related to pdf structure. But, I was hoping to find answers to the following questions in SO:
Why exactly does this happen inside a pdf?
How do detect this programmatically in java? What would be potential approaches?
What is the fix for this problem? (Apart from reading setting readSorted to true in PDFStripper)
Part of the pdf file is here for download.

Why exactly does this happen inside a pdf?
The contents of a PDF page you see as a final, static image are drawn following a sequence of instructions in its content stream. These instructions mostly either set some property (color, font, ...) or actually draw something ("draw a line from A to B", "draw text string A starting at B", ...). The PDF standard does not require to arrange these instructions in reading order, e.g. the string "Hello world" may be drawn by first drawing "world" and then drawing "Hello" before it.
The PDFBox text stripper by default extracts the text in the order it is drawn. E.g. assume on your page there are four text pieces A, B, C, and D visibly arranged in that order but drawn in the order A, C, D, and B. PDFBox by default will extract them in the latter order, A, C, D, and B. (If you ask it to sort, you'll get it in top-to-bottom, left-to-right order.)
Also Adobe Reader marks text in the order it is drawn. E.g. again assume on your page there are four text pieces A, B, C, and D visibly arranged in that order but drawn in the order A, C, D, and B. If you mark from A to C, B will not be marked, only A and C.
For example your PDF
Page 2 of your document is indeed drawn in a funny order:
The page content stream starts with an instruction to draw the form Xobject named X0. You can consider such objects as something like macros, independent content streams that can be included in the drawing of other content streams. Thus, the content stream of that form Xobject is drawn now:
The Xobject X0 content stream starts with an instruction to draw an image Xobject. Image Xobjects contain bitmap graphics in a number of formats. The bitmap in question contains the scanned page except all letters, i.e. essentially dirt specks and a few lines:
Thereafter there are a lot of text drawing instructions drawing all paragraphs except paragraph 7 and paragraph 8.01:
Hereafter the content stream of the form Xobject ends, so execution continues in the page content stream.
The page content stream continues with text drawing instruction drawing the two missing paragraphs:
This explains your observation:
The start and end of your marked text are drawn in the form Xobject. Thus, only text in that form Xobject is marked, not the two paragraphs drawn later in the page content stream.
By the way, if you wonder why the text does look like a bitmapped scanned image in spite of being drawn as text... The fonts used here have been constructed from the scanned page by cutting small pieces from it containing what the OCR mechanism considered a single glyph. This sometimes does not exactly correspond to individual characters, some glyphs in that font correspond to multiple characters:
As you see, for some characters there are multiple glyphs in the font (e.g. the lowercase 'o') and there are some glyphs containing multiple characters (e.g. the 'es' or 'mi').
What is the fix for this problem? (Apart from reading setting readSorted to true in PDFStripper)
Well, you have to decide what you want. Either you want the text in the order it is drawn or you want it in a sorted order.
If you want it in drawing order, there will every once in a while be documents with such erratic jumps in the order of text blocks.
If you want it in a different order, you'll have to sort. The sorting PDFBox offers is a simple top-to-bottom, left-to-right sorting. If you have a different sorting on your mind, you can retrieve the TextPosition objects from the text stripper in which it stores glyphs plus their positions, sizes and orientation on the page, and sort them yourself.
How do detect this programmatically in java? What would be potential approaches?
What exactly do you mean by "this"?
Do you mean that the text drawing instructions do not draw top-to-bottom, left-to-right?
For that you can simply override either of the PDFTextStripper methods processTextPosition(TextPosition) or writeString(String, List<TextPosition>) and analyze the positions in the TextPosition instances. If they suddenly jump upwards or (on the same line) left, you found such a situation.
Or do you mean that the text drawing instructions do not draw in reading order?
This is very difficult, there are multiple situations in which the reading order does jump up again, e.g. in case of multi-column text or inset text boxes. This definitively is beyond a stack overflow answer.

Related

PDFBox: Differentiating between transparent and non-transparent text

I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent) and the not-needed (non-transparent) text?
Here is a representative pdf file: https://easyupload.io/rbo333
Image OCR text should be extracted on page 2,3,12 but text is also extracted on page 4. On page 4 there is no OCR text behind images, but there is regular text under the image. I need to somehow filter that out as I only need OCR text.
So the images have in front of them or behind them transparent text. I thought that meant that they have no color, but #mkl said that they might have colors, but they are empty glyphs. The pdf specification also states that they can have color even if they are transparent. To be truly transparent the characters need to be rendered with neither stroking, nor non-stroking colors.
There is a RenderingMode enum in PDFBox, or Fontbox for exactly this purpose and its NEITHER value denotes whether something is transparent. I could extract it with the help of this answer.
The solution code looks like this.
#Override
protected void processTextPosition(TextPosition character) {
characterRenderingModes.put(character, getGraphicsState().getTextState().getRenderingMode());
super.processTextPosition(character);
}
This is an overriden method of the PDFTextStripper class and it goes through every character on the page/s and gets their RenderingModes. After that when needed I get the RenderingModes out of the map based on the characters I needed to examine.

PDFBox 2.0: Get color information in TextStripper

I'm using PDFBox PDFTextStripper for text extraction. I also need to get color information for each character, ideally in writeString method.
What I found, is this solution for PDFBox 1.8 (actually can be easy converted to 2.0 version), and what else i'm looking for is background color for each character (as in that answer there is only character color).
I added all handlers for Fill operators - CloseFillNonZeroAndStrokePath, CloseFillEvenOddAndStrokePath FillNonZeroAndStrokePath, FillEvenOddAndStrokePath, LegacyFillNonZeroRule, FillNonZeroRule, FillEvenOddRule (like suggested in this topic), and inside those operators get nonStrokingColor:
public final class FillEvenOddRule extends OperatorProcessor {
#Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
PDGraphicsState gs = getGraphicsState();
PDColor nonStrokingColor = gs.getNonStrokingColor();
fillColor = nonStrokingColor.toRGB();
}
#Override
public String getName() {
return "f*";
}
}
Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color. There is file I'm trying to process, each second row has blue filling, and I would like to get that blue color for each character in such row, and white color for each character in white row. Is it possible with PDFBox?
The problem in context with the sample document
Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color.
As you found out, your assumption is wrong for the PDF at hand. The strategy in this document is to first draw all background material and then draw all text. Thus, your approach for this document should always return the color of the last bit of background material.
As mentioned in my comment to the second question here you referenced, you have to collect all rectangles (or more generically: paths) filled in parallel to the actual text extraction and check whether the font rendering color(s) (depending on the text rendering mode it may also be the StrokingColor!) of the currently inspected text coincide with that of the currently top filled path at the location of the text.
In a comment you wonder
does this mean this approach will work for all documents?
Does this approach work for all documents
For many it does but not for all.
The following issues immediately come to mind:
Not all color spaces support the toRGB method you use. (I just checked, I'm positively surprised for how many PDFBox does have an implementation.)
In particular in case of pattern colors you have to do a lot of digging into the pattern and its usage in your case to find the actual background color(s).
There are other ways to paint a background form, too, in particular:
The approach only considers filled paths, but if you use a larger value for the graphics state line width or a stretching transformation matrix, a stroked line can also paint rectangular forms. Thus, for this case you also have to consider stroked paths.
The background might be a bitmap image. In this case you'll have to analyze the image to get the background color(s)
Another alternative to consider is a shading fill. This usually will also result in a range of colors in the background.
Forms drawn over the glyph afterwards instead of covering it may change foreground and background considerably. There e.g. are blend modes that take the hue from the backdrop and the saturation from the foreground...
Soft masks active when drawing background or foreground may also be of interest.
...

How to count color pages in a PDF/Word doc using Java

I am looking to develop a desktop application using Java to count the number of colored pages in a PDF or Word file. This will be used as part of an overall system to help calculate the cost of printing a document in terms of how many pages there are (color/B&W).
Ideally, the user of the application would use a file dialog to select the desired PRF/Word file, the application could then count and output the number of colored pages, allowing the system to automatically calculate document cost accordingly.
i.e
if A4 colored pages cost 50c per page to print,
and B&W cost 10c per page,
calculate the total cost of the document per colored/B&W pages.
I am aware of the existing software Rapid PDF Count http://www.traction-software.co.uk/rapidpdfcount/, but would be unsuitable as part on integration into a new system. I have also tried using GhostScript/Python as per this solution: http://root42.blogspot.de/2012/10/counting-color-pages-in-pdf-files.html, however this takes too long (5mins to count a 100 page pdf), and would be difficult to implement into a desktop app.
Is there any method of counting the number of colored pages in a PDF or Word file using Java (or alternative language)
Thanks
Although it might sound easy, the task is rather complicated.
One option would be to use a program such as iText to walk every single token in the PDF, look for tokens that support color and compare that to your definition of "black". However, this will only get you basic text and drawing commands. Images are a completely different beast so you'll probably need to find an image parser or grab a copy of each spec and then walk each of those.
One of the downsides of token walking is you need to properly handle tokens that reference other things and further walk those tokens.
Another downside is that things can overlap each other so you'd probably want be aware of their coordinates, z-index, transparency and such.
There will be many more bumps in the road but that's a good start. What's most interesting is that if you accomplish this, you'll actually have found that you've partially built a PDF renderer!
Next, you'll need to define "black". Off the top of my head there's RGB black, CMYK black, Grey black and maybe Lab black along with some Pantones. That shouldn't be too hard but if I were to build this I'd want to know "blank ink usage" which could also be shades of grey. There's also "rich blank" that you might need to deal with, too!
So, all that said, I think that the GhostScript option you found is really the best bet. It literally renders the PDF and calculates the ink coverage from an RGB standpoint. You still should handle grey's, too, but that shouldn't be too hard, here's a good starting point.
Wanting to know what the click-charge is going to be is a pretty common problem, but it's not easy to solve at all. As already indicated by the answer Chris Haas gave, but I want to put another spin on it.
First of all, you have to wonder whether you really want to support both Word and PDF documents. Analysing Word files is less useful than you might think because that Word file is probably going to be converted into something else before it's going to be printed. And because of the fact that you're starting from Word, the chance that your nice RGB black text in Word gets converted to less-than-perfect 4 color black in PDF is very high. In other words, even though you might count a page of black text in Word as a 'cheap' page, it might turn into an expensive color page after conversion from Word to something that can be printed.
Let's consider the PDF case then. PDF supports a whole host of color spaces (gray, RGB, CMYK, the same with an ICC Profile attached, spot color and a few multi-spot color variants, CalGray and CalRGB and Lab. Besides that there is a whole range of very tricky features such as transparency, overprint, shades, images, masks... that you all have to take into account. The only truly good way to calculate what you need is to do essentially the same work as your printer will do; convert the PDF into one image per page and examine the pixels.
Because of what you want to do, the best way to progress would be to:
1) Convert any word files into PDF
2) Convert any PDF files into CMYK
3) Render each page of that CMYK file into an image.
Once you've done that you can examine the image and see whether you have any colors left. There are a number of potential technologies you can use for this. GhostScript is definitely one, but there are commercial solutions too that would certainly be more expensive but potentially faster.

Building and printing complex table layouts with Java

A customer requested me a software, and one of its requirements is build a form and fill it with data collected from database.
This form is currently being created in Excel. It uses cells to build the form, some cells have blank background, others blank background with black bottom border (to look like a line where text is typed), others have gray background with white text, and there's also a logo image. In Excel, some cells are merged to become bigger than other cells. They fill the text in another spreadsheet and the required cells in the form take that text and format it.
I've looked many report frameworks in Java, some are very complex and some look like Excel's graph builders, but I saw none that can make a complex 2D form like this.
Data filled in it is simple, like name, quantity, some numbers, but they have different length requiring for example that name's cell to be merged to cover a full horizontal line, and some have smaller font size. There's no repeated data that would require sorting and I have no problem gathering the data.
In the end, the filled form must also be printed, so I can't use normal Swing table or grid. It will be used in Windows now, but it'd be nice to support Linux printing too.
Any suggestion of a Java component that builds a 2D layout like this and fills it with strings will be very much appreciated. I even thought of taking a screenshot of their current form and just use 2D Graphics to print the text, but I'd not be able to print it.
This is an example of the kind of form I must build, it's somewhat like that but some areas have gray background with white text:
No, it's not a duplicate, but it is a good example of the layout.

iText PDF colors are inconsistent in Acrobat

I'm generating a multipage PDF from Java using iText. Problem: the lines on my charts shift color between certain pages.
Here's a screenshot of the transition between pages:
This was taken from Adobe Reader. The lines are the correct color in OS X Preview.app.
In Reader the top is #73C352, the bottom is #35FF69. In Preview.app the line is #00FE7E.
Any thoughts on what could be causing this discrepancy? I saved the PDF from Preview.app and opened it in Adobe Reader, still has the colors off.
Here is the PDF that is having trouble. Open it in Adobe Reader and look at the transition between pages 11 & 12.
On checking this out further, it appears that the java.awt.print.PrinterJob is calling print() for each pageIndex twice. This might be a clue.
The problem with the pages with darker colors is that they include a pattern object with a transparent image. When transparency is involved, Adobe Acrobat switches automatically to a custom CMYK profile and this causes the darker colors. Only Acrobat does this, other viewers behave just fine. The solution is either to remove the pattern object with the transparent image (it seems to be a drawing artifact of the PDF generator engine, it is not used anywhere on the page) or you can make the page part of a transparency group and specify the transparency group to use RGB colorspace.
Several different possibilities, yes.
Different color matching. If you're using a "calibrated" color space on one page and a "device" color space on another, the same RGB/CMYK values can produce visually different values.
If the graph is inside a Form XObject, the same graph can appear differently depending on the current graphic state when the form is drawn.
If you could post a link to your PDF, I could probably give you a specific answer.
Ouch. That PDF is painful to shclep through. I'd like to have some words with whoever wrote their PDF converter. Harsh ones. Lots of unnecessary clipping ("text" is being clipped hither and yon, page 7 for example), poor use of patters for images, but not using patters when it would actually help, drawing text as paths, and on and on...
EDIT: Which is precisely the sort of stuff you see when rendering Java UI via a PdfGraphics2D object. You CAN keep the text as text though. It's just a matter of how you create the PdfGraphics2D instance.
Okay, so the color of the line itself is identical. 0 1 0.4 RG. HOWEVER, there is some "transparency stuff" going on.
On pages that have images with soft masks or extended graphic states that change the transparency, the green line appears darker. On pages without, it appears brighter.
I suspect that all those other PDF viewers that draw the lines consistently don't support transparency at all, or only poorly.

Categories

Resources