iText PDF colors are inconsistent in Acrobat

iText PDF colors are inconsistent in Acrobat - java

I'm generating a multipage PDF from Java using iText. Problem: the lines on my charts shift color between certain pages.
Here's a screenshot of the transition between pages:
This was taken from Adobe Reader. The lines are the correct color in OS X Preview.app.
In Reader the top is #73C352, the bottom is #35FF69. In Preview.app the line is #00FE7E.
Any thoughts on what could be causing this discrepancy? I saved the PDF from Preview.app and opened it in Adobe Reader, still has the colors off.
Here is the PDF that is having trouble. Open it in Adobe Reader and look at the transition between pages 11 & 12.
On checking this out further, it appears that the java.awt.print.PrinterJob is calling print() for each pageIndex twice. This might be a clue.

The problem with the pages with darker colors is that they include a pattern object with a transparent image. When transparency is involved, Adobe Acrobat switches automatically to a custom CMYK profile and this causes the darker colors. Only Acrobat does this, other viewers behave just fine. The solution is either to remove the pattern object with the transparent image (it seems to be a drawing artifact of the PDF generator engine, it is not used anywhere on the page) or you can make the page part of a transparency group and specify the transparency group to use RGB colorspace.

Several different possibilities, yes.
Different color matching. If you're using a "calibrated" color space on one page and a "device" color space on another, the same RGB/CMYK values can produce visually different values.
If the graph is inside a Form XObject, the same graph can appear differently depending on the current graphic state when the form is drawn.
If you could post a link to your PDF, I could probably give you a specific answer.
Ouch. That PDF is painful to shclep through. I'd like to have some words with whoever wrote their PDF converter. Harsh ones. Lots of unnecessary clipping ("text" is being clipped hither and yon, page 7 for example), poor use of patters for images, but not using patters when it would actually help, drawing text as paths, and on and on...
EDIT: Which is precisely the sort of stuff you see when rendering Java UI via a PdfGraphics2D object. You CAN keep the text as text though. It's just a matter of how you create the PdfGraphics2D instance.
Okay, so the color of the line itself is identical. 0 1 0.4 RG. HOWEVER, there is some "transparency stuff" going on.
On pages that have images with soft masks or extended graphic states that change the transparency, the green line appears darker. On pages without, it appears brighter.
I suspect that all those other PDF viewers that draw the lines consistently don't support transparency at all, or only poorly.

Related

Determining text areas using opencv in crowded images

I am attempting (and failing at) locating area's containing text from a larger image. Specifically I am looking to recognize titles of Magic cards. At the moment I have managed to cut the images down to blocks containing the title, such as
input image.
Despite this and even with training the ocr library to work with only with this font accuracy is still low. As far as I can tell the best thing I can do is crop the image to only the text. After research I still have been unable to do so. I attempted to implement the solutions presented in Extracting text OpenCV however the text is too close to the border for this to work. attempt image. If possible help in the form of java would be greatly appreciated. (sorry for the image links, I don't have the reputation to embed images)

Posting answer as suggested.
This answer relies on the text always being close to the same distance/offset away from the border.
Find the boundings of the border using Canny/Hough etc, and with whatever filtering techniques works best with your images (erosion, dilution, sharpen, grayscale, binary thresholding, etc).
Then take a smaller interior submat() of this border bounding Rect to get an approximation of where the text should be and run the ocr on this submat.

How to count color pages in a PDF/Word doc using Java

I am looking to develop a desktop application using Java to count the number of colored pages in a PDF or Word file. This will be used as part of an overall system to help calculate the cost of printing a document in terms of how many pages there are (color/B&W).
Ideally, the user of the application would use a file dialog to select the desired PRF/Word file, the application could then count and output the number of colored pages, allowing the system to automatically calculate document cost accordingly.
i.e
if A4 colored pages cost 50c per page to print,
and B&W cost 10c per page,
calculate the total cost of the document per colored/B&W pages.
I am aware of the existing software Rapid PDF Count http://www.traction-software.co.uk/rapidpdfcount/, but would be unsuitable as part on integration into a new system. I have also tried using GhostScript/Python as per this solution: http://root42.blogspot.de/2012/10/counting-color-pages-in-pdf-files.html, however this takes too long (5mins to count a 100 page pdf), and would be difficult to implement into a desktop app.
Is there any method of counting the number of colored pages in a PDF or Word file using Java (or alternative language)
Thanks

Although it might sound easy, the task is rather complicated.
One option would be to use a program such as iText to walk every single token in the PDF, look for tokens that support color and compare that to your definition of "black". However, this will only get you basic text and drawing commands. Images are a completely different beast so you'll probably need to find an image parser or grab a copy of each spec and then walk each of those.
One of the downsides of token walking is you need to properly handle tokens that reference other things and further walk those tokens.
Another downside is that things can overlap each other so you'd probably want be aware of their coordinates, z-index, transparency and such.
There will be many more bumps in the road but that's a good start. What's most interesting is that if you accomplish this, you'll actually have found that you've partially built a PDF renderer!
Next, you'll need to define "black". Off the top of my head there's RGB black, CMYK black, Grey black and maybe Lab black along with some Pantones. That shouldn't be too hard but if I were to build this I'd want to know "blank ink usage" which could also be shades of grey. There's also "rich blank" that you might need to deal with, too!
So, all that said, I think that the GhostScript option you found is really the best bet. It literally renders the PDF and calculates the ink coverage from an RGB standpoint. You still should handle grey's, too, but that shouldn't be too hard, here's a good starting point.

Wanting to know what the click-charge is going to be is a pretty common problem, but it's not easy to solve at all. As already indicated by the answer Chris Haas gave, but I want to put another spin on it.
First of all, you have to wonder whether you really want to support both Word and PDF documents. Analysing Word files is less useful than you might think because that Word file is probably going to be converted into something else before it's going to be printed. And because of the fact that you're starting from Word, the chance that your nice RGB black text in Word gets converted to less-than-perfect 4 color black in PDF is very high. In other words, even though you might count a page of black text in Word as a 'cheap' page, it might turn into an expensive color page after conversion from Word to something that can be printed.
Let's consider the PDF case then. PDF supports a whole host of color spaces (gray, RGB, CMYK, the same with an ICC Profile attached, spot color and a few multi-spot color variants, CalGray and CalRGB and Lab. Besides that there is a whole range of very tricky features such as transparency, overprint, shades, images, masks... that you all have to take into account. The only truly good way to calculate what you need is to do essentially the same work as your printer will do; convert the PDF into one image per page and examine the pixels.
Because of what you want to do, the best way to progress would be to:
1) Convert any word files into PDF
2) Convert any PDF files into CMYK
3) Render each page of that CMYK file into an image.
Once you've done that you can examine the image and see whether you have any colors left. There are a number of potential technologies you can use for this. GhostScript is definitely one, but there are commercial solutions too that would certainly be more expensive but potentially faster.

Printing in Java - Printable.print() resizes images

I have a custom report which draws via Graphics2D, and uses a lot of tiny BufferedImage sprites. PrinterJob.print() seems to be calling Printable.print() roughly once for each sprite (the actual count can vary both ways), so some pages are re-rendered 150 times... This causes printing to be unacceptably slow, about 10 seconds for two pages.
I found this: Why does the java Printable's print method get called multiple times with the same page number?
But it doesn't appear to explain my particular problem (or only partially explains it). I created a test report which has only a few sprites, and there was a small number of resizes that went up and down as I added and removed images on either the vertical or horizontal axes.
When printing to a PDF using Bullzip, I noticed that after zooming in on the images, they are being scaled up using a bilinear or bicubic algorithm. One of these images, which is unique in having an indexed color palette, does not appear to be scaled. I confirmed that the scaling is a Java behavior and not being performed by Bullzip by printing to a real printer and observing the same images being scaled versus not.
So it strikes me as the print API trying to rescale images to whatever DPI it has in mind, but for some reason it's calling Printable.print() each time it encounters an image that it deems as needing this treatment.
How do I fix this behavior? I tried setting rendering hints on the Graphics2D that I get when Printable.print() is called, to no avail. I don't know what else to do short of try to find and examine the print API's source code.

I think I just figured it out by accident. A report I just modified now draws an image over some geometry, and I noticed that the part of the geometry that's behind the box of the image is being rasterized and looks blurry compared to outside of the box. The image in question (and all other than the one indexed color image) has an 8 bit alpha channel.
I noticed before that Java's print rasterizer doesn't like things with translucency (one report which used it was being completely rasterized at I think 300dpi...), but I forgot that these images also had alpha channels.
When I get a chance, I'm probably going to fix this by further increasing the images' resolution and using 1 bit alpha. When scaled down for screen viewing, it will have a few bits of alpha again and look okay.

Finding bounding box of text within JPG image

My question is similar to this one, but is more specific in scope.
In my card game application, I would like for users to be able to click on words located in a scanned jpeg image. Please see this sample Pokemon trading card.
In this case, the user should be able to hover his mouse over the text "Scratch", upon which a pulsing rectangular border will appear around the text, indicating that it is clickable. The problem is how to detect the border of the text. There will be an array of words KNOWN BEFOREHAND that the user may click on (these will be retrieved from a database on a card-by-card basis). To continue our example, the array in this case will be ["Scratch", "Live Coal"]. Once the user clicks on "Scratch", the application must know via a call-back that "Scratch" was chosen instead of "Live Coal".
I was thinking of using optical character recognition libraries to solve this problem, but the open-source options for this are poor in quality (e.g. GOCR) and/or not well-tested on multiple platforms (e.g. Tesseract). I only care about Windows and Mac compatibility. Am I missing an obvious/simpler solution/algorithm that does not require OCR? I cannot simply hand-code in bounding boxes for each card, as there will be thousands of scanned cards in my database. The user may also upload his own custom card scans with an accompanying array of clickable text.
Text color is not always black. See this panorama of different card and text styles that will be permitted. The black cards have white text, and the third-to-last card (Zekrom) has black text with a white outline.
Solutions in any programming language are appreciated. However, please note that I am looking for open-source algorithms and/or libraries. If there is a solution in Ruby or Java, even better, as my code is primarily in these two languages.
EDIT: I forgot to mention that the order of the words/phrases in the array will be the same as on the card. Thus, the array will be ["Scratch", "Live Coal"] instead of ["Live Coal", "Scratch"]. I am mentioning this because it can potentially simplify the task. Thus, for this example, I can simply look for black pixels (though I have to watch out for the black star in the white circle). However, there will be more difficult cases where there is descriptive text under the attack name in a smaller font (again, see the panorama for examples).

I would just write a program that allows you to visually draw a bounding box around your text for simplicity but could could do this buy detecting differences in pixel color. Since the text is black you could see where the upper-left most black pixel is without large indents and within the bottom half of the card.

When the cursor is stationary, check if there is a black pixel either underneath or to 4 pixels around the cursor. If it is, check the first three consecutive (because there still might be a non-black pixel between the letters) non-black pixels to the left of the cursor, to the right, to the top and at the bottom. If yes, use these locations to draw a square. You can use OpenCV.

Android fuzzy / faded fonts possible?

So I am developing a very simple app, mostly for personal use, am am looking for a simple solution to a simple problem.
In its simplest form I am looking for a way to have a line of text with just one or two words blurred out. Basically I am looking to blur text beyond readability but still hinting at what is hidden. Kind of a knowledge / memory app to help memorize some definitions by prompting with a few key words.
I am having issues finding a simple way to accomplish this. Am I just missing an attribute to blur text?
I have thought about:
overriding say the textview onDraw but that seems overkill and I am unsure if there are any methods available to easily blur text.
using the toHtml and trying out the new CSS3 blur effects but I don't think that that is a reasonable solution and I am not sure that the Android platform supports all the CSS3 format, if any.
the simplest and most desirable solution in my book was to find a font (ttf, off, etc) file, derived from a common font, that is already blurred as I described, and use that alternating with the non blurred version of that font to achieve the desired effect.
make the described font but that just plain requires too much time on my part and the outcome is not necessarily good :)
I have thought about some alternative ways to simulate this effect but they all result in fading the text, which is undesirable, since I want to have some visual prompts to indicate the obscured texts length.
Any ideas? It's been years since I have developed in Java and I am unsure what is available and what the Android OS supports.

I haven't looked into using these properties for only part of the text, but TextView has some possibly useful properties related to text shadows. Using something like the following XML attributes, you could hide the actual text and just show a blurred shadow.
android:textColor - #0000 (fully transparent so that the crisp text is not shown)
android:shadowColor - #FFFF (or whatever color you want to appear)
android:shadowDx - 0 (the shadow is in the same horizontal position as the text)
android:shadowDy - 0 (the shadow is in the same vertical position as the text)
android:shadowRadius - Depends on how much you want to blur. A very small non-zero value, such as 0.001, will be sharp. Larger values blur more, and at some point the shadow becomes illegible.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.