Absolute positions in PDFBox

Absolute positions in PDFBox - java

I'm writing a program that converts TeX-generated PDFs back to a TeX-like string of text. In order to achieve that I use Apache PDFBox.
I would like to be able to detect subscripts, superscripts and then use a TeX-like method to denote them. I have read this question: Superscript and subscript differentiation using pdf box which isn't really helpful because it is impossible to detect subscripts and superscripts using Y and EndY probably because they are relative. Is there any way to detect the absolute position of text? The height of a glyph is actually easy to obtain as long as people use old TeX fonts though so I can easily detect font size change.

Related

How to get (x,y width height )of any given word in pdf using java

I need to get x,y ,width and height of a given word in pdf. so that later while parsing the same type of file i can fetch value from the co-ordinate itself. How should i get position of a word from PDF using java.
Rectangle rect = new Rectangle(451, 125,100,1); // i need to get this co-ordate for any particular word
stripper.addRegion("class1", rect);
stripper.extractRegions(pdDocument.getPage(0));
System.out.println("stripper "+stripper.getTextForRegion("class1").trim());

I think that you could make use of Apache's PDFBox API and follow the advice in this similar question which is specific to that API to write the code you need.

How to count color pages in a PDF/Word doc using Java

I am looking to develop a desktop application using Java to count the number of colored pages in a PDF or Word file. This will be used as part of an overall system to help calculate the cost of printing a document in terms of how many pages there are (color/B&W).
Ideally, the user of the application would use a file dialog to select the desired PRF/Word file, the application could then count and output the number of colored pages, allowing the system to automatically calculate document cost accordingly.
i.e
if A4 colored pages cost 50c per page to print,
and B&W cost 10c per page,
calculate the total cost of the document per colored/B&W pages.
I am aware of the existing software Rapid PDF Count http://www.traction-software.co.uk/rapidpdfcount/, but would be unsuitable as part on integration into a new system. I have also tried using GhostScript/Python as per this solution: http://root42.blogspot.de/2012/10/counting-color-pages-in-pdf-files.html, however this takes too long (5mins to count a 100 page pdf), and would be difficult to implement into a desktop app.
Is there any method of counting the number of colored pages in a PDF or Word file using Java (or alternative language)
Thanks

Although it might sound easy, the task is rather complicated.
One option would be to use a program such as iText to walk every single token in the PDF, look for tokens that support color and compare that to your definition of "black". However, this will only get you basic text and drawing commands. Images are a completely different beast so you'll probably need to find an image parser or grab a copy of each spec and then walk each of those.
One of the downsides of token walking is you need to properly handle tokens that reference other things and further walk those tokens.
Another downside is that things can overlap each other so you'd probably want be aware of their coordinates, z-index, transparency and such.
There will be many more bumps in the road but that's a good start. What's most interesting is that if you accomplish this, you'll actually have found that you've partially built a PDF renderer!
Next, you'll need to define "black". Off the top of my head there's RGB black, CMYK black, Grey black and maybe Lab black along with some Pantones. That shouldn't be too hard but if I were to build this I'd want to know "blank ink usage" which could also be shades of grey. There's also "rich blank" that you might need to deal with, too!
So, all that said, I think that the GhostScript option you found is really the best bet. It literally renders the PDF and calculates the ink coverage from an RGB standpoint. You still should handle grey's, too, but that shouldn't be too hard, here's a good starting point.

Wanting to know what the click-charge is going to be is a pretty common problem, but it's not easy to solve at all. As already indicated by the answer Chris Haas gave, but I want to put another spin on it.
First of all, you have to wonder whether you really want to support both Word and PDF documents. Analysing Word files is less useful than you might think because that Word file is probably going to be converted into something else before it's going to be printed. And because of the fact that you're starting from Word, the chance that your nice RGB black text in Word gets converted to less-than-perfect 4 color black in PDF is very high. In other words, even though you might count a page of black text in Word as a 'cheap' page, it might turn into an expensive color page after conversion from Word to something that can be printed.
Let's consider the PDF case then. PDF supports a whole host of color spaces (gray, RGB, CMYK, the same with an ICC Profile attached, spot color and a few multi-spot color variants, CalGray and CalRGB and Lab. Besides that there is a whole range of very tricky features such as transparency, overprint, shades, images, masks... that you all have to take into account. The only truly good way to calculate what you need is to do essentially the same work as your printer will do; convert the PDF into one image per page and examine the pixels.
Because of what you want to do, the best way to progress would be to:
1) Convert any word files into PDF
2) Convert any PDF files into CMYK
3) Render each page of that CMYK file into an image.
Once you've done that you can examine the image and see whether you have any colors left. There are a number of potential technologies you can use for this. GhostScript is definitely one, but there are commercial solutions too that would certainly be more expensive but potentially faster.

Find location of character index in a Swing text control

I want to do some custom drawing of brackets around certain text in a text control in Java Swing. But I need to know where to draw them. I do know the exact range of characters in the text content, so I just need to be able to translate those indexes into specific locations on the control so I can draw.
Is there some way to do that?

Incorrect / missing font metrics in Java?

Using a certain font, I use Java's FontLayout to determine its ascent, descent, and leading. (see Java's FontLayout tutorial here)
In my specific case I'm using Arial Unicode MS, font size 8. Using the following code:
Font font = new Font("Arial Unicode MS", 0, 8);
TextLayout layout = new TextLayout("Pp", font,
new FontRenderContext(null, true, true));
System.out.println( "Ascent: "+layout.getAscent());
System.out.println( "Descent: "+layout.getDescent());
System.out.println( "Leading: "+layout.getLeading());
Java gives me the following values:
Ascent: 8.550781
Descent: 2.1679688
Leading: 0.0
So far so good. However if I use the sum of these values as my line spacing for various lines of text, this differs by quite a bit from the line spacing used in OpenOffice, Microsoft Word, etc.: it is smaller. When using default single line spacing Word and OO seem to have a line spacing of around 13.7pt (instead of 10.7pt like I computed using Java's font metrics above).
Any idea
why this is?
whether I can somehow access the font information Word and OpenOffice seem to be accessing which leads to this different line spacing?
Things I've tried so far:
adding all glyphs to a glyph vector with font.getNumGlyphs() etc. - still get the same font metrics values
using multiple lines as described here - each line I get has the same font metrics as outlined above.
using FontMetrics' methods such as getLeading()

Zarkonnen doesn't deserve his downvotes as he's on the right lines. Many Java fonts appear to return zero for their leading when perhaps they shouldn't. Maybe it is down to this bug: I don't know. It would appear to be down to you to put this whitespace back in.
Typographical line height is usually defined as ascent + descent + leading. Ascent and descent are measured upwards and downwards from the baseline that characters sit on, and the leading is the space between the descent of one line and the ascent of the line underneath.
But leading is not fixed. You can set the leading in most Word-processing and typographical software. Word calls this the line-spacing. The original question is probably asking how Microsoft Word calculates its single line spacing. Microsoft's recommendations for OpenType fonts seem to suggest that software on different platforms calculate it differently. (Maybe this is why Java now returns zero?)
A quick bit of Googling around seems to indicate that a rule of thumb for leading is 120% of ascent+descent for single-line spacing, or a fixed point spacing; say 2pts leading between all lines. In the absence of any hard or fast rule I can find, I would say it boils down to the legibility of the text you're presenting, and you should just go with what you think looks best.

Are Word and OO including the white space between lines, while Java isn't?
So in Word / OO, your number is Ascent + Descent + Whitespace, while in Java you just have Ascent + Descent?

Find an Image within an Image

I am looking for the best way to detect an image within another image. I have a small image and would like to find the location that it appears within a larger image - which will actually be screen captures. Conceptually, it is like a 'Where's Waldo?' sort of search in the larger image.
Are there any efficient/quick ways to accomplish this? Speed is more important than memory.
Edit:
The 'inner' image may not always have the same scale but will have the same rotation.
It is not safe to assume that the image will be perfectly contained within the other, pixel for pixel.

Wikipedia has an article on Template Matching, with sample code.
(While that page doesn't handle changed scales, it has links to other styles of matching, for example Scale invariant feature transform)

If rotation also had to be catered for, the Generalised Hough Transform can be used.

You can treat this as a substring problem, where characters in the alphabet are pixels and your string is the image. You would need also to use a special character in a similar vein to a linebreak, to denote the image boundary.
The algorithm you want is on wikipedia: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Update: If you cannot assume that the image is perfectly contained within the other, pixel for pixel, then this approach will not work.
There are other, more complicated algorithms based on the same dynamic programming concept as the above, but I won't go into them unless it's necessary.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.