Detecting similar web pages

Detecting similar web pages - java

I have a system which crawls the net and takes a screenshot of web pages. At the moment i simply just hash the image file ( stored as a png ). However this doesn't work well with pages that either have a count of comments on a article in a blog. Or a view count.
So my question is what would be the best way to detect these changes? Which algorithm would work best?

A naive but very easily implemented method would be to wipe out all numeric characters from every page and compare just the character content of them.

So first we want to detect the areas with the change. A simple good way is just to take the difference between the two images and than look for all the areas with difference higher than zero.
After that we would look at every group of points and look at those points in the original image and try to detect numbers using some OCR software.
General algorithm:
Diff = Im1 - Im2
Threshold Diff to get the threshold image ThIm, i.e. if Diff(x,y) > 0 = ThIm(x,y) = 1 other wise ThIm(x,y) = 0.
Find connected component in ThIm
For each connected component find the bounding box around it.
Crop the original image using the bounding box
Run OCR on the cropped area and check if you found numbers

Related

How to count color pages in a PDF/Word doc using Java

I am looking to develop a desktop application using Java to count the number of colored pages in a PDF or Word file. This will be used as part of an overall system to help calculate the cost of printing a document in terms of how many pages there are (color/B&W).
Ideally, the user of the application would use a file dialog to select the desired PRF/Word file, the application could then count and output the number of colored pages, allowing the system to automatically calculate document cost accordingly.
i.e
if A4 colored pages cost 50c per page to print,
and B&W cost 10c per page,
calculate the total cost of the document per colored/B&W pages.
I am aware of the existing software Rapid PDF Count http://www.traction-software.co.uk/rapidpdfcount/, but would be unsuitable as part on integration into a new system. I have also tried using GhostScript/Python as per this solution: http://root42.blogspot.de/2012/10/counting-color-pages-in-pdf-files.html, however this takes too long (5mins to count a 100 page pdf), and would be difficult to implement into a desktop app.
Is there any method of counting the number of colored pages in a PDF or Word file using Java (or alternative language)
Thanks

Although it might sound easy, the task is rather complicated.
One option would be to use a program such as iText to walk every single token in the PDF, look for tokens that support color and compare that to your definition of "black". However, this will only get you basic text and drawing commands. Images are a completely different beast so you'll probably need to find an image parser or grab a copy of each spec and then walk each of those.
One of the downsides of token walking is you need to properly handle tokens that reference other things and further walk those tokens.
Another downside is that things can overlap each other so you'd probably want be aware of their coordinates, z-index, transparency and such.
There will be many more bumps in the road but that's a good start. What's most interesting is that if you accomplish this, you'll actually have found that you've partially built a PDF renderer!
Next, you'll need to define "black". Off the top of my head there's RGB black, CMYK black, Grey black and maybe Lab black along with some Pantones. That shouldn't be too hard but if I were to build this I'd want to know "blank ink usage" which could also be shades of grey. There's also "rich blank" that you might need to deal with, too!
So, all that said, I think that the GhostScript option you found is really the best bet. It literally renders the PDF and calculates the ink coverage from an RGB standpoint. You still should handle grey's, too, but that shouldn't be too hard, here's a good starting point.

Wanting to know what the click-charge is going to be is a pretty common problem, but it's not easy to solve at all. As already indicated by the answer Chris Haas gave, but I want to put another spin on it.
First of all, you have to wonder whether you really want to support both Word and PDF documents. Analysing Word files is less useful than you might think because that Word file is probably going to be converted into something else before it's going to be printed. And because of the fact that you're starting from Word, the chance that your nice RGB black text in Word gets converted to less-than-perfect 4 color black in PDF is very high. In other words, even though you might count a page of black text in Word as a 'cheap' page, it might turn into an expensive color page after conversion from Word to something that can be printed.
Let's consider the PDF case then. PDF supports a whole host of color spaces (gray, RGB, CMYK, the same with an ICC Profile attached, spot color and a few multi-spot color variants, CalGray and CalRGB and Lab. Besides that there is a whole range of very tricky features such as transparency, overprint, shades, images, masks... that you all have to take into account. The only truly good way to calculate what you need is to do essentially the same work as your printer will do; convert the PDF into one image per page and examine the pixels.
Because of what you want to do, the best way to progress would be to:
1) Convert any word files into PDF
2) Convert any PDF files into CMYK
3) Render each page of that CMYK file into an image.
Once you've done that you can examine the image and see whether you have any colors left. There are a number of potential technologies you can use for this. GhostScript is definitely one, but there are commercial solutions too that would certainly be more expensive but potentially faster.

Compare image to a reference image in Java

I need to write tests (in Java) for an image capturing application. To be more precise:
Some image is captured from a scanner.
The application returns a JPEG of this image.
The test shall compare the scanned image with a reference image. This reference image has been captured by the identical application and has been verified visually that it contains the same content.
My first idea was comparing the images pixel by pixel but as the application applies a JPG compression the result would never by "100 % identical". Besides the fact that the scanner would capture the image with slight differences each time.
The goal is not to compare two "random images" for similarity like "Do both images show a car?" but rather "How similar is the captured car to the reference image of this car?".
What other possibilities do you see?

Let me say first, that image processing is not my best field, nor am I an expert in any way. However, here is my suggestion: take the absolute value of each pixel in the pictures, in row-major order, respective to each other. Then, average out the differences, and see if the average is within a certain threshold: if so, the images are similar in nature. Another way, would be to go once more pixel by pixel, but instead count the number of pixels that differ by at least x, where x is a number chosen to represent how close the matching needs to be. Then, just see if the different pixels comprise 5% of all pixels, 10%, and so on. The more pixels, the more different the image.
I hope this helps you, and best of luck.P.S. to judge how far two pixels are apart, it's going to be like finding the distance between two points (i.e. sqrt[(r2-r1)^2+(g2-g1)^2+(b2-b1)^2]).P.P.S. This all of course assumes that you listened to the Hovercraft Full of Eels, and have somehow made sure that the images match up in size and position.

Extract similar area of two images

I wonder that extracting similar area of images is possible. if it is possible,i am going to find insigne of copmany which created it.
There are two images below. The red rectangles in the images are that i try to find area. The program is going to find similar area comparing images. I tried to find it using opencv,but i couldn't do it.

First thing in mind:
Convert images to grayscale
Divide image into small areas (patches)
Each patch should be labelled as 1 if entropy of image is high and 0 if low (to discard patches without letters)
For two images, compare all patches across images based on:
Histogram on sobel image (Bhattacharya distance is normalized)
Correlation (Minmax normalization)
Advanced descriptors (like SIFT) (L2 normalization)
Min distance wins.
You can narrow down the '1' patches with a text detector (Algorithm to detect presence of text on image).

Java image library - turn grid image into array

If I have an image of a table of boxes, with some coloured in, is there an image processing library that can help me turn this into an array?
Thanks

You can use a thresholding function to binarize the image into dark/light pixels so dark pixels are 0 and light ones are 1.
Then you would want to remove image artifacts using dilation and erosion functions to remove noise (all these are well defined on Wikipedia).
Finally if you know where the boxes are, you can just get the value in the center of each box to determine the array value, or possibly use an area near the center and take the prevailing value (i.e. more 0's is a filled in square, more 1's is and empty square).
If you are scanning these boxes and there is a lot of variation in the position of the boxes, you will have to perform some level of image registration using known points, or fiducials.
As far as what tools to use to do this, I'd recommend first trying this manually using a tool like ImageJ, which has a UI and can also be used programatically since it is written all in Java.
Other good libraries for this include OpenCV and the Java Advanced Imaging API.
Your results will definitely vary depending on the input images and how consistenly lit and positioned they are.
The best way to see how it will do for your data is to try applying these processing steps manually to see where your threshold value should be, how much dilating/eroding you need to get consistent results.

Auto scale and rotate images

Given:
two images of the same subject matter;
the images have the same resolution, colour depth, and file format;
the images differ in size and rotation; and
two lists of (x, y) co-ordinates that correlate the images.
I would like to know:
How do you transform the larger image so that it visually aligns to the second image?
(Optional.) What are the minimum number of points needed to get an accurate transformation?
(Optional.) How far apart do the points need to be to get an accurate transformation?
The transformation would need to rotate, scale, and possibly shear the larger image. Essentially, I want to create (or find) a program that does the following:
Input two images (e.g., TIFFs).
Click several anchor points on the small image.
Click the several corresponding anchor points on the large image.
Transform the large image such that it maps to the small image by aligning the anchor points.
This would help align pictures of the same stellar object. (For example, a hand-drawn picture from 1855 mapped to a photograph taken by Hubble in 2000.)
Many thanks in advance for any algorithms (preferably Java or similar pseudo-code), ideas or links to related open-source software packages.

This is called Image Registration.
Mathworks discusses this, Matlab has this ability, and more information is in the Elastix Manual.
Consider:
Open source Matlab equivalents
IRTK
IRAF
Hugin

you can use the javax.imageio or Java Advanced Imaging api's for rotating, shearing and scaling the images once you found out what you want to do with them.

For a C++ implementation (without GUI), try the old KLT (Kanade-Lucas-Tomasi) tracker.
http://www.ces.clemson.edu/~stb/klt/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.