I want to find a library or an algorithm (so I write the code myself) for identifying the nearest k neighbours of a webpage, where the webpage is defined as being a set of keywords. I have already done the part where I extract the keywords.
It doesn't have to be very good, just good enough.
Can anyone suggest a solution, or where to start. I have looked through lectures by Yury Lifshits in the past, but I am hoping to get something ready-made if possible.
Java libraries preferred.
As you said, you already have the keywords extracted from a page. I am assuming that you represent each document/page by a vector of words. Something like a document term-frequency matrix.
I guess the nearest neighbour of a page is ideally a page with similar contents. So you'd like to find documents where the relative frequency of each word is similar to the one you are searching for. So first normalize the doc-term matrix WRT each row; i.e. replace the occurrence count by %tage occurrence.
Next you have to assign some distance between 2 documents represented by these vectors. You can use the normal Euclidean distance or Manhattan Distance. However for text document the similarity measure that usually works best is Cosine Similarity. Use whatever distance or similarity function suits your problem (remember for nearest neighbour you want to minimize the distance; but maximize similarity).
Once you have the vectors and your distance function in place, run the Nearest neighbour or the K-Nearest neighbour algorithm.
Related
My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.
If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".
In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.
Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.
I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.
Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.
When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Interpolation over an array (or two)
I have a set of CSV files that contain points of a 2D function... in other words I have four CSV files, each is the result of evaluating a function f(x, y) at different y values. I need to interpolate between these data such that I can calculate an arbitrary f for a certain x and y. The CSV files have varying lengths and x-values. Does anyone know of a library or algorithm in Java for this task? Linear interpolation is OK, as is spline interpolation.
Thanks,
taktoa
Ok, first of all I assume the "CSV" bit is irrelevant, let's assume you have read those into memory and merged them together (they're the values of the same function, right?). Now you have a single set of f(x,y) values for different (x,y) pairs and would like to interpolate between those. Fine so far?
If you stick to linear interpolation, there's still the question of how many points to take into account, which will depend on the level of noise in the measurements. In the simplest case one would use just the three nearest points to identify the plane they lie in and use that to find the value for the point in question. This option requires neither libraries nor algorithms, apart from vector addition, subtraction, cross product and dot product.
More sophisticated solutions would generally require some sort of fitting, e.g. (weighted) least squares.
The simplest function is to find the closest points and use linear interpolation. e.g. chose two of three closest points and interpolate them.
Or you can take a weighted average based on distance. Or you can pick a close point and then find points on the "other side" of the closest point to improve the interpolation.
Lagrange interpolation would be simple and accurate.
I am trying to build the KD tree(independent) for image features. I have extracted the image features,the feature contains suppose 1000 float values.
Using map-reduce to distribute the images among the nodes of the cluster according to classification(eg, cat,dog,guns)ie. each node will contain the bunch of the similar images & then build KD tree of the images on each node. I am confused about how the tree can be built.
So how can I build the KD tree using map-reduce? Each node will contain the tree,right? What could be the logic to distribute the images? While building the KD-tree, on what basis should I add image-feature vectors in tree(ie left or right child)?
Any help is appreciated.Thanks in advance.
I don't think that a k-d-tree is the right thing for your data. Here's what Wikipedia says about it:
k-d trees are not suitable for efficiently finding the nearest neighbour in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2^k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbour methods should be used instead.
Your feature vectors have dimensionality 1000, which means that you should have around 10^300 images, which is quite unlikely.
I suggest that you look at Locality-sensitive hashing, which is one of the mentioned approximate nearest-neighbor searches for high-dimensional data.
Since Wikipedia is not always the best place to learn something complicated, I suggest you take a look at the respective lecture slides of the Data Mining course of ETH Zurich instead. It just so happens that I am taking this course in the current semester.
I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance
An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.
This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric
Do you know where I can find a high level explanation of Lucene Similarity Class algorithm. I will like to understand it without having to decipher all the math and terms involved with searching and indexing.
Lucene's built-in Similarity is a fairly standard "Inverse Document Frequency" scoring algorithm. The Wikipedia article is brief, but covers the basics. The book Lucene in Action breaks down the Lucene formula in more detail; it doesn't mirror the current Lucene formula perfectly, but all of the main concepts are explained.
Primarily, the score varies with number of times that term occurs in the current document (the term frequency), and inversely with the number of times a term occurs in all documents (the document frequency). The other factors in the formula are secondary, adjusting the score in attempt to make scores from different queries fairly comparable to each other.
Think of each document and search term as a vector whose coordinates represent some measure of how important each word in the entire corpus of documents is to that particular document or search term. Similarity tells your the distance between two different vectors.
Say your corpus is normalized to ignore some terms, then a document consisting only of those terms would be located at the origin of a graph of all of your documents in the vector space defined by your corpus. Each document that contains some other terms, then represents a point in the space whose coordinates are defined by the importance of that term in the document relative to that term in the corpus. Two documents (or a document and search) whose coordinates put their "points" closer together are more similar than those with coordinates that put their "points" further apart.
How was mentioned by erickson in Lucene is Cosine similarity Term Frequency-Inverse document frequency (TF-IDF). Imagine that you have two bags of terms in the query and in the document. This measurement only match exactly terms and after in the context include their semantically weights. Terms with very frequetly occurence has smaller weight (importancy), because you could them find it in lot of documents. But the serious problem what I see that Cosine similarity TF-IDF is not so robust on more inconsistent data, where you need to compute similarity betweens the query and the document more robust e.g. misspeling, typographical and phonetical errors. Because the words must have exact match.