Need help in latent semantic indexing - java

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Related

Java: How to use TF-IDF to compute similarity of two documents?

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.
If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".
In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.
Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.
I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.
Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.
When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

Comparing documents - document similarity

I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help
TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.
If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.

What's the best way to classify a high dimensional int-vector with the weka API?

I have some high dimensional (30000 dimensions) vectors of integer numbers. I have 2 classes: [YES, NO]. I have 6000 samples of the YES-class and 50000 samples of the NO-class. I would like to train a classifier, to classify new samples in future automatically to one of these classes.
I know how to use the Weka Java API, but I am not sure which algorithms in which order to use. Can anyone give me advice on the following questions:
Are the vectors too high dimensional or do I have too many samples to do this efficiently in Weka?
Should I reduce the dimensionality before I start? What algorithm can I use to identify significant elements of my feature vector?
What classifier would be best to classify this kind of data? I think a decision tree should work fine, but maybe a naive bayes is faster to train, is it?
Since every element must have a name in weka, how can I assign a name to each of my 30000 features?
Any advice is appreciated. Thanks.
The number of dimensions to this problem most certainly are quite large, but I believe that Weka should be able to handle a large number of dimensions. The number of samples should not be a problem, but there are a lot more NO class samples than there are YES Class, so balancing the two might assist in classifying the NO Class cases better.
If you believe that there are redundant dimensions or some of the dimensions may contain noise, then it would certainly help.
A decision tree shouldn't be too much of a problem. There are a number of algorithms available in Weka, but I wouldn't recommend Neural Networks given the dimensionality of the problem.
If you have saved the data in a CSV File, you could assign attribute names in the first row of the data. This way, you can assign attribute names. Given the number of dimensions, you would likely call these a1 to a30000 and output for the output class.
Hope this Helps!

Understand if two different pdf are the same research paper

I'm thinking to write a simple research paper manager.
The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend,
I'm thinking on how to generate the paper_id of a paper: IMHO should be produced
by the text of the pdf, to garantee that two different collections have the same ids only for the same papers.
At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text.
the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant.
With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for
'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature
(short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square
and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider
two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.
In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository.
Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.

Efficient 2D Radial Gravity Layout (Java)

I'm after an efficient 2D mapping algorithm, and I've tried a number of implementations, but they all seem lacking. I'm hoping the stackoverflow world can help out with some pointers to existing, tried-n-tested algorithms I could learn from.
My goal is to display articles based on the genre of writing; for the prototype, I am using Philosophy, Programming, Politics and Poetry, since those are the only four styles of writing I have.
Each article is weighted based on each category, and the home view will have each category as a header in each corner. The articles are then laid out in word-cloud-like format, with "artificial gravity" placing each item as-near-as-possible to its main category (or between its main categories), without overlapping.
Currently, I am using an inefficient algorithm which stores arrays of rectangles to perform hit-test-and-search every time an article is added to the view, (with A* search patterns to find empty space to fill). By approximating a single destination for all articles of the same weight, and by using a round-robin queue to pick off articles from each pool, I can achieve fresh results (arrays are sorted by weight, then timestamp), with positioning-by-relevance ("artificial gravity").
However, using A* to blindly search seems really wasteful, even with heuristics to make each article check closest to it's target marks first. I need a more efficient way to iterate over a 2D space.
I'm wondering if a Linked-List approach might work better; rather than go searching blindly in all directions for empty space, I can just iterate through connected nodes to ask each one if it has either a) nearby free space, or b) other connected nodes to ask (and always ask the closest node first).
If there are any better algorithms available, or critiques on my methods, any and all help would surely be appreciated.
I am using gwt elemental + java in this gui, but any 2D mapping algorithm in any language will surely help.
[EDIT (request for more details)] : The main problem here is the amount of work each new addition performs; it produces noticable glitches in the ui thread, especially when there is almost no space left, as I am searching many points in a given radius for enough free space to fit the article.
If I cut the algorithm off too soon, I get blank spots that could have been filled. If I let it run too long, the ui glitches pretty bad, and I'm sure users will hate it.
What is the fastest / most efficient way to store and modify collections of 2D space?
You haven't provided enough information to say what would make an algorithm "better." Faster? Produces layouts that are "nicer" by some metric for quality? Able to handler bigger data sources?
There is certainly nothing wrong with arrays, nor with A*. If they are giving acceptable results with the size of problem you are trying to solve, how can they be "wasteful?" Linked data structures are worthwhile only if they reduce cost of frequently needed operations.
If you sharpen the problem, you're more likely to get a useful answer.
At any rate, there is an enormous literature on "graph layout" and "graph drawing." Try searching on these terms. If you can represent your desired layout as a collection of nodes and edges, these might apply. Many are based on simulated spring systems, which seems akin to what you are doing.

Categories

Resources