Java - inverted index - java

Iam trying to implement a program in java that should be able to take in a list of documents, say 3 for example, then using some single term queries i should be able to get result of how many times the word appears in the documents.
The result should be returned in tuples e.g [doc 1, doc 2]. It should be implemented as an inverted index that runs in memory.
For example if i have:
Doc 1 : "the fish in the water"
Doc 2 : "The fish is named billy"
doc 3 : "the fish is swimming"
searching for "water" gives the result : [doc 1]
searching for fish should give: [doc1, doc2, doc3]
Iam trying to split the problem into smaller segments so it's easier for me to focus on how to actually implement it. I was thinking more like this:
1) Start with indexing the documents somehow
2) Support single term searches
3) return a list of matching documents sorted by TF-IDF
If we start with point 1, how should i start tackling this problem?

Create a Map<String, Long> for each document that contains all the words in the document and the number of occurences (search on SO - this has been addressed many times). Using String::split can help extracting individual words. You may want to store the words in lower case for easier searching (note that this doesn't work well in certain languages such as Turkish).
You can then use Map::get to find the number of occurences of a word in each document
Output the result

I think Assylias solution is the best. But I would suggest to use Lucene which does exactly what you are trying to achieve.

What about something like this example :
String keyword = "fish";
List<String> results = new ArrayList<String>();
for(Document doc:documents){
if(doc.getTextContent().contains(keyword)){
results.add(doc);
}
}
System.out.println(results);

Why do you need to compute TF-IDF weights ?
If you're just returning docs that match a word, you're doing boolean retrieval which doesn't require you to compute any tf-idf. You would need tf-idf if you were doing probabilistic retrieval and you're computing scores, etc.

Related

Inverted indexing

I'm working on inverted indexing and my question is: in the final step we should return the total number of documents the word appeared in or just each document number ?
for example :
if the word "Hello" appeared in 3 documents(document A and document B and document C) I should return 3 or A,B,C ?
An Index implies it will give you a lookup to something, not just a number. A frequency count would give you a count of the number of occurrences of a word.
BTW You can get the number from the A,B,C but not the other way around.
That's totally up to you !
If you just need to return the total number of documents a certain word appears in, then you won't even need an inverted index. All you would need is a mapping from words to counts. That would take much less computation and space than an inverted index.
If you're working on an exercise in Information Retrieval (or doing some proof of concept, etc), it seems to me that you would also need to return the docs where a given words was found, that's Boolean Retrieval

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.
I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.
I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.
I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.
I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

How to get synonyms ordered by their occurrence probability from Wordnet

I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym.
I have used the prolog wordnet database and Syns2Index to convert it into Lucene type index for querying synonyms. Is there a way to get them ordered by their probabilities in this way, or I should use another approach?
Speed not important, this synonym lookup will not be done online.
In case someone stumbles upon this thread, this was the way to go(at least what i needed):
http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
tagCount method gives the most likely synset group for every word. The problem again is that synset with highes probability again can have several words. But i guess theres no chance to avoid this
I think that you should do another step (provided that speed is not important).
From the Lucene index, you should build another dictionary in which each word is mapped to a small object that contains the only synonym that its meaning has higher probability of appearance, its meaning, and probability of appearance. I.e., given this code:
class Synonym {
public:
String name;
double probability;
String meaning;
}
Map<String, Synonym> m = new HashMap<String, Synonym>();
... you just have to fill it from the Lucene index.

In java - Grouping similar values

First of all,thanks for reading my question.
I used TF/IDF then on those values, I calculated cosine similarity to see how many documents are more similar. You can see the following matrix. Column names are like doc1, doc2, doc3 and rows names are same like doc1, doc2, doc3 etc. With the help of following matrix, I can see that doc1 and doc4 has 72% similarity (0.722711142). It is correct even if I see both documents they are similar. I have 1000 documents and I can see each document freq. in matrix to see how many of them are similar.
I used different clustering like k-means and agnes ( hierarchy) to combine them. It made clusters. For example Cluster1 has (doc4, doc5, doc3) becoz they have values (0.722711142, 0.602301766, 0.69912109) more close respectively. But when I see manually if these 3 documents are realy same so they are NOT. :( What am I doing or should I use something else other than clustering??????
1 0.067305859 -0.027552299 0.602301766 0.722711142
0.067305859 1 0.048492904 0.029151952 -0.034714695
-0.027552299 0.748492904 1 0.610617214 0.010912109
0.602301766 0.029151952 -0.061617214 1 0.034410392
0.722711142 -0.034714695 0.69912109 0.034410392 1
P.S: The values can be wrong, it is just to give you an idea.
If you have any question please do ask.
Thanks
I'm not familiar with TF/IDF, but the process can go wrong in many stages generally:
1, Did you remove stopwords?
2, Did you apply stemming? Porter stemmer for example.
3, Did you normalize frequencies for document length? (Maybe the TFIDF thing has a solution for that, I don't know)
4, Clustering is a discovery method but not a holy grail. The documents it retrieves as a group may be related more or less, but that depends on the data, tuning, clustering algorithm, etc.
What do you want to achieve? What is your setup?
Good luck!
My approach would be not to use pre-calculated similarity values at all, because the similarity between docs should be found by the clustering algorithm itself. I would simply set up a feature space with one column per term in the corpus, so that the number of columns equals the size of the vocabulary (minus stop word, if you want). Each feature value contains the relative frequency of the respective term in that document. I guess you could use tf*idf values as well, although I wouldn't expect that to help too much. Depending on the clustering algorithm you use, the discriminating power of a particular term should be found automatically, i.e. if a term appears in all documents with a similar relative frequency, then that term does not discriminate well between the classes and the algorithm should detect that.

Any tutorial or code for Tf Idf in java

I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help !
Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :)
OR
If you can tell me some good java tutorial for this.
Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :(
Please also do not refer me to Lucene :(
Term Frequency is the square root of the number of times a term occurs in a particular document.
Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.
If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.
And then TF-IDF(term, document) = TF(term, document) * IDF(term)
Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.
So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.
And since I have nothing else to do, I looked up the vector distance formula. Here you go:
D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)
For this purpose, x1 is the TF-IDF for term x in document 1.
Edit: in response to your question about how to count the words in a document:
Read the file in line by line with a reader, like new BufferedReader(new FileReader(filename)) - you can call BufferedReader.readLine() in a while loop, checking for null each time.
For each line, call line.split("\\s") - that will split your line on whitespace and give you an array of all of the words.
For each word, add 1 to the word's count for the current document. This could be done using a HashMap.
Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?
agazerboy, Sujit Pal's blog post gives a thorough description of calculating TF and IDF.
WRT verifying results, I suggest you start with a small corpus (say 100 documents) so that you can see easily whether you are correct. For 10000 documents, using Lucene begins to look like a really rational choice.
While you specifically asked not to refer Lucene, please allow me to point to you the exact class. The class you are looking for is DefaultSimilarity. It has an extremely simple API to calculate TF and IDF. See the java code here. Or you could just implement yourself as specified in the DefaultSimilarity documentation.
TF = sqrt(freq)
and
IDF = log(numDocs/(docFreq+1)) + 1.
The log and sqrt functions are used to damp the actual values. Using the raw values can skew results dramatically.

Categories

Resources