Counting sentences: Database (like h2) vs. Lucene vs.? - java

I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had "john" as first word, "went" as second word and "hospital" as the fifth word...etc So I just need the count and don't need to actually retrieve the sentences.
The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:
Word1 Word2 Word3 Word4 Word5 ....
Congress approved a new bill
John went to school
.....
And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some other tool)?
The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.
Thanks for any advice,
Anas

Assuming that the queries are as simple as you have indicated, a simple SQL db (Postgres, MySQL, possibly H2) would be perfect for this.

I suppose you already have infrastructure to create tokens from a given sentence. You can create a lucene document with one field for each word in the sentence. You can name the fields as field1, field2, and so on. Since, lucene doesn't have a schema like DB, you can define as many fields, on the fly, as you wish. You can add an additional identifier field if you want to identify which sentences matched a query.
While searching, your typical lucene query will be
+field1:John +field4:school
Since you are not bothered about the speed of retrieval, you can write a custom Collector which will ignore scores. (That will return results significantly faster as well.)
Since you don't plan to retrive back the matching sentences or words, you should only index these fields and not store. That should push performance up by a notch.

Lucene span queries can implement positional search. Use SpanFirst to find a word in the first N positions of a document, and combine it with SpanNot to rule out the first N-1.
Your example query would like this:
<BooleanQuery: +(+spanFirst(john, 1) +spanFirst(went, 2)) +spanNot(spanFirst(hospital, 5), spanFirst(hospital, 4))>
Lucene also of course allows getting the total hit count of a search result without iterating all the docs.

I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker.
For example:
A boy and a girl drank milk
translates into:
120 530 14 120 619 447 253
(I chose arbitrary word codes) leading to store a row
120 530 14 120 619 447 253 0 0 0 0 0 0 0 ....
(until the number of words you allocate per a sentence is exhausted).
This is a somewhat sparse matrix, so maybe this question will help.

Look at Apache Hadoop and Map Reduce. It's developed for things like this.

Or you can done it by the hand, using only only java by
List triple = new ArrayList(3);
for (String word: inputFileWords) {
if (triple.size == 3) {
resultFile.println(StringUtils.join(" ", triple));
triple.remove(0);
}
triple.add(line);
}
then sort this file and sum all duplicate lines (manually or from some command line utility), it will be fast as possible.

Related

Which Pattern Matching Algorithm fits for my case?

I have a project which it needs to compare to text documents and find the similarity rate between every single sentence and the general similarities of the texts.
I did some transforming on texts like lowering all words,deleting duplicate words,deleting punctuations except fullstops. After doing some operations, i had 2 arraylists which include sentences and the words all seperated. It looks like
[["hello","world"],["welcome","here"]]
Then i sorted every sentence alphabetically.After all these, i'm comparing all the words one by one,doing linear search but if the word which i'm searching is bigger than i'm looking (ASCII of first character like world > burger) ,i'm not looking remaining part,jumping other word. It seems like complicated but i need an answer of " Is there any faster,efficient common algorithms like Boyer Moore,Hashing or other?" . I'm not asking any code peace but i need some theoretical advices.Thank you.
EDIT:
I should've tell the main purpose of the project. Actually it is kinda plagiarism detector.There are two txt files which are main.txt and sub.txt.The program will compare them and it gives an output something like that:
Output:
Similarity rate of two texts is: %X
{The most similar sentence}
{The most similar 2nd sentence}
{The most similar 3d sentence}
{The most similar 4th sentence}
{The most similar 5th sentence}
So i need to find out sub.txt similarity rate to main.txt file.I thought that i need to compare all the sentences in two files with each other.
For instance, main.txt has 10 sentences and sub.txt has 5 sentences,
there will be 50 comparison and 50 similarity rate will be calculated
and stored.
Finally i sort the similarty rates and print the most 5 sentences.Actually i've done the project,but it's not efficient. It has 4 nested for loops and compares all words uncountable times and complexity becomes like O(n^4) ( maybe not that much) but it's really huge even in the worst case. I found Levenshtein distance algorithm and Cosine similarity algorithms but i'm not sure about them. Thanks for any suggestion!
EDIT2:
For my case similarity between 2 sentence is like:
main_sentence:"Hello dude how are you doing?"
sub_sentence:"Hello i'm fine dude."
Since intersection is 2 words ["hello","dude"]
The similarity is : (length of intersected words)*100/(length of main text)
For this case it's: 2*100/6 = %33,3
As a suggestion, and even if this is not a "complete answer" to your issue, comparing Strings is usually a "heavy" operation (even if you first check their length, which, in fact, is one of the first things the equals() method already performs when comparing Strings)
What I suggest is doing next: create a dummy hashcode()-like method. It won't be a real hashcode(), but the number associated to the order in which that word was read by your code. Something like a cryptographic method, but much simpler.
Note that string.hashCode() won't work, as the word "Hello" from the first document wouldn't return the same hashcode than the word "Hello" from the second document.
Data "Warming" - PreConversion
Imagine you have a shared HashMap<String,Integer> (myMap), which key is an String and the value an Integer. Note that HashMap's hashing in java with String keys lower than 10 characters (which usually are, in english language) is incredibly fast. Without any check, just put each word with its counter value:
myMap.put(yourString, ++counter);
Let's say you have 2 documents:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
I assume you already lowercased all words, and removed duplicates.
You start reading the first document and assigning each word to a number. The map would look like:
KEY VALUE
welcome 1
mate 2
what 3
are 4
you 5
doing 6
here 7
Now with the second document. If a key is repeated, the put() method will update its value. So:
KEY VALUE
welcome 1
mate 8
what 3
are 13
you 14
doing 6
here 11
I 9
was 10
before 12
dumb 15
Once complete, you create another HashMap<Integer,String> (reverseMap), this way in reverse:
KEY VALUE
1 welcome
8 mate
3 what
13 are
14 you
6 doing
11 here
9 I
10 was
12 before
15 dumb
You convert both documents into a List of Integers, so they look like:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
to:
listOne - [1, 8, 3, 13, 14, 6, 11]
listTwo - [8, 9, 10, 11, 12, 13, 14, 15]
Duplicate words, positions and sequences
To find the duplicated within both documents:
First, create a deep clone of one of the lists, for example, listTwo. A deep clone of a List of Integers is relatively easy to perform. Calling it listDuplicates as that will be its objective.
List<Integer> listDuplicates = new ArrayList<>();
for (Integer i:listTwo)
listDuplicates.add(new Integer(i));
Call retainAll:
listDuplicates.retainAll(listOne);
The result would be:
listDuplicates- [8,11,13,14]
So, from a total of listOne.size()+listTwo.size() = 15 words found on 2 documents, 4 are duplicates are 11 are unique.
In order to get the converted values, just call:
for (Integer i : listDuplicates)
System.out.println(reverseMap.get(i)); // mate , here, are, you
Now that duplicates are identified, listOne and listTwo can also be used now in order to:
Identify the position on each list (so we can get the difference in the positions of this words). The first word would have -1 value, as its the first one and doesn't have a diff with the previous one, but won't necessarily mean they are consequent with any other (they are just the first duplicates).
If the next element has -1 value, that means the [8] and [11] would aslo be consecutive:
doc1 doc2 difDoc1 difDoc2
[8] 2 1 -1 (0-1) -1 (0-1)
[11] 7 4 -5 (2-7) -3 (1-4)
[13] 4 6 3 (7-4) -2 (4-6)
[14] 5 7 -1 (4-5) -1 (6-7)
In this case, the distance shown in [14] with its previous duplicate (the diff between [13] and [14]) is the same in both documents: -1: that means that not only are duplicates, but both are consequently placed in both documents.
Hence, we've found not only duplicate words, but also a duplicate sequence of two words between those lines:
[13][14]--are you
The same mechanism (identifying a diff of -1 for the same variable in both documents) would also help to find a complete duplicate sequence of 2 or more words. If all the duplicates show a diff of -1 in both documents, that means we've found a complete duplicate line:
In this example this is shown clearer:
doc1- "here i am" [4,5,6]
doc2- "here i am" [4,5,6]
listDuplicates - [4,5,6]
doc1 doc2 difDoc1 difDoc2
[4] 1 1 -1 (0-1) -1 (0-1)
[5] 2 2 -1 (1-2) -1 (1-2)
[6] 3 3 -1 (2-3) -1 (2-3)
All the diffs are -1 for the same variable in both documents -> all duplicates are next to each other in both documents --> The sentence is exactly the same in both documents. So, this time, we've found a complete duplicate line of 3 words.
[4][5][6] -- here i am
Apart of this duplicate sequence search, this difference table would also be helpful when calculating the variance, median,... from the duplicate words, in order to get some kind of "similarity" factor (something like a basic indicative value of equity between documents. By no mean definitive, but somehow helpful)
Unique values - helpful in order to get a non-equity indicative ?
Similar mechanisms would be used to get those unique values. For example, by removing the duplicates from the reverseMap:
for (Integer i: listDuplicates)
reverseMap.remove(i);
Now the reverseMap only contains unique values. reverseMap.size() = 11
KEY VALUE
1 welcome
3 what
6 doing
9 I
10 was
12 before
15 dumb
In order to get the unique words:
reverseMap.values() = {welcome,what,doing,I,was,before,dumb}
If you need to know which unique words are from which document, you could use the reverseMap (as the Lists may be altered after you execute methods such as retainAll on them):
Count the number of words from the 1st document. This time, 7.
If the key of the reverseMap is <=7, that unique word comes from the 1st document. {welcome,what,doing}
If the key is >7, that unique word comes from the 2nd document. {I,was,before,dumb}
The uniqueness factor could also be another indicative, this way, a negative one (as we are searching for similarities here). Still could be really helpful.
equals and hashCode - avoid
As the hashcode() method for Strings won't return the same value for two same words (only for two same String Object references), wouldn't work here. String.equals() method works by comparing the chars (also checks for the length, as you do manually) which would be total overkill if used for big documents:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String) anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
My oppinion is to avoid this as much as possible, specially hashCode() should never be used, as:
String one = "hello";
String two = "hello";
one.hashCode() != two.hashCode()
There's an exception to this, but only when the compiler interns strings; Once you load thousands of them, that won't ever again be used by the compiler. In those rare cases where both String Objects reference the same cached memory address, this will also be true:
one.hashCode() == two.hashCode() --> true
one == two --> true
But those are really unusual exceptions, and once string internship doesn't kick, those hashCodes won't be equal and the operator == to compare Strings will return false even if the Strings hold the same value (as usual, because it works comparing their memory addresses).
The essential technique is to see this is as a multi-stage process. The key is that you're not trying to compare every document with every other document, but rather, you have a first pass that identifies small clusters of likely matches in essentially a one-pass process:
(1) Index or cluster the documents in a way that will allow candidate matches to be identified;
(2) Identify candidate documents that may be a match based on those indexes/clusters;
(3) For each cluster or index match, have a scoring algorithm that scores the similarity of a given pair of documents.
There are a number of ways to solve (1) and (3), depending on the nature and number of the documents. Options to consider:
For certain datasets, (1) could be as simple as indexing on unusual words/clombinations of words
For more complex documents and/or larger datasets, you will need to do something sometimes called 'dimension reduction': rather than clustering on shared combinations of words, you'll need to cluster on combinations of shared features, where each feature is identified by a set of words. Look at a feature extraction technique sometimes referred to as "latent semantic indexing" -- essentially, you model the documents mathematically as a matrix of "words per feature" multiplied by "feature per document", and then by factorising the matrix you arrive at an approximation of a set of features, along with a weighted list of which words make up each feature
Then, once you have a means of identifying a set of words/features to index on, you need some kind of indexing function that will mean that candidate document matches have identical/similar index keys. Look at cosine similarity and "locality-sensitive hashing" such as SimHash.
Then for (3), given a small set of candidate documents (or documents that cluster together in your hashing system), you need a similarity metric. Again, what method is appropriate depends on your data, but conceptually, one way you could see this as "for each sentence in document X, find the most similar document in document Y and score its similarity; obtain a 'plagiarism score' that his the sum of these values". There are various ways to define 'similarity score' between two strings: e.g. longest common subsequence, edit distance, number of common word pairs/sequences...
As you can probably imagine from all of this, there's no single algorithm that will hand you exactly what you need on a plate. (That's why entire companies and research departments are dedicated to this problem...) But hopefully the above will give you some pointers.

Lucene BooleanQuery with multiple FuzzyQuery is too slow

A Document is an employee data of a company with multiple fields name like: empName, empId, departmentId etc.
Using custom analyzer have indexed around 4 million data.
Search query is having a list of employees' name, and know that all employees in list belong to same department. There are multiple departments in company.
So I want to do fuzzy search for all employees' names for under given department id.
For this I am using boolean query which looks like:
Query termQuery = new TermQuery(new Term("departmentId","1234"));
BooleanQuery.Builder bld = new BooleanQuery.Builder();
for(String str:employeeNameList) {
bld.add(new FuzzyQuery(new Term("name",str)), BooleanClause.Occur.SHOULD);
}
BooleanQuery bq = bld.build();
BooleanQuery finalBooleanQuery = new BooleanQuery.Builder()
.add(termQuery, BooleanClause.Occur.MUST)
.add(bq, BooleanClause.Occur.MUST).build();
Now passing finalBooleanQuery inside search method of IndexSearcher and getting results.
Problem is its taking too much time, when size of employeeNameList more than 50 it takes around 500 ms for search.
How can I reduce time from 500 ms to 50 ms ?
There is any other solution for this problem ?
If you take a look at the other constructors for FuzzyQuery, you'll see some easy ways to improve performance. Each additional argument is there for you to reduce the amount of work the FuzzyQuery is going to do, and so improve performance.
First, and most important:
Prefix length: I strongly recommend setting this to a non-zero value. This is how many characters at the beginning of the term will not be subject to fuzzy matching. So, if searching for "abc" with a prefix of 1, "abb" and "acc" would be matched, but not "bbc". This allows lucene to work with the index when attempting to find matching terms, instead of having to scan the whole term dictionary. It's likely you will see the largest performance improvement here. Many seem to find 2 to be a good balance point between performance and meeting search demands.
The rest of the available arguments can also help:
maxEdits - 2 is the default, and the maximum. Setting this to 1 will match less, and as such, work faster.
maxExpansions - Under the hood, this query finds terms that match the fuzzy parameters, then performs a search for those terms. If you are searching for short terms, especially, this list of matching terms could turn out to be very long. Setting maxExpansions will prevent these extremely long lists of matches from occurring. Default is 50.
transpositions - Whether swapping two characters is an allowed edit. Default is true. Basically, the difference between Levenshtein and Damerau-Levenshtein. false is less work and less matches, so will perform better. Don't know if the difference will be that big though.

Java - inverted index

Iam trying to implement a program in java that should be able to take in a list of documents, say 3 for example, then using some single term queries i should be able to get result of how many times the word appears in the documents.
The result should be returned in tuples e.g [doc 1, doc 2]. It should be implemented as an inverted index that runs in memory.
For example if i have:
Doc 1 : "the fish in the water"
Doc 2 : "The fish is named billy"
doc 3 : "the fish is swimming"
searching for "water" gives the result : [doc 1]
searching for fish should give: [doc1, doc2, doc3]
Iam trying to split the problem into smaller segments so it's easier for me to focus on how to actually implement it. I was thinking more like this:
1) Start with indexing the documents somehow
2) Support single term searches
3) return a list of matching documents sorted by TF-IDF
If we start with point 1, how should i start tackling this problem?
Create a Map<String, Long> for each document that contains all the words in the document and the number of occurences (search on SO - this has been addressed many times). Using String::split can help extracting individual words. You may want to store the words in lower case for easier searching (note that this doesn't work well in certain languages such as Turkish).
You can then use Map::get to find the number of occurences of a word in each document
Output the result
I think Assylias solution is the best. But I would suggest to use Lucene which does exactly what you are trying to achieve.
What about something like this example :
String keyword = "fish";
List<String> results = new ArrayList<String>();
for(Document doc:documents){
if(doc.getTextContent().contains(keyword)){
results.add(doc);
}
}
System.out.println(results);
Why do you need to compute TF-IDF weights ?
If you're just returning docs that match a word, you're doing boolean retrieval which doesn't require you to compute any tf-idf. You would need tf-idf if you were doing probabilistic retrieval and you're computing scores, etc.

Encog Neural Net - How to structure training data?

Every example I've seen for Encog neural nets has involved XOR or something very simple. I have around 10,000 sentences and each word in the sentence has some type of tag. The input layer needs to take 2 inputs, the previous word and the current word. If there is no previous word, then the 1st input is not activated at all. I need to go through each sentence like this. Each word is contingent on the previous word, so I can't just have an array that looks similar to the XOR example. Furthermore, I don't really want to load all the words from 10,000+ sentences into an array, I'd rather scan one sentence at a time and once I reach EOF, start back at the beginning.
How should I go about doing this? I'm not super comfortable with Encog because all the examples I've seen have either been XOR or extremely complicated.
There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen.
I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.)
It seems that the networks are trained with a single array holding all training data, and that is looped through until the network is trained. I would like to train the network with many different arrays (an array per sentence) and then look through them all again.
This format is clearly not going to work for what I'm doing:
do {
train.iteration();
System.out.println(
"Epoch #" + epoch + " Error:" + train.getError());
epoch++;
} while(train.getError() > 0.01);
So, I'm not sure how to tell you this, but that's not how a neural net works. You can't just use a word as an input, and you can't just "not activate" an input either. At a very basic level, this is what you need to run a neural network on a problem:
A fixed-length input vector (whatever you are feeding in, it must be represented numerically with a fixed length. Each entry in the vector is a single number)
A set of labels (each input vector must correspond to a single, fixed-length output vector)
Once you have those two, the neural net classifies an example, then edits itself to get as close as possible to the labels.
If you're looking to work with words and a deep learning framework, you should map your words to an existing vector representation (I would highly recommend glove, but word2vec is decent as well) and then learn on top of that representation.
After having a deeper understanding of what you're attempting here I think the issue is that you're dealing with 60 inputs, not one. These inputs are the concatenation of the existing predictions for both words (in the case with no first word the first 30 entries are 0). You should take care of the mapping yourself (should be very straightforward), and then just treat it as trying to predict 30 numbers with 60 numbers.
I feel obliged to tell you that the way you've framed the problem you will see awful performance. When dealing with a sparse (mostly zeros) vector and such a small dataset deep learning techniques will show VERY poor performance compared to other methods. You are better off using glove + svm or a random forest model on your existing data.
You can use other implementations of MLDataSet besides BasicMLDataSet.
I ran into a similar problem with windows of DNA sequences. Building an array of all the windows would not have been scalable.
Instead, I implemented my own VersatileDataSource, and wrapped it in a VersatileMLDataSet.
VersatileDataSource has just a few methods to implement:
public interface VersatileDataSource {
String[] readLine();
void rewind();
int columnIndex(String name);
}
For each readLine(), you could return the inputs for the previous/current word, and advance the position to the next word.

Memory conscious string filtering

Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.
I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.
I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.
If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.

Categories

Resources