Text Classification in Java

Text Classification in Java - java

I need some sort of solution in Java for the following requirements:
Search in a text for certain terms (each term can be 1-3 words). For example: {"hello world", "hello"}. The match need to be exact.
There are about 500 types of terms groups each contains about 30 terms.
Each text might contain up to 4000 words.
performance is an important issue.
Thanks,
Rod

I have done something similar for a bespoke spam filter.
A technique I found to be both simple and fast is:
Split the input file into words first.
Call intern() on each word, to simplify the comparisons in step 3.
Create a Term class, encapsulating an array of up to three strings. Its equals() method can do pointer comparison on the strings, rather than calling String.equals(). Create a Term instance for each group of 2 or 3 consecutive words in the input.
Use a Multimap (from Google Collections) to map each term to the set of files in which it appears.

Use regex expressions. See: http://java.sun.com/docs/books/tutorial/essential/regex/

There seems to be two parts to this. Figuring a decent algorithm, and implementing it in Java. (For the moment let's put aside the idea that surely "out there" someone has already implemented this, and you can probably find some ideas.)
Seems like we want to avoid repeat expensive work. but it's not clear where the costs would be. So I guess you'll need to be prepared to benchmark a few candidate appraoches. Also have in mind what is "good enough".
Start wih the simplest thing you can think of that works. Measure it. You might get the surprising result that it's good enough. Stop right there! For example, this is really dumb:
read text into String (4k, that's not too big)
for each term
use regexp to find matches in text
but it might well give a sub-second response time. Would your users really care if you took a 200ms response down to 100ms? How much would they pay for that?
Another approach. I wonder of this is faster?
prepare a collection of terms keyed by first word
tokenize the text
for each token
find terms that match
check for match (using look ahead for multi-word terms)
As for implementing in Java. Separate problem ask specific questions if you need to.

Related

If you have a dictionary of strings, what's the fastest way to search a file and increment the number of times the strings appear?

Let's say you have a dictionary with 5 strings in it, and you also have multiple files. I want to iterate through those files and see how many times the strings in my dictionary appears in them. How can I do this so it is most efficient?
I would like this to scale as well..so more than 5 strings and more than a few documents. I'm pretty open about what language I'm using. Preferably Java or C#, but once again, I can work in another language.

Most efficient is always a trade off between time you want to put into it and the results you want (or need).
One easy approach that is efficient is to use a regular expression. This is probably pretty good with five strings and this will be fairly efficient. If that isn't good enough for you, well... You can certainly find a better approach.

This is a Pattern Matching Problem. The best algorithm to solve this kind of problem is Knuth-Morris-Pratt Algorithm. This is a fomous algorithm therefore you will find its description anywhere, but it found on Introduction to Algorithm book.

Lightweight library cappable of suggesting different spellings of words from a bounded set?

I was looking for lightweight library that'd allow me to feed it a bunch of words, and then ask it whether a given word would have any close matches.z
I'm not particularly concerned with the underlying algorithm (I reckon a simple hamming distance algorithm would probably suffice, were I to undertake the task myself).
I'm just in the development of a small language and I found it nifty to make suggestions to the user when an "Undefined class" error is detected (lots of times it's just a misspelled word). I don't want to lose much time on the issue though.
Thanks

Levenshtein distance is a common way of handling it. Just add all the words to a list and then brute-force iterate over it and return the smallest distance. Here's one library with a Levenschtein function: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html
If you have a large number of words and you want it to run fast, then you'd have to use ngrams. Spilt each word into bigrams and then add (bigram, word) to a map. Use the map to look up the bigrams in the target word, and then iterate through the candidates. That's probably more work than you want to do, though.

not necessarily a library but i think this article may be really helpful. it mostly describes the general workings of how a spelling corrector works in python, but also has a link for a java implementation which you may use if that is what you are looking for specifically (note that I haven't specifically used the java one before)

Regex performance issues - Profanity filter

I'm making a profanity filter (bad idea I know), and I'm trying to do it with regex in Java.
Right now here's my regex example string, this would filter 2 words, foo and bar.
(?i)f(?>[.,;:*`'^+~\\/#|]*+|.)o(?>[.,;:*`'^+~\\/#|]*+|.)o|b(?>[.,;:*`'^+~\\/#|]*+|.)a(?>[.,;:*`'^+~\\/#|]*+|.)r
Basically, I have it ignore case, then I put (?>[.,;:*'^+~\\/#|]*+|.) in between each letter of a curse word, and | between each complete curse word regex.
It works, but it's sorta slow.
If I have 6 words in the filter, it will filter a fairly long string (500 characters) in 939,548 nanoseconds. When I have 12, it just about doubles.
So, about 1ms per 6 curse words with this. But my filter will have hundreds (400 or so).
Calculating this, it would take about 66ms to filter this long string.
This is a chat server I'm building, and if I have lots of users on (say, 5,000) and 1 out of 5 are chatting in 1 second (1,000 chat messages) I need to filter a message in about 1ms.
Am I asking too much of regexps? Would it be faster to make my own specialized type of filter by hand? Are there ways to optimize this?
I am precompiling the regex.
If you want to see the effect of this regex http://regexr.com?30454
Update: Another thing I could to is have chat messages filtered client side in actionscript.
Update: I believe the only way to achieve such degree of performance would be a hand-coded solution without using regexps sadly, so I'll have to do a more basic filter.

To answer your question "am I asking too much of regexps?"- Yes
I spent the better part of 2 years working on a profanity filter using regular expressions and finally gave up. During this time, I tried all of these things:
Pre-compiling
Character classes (punctuation, whitespace, etc)
Non-capturing groups (mentioned above and can greatly reduce memory and increase speed)
Combining similar regexps (also mentioned above)
Trimming whitespace (str.trim())
Case handling (str.toLowerCase())
Packing and unpacking whitespace (convert multiple adjacent whitespace to a single space and vice-versa)
Writing my own custom regexp engine (highly unrecommended as it is complex and not scalable)
Nothing worked well and as my blacklist grew my system slowed down. In the end I gave up and implemented a linear analysis filter, which is now the core part of CleanSpeak, my company's profanity filtering product.
We found that we were also able to do some great multi-threading and other optimizations once we stopped using regexps and went from handling 600-700 messages per second to 10,000+ messages per second.
Lastly, we also found that performing linear analysis made the filter more accurate and allowed us to solve the "scunthrope problem" and many of the other ones people have mentioned in the comments here.
You can definitely try all of the things I mention above and see if you can get your performance up, but it is a hard problem to solve because regexps weren't really designed for language analysis. They were designed for text analysis, which is a very different problem.

Can you make use of any of the built-in character classes, e.g.
\bf\W?o\W?o\W?\b
to detect "foo" with any non-letters between the letters, but not "food" or "snafoo" (sic)
However, the weakness of this is that "_" counts as a word character :-(
I think a more promising approach is to use a simple, fast filter with some false positives, then re-test the positives against a more rigorous filter. Unless your users are total potty-mouths, there shouldn't be all that many detailed checks needed.
Update: Thought of this after I went home, but Qtax got there first (see other answer) - try removing all the punctuation first, then run plain word patterns on the text. This should make the word patterns much simpler and faster, especially when you have a lot of words to test.
Finally, note that within [] you don't need to escape regex special characters, so:
[.,;:*`'^+~\\/#|]
is OK (backslash still needs escaping)

When you have many words, group them by their first equal characters, and you should see less than linear time increase for added words.
I mean that if you have two words "foobar" and "fook" make a regex formed like foo(?:bar|k).
Using non-backtracking groups instead of non-capturing might increase performance. I.e. replace (?:...) with (?>...).
Another suggestion could be to just remove all the punctuation in the string first and then you could apply a simpler expression.
Also, if you can, try to apply the expression on longer strings. As that would probably be faster than doing one message at a time. Maybe combine several messages for a first check.

you can try to replace all
[.,;:*`'^+~\\/#|]+
with empty strings and then simply check for
\b(foo|bar)\b
UPDATE:
if you are more paranoid about the spaces f( *+)o\1o|b( *+)a\2r
or more paranoid in general f([^o]?)o\1o|b([^a]?)a\2r

AppEngine Approximate Partial String Matching Algorithm

So, I realize that this covers a wide array of topics and pieces of them have been covered before on StackOverflow, such as this question. Similarly, Partial String Matching and Approximate String Matching are popular algorithmic discussions, it seems. However, using these ideas in conjunction to suit a problems where both need to be discussed seems highly inefficient. I'm looking for a way to combine the two problems in to one solution, efficiently.
Right now, I'm using AppEngine with Java and the Persistent DataStore. This is somewhat annoying, since it doesn't seem to have any arithmetic usage in the queries to make things easier, so I'm currently considering doing some precalculation and storing it as an extra field in the database. Essentially, this is the idea that a friend and I were having on how to possibly implement a system for matching and I was more or less hoping for suggestions on how to make it more efficient. If it needs to be scrapped in favor of something better that already exists, I can handle that, as well.
Let's start off with a basic example of what I'd look to do a search for. Consider the following nonsense sentence:
The isolating layer rackets the principal beneath your hypocritical rubbish.
If a user does a search for
isalatig pri
I would think that this would be a fairly good starting match for the string, and the value should be returned. The current method that we are considering using basically assigns a value to test divisibility. Essentially, there is a table with the following data
A: 2 B: 3 C: 5
D: 7 E: 11 F: 13
...
with each character being mapped to a prime number (multiple characters don't make a difference, only one character is needed). And if the query string divides the string in the database, then the value is returned as a possible match.
After this, keywords that aren't listed as stopwords are compared from the search string to see if they are starting substrings of words in the possible match under a given threshold of an edit distance (currently using the Levenshtein distance).
distance("isalatig", "isolating") == 2
distance("pri", "principal") == 0 // since principal has a starting
// substring of pri it passes
The total distance for each query is then ranked in ascending order and the top n values are then returned back to the person doing the querying.
This is the basic idea behind the algorithm, though since this is my first time dealing with such a scenario, I realize that I'm probably missing something very important (or my entire idea may be wrong). What is the best way to handle the current situation that I'm trying to implement. Similarly, if there are any utilities that AppEngine currently offers to combat what I'm trying to do, please let me know.

First off, a clarification: App Engine doesn't allow arithmetic in queries because there's no efficient way to query on the result of an arbitrary arithmetic expression. When you do this in an SQL database, the planner is forced to select an inefficient query plan, which usually involves scanning all the candidate records one by one.
Your scheme will not work for the same reason: There's no way to index an integer such that you can efficiently query for all numbers that are divisible by your target number. Other potential issues include words that translate into numbers that are too large to store in a fixed length integer, and being unable to distinguish between 'rental', 'learnt' and 'antler'.
If we discard for the moment your requirement for matching arbitrary prefixes of strings, what you are searching for is full-text indexing, which is typically implemented using an inverted index and stemming. Support for fulltext search is on the App Engine roadmap but hasn't been released yet; in the meantime your best option appears to be SearchableModel, or using an external search engine such as Google Site Search.

how to approach phrase queries and term grouping

I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.

You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.