Lucene Stopword and nGram

Lucene Stopword and nGram - java

I'm using Lucene and I want to use nGrams with stopwords.
I wrote an own Analyzer in Lucene with respect to the German Stopword Analyzer.
public class GermanNGramAnalyzer extends StopwordAnalyzerBase {
#Override
protected TokenStreamComponents createComponents(String s) {
NGramTokenizer tokenizer = new NGramTokenizer(4,4); //Tokenizer for nGrams
TokenStream result = new StandardFilter(tokenizer);
result = new LowerCaseFilter(result);
result = new StopFilter(result, this.stopwords);
result = new SetKeywordMarkerFilter(result, this.exclusionSet);
result = new GermanNormalizationFilter(result);
result = new NumberFilter(result);
return new TokenStreamComponents(tokenizer, result);
}
(...)
}
This works but not like I want.
As you can see we have 4 grams so it looks like this: (blanks are masked as "_")
Das Haus
das_
as_h
s_ha
_hau
haus
In German "das" is like "the" and should be removed. But of course it won't be removed then "das_", "as_h", "s_ha" doesn't contains "das" at all.
So I want to first a word tokenizer, use stopword and after that merge everything again and use ngram like normal.
Of course I can "manually" remove all stopwords from the string before I throw it into Lucene but I thought it should be possible to do this with Lucene.
Someone has an idea?

One of the possibility would be instead of using NGramTokenizer as a tokenizer, first you could use StandardTokenizer or any other nice tokenization and then apply creation of the ngrams via usage of NGramTokenFilter which could be applied exactly after the usage of StopFilter.

Related

PatternReplaceCharFilterFactory arguments problem in Lucene (java)

I am doing a practice in Java using Lucene. I want to remove "{", "}" and ";" using a CharFilter in a CustomAnalyzer but I don't know how to call the "PatternReplaceCharFilterFactory". I have tried to call it passing it "map" but it doesn't work and it returns an exception. I have also tried with pattern "p" but it's the same.
public static ArrayList<String> analyzer_codigo(String texto)throws IOException{
Map<String, String> map = new HashMap<String, String>();
map.put("{", "");
map.put("}", "");
map.put(";", "");
Pattern p = Pattern.compile("([^a-z])");
boolean replaceAll = Boolean.TRUE;
Reader r = new Reader(texto);
Analyzer ana = CustomAnalyzer.builder(Paths.get("."))
.addCharFilter(PatternReplaceCharFilterFactory.class,p,"",r)
.withTokenizer(StandardTokenizerFactory.class)
.addTokenFilter(LowerCaseFilterFactory.class)
.build();
return muestraTexto(ana, texto);
}

You can pass a Map to the PatternReplaceCharFilterFactory - but the keys you use for the map are those defined in the JavaDoc for the factory class:
pattern="([^a-z])" replacement=""
This uses Solr documentation to define the keys (pattern and replacement) together with their Solr default values.
Using these keys, your map becomes:
Map<String, String> map = new HashMap<>();
map.put("pattern", "\\{|\\}|;");
map.put("replacement", "");
The regular expression \\{|\\}|; needs to escape the { and } characters because they have special meanings, and then the regex backslashes also need to be escaped in the Java string.
So, the above regular expression means { and } and ; will all be replaced by the empty string.
Your custom analyzer then becomes:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addCharFilter(PatternReplaceCharFilterFactory.NAME, map)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.build();
If you use this to index the following input string:
foo{bar}baz;bat
Then the indexed value will be stored as:
foobarbazbat
Very minor point: I prefer to use PatternReplaceCharFilterFactory.NAME instead of PatternReplaceCharFilterFactory.class or even just "patternReplace" - but these all work.
Update
Just for completeness:
The CustomAnalyzer.Builder supports different ways to add a CharFilter. See its addCharFilter methods.
As well as the approach shown above, using a Map...
.addCharFilter(PatternReplaceCharFilterFactory.NAME, map)
...you can also use Java varargs:
"key1", "value1", "key2", "value2", ...
So, in our case, this would be:
.addCharFilter(PatternReplaceCharFilterFactory.NAME
"pattern", "\\{|\\}|;", "replacement", "")

How do I search partial words in Lucene when using MultiFieldQueryParser?

public SearchResult search(String queryStr, SortBy sortBy, int maxCount)
throws ParseException, IOException {
String[] fields = {Indexer.TITLE_FIELD_NAME, Indexer.REVIEW_FIELD_NAME, "name"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer);
Query query = parser.parse(queryStr);
Sort sort = null;
if (sortBy != null) {
sort = sortBy.sort;
}
return searchAfter(null, query, sort, maxCount);
}
Above method just gives me the result, but for that I have to search for the whole word but if I search partial word it doesn't work.

By default MultiFieldQueryParser (and QueryParser, which this class inherits from) will look for the whole words you are searching, however it can also generate wildcard queries. The word "elephant" can be matched by using elep*, elep?ant (i.e. ? mathes a single letter) or ele*nt. You can also use fuzzy queries, like elechant~.
You can read the whole syntax specification here: http://lucene.apache.org/core/7_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html (below the class list).

dynamically populate hashmap with human language dictionary for text analysis

I'm writing a software project to take as input a text in human language and determine what language it's written in.
My idea is that I'm going to store dictionaries in hashmaps, with the word as a key and a bool as a value.
If the document has that word I will flip the bool to ture.
Right now I'm trying to think of a good way to read in these dictionaries, put them into the hashmaps, the way I'm doing it now is very naieve and looks clunky, is there a better way to populate these hashmaps?
moreover, these dictionaries are huge. maybe this isn't the best way to do this, i.e. populate them all in succession like this.
I was thinking that it might be better to just consider one dictionary at a time, and then create a score, how many words of the input text registered with that document, save that, and then process the next dictionary. that would save on RAM, isn't it? Is that a good solution?
The code so far looks like this:
static HashMap<String, Boolean> de_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> fr_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> ru_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> eng_map = new HashMap<String, Boolean>();
public static void main(String[] args) throws IOException
{
ArrayList<File> sub_dirs = new ArrayList<File>();
final String filePath = "/home/matthias/Desktop/language_detective/word_lists_2";
listf( filePath, sub_dirs );
for(File dir : sub_dirs)
{
String word_holding_directory_path = dir.toString().toLowerCase();
BufferedReader br = new BufferedReader(new FileReader( dir ));
String line = null;
while ((line = br.readLine()) != null)
{
//System.out.println(line);
if(word_holding_directory_path.toLowerCase().contains("/de/") )
{
de_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/ru/") )
{
ru_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/fr/") )
{
fr_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/eng/") )
{
eng_map.put(line, false);
}
}
}
So I'm looking for advice on how I might populate them one at a time, and an opinion as to whether that's a good methodology, or suggestions about possibly better methodologies for acheiving this aim.
The full programme can be found here on my GitHub page.
27th

The task of language identification is well researched and there are a lot of good libraries.
For Java, try TIKA, or Language Detection Library for Java (they report "99% over precision for 53 languages"), or TextCat, or LingPipe - I'd suggest to start from the 1st, it seems to have the most detailed tutorial.
If you task is too specific for existed libraries (although I doubt this is the case), refer to this survey paper and adapt closest techniques.
If you do want to reinvent the wheel, e.g. for self-learning purposes, note that identification can be treated as a special case of text classification and read this basic tutorial for text classification.

java weka stringtowordvector is not counting word occurences properly

so I'm using Weka Machine Learning Library's JAVA API and I have the following code:
String html = "repeat repeat repeat";
Attribute input = new Attribute("html",(FastVector) null);
FastVector inputVec = new FastVector();
inputVec.addElement(input);
Instances htmlInst = new Instances("html",inputVec,1);
htmlInst.add(new Instance(1));
htmlInst.instance(0).setValue(0, html);
StringToWordVector filter = new StringToWordVector();
filter.setUseStoplist(true);
filter.setInputFormat(htmlInst);
Instances dataFiltered = Filter.useFilter(htmlInst, filter);
Instance last = dataFiltered.lastInstance();
System.out.println(last);
though StringToWordVector is supposed to count the word occurences within the string, instead of having the word 'repeat' counted 3 times, the count only comes out as 1
what am I doing wrong?

The default setting is only reporting presence/absence as 0/1. You must enable counting explicitly. Add:
filter.setOutputWordCounts(true);
and re-run.
Weka has an explicit mailing list; posting such questions there might give you faster responses.

Gee... all those lines of code. How about these few lines instead?
public static Map<String, Integer> countWords(String input) {
Map<String, Integer> map = new HashMap<String, Integer>();
Matcher matcher = Pattern.compile("\\b\\w+\\b").matcher(input);
while (matcher.find())
map.put(matcher.group(), map.containsKey(matcher.group()) ? map.get(matcher.group()) + 1 : 1);
return map;
}
Here's the code in action:
public static void main(String[] args) {
System.out.println(countWords("sample, repeat sample, of text"));
}
Output:
{of=1, text=1, repeat=1, sample=2}

Using Lucene to count results in categories

I am trying to use Lucene Java 2.3.2 to implement search on a catalog of products. Apart from the regular fields for a product, there is field called 'Category'. A product can fall in multiple categories. Currently, I use FilteredQuery to search for the same search term with every Category to get the number of results per category.
This results in 20-30 internal search calls per query to display the results. This is slowing down the search considerably. Is there a faster way of achieving the same result using Lucene?

Here's what I did, though it's a bit heavy on memory:
What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.
Here's the code to create the bit sets:
public BitSet[] getBitSets(IndexSearcher indexSearcher,
Category[] categories) {
BitSet[] bitSets = new BitSet[categories.length];
for(int i=0; i<categories.length; i++)
{
Query query = categories[i].getQuery();
final BitSet bitset = new BitSet()
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
bitSet.set(doc);
}
});
bitSets[i] = bitSet;
}
return bitSets;
}
This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.
Now, when it's time to count categories of search results you do this:
public int[] getCategroryCount(IndexSearcher indexSearcher,
Query query,
final BitSet[] bitSets) {
final int[] count = new int[bitSets.length];
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
for(int i=0; i<bitSets.length; i++) {
if(bitSets[i].get(doc)) count[i]++;
}
}
});
return count;
}
What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
with this:
int numDocs = terms.docFreq()
and then get rid of the td variable altogether. This should make it even faster.

You may want to consider looking through all the documents that match categories using a TermDocs iterator.
This example code goes through each "Category" term, and then counts the number of documents that match that term.
public static void countDocumentsInCategories(IndexReader reader) throws IOException {
TermEnum terms = null;
TermDocs td = null;
try {
terms = reader.terms(new Term("Category", ""));
td = reader.termDocs();
do {
Term currentTerm = terms.term();
if (!currentTerm.field().equals("Category")) {
break;
}
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
} while (terms.next());
} finally {
if (td != null) td.close();
if (terms != null) terms.close();
}
}
This code should run reasonably fast even for large indexes.
Here is some code that tests that method:
public static void main(String[] args) throws Exception {
RAMDirectory store = new RAMDirectory();
IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
addDocument(w, 1, "Apple", "fruit", "computer");
addDocument(w, 2, "Orange", "fruit", "colour");
addDocument(w, 3, "Dell", "computer");
addDocument(w, 4, "Cumquat", "fruit");
w.close();
IndexReader r = IndexReader.open(store);
countDocumentsInCategories(r);
r.close();
}
private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
Document d = new Document();
d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));
for (String category : categories) {
d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
}
w.addDocument(d);
}

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?
Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.
Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene Stopword and nGram - java

One of the possibility would be instead of using NGramTokenizer as a tokenizer, first you could use StandardTokenizer or any other nice tokenization and then apply creation of the ngrams via usage of NGramTokenFilter which could be applied exactly after the usage of StopFilter.

Related

PatternReplaceCharFilterFactory arguments problem in Lucene (java)

How do I search partial words in Lucene when using MultiFieldQueryParser?

dynamically populate hashmap with human language dictionary for text analysis

java weka stringtowordvector is not counting word occurences properly

Using Lucene to count results in categories

Categories

Resources