Using multiple Leaves in Lucene Classifiers - java

I am trying to use the KNearestNeighbour classifier in lucene. The document classifier accepts a leafReader in its constructor, for training the classifier.
The problem is that, the Index I am using to train the classifier has multiple leaves. But the constructor for the class only accepts one leaf, and I could not find a process to add the remaining LeafReaders to the Class. I might be missing out on something. Could anyone help please me out with this?
Here is the code I am using currently :
FSDirectory index = FSDirectory.open(Paths.get(indexLoc));
IndexReader reader = DirectoryReader.open(index);
LeafReaderContext leaf = leaves.get(0);
LeafReader atomicReader = leaf.reader();
KNearestNeighborDocumentClassifier knn = new KNearestNeighborDocumentClassifier(atomicReader, BM25, null, 10, 0, 0, "Topics", field2analyzer, "Text");

Leaves represent each segment of you index. In terms of performance and resource usage, you should iterate over the leaves, run the classification for each segment and accumulate your results.
for (LeafReaderContext context : indexReader.getContext().leaves()) {
LeafReader reader = context.reader();
// run for each leaf
}
If that is not possible, you can use the SlowCompositeReaderWrapper which, as the name suggests, might be very slow as it aggregates all the leaves on the fly.
LeafReader singleLeaf = SlowCompositeReaderWrapper.wrap(indexReader);
// run classifier on singleLeaf
Depending on your Lucene version, this sits in lucene-core or lucene-misc (since Lucene 6.0, I think). Also, this class is deprecated and scheduled for removal in Lucene 7.0.
The third option might be to run forceMerge(1) until you only have one segment and you can use the single leaf for this. However, forcing a merge down to a single segment has other issues and might not work for your use case. If you data is write-once and then only used for reading, you could use a forceMerge. If you have regular updates, you'll have to end up using the first option and aggregate the classification result yourself.

Related

dl4j - what's the label mechanism in paragraph2vec?

I just read the paper Distributed Representations of Sentences and Documents. In the sentiment analysis experiment section, it says, "After learning the vector representations for training sentences and their subphrases, we feed them to a logistic regression to learn a predictor of the movie rating." So it uses logistic regression algorithm as a classifier to determine what the label is.
Then I moved on to dl4j, I read the example "ParagraphVectorsClassifierExample" the code shows as below:
void makeParagraphVectors() throws Exception {
ClassPathResource resource = new ClassPathResource("paravec/labeled");
// build a iterator for our dataset
iterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(resource.getFile())
.build();
tokenizerFactory = new DefaultTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
// ParagraphVectors training configuration
paragraphVectors = new ParagraphVectors.Builder()
.learningRate(0.025)
.minLearningRate(0.001)
.batchSize(1000)
.epochs(20)
.iterate(iterator)
.trainWordVectors(true)
.tokenizerFactory(tokenizerFactory)
.build();
// Start model training
paragraphVectors.fit();
}
void checkUnlabeledData() throws IOException {
/*
At this point we assume that we have model built and we can check
which categories our unlabeled document falls into.
So we'll start loading our unlabeled documents and checking them
*/
ClassPathResource unClassifiedResource = new ClassPathResource("paravec/unlabeled");
FileLabelAwareIterator unClassifiedIterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(unClassifiedResource.getFile())
.build();
/*
Now we'll iterate over unlabeled data, and check which label it could be assigned to
Please note: for many domains it's normal to have 1 document fall into few labels at once,
with different "weight" for each.
*/
MeansBuilder meansBuilder = new MeansBuilder(
(InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable(),
tokenizerFactory);
LabelSeeker seeker = new LabelSeeker(iterator.getLabelsSource().getLabels(),
(InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable());
while (unClassifiedIterator.hasNextDocument()) {
LabelledDocument document = unClassifiedIterator.nextDocument();
INDArray documentAsCentroid = meansBuilder.documentAsVector(document);
List<Pair<String, Double>> scores = seeker.getScores(documentAsCentroid);
/*
please note, document.getLabel() is used just to show which document we're looking at now,
as a substitute for printing out the whole document name.
So, labels on these two documents are used like titles,
just to visualize our classification done properly
*/
log.info("Document '" + document.getLabels() + "' falls into the following categories: ");
for (Pair<String, Double> score: scores) {
log.info(" " + score.getFirst() + ": " + score.getSecond());
}
}
}
It demonstrates how does doc2vec associate arbitrary documents with labels, but it hides the implementations behind the scenes. My question is: is it does so also by logistic regression? if not, what is it? And how can I do it by logistic regression?
I'm not familiar with DL4J's approach, but at the core 'Paragraph Vector'/'Doc2Vec' level, documents typically have an identifier assigned by the user – most typically, a single unique ID. Sometimes, though, these (provided) IDs have been called "labels", and further, sometimes it can be useful to re-use known-labels as if they were per-document doc-tokens, which can lead to confusion. In the Python gensim library, we call those user-provided tokens "tags" to distinguish from "labels" that might be from a totally different, and downstream, vocabulary.
So in a followup paper like "Document Embedding with Paragraph Vectors", each document has a unique ID - its title or identifer within Wikpedia or Arxiv. But then the resulting doc-vectors are evaluated by how well they place documents with the same category-labels closer to each other than third documents. So there's both a learned doc-tag space, and a downstream evaluation based on other labels (that weren't in any way provided to the unsupervised Paragraph Vector algorithm).
Similarly, you might give all training documents unique IDs, but then later train a separate classifier (of any algorithm) to use the doc=vectors as inputs, and learn to predict other labels. That's my understanding of the IMDB experiment in the original 'Paragraph Vectors' paper: every review has a unique ID during training, and thus got its own doc-vector. But then a downstream classifier was trained to predict positive/negative review sentiment based on those doc-vectors. So, the assessment/prediction of labels ("positive"/"negative") was a separate downstream step.
As mentioned, it's sometimes the case that re-using known category-labels as doc-ids – either as the only doc-ID, or as an extra ID in addition to a unique-per-document ID – can be useful. In a way, it creates synthetic combined documents for training, made up of all documents with the same label. This may tend to influence the final space/coordinates to be more discriminative with regard to the known labels, and thus make the resulting doc-vectors more helpful to downstream classifiers. But then you've replaced classic 'Paragraph Vector', with one ID per doc, with a similar semi-supervised approach where known labels influence training.

Enhance the degree of parallelization of groupReduce transformation

In my Flink program I transform my data using a flatMap operation which divides several blocks of data in multiple smaller blocks. These blocks have a "position" attribute which describes their position in the respective original block. Now I use a groupReduce which needs to transform all small blocks which share the same "position" attribute. So it should be easily distributable on multiple nodes. But when I run my program on multiple nodes the groupReduce is executed with a dop of 1.
I guess this is because I have only one DataSet, but it seems that a GroupedDataSet is not available in Flink Java API. Is there another possibility to enhance the dop of my groupReduce transformation?
Here is the code I am using (dummy code ignoring "details"):
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.getDataSet()
//Until here the dop is correct
DataSet<SlicedTile> processedSlicedTiles = slicedTiles.reduceGroup;
The problem with your code is the getDataSet() call. It returns the input of the grouping operation. Hence, the dataset represented by slicedTiles is neither grouped nor are its groups sorted but instead it is the result of the flatMap transformation and the groupBy and sortGroup calls are not considered in the program at all.
Applying a groupReduce (or reduce) operation on a non-grouped dataset is always a non-parallel operation because all elements of the input data set are processed as a single group.
Logically, the three transformation groupBy().sortGroup().reduceGroup() belong together and are translated into a single groupReduce operator (maybe with an additional combiner if the GroupReduceFunction is combinable).
If you change your implementation as follows, it should work as expected.
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.reduceGroup(yourFunction);
I will open a JIRA issue to add JavaDocs to the Grouping.getDataSet() method to document the behavior of this function.

Lucene Solr using complex filters

I am currently having a problem with specifying filters for Lucene/Solr. Every solution I come up with breaks other solutions. Let me start with an example. Assume that we have the following 5 documents:
doc1 = [type:Car, sold:false, owner:John]
doc2 = [type:Bike, productID:1, owner:Brian]
doc3 = [type:Car, sold:true, owner:Mike]
doc4 = [type:Bike, productID:2, owner:Josh]
doc5 = [type:Car, sold:false, owner:John]
So I need to construct the following filter queries:
Give me all documents of type:Car which has sold:false only and if it is a type that is different that Car, include in the result. So basically I want docs 1, 2, 4, 5 the only document I don't want is doc3 because it is has sold:true. To put it more precisely:
for each document d in solr/lucene
if d.type == Car {
if d.sold == false, then add to result
else ignore
}
else {
add to result
}
return result
Filter in all documents that are of (type:Car and sold:false) or (type:Bike and productID:1). So for this I will get 1,2,5.
Get all documents that if the type:Car then get only with sold:false, otherwise get me documents from owners John, Brian, Josh. So for this query I should get 1, 2, 4, 5.
Note: You don't know all the types in the documents. Here it is obvious because of the small number of documents.
So my solutions were:
(-type:Car) OR ((type:Car) AND (sold:false). This works fine and as expected.
((-type:Car) OR ((type:Car) AND (sold:false)) AND ((-type:Bike) OR ((type:Bike) AND (productID:1))). This solution does not work.
((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((-type:Car) OR ((type:Car) AND (sold:false)). This does not work, I can make it work if I do I do this: ((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((version:* OR (-type:Car)) OR ((type:Car) AND (sold:false)). I don't understand how this works, because logically it should work, but Solr/Lucene somehow does something.
Okay, to get anything but a sold car, you could use -(type:Car sold:true).
This can be incorporated into the other queries, but you'll need to be careful with lonely negative queries like this. Lucene doesn't handle them well, generally speaking, and Solr has some odd gotchas as well. Particularly, A -B reads more like "get all A but forbid B" rather than "get all A and anything but B". Similar problem with A or -B, see this question for more.
To get around that, you'll need to surround the negative with an extra set of parentheses, to ensure it is understood by Solr to be a standalone negative query, like: (-(type:Car AND sold:true))
So:
-(type:Car AND sold:true) (This doesn't get the result you stated, but as per my comment, I don't really understand your stated results)
(type:Bike AND productID:1) (-(type:Car AND sold:true)) (You actually wrote this in the description of the problem!)
(-(type:Car AND sold:false)) owner:(John Brian Josh)
My advice is to use programmatic Lucene (that is, directly in Java using the Java Lucene API) rather than issuing text queries which will be interpreted. This will give you much more fine-grained control.
What you're going to want to do is construct a Lucene Filter Object using the QueryWrapperFilter API. A QueryWrapperFilter is a filter which takes a Lucene Query, and filters out any documents which do not match that query.
In order to use QueryWrapperFilter, you'll need to construct a Query which matches the terms you're interested in. The best way to do this is to use TermQuery:
TermQuery tq = new TermQuery(new Term("fieldname", "value"));
As you might have guessed, you'll want to replace "fieldname" with the name of a field, and "value" with a desired value. For example, from your example in the OP, you might want to do something like new Term("type", "Car").
This only matches a single term. You're going to need multiple TermQueries, and a way to combine them to create a single, larger query. The best way to do this is with BooleanQuery:
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanQuery.Occur.MUST);
You can call bq.add as many times as you want - once for each TermQuery that you have. The second argument specifies how strict the query is. It can specify that a sub-query MUST appear, SHOULD appear, or should NOT appear (these are the three values of the BooleanQuery.Occur enum).
After you've added each of the sub-queries, this BooleanQuery represents the full query which will match only the documents you ask for. However, it's still not a filter. We now need to feed it to QueryWrapperFilter, which will give us back a filter object:
QueryWrapperFilter qwf = new QueryWrapperFilter(bq);
That should do it. Then if you want to run queries over only the documents allowed through by that filter, you just take your new query (call it q) and your filter, and create a FilteredQuery:
FilteredQuery fq = new FilteredQuery(q, qwf);

Solve terrible performance after upgrading from Lucene 4.0 to 4.1

After upgrading from Lucene 4.0 to 4.1 my solution's performance degraded by more than an order of magnitude. The immediate cause is the unconditional compression of stored fields. For now I'm reverting to 4.0, but this is clearly not the way forward; I'm hoping to find a different approach to my solution.
I use Lucene as a database index, meaning my stored fields are quite short: just a few words at most.
I use a CustomScoreQuery where in CustomScoreProvider#customScore I end up loading all candidate documents and perform detailed word-similarity scoring against the query. I employed two levels of heuristic to narrow down the candidate document set (based on Dice's coefficient), but in the last step I need to match up each query word against each document word (they could be in different order) and calculate the total score based on the sum of best word matches.
How could I approach this differently and do my calculation in a way that avoids the pitfall of loading compressed fields during query evaluation?
In the IndexWriterConfig, you can pass in a Codec, which defined the storage method to be used by the index. This will only take effect when the IndexWriter is constructed (that is, changing the config after construction will have no effect). You'll want to use Lucene40Codec.
Something like:
//You could also simply pass in Version.LUCENE_40 here, and not worry about the Codec
//(though that will likely affect other things as well)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, analyzer);
config.setCodec(new Lucene40Codec());
IndexWriter writer = new IndexWriter(directory, config);
You could also use Lucene40StoredFieldsFormat directly to get the old, uncompressed stored field format, and pass it back from a custom Codec implementation. You could probably take most of the code from Lucene41Codec, and just replace the storedFieldFormat() method. Might be the more targeted approach, but a touch more complex, and I don't know for sure whether you might run into other issues.
A further note on creating a custom codec, the way the API indicates that you should accomplish this is to extend FilterCodec, and modifying their example a bit to fit:
public final class CustomCodec extends FilterCodec {
public CustomCodec() {
super("CustomCodec", new Lucene41Codec());
}
public StoredFieldsFormat storedFieldsFormat() {
return new Lucene40StoredFieldsFormat();
}
}
Of course, the other implementation that springs to mind:
I think it's clear to you, as well, that the the issue is right around "I end up loading all candidate documents". I won't editorialize too much on a scoring implementation I don't have complete details on or understanding of, but it sounds like your fighting against Lucene's architecture to make it do what you want. Stored fields shouldn't be used for scoring, generally, and you can expect performance to suffer very noticeably as a result using the 4.0 stored field format, as well, though to a somewhat lesser extent. Might there be a better implementation, either in terms of scoring algorithm, or in terms of document structure, that will remove the requirement to score documents based on stored fields?
With Lucene 3.x I had this:
new CustomScoreQuery(bigramQuery, new FieldScoreQuery("bigram-count", Type.BYTE)) {
protected CustomScoreProvider getCustomScoreProvider(IndexReader ir) {
return new CustomScoreProvider(ir) {
public double customScore(int docnum, float bigramFreq, float docBigramCount) {
... calculate Dice's coefficient using bigramFreq and docBigramCount...
if (diceCoeff >= threshold) {
String[] stems = ir.document(docnum).getValues("stems");
... calculate document similarity score using stems ...
}
}
};
}
}
This approach allowed efficient retrieval of cached float values from stored fields, which I used to get the bigram count of a document; it didn't allow retrieving strings, so I needed to load the document to get what I need to calculate document similarity score. It worked okayish until the Lucene 4.1 change to compress stored fields.
The proper way to leverage the enhancements in Lucene 4 is to involve DocValues like this:
new CustomScoreQuery(bigramQuery) {
protected CustomScoreProvider getCustomScoreProvider(ReaderContext rc) {
final AtomicReader ir = ((AtomicReaderContext)rc).reader();
final ValueSource
bgCountSrc = ir.docValues("bigram-count").getSource(),
stemSrc = ir.docValues("stems").getSource();
return new CustomScoreProvider(rc) {
public float customScore(int docnum, float bgFreq, float... fScores) {
final long bgCount = bgCountSrc.getInt(docnum);
... calculate Dice's coefficient using bgFreq and bgCount ...
if (diceCoeff >= threshold) {
final String stems =
stemSrc.getBytes(docnum, new BytesRef())).utf8ToString();
... calculate document similarity score using stems ...
}
}
};
}
}
This resulted in a performance improvement from 16 ms (Lucene 3.x) to 10 ms (Lucene 4.x).

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.
For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".
From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.
The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.
Like I said I have indexed the database on company name, and then have the following code:
IndexSearcher searcher = new IndexSearcher(directory);
new QueryParser(Version.LUCENE_30, "company", analyzer);
Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));
I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(fuzzy_query, collector);
System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());
Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:
import org.apache.lucene.queryParser.*;
Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.
You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:
Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);
Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.
You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.
You can get the match values with:
TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.score);
}

Categories

Resources