Lucene - How to use TeeSinkTokenFilter?

Lucene - How to use TeeSinkTokenFilter? - java

Would anyone please explain how(and what for) to use the TeeSinkTokenFilter, from Lucene? An example would be much appreciated too =P. I didn't find the official documentation much clear , and have also looked up many sites, without much progress.
Thanks.

Yeah, I didn't think the official documentation was very clear, either. I think part of what made it so confusing is that it demonstrated two different features together in a way that made it hard to tell them apart. Let me see if I can rewrite their example to just show the basic case.
TeeSinkTokenFilter source1 = new TeeSinkTokenFilter(
new WhitespaceTokenizer(version, reader1));
TeeSinkTokenFilter.SinkTokenStream sink1 = source1.newSinkTokenStream();
TeeSinkTokenFilter.SinkTokenStream sink2 = source1.newSinkTokenStream();
source1.consumeAllTokens(); // all tokens get cached at this point
TokenStream final3 = new EntityDetect(sink1);
TokenStream final4 = new URLDetect(sink2);
d.add(new TextField("f3", final3, Field.Store.NO));
d.add(new TextField("f4", final4, Field.Store.NO));
This allows the final3 and final4 token streams to share the processing done by source1. As the official documentation said, the order in which the streams are consumed is important, but as it doesn't say, the order appears to be indeterminate (or maybe alphabetical by field name). I recommend using the consumeAllTokens method as I've done above.

Related

Java: READ and WRITE are "ambiguous" when using FileChannel with ByteChannel?

I am learning Java through an introductory course using the textbook, Java Programming 9th Edition by Joyce Farrel. The examples and exercises are written for Java 9e, however, I am using Java SE 14.
I've successfully navigated Java API and found updates as well as useful explanations as to what errors I have been encountering between the two versions and what is the best way to make corrections to get the examples and exercises to work.
However, in this case, I have been having a really hard time. I'm pretty sure it's due to the lack of experience, but I can't find anything I can understand using the Java API that has given me an idea on how to resolve this issue. Google and Stackoverflow posts have not been that much more successful as I am assuming people are using a much more streamlined method or approach.
Code with the comment on the line in question:
...
Path rafTest = Paths.get("raf.txt");
String addIn = "abc";
byte[] data = addIn.getBytes();
ByteBuffer out = ByteBuffer.wrap(data);
FileChannel fc = null;
try {
fc = (FileChannel)Files.newByteChannel(file, READ, WRITE); // Error READ and Write is ambiguous?
...
} catch (Exception e){
System.out.println("Error message: " + e);
}
...
What is the best way to go about finding an approach to figuring out what exactly is going here?

#Bradley: Found the answer by trying to rewrite my question. The compiler returned 3 specific errors dealing with StandardOpenOption. Using that and Java API, I found the solution. Thank you.
#NomadMaker: First thought was that I did not include the package correctly for newByteChannel. The second options were that the arguments needed a more specific reference.
Answer: newByteChannel(...); requires the open options argument to reference the StandardOpenOption.READ and WRITE. So that:
...newByteChannel(raf, StandardOpenOption.READ, StandardOpenOption.WRITE);
This change was implemented in Java SE 11. The program now works correctly.

Force Android TTS to parse time as text?

I'm building an app that speaks a custom message that includes the current time. I've noticed that some voices seem to parse the time correctly- that is, when I pass in a string like "The time is now 7:07" it actually speaks it as "seven-oh-seven."
However, other voices insist on saying "the time is now seven colon zero seven." It would be simple enough to write a function to parse it myself, but I'm trying to find something built-in so I won't have to worry about localization. Is there something I can do within TTS (attributed-string-related, maybe?) or even a Java library that will provide me with a localized "text" string of a time? I've dug through the TTS documentation and didn't find anything, and all the Java time-formatting patterns I've seen are numbers-only, no words.

Figured it out- the TtsSpan is what's needed. Took me a while to get the format right; there are basically zero examples out there on how to do this. Here are the basics for what worked for me, if anyone else comes across this with the same need (in kotlin):
var h:Int = Calendar.getInstance().get(Calendar.HOUR)
if (DateFormat.is24HourFormat(ApplicationContext))
h = Calendar.getInstance().get(Calendar.HOUR_OF_DAY)
val m:Int = Calendar.getInstance().get(Calendar.MINUTE)
val whenStr = getFormattedTime() // internal function I'm using to format it visually
wholeString = message.replace("|TIME|",whenStr) // my time doesn't always appear at the same place in the custom message
val spannedMsg:Spannable = SpannableString(wholeString)
val span:TtsSpan = TtsSpan.TimeBuilder(h, m).build()
spannedMsg.setSpan(span, wholeString.inexOf(whenStr),wholeString.indexOf(whenStr) + whenStr.length, Spannable.SPAN_EXCLUSIVE_EXCLUSIVE)
tts.speak(spannedMsg, TextToSpeech.QUEUE_FLUSH, params, "UTT_ID")
According to the TtsSpan documenation page, there are plenty of other options aside from TimeBuilder, as well, so be sure to check that out if you're looking for different customizations- just be aware that there aren't any code examples.

Identifying closest match to given string

My requirement is to be able to match two strings that are similar but not an exact match.
For example, given the following strings
First Name
Last Name
LName
FName
The output should be FirstName, FName and Last Name, LName as they are a logical match. Are there any libraries that I could use to do this? I am using JAVA to achieve this functionality.
Thanks
Raam

You could use Apache Commons StringUtils...
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance(java.lang.CharSequence,%20java.lang.CharSequence)
But it's worth noting that this may not be the best algorithm for the specific use-case in the question - I recommend reading some of the other answers here for more ideas.

According to the example you gave, you should use a modified Levenshtein distance where the penalty for adding spaces is small and the penalty for mismatched characters is larger. This will handle matching abbreviations to the strings that were abbreviated quite well. However that's assuming that you're mainly dealing with aligning abbreviations to corresponding longer versions of the strings. You should elaborate more exactly what kind of matchings you want to perform (e.g. more examples, or some kind of high-level description) if you want a more detailed and pointed answer about what methods you can/should use.

StringUtils is simply best for this - this is one of the examples i found on stackOverflow - as #CupawnTae said already
Below is the one of the simple example i came across
public static Object getTheClosestMatch(Collection<?> collection, Object target) {
int distance = Integer.MAX_VALUE;
Object closest = null;
for (Object compareObject : collection) {
int currentDistance = StringUtils.getLevenshteinDistance(compareObject.toString(), target.toString());
if(currentDistance < distance) {
distance = currentDistance;
closest = compareObject;
}
}
return closest;
}

An answer to a really similar question to yours can be found here.
Also, wikipedia has an article on Approximate String Matching that can be found here. If the first link isn't what you're looking for, I would suggest reading the wikipedia article and digging through the sources to find what you need.
Sorry I can't personally be of more help to you, but I really hope that these resources can help you find what you're looking for!

The spell check algorithms use a variant of this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance. I implemented it in class for a project and it was fairly simple to do so. If you don't want to implement it yourself you can use the name to search for other libraries.

Lucene Solr using complex filters

I am currently having a problem with specifying filters for Lucene/Solr. Every solution I come up with breaks other solutions. Let me start with an example. Assume that we have the following 5 documents:
doc1 = [type:Car, sold:false, owner:John]
doc2 = [type:Bike, productID:1, owner:Brian]
doc3 = [type:Car, sold:true, owner:Mike]
doc4 = [type:Bike, productID:2, owner:Josh]
doc5 = [type:Car, sold:false, owner:John]
So I need to construct the following filter queries:
Give me all documents of type:Car which has sold:false only and if it is a type that is different that Car, include in the result. So basically I want docs 1, 2, 4, 5 the only document I don't want is doc3 because it is has sold:true. To put it more precisely:
for each document d in solr/lucene
if d.type == Car {
if d.sold == false, then add to result
else ignore
}
else {
add to result
}
return result
Filter in all documents that are of (type:Car and sold:false) or (type:Bike and productID:1). So for this I will get 1,2,5.
Get all documents that if the type:Car then get only with sold:false, otherwise get me documents from owners John, Brian, Josh. So for this query I should get 1, 2, 4, 5.
Note: You don't know all the types in the documents. Here it is obvious because of the small number of documents.
So my solutions were:
(-type:Car) OR ((type:Car) AND (sold:false). This works fine and as expected.
((-type:Car) OR ((type:Car) AND (sold:false)) AND ((-type:Bike) OR ((type:Bike) AND (productID:1))). This solution does not work.
((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((-type:Car) OR ((type:Car) AND (sold:false)). This does not work, I can make it work if I do I do this: ((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((version:* OR (-type:Car)) OR ((type:Car) AND (sold:false)). I don't understand how this works, because logically it should work, but Solr/Lucene somehow does something.

Okay, to get anything but a sold car, you could use -(type:Car sold:true).
This can be incorporated into the other queries, but you'll need to be careful with lonely negative queries like this. Lucene doesn't handle them well, generally speaking, and Solr has some odd gotchas as well. Particularly, A -B reads more like "get all A but forbid B" rather than "get all A and anything but B". Similar problem with A or -B, see this question for more.
To get around that, you'll need to surround the negative with an extra set of parentheses, to ensure it is understood by Solr to be a standalone negative query, like: (-(type:Car AND sold:true))
So:
-(type:Car AND sold:true) (This doesn't get the result you stated, but as per my comment, I don't really understand your stated results)
(type:Bike AND productID:1) (-(type:Car AND sold:true)) (You actually wrote this in the description of the problem!)
(-(type:Car AND sold:false)) owner:(John Brian Josh)

My advice is to use programmatic Lucene (that is, directly in Java using the Java Lucene API) rather than issuing text queries which will be interpreted. This will give you much more fine-grained control.
What you're going to want to do is construct a Lucene Filter Object using the QueryWrapperFilter API. A QueryWrapperFilter is a filter which takes a Lucene Query, and filters out any documents which do not match that query.
In order to use QueryWrapperFilter, you'll need to construct a Query which matches the terms you're interested in. The best way to do this is to use TermQuery:
TermQuery tq = new TermQuery(new Term("fieldname", "value"));
As you might have guessed, you'll want to replace "fieldname" with the name of a field, and "value" with a desired value. For example, from your example in the OP, you might want to do something like new Term("type", "Car").
This only matches a single term. You're going to need multiple TermQueries, and a way to combine them to create a single, larger query. The best way to do this is with BooleanQuery:
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanQuery.Occur.MUST);
You can call bq.add as many times as you want - once for each TermQuery that you have. The second argument specifies how strict the query is. It can specify that a sub-query MUST appear, SHOULD appear, or should NOT appear (these are the three values of the BooleanQuery.Occur enum).
After you've added each of the sub-queries, this BooleanQuery represents the full query which will match only the documents you ask for. However, it's still not a filter. We now need to feed it to QueryWrapperFilter, which will give us back a filter object:
QueryWrapperFilter qwf = new QueryWrapperFilter(bq);
That should do it. Then if you want to run queries over only the documents allowed through by that filter, you just take your new query (call it q) and your filter, and create a FilteredQuery:
FilteredQuery fq = new FilteredQuery(q, qwf);

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.
For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".
From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.
The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.
Like I said I have indexed the database on company name, and then have the following code:
IndexSearcher searcher = new IndexSearcher(directory);
new QueryParser(Version.LUCENE_30, "company", analyzer);
Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));
I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(fuzzy_query, collector);
System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());
Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:
import org.apache.lucene.queryParser.*;
Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:
Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);
Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.
You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:
TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.score);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.