Lucene: FastVectorHighlighter returns null

Lucene: FastVectorHighlighter returns null - java

Here's what I did:
String textField1 = fastVectorHighlighter.getBestFragment(fastVectorHighlighter.getFieldQuery(query), indexReader, docId, SearchItem.FIELD_TEXT_FIELD1, DEFAULT_FRAGMENT_LENGTH);
Here's the query:
((FIELD_TEXT_FIELD1:十五*)^4.0) (FIELD_TEXT_FIELD3:十五*)
The original text is correct(indexReader.document(docId).get(SearchItem.FIELD_TEXT_FIELD3) is correct.), and definitely contains characters in query.
Here's how I index textField1 :
Field textField1 = new TextField(SearchItem.FIELD_TEXT_FIELD1, "", Field.Store.YES);

Problem solved!
It turns out, I need to change
fastVectorHighlighter.getFieldQuery(query)
to
fastVectorHighlighter.getFieldQuery(query, indexReader)
Follow the code into FieldQuery#flatten, we will find Lucene doesn't deal with PrefixQuery the normal way：
} else if (sourceQuery instanceof CustomScoreQuery) {
final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery();
if (q != null) {
flatten( applyParentBoost( q, sourceQuery ), reader, flatQueries);
}
} else if (reader != null) { // <<====== Here it is!
Query query = sourceQuery;
if (sourceQuery instanceof MultiTermQuery) {
MultiTermQuery copy = (MultiTermQuery) sourceQuery.clone();
copy.setRewriteMethod(new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS));
query = copy;
}
Query rewritten = query.rewrite(reader);
if (rewritten != query) {
// only rewrite once and then flatten again - the rewritten query could have a speacial treatment
// if this method is overwritten in a subclass.
flatten(rewritten, reader, flatQueries);
}
We can see it needs a IndexReader for PrefixQuery, FuzzyQuery etc.

Related

Hibernate search - handling null in boolean query

Is there a best practice for handling optional sub-queries? So say my search service has
query = builder.bool().must(createQuery(field1, term1)).must(createQuery(field2, term2)).createQuery();
createQuery(field, term) {
if(term != null) {
return builder.keyword().onField(field).matching(term).createQuery();
}
return null;
}
With the default QueryBuilder if I use a query like this and the term is null, the resulting query is "+term1 +null" or something along those lines, which causes a null pointer exception when the query is executed against the index. Is there a recommended way to avoid this issue? I was thinking about a custom QueryBuilder but I'm not sure how to tell the fulltext session to use my implementation rather than it's default. The only other way I can think of is something like
query;
query1 = createQuery(field1, term1);
query2 = createQuery(field2, term2);
if(query1 != null && query2 != null) {
query = builder.bool().must(query1).must(query2).createQuery();
} else if(query1 != null && query2 == null) {
query = query1;
} else if(query1 == null && query2 != null) {
query = query2;
}
createQuery(field, term) {
if(term != null) {
return builder.keyword().onField(field).matching(term).createQuery();
}
return null;
}
But this gets really messy really fast if there are more than a handful of sub-queries.

What you might do is introducing a method whose sole purpose would be to add a "must" in a null-safe way. I.e. do something like this:
BooleanJunction junction = builder.bool();
must(junction, createQuery(field1, term1));
must(junction, createQuery(field2, term2));
query = junction.createQuery();
void must(BooleanJunction junction, Query query) {
if (query != null) {
junction.must(query);
}
}
Query createQuery(String field, Object term) {
if(term != null) {
return builder.keyword().onField(field).matching(term).createQuery();
}
return null;
}
This would take out the "fluidity" of the BooleanJunction API, but since it's at the top-level only, I guess it's not so bad.

what about this
org.json.JSONObject json = new org.json.JSONObject();
json.put(field1, term1);
json.put(field2, term2);
...
bool = builder.bool();
for (Iterator keys = json.keys(); keys.hasNext();) {
String field = (String) keys.next();
String term = (String) json.get(field);
q = createQuery(field, term);
if (q != null) {
bool.must(q);
}
}
query = bool.createQuery();
if you have duplicate fields with different terms you must use this :
org.json.JSONObject json = new org.json.JSONObject();
json.append(field1, term1);
json.append(field2, term2);
...
bool = builder.bool();
for (Iterator keys = json.keys(); keys.hasNext();) {
String field = (String) keys.next();
JSONArray terms = (JSONArray) json.get(field);
for (int i = 0; i < terms.length(); i++) {
String term = (String) terms.get(i);
q = createQuery(field, term);
if (q != null) {
bool.must(q);
}
}
}
query = bool.createQuery();

Segregating filtered tweets based on matched keywords : Twitter4j API

I have created twitter stream filtered by some keywords as follows.
TwitterStream twitterStream = getTwitterStreamInstance();
FilterQuery filtre = new FilterQuery();
String[] keywordsArray = { "iphone", "samsung" , "apple", "amazon"};
filtre.track(keywordsArray);
twitterStream.filter(filtre);
twitterStream.addListener(listener);
What is the best way to segregate tweets based on keywords matched. e.g. All the tweets that matches "iphone" should be stored into "IPHONE" table and all the tweets that matches "samsung" will be stored into "SAMSUNG" table and so on. NOTE: The no of filter keywords is about 500.

It seems that the only way to find out to which keyword a tweet belongs to is iterating over multiple properties of the Status object. The following code requires a database service with a method insertTweet(String tweetText, Date createdAt, String keyword) and every tweet is stored in the database multiple times, if multiple keywords are found. If at least one keyword is found in the tweet text, the additional properties are not searched for more keywords.
// creates a map of the keywords with a compiled pattern, which matches the keyword
private Map<String, Pattern> keywordsMap = new HashMap<>();
private TwitterStream twitterStream;
private DatabaseService databaseService; // implement and add this service
public void start(List<String> keywords) {
stop(); // stop the streaming first, if it is already running
if(keywords.size() > 0) {
for(String keyword : keywords) {
keywordsMap.put(keyword, Pattern.compile(keyword, Pattern.CASE_INSENSITIVE));
}
twitterStream = new TwitterStreamFactory().getInstance();
StatusListener listener = new StatusListener() {
#Override
public void onStatus(Status status) {
insertTweetWithKeywordIntoDatabase(status);
}
/* add the unimplemented methods from the interface */
};
twitterStream.addListener(listener);
FilterQuery filterQuery = new FilterQuery();
filterQuery.track(keywordsMap.keySet().toArray(new String[keywordsMap.keySet().size()]));
filterQuery.language(new String[]{"en"});
twitterStream.filter(filterQuery);
}
else {
System.err.println("Could not start querying because there are no keywords.");
}
}
public void stop() {
keywordsMap.clear();
if(twitterStream != null) {
twitterStream.shutdown();
}
}
private void insertTweetWithKeywordIntoDatabase(Status status) {
// search for keywords in tweet text
List<String> keywords = getKeywordsFromTweet(status.getText());
if (keywords.isEmpty()) {
StringBuffer additionalDataFromTweets = new StringBuffer();
// get extended urls
if (status.getURLEntities() != null) {
for (URLEntity url : status.getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get retweeted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getText());
}
// get retweeted status -> quoted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getQuotedStatus().getText());
}
// get retweeted status -> quoted status -> extended urls
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getRetweetedStatus().getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get quoted status -> text
if (status.getQuotedStatus() != null && status.getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getQuotedStatus().getText());
}
// get quoted status -> extended urls
if (status.getQuotedStatus() != null && status.getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
String additionalData = additionalDataFromTweets.toString();
keywords = getKeywordsFromTweet(additionalData);
}
if (keywords.isEmpty()) {
System.err.println("ERROR: No Keyword found for: " + status.toString());
} else {
// insert into database
for(String keyword : keywords) {
databaseService.insertTweet(status.getText(), status.getCreatedAt(), keyword);
}
}
}
// returns a list of keywords which are found in a tweet
private List<String> getKeywordsFromTweet(String tweet) {
List<String> result = new ArrayList<>();
for (String keyword : keywordsMap.keySet()) {
Pattern p = keywordsMap.get(keyword);
if (p.matcher(tweet).find()) {
result.add(keyword);
}
}
return result;
}

Here's how you'd use a StatusListener to interrogate the received Status objects:
final Set<String> keywords = new HashSet<String>();
keywords.add("apple");
keywords.add("samsung");
// ...
final StatusListener listener = new StatusAdapter() {
#Override
public void onStatus(Status status) {
final String statusText = status.getText();
for (String keyword : keywords) {
if (statusText.contains(keyword)) {
dao.insert(keyword, statusText);
}
}
}
};
final TwitterStream twitterStream = getTwitterStreamInstance();
final FilterQuery fq = new FilterQuery();
fq.track(keywords.toArray(new String[0]));
twitterStream.addListener(listener);
twitterStream.filter(fq);
I see the DAO being defined along the lines of:
public interface StatusDao {
void insert(String tableSuffix, Status status);
}
You would then have a DB table corresponding with each keyword. The implementation would use the tableSuffix to store the Status in the correct table, the sql would roughly look like:
INSERT INTO status_$tableSuffix$ VALUES (...)
Notes:
This implementation would insert a Status into multiple tables if a Tweet contained 'apple' and 'samsung' for instance.
Additionally, this is quite a naive implementation, you might want to consider batching inserts into the tables... but it depends on the volume of Tweets you'll be receiving.
As noted in the comments, the API considers other attributes when matching e.g. URLs and an embedded Tweet (if present) so searching the status text for a keyword match may not be sufficient.

Well, you could create a class similar to an ArrayList but make it so you can create an array of ArrayLists, call it TweetList. This class will need an insert function.
Then use two for loops to search through the tweets and find matching keywords that are contained in a normal array list, and then add them to the TweetList that matches the index of the keyword in the keywords ArrayList
for (int i = 0; i < tweets.length; i++)
{
String[] split = tweets[i].split(" ");// split the tweet up
for (int j = 0; j < split.length; j++)
if (keywords.contains(split[j]))//check each word against the keyword list
list[keywords.indexOf(j)].insert[tweets[i]];//add the tweet to the tree index that matches index of the keyword
}

Is it 2 character search possible in lucence

Hi i have a question about lucence search
Is it possible to search a 2 character from file using lucence search
For ex. if there are names like "karthik test" is it possible to search for "ka" or "te" in lucence. If so kindly provide a code piece..

Yes, this is possible using wildcards.
Feed your QueryParser with te*, and it will generate a query that starts for a te prefix with any suffix.

May be this will help you
private List search(String word, IndexSearcher searcher, Date fromDate, Date toDate, int skip, int noOfRecords) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
BooleanQuery.Builder finalQuery = new BooleanQuery.Builder();
List results = null;
for(String key : keyUtil.getAllKeys()) {
if((!key.contains("Date") || !key.contains("Time"))) {
QueryParser queryParser = new QueryParser(key, analyzer);
Query query = queryParser.parse(word);
finalQuery.add(query, Occur.SHOULD);
}
}
if(fromDate != null && toDate != null) {
Query query = NumericDocValuesField.newSlowRangeQuery("StartDate", fromDate.getTime(), toDate.getTime());
finalQuery.add(query, Occur.MUST);
}
TopDocs hits = searcher.search(finalQuery.build(), skip + noOfRecords);
results = new ArrayList();
if(hits.totalHits.value > 0) {
int count = 0;
for (ScoreDoc sd : hits.scoreDocs) {
if(count >= skip) {
Document d = searcher.doc(sd.doc);
results.add(d.get("storePath"));
}
count ++;
}
}
analyzer.close();
return results;
}
You can always use RegEx pattern with with attribute "word". Like * someWord *

Hibernate org.hibernate.criterion.Example.create OR clause

I'm using org.hibernate.criterion.Example.create to create my query from my Entity object. Everything is fine, but using this method the SQL is only created with AND clause between the restrictions.
Is it possible to use org.hibernate.criterion.Example.create but with OR clause?

The short answer is no, you can not do it, but you can implement a OrExample, it's pretty easy, only check the source code of the Example and change the and for or (see sourcecode line 329). Since the methods are protected, you can extend it and override just the necesary.
Something like this:
public class OrExample extends org.hibernate.criterion.Example {
#Override
protected void appendPropertyCondition(
String propertyName,
Object propertyValue,
Criteria criteria,
CriteriaQuery cq,
StringBuffer buf)
throws HibernateException {
Criterion crit;
if ( propertyValue!=null ) {
boolean isString = propertyValue instanceof String;
if ( isLikeEnabled && isString ) {
crit = new LikeExpression(
propertyName,
( String ) propertyValue,
matchMode,
escapeCharacter,
isIgnoreCaseEnabled
);
}
else {
crit = new SimpleExpression( propertyName, propertyValue, "=", isIgnoreCaseEnabled && isString );
}
}
else {
crit = new NullExpression(propertyName);
}
String critCondition = crit.toSqlString(criteria, cq);
if ( buf.length()>1 && critCondition.trim().length()>0 ) buf.append(" or ");
buf.append(critCondition);
}
See the or instead of the original and.

Yes, you can
session.createCriteria(Person.class) .add(Restrictions.disjunction() .add(Restrictions.eq("name", "James")) .add(Restrictions.eq("age", 20)) );
In the example above, class Person would have properties name and age and you would be selecting those people with name = "James" or age = 20.

an old post from SO may be helpful: Hibernate Criteria Restrictions AND / OR combination
Criteria criteria = getSession().createCriteria(clazz);
Criterion rest1= Restrictions.and(Restrictions.eq("A", "X"),
Restrictions.in("B", Arrays.asList("X","Y")));
Criterion rest2= Restrictions.and(Restrictions.eq("A", "Y"),
Restrictions.eq("B", "Z"));
criteria.add(Restrictions.or(rest1, rest2));

Get terms present in a document with a collection [duplicate]

Given a Lucene search query like: +(letter:A letter:B letter:C) +(style:Capital), how can I tell which of the three letters actually matched any given document? I don't care where they match, or how many times they match, I just need to know whether they matched.
The intent is to take the initial query ("A B C"), remove the terms which successfully matched (A and B), and then do further processing on the remainder (C).

Although the sample is in c#, Lucene APIs are very similar(some upper/lower case differences). I don't think it would be hard to translate to java.
This is the usage
List<Term> terms = new List<Term>(); //will be filled with non-matched terms
List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
GetHitTerms(query, searcher,docId, hitTerms,terms);
And here is the method
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
{
if (query is TermQuery)
{
if (searcher.Explain(query, docId).IsMatch() == true)
hitTerms.Add((query as TermQuery).GetTerm());
else
rest.Add((query as TermQuery).GetTerm());
return;
}
if (query is BooleanQuery)
{
BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
if (clauses == null) return;
foreach (BooleanClause bc in clauses)
{
GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query is MultiTermQuery)
{
if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
(query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
}
}

As answer given by #L.B, Here is the converted code of JAVA which works for me:
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
{
if(query instanceof TermQuery )
{
if (searcher.explain(query, docId).isMatch())
hitTerms.add(((TermQuery) query).getTerm());
else
rest.add(((TermQuery) query).getTerm());
return;
}
if(query instanceof BooleanQuery )
{
for (BooleanClause clause : (BooleanQuery)query) {
GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query instanceof MultiTermQuery)
{
if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
}
}

I basically used the same approach as #L.B, but updated it for usage for the newest Lucene Version 7.4.0. Note: FuzzyQuery now supports .setRewriteMethod (that's why I removed the if).
I also included handling for BoostQuerys and saved the words that were found by Lucene in a HashSet to avoid duplicates instead of the Terms.
private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
int docId, HashSet<String> hitWords) throws IOException {
if (query instanceof TermQuery)
if (indexSearcher.explain(query, docId).isMatch())
hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
if (query instanceof BooleanQuery) {
for (BooleanClause clause : (BooleanQuery) query) {
saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
}
}
if (query instanceof MultiTermQuery) {
((MultiTermQuery) query)
.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
indexSearcher, docId, hitWords);
}
if (query instanceof BoostQuery)
saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
hitWords);
}

Here is a simplified and non-recursive version with Lucene.NET 4.8.
Unverified, but this should also work on Lucene.NET 3.x
IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
{
//Rewrite query into simpler internal form, required for ExtractTerms
var simplifiedQuery = query.Rewrite(searcher.IndexReader);
HashSet<Term> queryTerms = new HashSet<Term>();
simplifiedQuery.ExtractTerms(queryTerms);
List<Term> hitTerms = new List<Term>();
foreach (var term in queryTerms)
{
var termQuery = new TermQuery(term);
var explanation = searcher.Explain(termQuery, docId);
if (explanation.IsMatch)
{
hitTerms.Add(term);
}
}
return hitTerms;
}

You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene: FastVectorHighlighter returns null - java

Related

Hibernate search - handling null in boolean query

Segregating filtered tweets based on matched keywords : Twitter4j API

Is it 2 character search possible in lucence

Hibernate org.hibernate.criterion.Example.create OR clause

Get terms present in a document with a collection [duplicate]

Categories

Resources