Get terms present in a document with a collection [duplicate]

Get terms present in a document with a collection [duplicate] - java

Given a Lucene search query like: +(letter:A letter:B letter:C) +(style:Capital), how can I tell which of the three letters actually matched any given document? I don't care where they match, or how many times they match, I just need to know whether they matched.
The intent is to take the initial query ("A B C"), remove the terms which successfully matched (A and B), and then do further processing on the remainder (C).

Although the sample is in c#, Lucene APIs are very similar(some upper/lower case differences). I don't think it would be hard to translate to java.
This is the usage
List<Term> terms = new List<Term>(); //will be filled with non-matched terms
List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
GetHitTerms(query, searcher,docId, hitTerms,terms);
And here is the method
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
{
if (query is TermQuery)
{
if (searcher.Explain(query, docId).IsMatch() == true)
hitTerms.Add((query as TermQuery).GetTerm());
else
rest.Add((query as TermQuery).GetTerm());
return;
}
if (query is BooleanQuery)
{
BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
if (clauses == null) return;
foreach (BooleanClause bc in clauses)
{
GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query is MultiTermQuery)
{
if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
(query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
}
}

As answer given by #L.B, Here is the converted code of JAVA which works for me:
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
{
if(query instanceof TermQuery )
{
if (searcher.explain(query, docId).isMatch())
hitTerms.add(((TermQuery) query).getTerm());
else
rest.add(((TermQuery) query).getTerm());
return;
}
if(query instanceof BooleanQuery )
{
for (BooleanClause clause : (BooleanQuery)query) {
GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query instanceof MultiTermQuery)
{
if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
}
}

I basically used the same approach as #L.B, but updated it for usage for the newest Lucene Version 7.4.0. Note: FuzzyQuery now supports .setRewriteMethod (that's why I removed the if).
I also included handling for BoostQuerys and saved the words that were found by Lucene in a HashSet to avoid duplicates instead of the Terms.
private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
int docId, HashSet<String> hitWords) throws IOException {
if (query instanceof TermQuery)
if (indexSearcher.explain(query, docId).isMatch())
hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
if (query instanceof BooleanQuery) {
for (BooleanClause clause : (BooleanQuery) query) {
saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
}
}
if (query instanceof MultiTermQuery) {
((MultiTermQuery) query)
.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
indexSearcher, docId, hitWords);
}
if (query instanceof BoostQuery)
saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
hitWords);
}

Here is a simplified and non-recursive version with Lucene.NET 4.8.
Unverified, but this should also work on Lucene.NET 3.x
IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
{
//Rewrite query into simpler internal form, required for ExtractTerms
var simplifiedQuery = query.Rewrite(searcher.IndexReader);
HashSet<Term> queryTerms = new HashSet<Term>();
simplifiedQuery.ExtractTerms(queryTerms);
List<Term> hitTerms = new List<Term>();
foreach (var term in queryTerms)
{
var termQuery = new TermQuery(term);
var explanation = searcher.Explain(termQuery, docId);
if (explanation.IsMatch)
{
hitTerms.Add(term);
}
}
return hitTerms;
}

You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.

Related

Lucene: FastVectorHighlighter returns null

Here's what I did:
String textField1 = fastVectorHighlighter.getBestFragment(fastVectorHighlighter.getFieldQuery(query), indexReader, docId, SearchItem.FIELD_TEXT_FIELD1, DEFAULT_FRAGMENT_LENGTH);
Here's the query:
((FIELD_TEXT_FIELD1:十五*)^4.0) (FIELD_TEXT_FIELD3:十五*)
The original text is correct(indexReader.document(docId).get(SearchItem.FIELD_TEXT_FIELD3) is correct.), and definitely contains characters in query.
Here's how I index textField1 :
Field textField1 = new TextField(SearchItem.FIELD_TEXT_FIELD1, "", Field.Store.YES);

Problem solved!
It turns out, I need to change
fastVectorHighlighter.getFieldQuery(query)
to
fastVectorHighlighter.getFieldQuery(query, indexReader)
Follow the code into FieldQuery#flatten, we will find Lucene doesn't deal with PrefixQuery the normal way：
} else if (sourceQuery instanceof CustomScoreQuery) {
final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery();
if (q != null) {
flatten( applyParentBoost( q, sourceQuery ), reader, flatQueries);
}
} else if (reader != null) { // <<====== Here it is!
Query query = sourceQuery;
if (sourceQuery instanceof MultiTermQuery) {
MultiTermQuery copy = (MultiTermQuery) sourceQuery.clone();
copy.setRewriteMethod(new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS));
query = copy;
}
Query rewritten = query.rewrite(reader);
if (rewritten != query) {
// only rewrite once and then flatten again - the rewritten query could have a speacial treatment
// if this method is overwritten in a subclass.
flatten(rewritten, reader, flatQueries);
}
We can see it needs a IndexReader for PrefixQuery, FuzzyQuery etc.

Ormlite Query BuilderCondition

i have a user form, the user specify the research criteria and i must apply them to get the right data from the database using ormlite :
boolean set = false;
QueryBuilder<Client, Integer> builder = clientsDao.queryBuilder();
Where<Client, Integer> builderWhere = builder.where();
if (!tfSearchName.getText().equals("")) {
builderWhere.like("name", tfSearchName.getText().trim());
builderWhere.and();
set = true;
}
if (!tfSearchBalanceMin.getText().equals("")) {
builderWhere.gt("balance", tfSearchBalanceMin);
builderWhere.and();
set = true;
}
if (!tfSearchBalanceMax.getText().equals("")) {
builderWhere.lt("balance", tfSearchBalanceMax);
set = true;
}
clientTable.setItems(FXCollections.observableArrayList(
set ? clientsDao.query(builderWhere.prepare())
: clientsDao.queryForAll()));
the problem with the query builder is there is always an and Clause in the end so that always throw an expection.
i want to know a good way to generate my sql statement using condition like i do in my code.
PS : sorry for bad english

You could try to use public Where<T, ID> and(int numClauses)
For example:
int andClauses= 0; // number of clauses that should be connected with "and" operation
QueryBuilder<Client, Integer> builder = clientsDao.queryBuilder();
Where<Client, Integer> builderWhere = builder.where();
if (!TextUtils.isEmpty(tfSearchName.getText())) {
builderWhere.like("name", tfSearchName.getText().trim());
andClauses++;
}
if (!TextUtils.isEmpty(tfSearchBalanceMin.getText())) {
builderWhere.gt("balance", tfSearchBalanceMin);
andClauses++;
}
if (!TextUtils.isEmpty(tfSearchBalanceMax.getText())) {
builderWhere.lt("balance", tfSearchBalanceMax);
andClauses++;
}
clientTable.setItems(FXCollections.observableArrayList(
andClauses > 0 ? clientsDao.query(builderWhere.and(andClauses).prepare())
: clientsDao.queryForAll()));

Segregating filtered tweets based on matched keywords : Twitter4j API

I have created twitter stream filtered by some keywords as follows.
TwitterStream twitterStream = getTwitterStreamInstance();
FilterQuery filtre = new FilterQuery();
String[] keywordsArray = { "iphone", "samsung" , "apple", "amazon"};
filtre.track(keywordsArray);
twitterStream.filter(filtre);
twitterStream.addListener(listener);
What is the best way to segregate tweets based on keywords matched. e.g. All the tweets that matches "iphone" should be stored into "IPHONE" table and all the tweets that matches "samsung" will be stored into "SAMSUNG" table and so on. NOTE: The no of filter keywords is about 500.

It seems that the only way to find out to which keyword a tweet belongs to is iterating over multiple properties of the Status object. The following code requires a database service with a method insertTweet(String tweetText, Date createdAt, String keyword) and every tweet is stored in the database multiple times, if multiple keywords are found. If at least one keyword is found in the tweet text, the additional properties are not searched for more keywords.
// creates a map of the keywords with a compiled pattern, which matches the keyword
private Map<String, Pattern> keywordsMap = new HashMap<>();
private TwitterStream twitterStream;
private DatabaseService databaseService; // implement and add this service
public void start(List<String> keywords) {
stop(); // stop the streaming first, if it is already running
if(keywords.size() > 0) {
for(String keyword : keywords) {
keywordsMap.put(keyword, Pattern.compile(keyword, Pattern.CASE_INSENSITIVE));
}
twitterStream = new TwitterStreamFactory().getInstance();
StatusListener listener = new StatusListener() {
#Override
public void onStatus(Status status) {
insertTweetWithKeywordIntoDatabase(status);
}
/* add the unimplemented methods from the interface */
};
twitterStream.addListener(listener);
FilterQuery filterQuery = new FilterQuery();
filterQuery.track(keywordsMap.keySet().toArray(new String[keywordsMap.keySet().size()]));
filterQuery.language(new String[]{"en"});
twitterStream.filter(filterQuery);
}
else {
System.err.println("Could not start querying because there are no keywords.");
}
}
public void stop() {
keywordsMap.clear();
if(twitterStream != null) {
twitterStream.shutdown();
}
}
private void insertTweetWithKeywordIntoDatabase(Status status) {
// search for keywords in tweet text
List<String> keywords = getKeywordsFromTweet(status.getText());
if (keywords.isEmpty()) {
StringBuffer additionalDataFromTweets = new StringBuffer();
// get extended urls
if (status.getURLEntities() != null) {
for (URLEntity url : status.getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get retweeted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getText());
}
// get retweeted status -> quoted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getQuotedStatus().getText());
}
// get retweeted status -> quoted status -> extended urls
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getRetweetedStatus().getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get quoted status -> text
if (status.getQuotedStatus() != null && status.getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getQuotedStatus().getText());
}
// get quoted status -> extended urls
if (status.getQuotedStatus() != null && status.getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
String additionalData = additionalDataFromTweets.toString();
keywords = getKeywordsFromTweet(additionalData);
}
if (keywords.isEmpty()) {
System.err.println("ERROR: No Keyword found for: " + status.toString());
} else {
// insert into database
for(String keyword : keywords) {
databaseService.insertTweet(status.getText(), status.getCreatedAt(), keyword);
}
}
}
// returns a list of keywords which are found in a tweet
private List<String> getKeywordsFromTweet(String tweet) {
List<String> result = new ArrayList<>();
for (String keyword : keywordsMap.keySet()) {
Pattern p = keywordsMap.get(keyword);
if (p.matcher(tweet).find()) {
result.add(keyword);
}
}
return result;
}

Here's how you'd use a StatusListener to interrogate the received Status objects:
final Set<String> keywords = new HashSet<String>();
keywords.add("apple");
keywords.add("samsung");
// ...
final StatusListener listener = new StatusAdapter() {
#Override
public void onStatus(Status status) {
final String statusText = status.getText();
for (String keyword : keywords) {
if (statusText.contains(keyword)) {
dao.insert(keyword, statusText);
}
}
}
};
final TwitterStream twitterStream = getTwitterStreamInstance();
final FilterQuery fq = new FilterQuery();
fq.track(keywords.toArray(new String[0]));
twitterStream.addListener(listener);
twitterStream.filter(fq);
I see the DAO being defined along the lines of:
public interface StatusDao {
void insert(String tableSuffix, Status status);
}
You would then have a DB table corresponding with each keyword. The implementation would use the tableSuffix to store the Status in the correct table, the sql would roughly look like:
INSERT INTO status_$tableSuffix$ VALUES (...)
Notes:
This implementation would insert a Status into multiple tables if a Tweet contained 'apple' and 'samsung' for instance.
Additionally, this is quite a naive implementation, you might want to consider batching inserts into the tables... but it depends on the volume of Tweets you'll be receiving.
As noted in the comments, the API considers other attributes when matching e.g. URLs and an embedded Tweet (if present) so searching the status text for a keyword match may not be sufficient.

Well, you could create a class similar to an ArrayList but make it so you can create an array of ArrayLists, call it TweetList. This class will need an insert function.
Then use two for loops to search through the tweets and find matching keywords that are contained in a normal array list, and then add them to the TweetList that matches the index of the keyword in the keywords ArrayList
for (int i = 0; i < tweets.length; i++)
{
String[] split = tweets[i].split(" ");// split the tweet up
for (int j = 0; j < split.length; j++)
if (keywords.contains(split[j]))//check each word against the keyword list
list[keywords.indexOf(j)].insert[tweets[i]];//add the tweet to the tree index that matches index of the keyword
}

Hibernate org.hibernate.criterion.Example.create OR clause

I'm using org.hibernate.criterion.Example.create to create my query from my Entity object. Everything is fine, but using this method the SQL is only created with AND clause between the restrictions.
Is it possible to use org.hibernate.criterion.Example.create but with OR clause?

The short answer is no, you can not do it, but you can implement a OrExample, it's pretty easy, only check the source code of the Example and change the and for or (see sourcecode line 329). Since the methods are protected, you can extend it and override just the necesary.
Something like this:
public class OrExample extends org.hibernate.criterion.Example {
#Override
protected void appendPropertyCondition(
String propertyName,
Object propertyValue,
Criteria criteria,
CriteriaQuery cq,
StringBuffer buf)
throws HibernateException {
Criterion crit;
if ( propertyValue!=null ) {
boolean isString = propertyValue instanceof String;
if ( isLikeEnabled && isString ) {
crit = new LikeExpression(
propertyName,
( String ) propertyValue,
matchMode,
escapeCharacter,
isIgnoreCaseEnabled
);
}
else {
crit = new SimpleExpression( propertyName, propertyValue, "=", isIgnoreCaseEnabled && isString );
}
}
else {
crit = new NullExpression(propertyName);
}
String critCondition = crit.toSqlString(criteria, cq);
if ( buf.length()>1 && critCondition.trim().length()>0 ) buf.append(" or ");
buf.append(critCondition);
}
See the or instead of the original and.

Yes, you can
session.createCriteria(Person.class) .add(Restrictions.disjunction() .add(Restrictions.eq("name", "James")) .add(Restrictions.eq("age", 20)) );
In the example above, class Person would have properties name and age and you would be selecting those people with name = "James" or age = 20.

an old post from SO may be helpful: Hibernate Criteria Restrictions AND / OR combination
Criteria criteria = getSession().createCriteria(clazz);
Criterion rest1= Restrictions.and(Restrictions.eq("A", "X"),
Restrictions.in("B", Arrays.asList("X","Y")));
Criterion rest2= Restrictions.and(Restrictions.eq("A", "Y"),
Restrictions.eq("B", "Z"));
criteria.add(Restrictions.or(rest1, rest2));

Ormlite Where issue with parenthesis, building meta-layer query library

I've read the other posts and the docs about how to use the "Where" clause to "create" parenthesis statements.
My requirement is simple:
... WHERE companyID=1 AND (director=true OR officer=true) ;
I'm writing a routine that takes an array of Object, which are then parsed into an Ormlite Where call. a typical call looks like this:
.., "companyID", 1, Q.AND, Q.Bracket, "director", true, Q.OR, "officer", true, Q.Bracket)
The intent is to speed up simple queries. There is no desire to replace Ormlite's querying tools. This is a simple meta-layer on top.
Everything works fine for simple queries, since the parameters are processed sequentially and the where clause is built incrementally.
For parenthesis I am postponing the processing until the bracket is closed.
This is where I am having a problem. The example from the docs I am using shows this:
-- From the OrmLite docs...
Where<Account, String> where = queryBuilder.where();
where.or(
where.and(
where.eq(Account.NAME_FIELD_NAME, "foo"),
where.eq(Account.PASSWORD_FIELD_NAME, "_secret")),
where.and(
where.eq(Account.NAME_FIELD_NAME, "bar"),
where.eq(Account.PASSWORD_FIELD_NAME, "qwerty")));
This produces the following approximate SQL:
SELECT * FROM account
WHERE ((name = 'foo' AND password = '_secret')
OR (name = 'bar' AND password = 'qwerty'))
The key thing I understand from the docs example, is that the same where instance is used in the nested and(...) call. This is precisely what I'm doing but I'm still getting a "Did you forget an AND or an OR" message.
The code implementing the "delayed" processing looks like this:
#SuppressWarnings("unchecked")
private void processWhere(Where<?, ?> where, Q q, List<QValue> list)
{
if (null == list || list.size() < 2)
{
System.err.println("Invalid where passed: " + list);
return;
}
if (q.equals(Q.AND))
where.and(getCondition(where, list.get(0)), getCondition(where, list.get(1)));
else
where.or(getCondition(where, list.get(0)), getCondition(where, list.get(1)));
}
The "QValue" item is just a "holder" for column, condition and value data.
The "getCondition" method is as follows:
#SuppressWarnings("rawtypes")
protected Where getCondition(Where<?, ?> where, QValue qv)
{
if (null != where && null != qv)
return getCondition(where, qv.getType(), qv.getColumn(), qv.getValue(), qv.getValue2());
else
return null;
}
#SuppressWarnings("rawtypes")
protected Where getCondition(Where<?, ?> where, Q cond, String key, Object val, Object val2)
{
if (null == where || null == cond || null == key || null == val)
return null;
SelectArg arg = new SelectArg();
arg.setValue(val);
try
{
switch (cond)
{
case NotNull:
where.isNotNull(key);
break;
case IsNull:
where.isNull(key);
break;
case Equals:
where.eq(key, arg);
break;
case NotEqual:
where.ne(key, arg);
break;
case GreaterThan:
where.gt(key, arg);
break;
case LessThan:
where.lt(key, arg);
break;
case Like:
arg.setValue("%" + val + "%");
where.like(key, arg);
break;
case LikeStart:
arg.setValue("" + val + "%");
where.like(key, arg);
break;
case LikeEnd:
arg.setValue("%" + val);
where.like(key, arg);
break;
case Between:
if (null != val && null != val2)
where.between(key, val, val2);
break;
default:
where.eq(key, arg);
break;
}
}
catch (SQLException e)
{
GlobalConfig.log(e, true);
return null;
}
return where;
}
As far as I can tell, I'm using the Where object correctly, but I am still getting a:
"Did you forget an AND or OR?" message.
I've tried creating "new" Where clauses with QueryBuilder:
Where w1 = qb.where() ;
//process w1 conditions...
return where.query() ;
Which also fails or generates incorrect SQL in the various combinations I've tried. Any suggestions on how to get the and(...) and or(...) methods working properly will be greatly appreciated.
BTW once the library is working properly, I'll put it up as Open Source or donate it to Gray, or both.
Thanks in advance.
Anthony

I faced the same issue and solved it like this:
where.eq("companyID", 1);
where.and(where, where.or(where.eq("director", true), where.eq("officer", true)));
or
where.and(where.eq("companyID", 1), where.or(where.eq("director", true), where.eq("officer", true)));
which in SQL gives us:
((`companyID` = 1 AND `director` = 1 ) OR (`companyID` = 1 AND `officer` = 1 ))
It's not identical to your example clause WHERE companyID=1 AND (director=true OR officer=true) but has the same meaning.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get terms present in a document with a collection [duplicate] - java

You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.

Related

Lucene: FastVectorHighlighter returns null

Ormlite Query BuilderCondition

Segregating filtered tweets based on matched keywords : Twitter4j API

Hibernate org.hibernate.criterion.Example.create OR clause

Ormlite Where issue with parenthesis, building meta-layer query library

Categories

Resources