AND OR search syntax in lucene - java

I am using Lucene. I have three columns which are
DocId - TermID - TermFrequency
1 - 004 - 667
2 - 005 - 558
If i use mysql then query for AND operation is
SELECT * FROM table_name WHERE DocId='1' AND TermId='004'
How can i write above query in Lucene using JAVA. For one column search code i am using is
Query query = new QueryParser(Version.LUCENE_35,"TermID", analyzer).parse("004");
How can i use AND operation in QueryParser ??

Terms can be grouped with the AND keyword like so:
Query query = new QueryParser(Version.LUCENE_35,"TermID", analyzer).parse("004 AND DocId:1");
Note that you don't need to qualify the field for your "004" term because you've set "TermId" as the default field.
You should read the manual on the query syntax...it's pretty expressive.

Related

Hibernate Search Lucene query parser with Special Characters

FIRST QUESTION:
Can somebody explain to me how the lucene query in Hibernate Search handles special characters. I read the documentation about Hibernate search and also the Lucene Regexp Syntax but somehow they don't add up with the generated Queries and Results.
Lets assume i have following database entries:
name
Will
Will Smith
Will - Smith
Will-Smith
and i am using following Query:
Query query = queryBuilder
.keyword()
.onField("firstName")
.matching(input)
.createQuery();
Now I am looking for the following input:
Will -> returns all 4 entries, with the following generated query: FullTextQueryImpl(firstName:will)
Will Smith -> also returns all 4 entries with the following generated query: FullTextQueryImpl(firstName:will firstName:smith)
Will - Smith -> also returns all 4 entries with the following generated query: FullTextQueryImpl(firstName:will firstName:smith) ? Where is the "-" or shouldn't it forbid everything after the "-" according to Lucene Query Syntax?
Will-Smith -> same here
Will-Smith -> here i tried to use backslash but same result
Will -Smith -> Same here
SECOND QUESTION: Lets assume i have following database entries in which the entry without numerical ending always exists and the ones with numerical ending could be in the datebase.
How woul a lucene query for this look like?
name
Will
Will1
Will2
You can play around with Lucene analyzers and see what happens behind the scenes. Here is a tutorial: https://www.baeldung.com/lucene-analyzers
The tokenizer is pluggable, so you can change how special characters are treated.

pagination using MySQL database and Java ResultSet

To implement a pagination on a list, I need to do two queries:
Get elements count from selected table using SELECT COUNT(*)...
Get subset of list using LIMIT and OFFSET in a query.
Are there any way to avoid this?. Are There any metadata where this is stored?
The function resultSet.getRow() retrive the array index of list, then I need to make a query whose results are all rows. After I get a subSet but this is expensive so.
I want send a only query with limits and offsets and retrive the selected datas and total count of datas.
is this possible?
Thanks in advance,
Juan
I saw some things about this, then new doubts are coming on to me.
When a query is lanched with limits, we can add SQL_CALC_FOUND_ROWS * on "select" section as follow:
"SELECT SQL_CALC_FOUND_ROWS * FROM ... LIMIT 0,10"
After, I query the follow:
"SELECT FOUND_ROWS()"
I understand that first query store count in a internal var whose value will be returned on second query. The second query isn't a "select count(*) ..." query so "SELECT FOUND_ROWS()" query should be inexpensive.
Am I right?
Some test that I have made show the follow:
--fist select count(*), second select with limits--
Test 1: 194 ms
out: {"total":94607,"list":["2 - 1397199600000","2 - 1397286000000","13 - 1398150000000","13 - 1398236400000","13 - 1398322800000","13 - 1398409200000","13 - 1398495600000","14 - 1398150000000","14 - 1398236400000","14 - 1398322800000"]}
--the new way--
Test 2: 555 ms
out: {"total":94607,"list":["2 - 1397199600000","2 - 1397286000000","13 - 1398150000000","13 - 1398236400000","13 - 1398322800000","13 - 1398409200000","13 - 1398495600000","14 - 1398150000000","14 - 1398236400000","14 - 1398322800000"]}
Why the test dont show the expected result?
My assumptions are wrong?
thanks, regards
I have resolve the question
The next link has got the response.
https://www.percona.com/blog/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/

ORMLite groupByRaw and groupBy issue on android SQLite db

I have a SQLite table content with following columns:
-----------------------------------------------
|id|book_name|chapter_nr|verse_nr|word_nr|word|
-----------------------------------------------
the sql query
select count(*) from content where book_name = 'John'
group by book_name, chapter_nr
in DB Browser returns 21 rows (which is the count of chapters)
the equivalent with ORMLite android:
long count = getHelper().getWordDao().queryBuilder()
.groupByRaw("book_name, chapter_nr")
.where()
.eq("book_name", book_name)
.countOf();
returns 828 rows (which is the count of verse numbers)
as far as I know the above code is translated to:
select count(*) from content
where book_name = 'John'
group by book_name, chapter_nr
result of this in DB Browser:
| count(*)
------------
1 | 828
2 | 430
3 | 653
...
21| 542
---------
21 Rows returned from: select count(*)...
so it seems to me that ORMLite returns the first row of the query as the result of countOf().
I've searched stackoverflow and google a lot. I found this question (and more interestingly the answer)
You can also count the number of rows in a custom query by calling the > countOf() method on the Where or QueryBuilder object.
// count the number of lines in this custom query
int numRows = dao.queryBuilder().where().eq("name", "Joe Smith").countOf();
this is (correct me if I'm wrong) exactly what I'm doing, but somehow I just get the wrong number of rows.
So... either I'm doing something wrong here or countOf() is not working the way it is supposed to.
Note: It's the same with groupBy instead of groupByRaw (according to ORMLite documentation joining groupBy's should work)
...
.groupBy("book_name")
.groupBy("chapter_nr")
.where(...)
.countOf()
EDIT: getWordDao returns from class Word:
#DatabaseTable(tableName = "content")
public class Word { ... }
returns 828 rows (which is the count of verse numbers)
This seems to be a limitation of the QueryBuilder.countOf() mechanism. It is expecting a single value and does not understand the addition of GROUP BY to the count query. You can tell that it doesn't because that method returns a single long.
If you want to extract the counts for each of the groups it looks like you will need to do a raw query check out the docs.

Load Social Network Data into Neo4J

I have a dataset similar to Twitter's graph. The data is in the following form:
<user-id1> <List of ids which he follows separated by spaces>
<user-id2> <List of ids which he follows separated by spaces>
...
I want to model this in the form of a unidirectional graph, expressed in the cypher syntax as:
(A:Follower)-[:FOLLOWS]->(B:Followee)
The same user can appear more than once in the dataset as he might be in the friend list of more than one person, and he might also have his friend list as part of the data set. The challenge here is to make sure that there are no duplicate nodes for any user. And if the user appears as a Follower and Followee both in the data set, then the node's label should have both the values, i.e., Follower:Followee. There are about 980k nodes in the graph and size of dataset is 1.4 GB.
I am not sure if Cypher's load CSV will work here because each line of the dataset has a variable number of columns making it impossible to write a query to generate the nodes for each of the columns. So what would be the best way to import this data into Neo4j without creating any duplicates?
I did actually exactly the same for the friendster dataset, which has almost the same format as yours.
There the separator for the many friends was ":".
The queries I used there, are these:
create index on :User(id);
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
MERGE (u1:User {id:line[0]})
;
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
WITH line[1] as id2
WHERE id2 <> '' AND id2 <> 'private' AND id2 <> 'notfound'
UNWIND split(id2,",") as id
WITH distinct id
MERGE (:User {id:id})
;
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
WITH line[0] as id1, line[1] as id2
WHERE id2 <> '' AND id2 <> 'private' AND id2 <> 'notfound'
MATCH (u1:User {id:id1})
UNWIND split(id2,",") as id
MATCH (u2:User {id:id})
CREATE (u1)-[:FRIEND_OF]->(u2)
;

Lucene and Multifield query

I have an archive of university theses and publications indexed (with BM25 similarity) on Lucene (Java version). I have English document and Italian document, for this reason i have duplicate field like: pdf, pdf_en or like: titolo, titolo_en. When i have an italian document i fill italian field, otherwise i fill english filed.
Now i have a BooleanQuery with MultiFieldQueryParser, this is my code:
String[] fieldsGEN={"url","autori","lingua","settore","pdfurl"};
String[] fieldsITA={"titolo","tipologia","abstract","pdf"};
String[] fieldsENG={"titolo_en","tipologia_en", "abstract_en","pdf_en"};
MultiFieldQueryParser parserGEN = new MultiFieldQueryParser(version, fieldsGEN, analyzerIT);
MultiFieldQueryParser parserITA = new MultiFieldQueryParser(version, fieldsITA, analyzerIT);
MultiFieldQueryParser parserENG = new MultiFieldQueryParser(version, fieldsENG, analyzerENG);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserENG.setDefaultOperator(QueryParser.Operator.OR);
Query query4 =parserGEN.parse(ricerca.ricerca);
bq.add(query4, Occur.SHOULD);
Query query2 =parserITA.parse(ricerca.ricerca);
bq.add(query2, Occur.SHOULD);
Query query3 =parserENG.parse(ricerca.ricerca);
bq.add(query3, Occur.SHOULD);
If I search "anna" (Name of an author) the 3 query are:
Query: [titolo:anna tipologia:anna abstract:anna pdf:anna]
Query: [titolo_en:anna tipologia_en:anna abstract_en:anna pdf_en:anna]
Query: [url:anna autori:anna lingua:anna settore:anna pdfurl:anna]
and I also authors without the name anna even if they are in the last position (about 3 document of 21 on 1000 indexed), I suppose that finds them in other fields.
Do you think the query is done well? the query can be improved? how? a search engine like google how it works on multifield search?
There is a better way to deal with multi-language field?
Thanks,
Neptune.
Unless you have both translations for all documents, I would create 2 indexes -- 1 for each language, using the same field names for each index. You would then use a MultiReader with the search queries.
The problem with this approach is words that are spelled the same in each language but have different meanings between English and Italian. Apart from those words, I think that this architecture will be easier to understand as well as easier to interpret the results of.

Categories

Resources