Lucene 6 - recommended way to store numeric fields with term vocabulary

Lucene 6 - recommended way to store numeric fields with term vocabulary - java

In Lucene 6, LongField and IntField have been renamed to LegacyLongField and LegacyIntField, deprecated with a JavaDoc suggestion to use LongPoint and IntPoint classes instead.
However, it seems impossible to build a term vocabulary (=enumerate all distinct values) of these XPoint fields. Lucene mailing list entry confirms it
PointFields are different than conventional inverted fields, so they also don't show up in fields(). You cannot get a term dictionary from them.
As a third option, one can add a field of class NumericDocValuesField, which as far as I know, also doesn't provide a way of building term vocabulary.
Is there a non-deprecated way of indexing a numeric field in Lucene 6, given the requirement to build a term vocabulary?

In my case I just duplicated the field once as LongPoint and once as a stored non-indexed field both fields with the same name.
in my case it is roughly
doc.add(new NumericDocValuesField("ts", timestamp.toEpochMilli()));
doc.add(new LongPoint("ts", timestamp.toEpochMilli()));
doc.add(new StoredField("ts", timestamp.toEpochMilli()));
It is a bit ugly, but think of it as adding an index to the stored field.
These field types can use the same name without interfering.
The DocValues for document age based scoring and the LongPoint for range queries.

I had the same issue and finally found a solution for my use case - I'm indexing, not storing, a LongPoint:
doc.add(new LongPoint("time",timeMsec));
My first idea was to create the query like this:
Query query = parser.parse("time:[10003 TO 10003]");
System.err.println( "Searching for: " + query + " (" + query.getClass() + ")" );
But this will not return ANY document, at least not with the StandardAnalyzer and the default QueryParser :-(
The printout is: "Searching for: time:[10003 TO 10003] (class org.apache.lucene.search.TermRangeQuery)"
What works, however, is creating the query with LoingPoint.newRangeQuery():
Query query = LongPoint.newRangeQuery("time", 10003, 10003);
System.err.println( "Searching for: " + query + " (" + query.getClass() + ")" );
This prints: "Searching for: time:[10003 TO 10003] (class org.apache.lucene.document.LongPoint$1)". So the standard QueryParser is creating a TermRangeQuery instead of a LoingPoint range query. I'm new to Lucene so don't understand the details here, but it would be nice for the QuerParser to support LongPoint seamlessly...

Related

Lucene: Is there any way to know which subqueries have hit the document?

I have a MemoryIndex created like this.
```
Version version = Version.LUCENE_47;
Analyzer analyzer = new SimpleAnalyzer(version);
MemoryIndex index = new MemoryIndex();
index.addField("text", "Readings about Salmons and other select Alaska fishing Manuals", analyzer);
```
Then, I have a query containing a number of sub-query which is created from a set of concepts (including id, name, description). Right now I have to loop for every concept, generate a query, and finally check if it is matched => if it is, I append it to a string which is used to store matches
```
for (Concept concept : concepts) {
Query query = queryGenerator.getQueryForConcept(concept);
float score = query != null ? index.search(query) : 0.0f;
if (score > 0) {
matches.append(sep + concept.getId() + "|" + concept.getName());
sep = "|";
}
}```
The problem is: the number of concepts is growing larger and larger, which affects the performance. Is there anyway that I can create a one single query and compare to a document, and find out what concepts have been hit the document?
I tried using BooleanQuery as a whole, then add all subquery which derrived from concept into it. It matches but don't know which subquery hits, and even if we do, how do we put the details like "id", and "name" of a concept into it?
Much appreciate all answers

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Lucene 4.0 IndexWriter updateDocument for Numeric Term

I just wanted to know how it is possible to to update (delete/insert) a document based on a numeric field.
So far I did this:
LuceneManager.updateDocument(writer, new Term("id", NumericUtils.intToPrefixCoded(sentenceId)), newDoc);
But now with Lucene 4.0 the NumericUtils class has changed to this which I don't really understand.
Any help?

With Lucene 5.x, this could be solved by code below:
int id = 1;
BytesRefBuilder brb = new BytesRefBuilder();
NumericUtils.intToPrefixCodedBytes(id, 0, brb);
Term term = new Term("id", brb.get());
indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);

You can use it this way:
First you must set the FieldType's numeric type:
FieldType TYPE_ID = new FieldType();
...
TYPE_ID.setNumericType(NumericType.INT);
TYPE_ID.freeze();
and then:
int idTerm = 10;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(id, 0, bytes);
Term idTerm = new Term("id", bytes);
and now you'll be able to use idTerm to update the doc.

I would recommend, if possible, it would be better to store an ID as a keyword string, rather than a number. If it is simply a unique identifier, indexing as a keyword makes much more sense. This removes any need to mess with numeric formatting.
If it is actually being used as a number, then you might need to perform the update manually. That is, search for and fetch the document you wish to update, delete the old document with tryDeleteDocument, and then add the updated version with addDocument. This is basically what updateDocument does anyway, to my knowledge.
The first option would certainly be the better way, though. A non-numeric field to use as an update ID would make life easier.

With Lucene 4, you can now create IntField, LongField, FloatField or DoubleField like this:
document.add(new IntField("id", 6, Field.Store.NO));
To write the document once you modified it, it's still:
indexWriter.updateDocument(new Term("pk", "<pk value>"), document);
EDIT:
And here is a way to make a query including this numeric field:
// Query <=> id <= 7
Query query = NumericRangeQuery.newIntRange("id", Integer.MIN_VALUE, 7, true, true);
TopDocs topDocs = indexSearcher.search(query, 10);

According to the documentation of Lucene 4.0.0, the ID field must to be used with StringField class:
"A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache."
I had the same problem as you and I solved it by making this change. After that, my update and delete worked perfectly.

Mapping java.long to oracle.Number(14)

I have db column whose datatype is Number (15) and i have the corresponding field in java classes as long. The question is how would i map it using java.sql.Types.
would Types.BIGINT work?
Or shall i use something else?
P.S:
I can't afford to change the datatype within java class and within DB.

From this link it says that java.sql.Types.BIGINT should be used for long in Java to Number in SQL (Oracle).
Attaching screenshot of the table in case the link ever dies.

A good place to find reliable size mappings between Java and Oracle Types is in the Hibernate ORM tool. Documented in the code here, Hibernate uses an Oracle NUMBER(19,0) to represent a java.sql.Types.BIGINT which should map to a long primitave

I always use wrapper type, because wrapper types can be express null values.
In this case I will use Long wrapper type.

I had a similar problem where I couldn't modify the Java Type or the Database Type. In my situation I needed to execute a native SQL query (to be able to utilize Oracle's Recursive query abilities) and map the result set to a non-managed entity (essentially a simple pojo class).
I found a combination of addScalar and setResultTransformer worked wonders.
hibernateSes.createSQLQuery("SELECT \n"
+ " c.notify_state_id as \"notifyStateId\", \n"
+ " c.parent_id as \"parentId\",\n"
+ " c.source_table as \"sourceTbl\", \n"
+ " c.source_id as \"sourceId\", \n"
+ " c.msg_type as \"msgType\", \n"
+ " c.last_updt_dtm as \"lastUpdatedDateAndTime\"\n"
+ " FROM my_state c\n"
+ "LEFT JOIN my_state p ON p.notify_state_id = c.parent_id\n"
+ "START WITH c.notify_state_id = :stateId\n"
+ "CONNECT BY PRIOR c.notify_state_id = c.parent_id")
.addScalar("notifyStateId", Hibernate.LONG)
.addScalar("parentId", Hibernate.LONG)
.addScalar("sourceTbl",Hibernate.STRING)
.addScalar("sourceId",Hibernate.STRING)
.addScalar("msgType",Hibernate.STRING)
.addScalar("lastUpdatedDateAndTime", Hibernate.DATE)
.setParameter("stateId", notifyStateId)
.setResultTransformer(Transformers.aliasToBean(MyState.class))
.list();
Where notifyStateId, parentId, sourceTbl, sourceId, msgType, and lastUpdatedDateAndTime are all properties of MyState.
Without the addScalar's, I would get a java.lang.IllegalArgumentException: argument type mismatch because Hibernate was turning Oracle's Number type into a BigDecimal but notifyStateId and parentId are Long types on MyState.

Problem with JDOQL to obtain results with a "contains" request

I am using Google App Engine for a project and I need to do some queries on the database. I use the JDOQL to ask the database. In my case I want to obtain the university that contains the substring "array". I think my query has a mistake because it returns the name of universities in the alphabetical order and not the ones containing the substring.
Query query = pm.newQuery("SELECT FROM " + University.class.getName() + " WHERE name.contains("+array+") ORDER BY name RANGE 0, 5");
Could someone tell me what's wrong in my query?
Thank you for your help!
EDIT
I have a list of universities store and I have a suggestbox where we can request a university by his name. And I want to autocomplete the requested name.

App engine does not support full-text searches, you should star issue 217. However, A partial workaround is possible. And in your case I think it is a good fit.
First thing, adjust your model such that there is a lower (or upper case) version of the name as well -- I will assume it is called lname. Unless you want your queries to be case-sensitive.
Then you query like this:
Query query = pm.newQuery(University.class);
query.setFilter("lname >= startNameParam");
query.setFilter("lname < stopNameParam");
query.setOrdering("lname asc");
query.declareParameters("String startNameParam");
query.declareParameters("String stopNameParam");
query.setRange(0, 5);
List<University> results = (List<University>) query.execute(search_value, search_value + "z");

The correct way to do this is like this -
Query query = pm.newQuery(University.class,":p.contains(name)");
query.setOrdering("name asc");
query.setRange(0, 5);
List univs = q.execute(Arrays.asList(array));
(note- In this case the :p is an implicit param name you can replace with any name)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene 6 - recommended way to store numeric fields with term vocabulary - java

Related

Lucene: Is there any way to know which subqueries have hit the document?

Lucene: Multiple words in a single term

Lucene 4.0 IndexWriter updateDocument for Numeric Term

Mapping java.long to oracle.Number(14)

Problem with JDOQL to obtain results with a "contains" request

Categories

Resources