lucene error in document field setBoost - java

As the lucene migration guide mentioned, to set document level boost we should multiply all fields boost by boosting value. here is my code :
StringField nameField = new StringField("name", name, Field.Store.YES) ;
StringField linkField = new StringField("link", link, Field.Store.YES);
Field descField;
TextField reviewsField = new TextField("reviews", reviews_str, Field.Store.YES);
TextField authorsField = new TextField("authors", authors_str, Field.Store.YES);
FloatField scoreField = new FloatField("score", origScore,Field.Store.YES);
if (desc != null) {
descField = new TextField("desc", desc, Field.Store.YES);
} else {
descField = new TextField("desc", "", Field.Store.YES);
}
doc.add(nameField);
doc.add(linkField);
doc.add(descField);
doc.add(reviewsField);
doc.add(authorsField);
doc.add(scoreField);
nameField.setBoost(score);
linkField.setBoost(score);
descField.setBoost(score);
reviewsField.setBoost(score);
authorsField.setBoost(score);
scoreField.setBoost(score);
but I've got this exception when running code :
Exception in thread "main" java.lang.IllegalArgumentException: You cannot set an index-time boost on an unindexed field, or one that omits norms
I've searched google. but I've got no answers. would you please help me?

Index-time boosts are stored in the field's norm, and both StringField and FloatField omit norms by default. So, you'll need to turn them on before you set the boosts.
To turn norms on, you'll need to define your own field types:
//Start with a copy of the standard field type
FieldType myStringType = new FieldType(StringField.TYPE_STORED);
myStringType.setOmitNorms(false);
//StringField doesn't do anything special except have a customized fieldtype, so just use Field.
Field nameField = new Field("name", name, myStringType);
FieldType myFloatType = new FieldType(FloatField.TYPE_STORED);
myFloatType.setOmitNorms(false);
//For FloatField, use the appropriate FloatField ctor, instead of Field (similar for other numerics)
Field scoreField = new FloatField("score", origScore, myFloatType);

Related

Add weights to documents Lucene 8

I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.
I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source
Here is my (updated) code that generates a Document object from a File object:
public static Document getDocument(File f) throws FileNotFoundException, IOException {
Document d = new Document();
//adding a field
FieldType contentType = new FieldType();
contentType.setStored(true);
contentType.setTokenized(true);
contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentType.setStoreTermVectors(true);
String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
d.add(new Field("content", fileContents, contentType));
//adding other fields, then...
//the boost coefficient (updated):
double coef = 1.0 + ranks.get(path);
d.add(new DoubleDocValuesField("boost", coef));
return d;
}
The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.
Edit:
After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by #EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:
public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
try {
Query q_temp = buildQuery(query); //the original query, was working fine alone
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
q = q.rewrite(DirectoryReader.open(bm25IndexDir));
TopDocs results = searcher.search(q, 10);
ScoreDoc[] filterScoreDosArray = results.scoreDocs;
for (int i = 0; i < filterScoreDosArray.length; ++i) {
int docId = filterScoreDosArray[i].doc;
Document d = searcher.doc(docId);
//here, when printing, I see that the document's score is the same as its "boost" value. WHY??
System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
}
return results;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
//function that builds the query, working fine
public static Query buildQuery(String query) {
try {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("content", charTermAttribute.toString()));
}
tokenStream.end(); tokenStream.close();
builder.setSlop(1000);
PhraseQuery q = builder.build();
return q;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
Starting from Lucene 6.5.0 :
Index-time boosts are deprecated. As a replacement,
index-time scoring factors should be indexed into a doc value field
and combined at query time using eg. FunctionScoreQuery. (Adrien
Grand)
The recommendation instead of using index time boost would be to encode scoring factors (ie. length normalization factors) into doc values fields instead. (cf. LUCENE-6819)
Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):
A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query's score.
So, when does it replace, and when does it modify?
Turns out, the code I was using is for entirely replacing the score by the boost value:
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):
Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));
And now it works! Thanks #EricLavault for the help!

How to add a field to a FeatureLayer?

I'm trying to add a field to a FeatureLayer, this is the code I'm using to do this:
IMxDocument mxd = (IMxDocument) app.getDocument();
FeatureLayer flayer = ((FeatureLayer)mxd.getSelectedLayer());
IField newField = null;
newField = new Field();
IFieldEdit newFieldEdit = (IFieldEdit) newField;
newFieldEdit.setAliasName("Id2");
newFieldEdit.setName("Id2");
newFieldEdit.setType(esriFieldType.esriFieldTypeString);
newFieldEdit.setLength(100);
flayer.addField(newFieldEdit);
However it raises an exception, according the documentation I should to get an ISchemaLock but I have no idea how to get a schemalock from a FeatureLayer, Does anyone have any idea?

how to know if a field is on a particular page?

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.
ie. I'm processing fields per page, but not sure which fields are on which pages.
Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?
Thank you!
Mark
code snippet:
PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
processFields(acroForm, fieldList, contentStream, page);
contentStream.close();
}
The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages
The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.
PDFBox 1.8.x
Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.
The following code should make clear how to do that:
#SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
List<PDPage> pages = docCatalog.getAllPages();
Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
for (PDAnnotation annotation : page.getAnnotations())
pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
}
PDAcroForm acroForm = docCatalog.getAcroForm();
for (PDField field : (List<PDField>)acroForm.getFields()) {
COSDictionary fieldDict = field.getDictionary();
List<Integer> annotationPages = new ArrayList<Integer>();
List<COSObjectable> kids = field.getKids();
if (kids != null) {
for (COSObjectable kid : kids) {
COSBase kidObject = kid.getCOSObject();
if (kidObject instanceof COSDictionary)
annotationPages.add(pageNrByAnnotDict.get(kidObject));
}
}
Integer mergedPage = pageNrByAnnotDict.get(fieldDict);
if (mergedPage == null)
if (annotationPages.isEmpty())
System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
else
System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
else
if (annotationPages.isEmpty())
System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
else
System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
}
}
Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:
The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.
Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.
PS: #mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.
PDFBox 2.0.x
In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.
The safe methods:
int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
COSDictionary widgetObject = widget.getCOSObject();
PDPageTree pages = document.getPages();
for (int i = 0; i < pages.getCount(); i++)
{
for (PDAnnotation annotation : pages.get(i).getAnnotations())
{
COSDictionary annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject))
return i;
}
}
return -1;
}
The fast method
int determineFast(PDDocument document, PDAnnotationWidget widget)
{
PDPage page = widget.getPage();
return page != null ? document.getPages().indexOf(page) : -1;
}
Usage:
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
for (PDField field : acroForm.getFieldTree())
{
System.out.println(field.getFullyQualifiedName());
for (PDAnnotationWidget widget : field.getWidgets())
{
System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
System.out.printf(" - fast: %s", determineFast(document, widget));
System.out.printf(" - safe: %s\n", determineSafe(document, widget));
}
}
}
(DetermineWidgetPage.java)
(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)
Example documents
A document for which the fast method fails: aFieldTwice.pdf
A document for which the fast method works: test_duplicate_field2.pdf
Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:
PDDocumentCatalog catalog = doc.getDocumentCatalog();
int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
This example uses Lucee (cfml) https://lucee.org/
A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.
Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.
<cfscript>
try{
/*
java is used by using CreateObject()
*/
variables.File = CreateObject("java", "java.io.File");
//references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")
function determineSafe(doc, widget){
var i = '';
var widgetObject = widget.getCOSObject();
var pages = doc.getPages();
var annotation = '';
var annotationObject = '';
for (i = 0; i < pages.getCount(); i=i+1){
for (annotation in pages.get(i).getAnnotations()){
if(annotation.getSubtype() eq 'widget'){
annotationObject = annotation.getCOSObject();
if (annotationObject.equals(widgetObject)){
return i;
}
}
}
}
return -1;
}
function pageForSignature(doc, fieldName){
try{
var acroForm = doc.getDocumentCatalog().getAcroForm();
var field = '';
var widget = '';
var annotation = '';
var pageNo = '';
for(field in acroForm.getFields()){
if(field.getPartialName() == fieldName){
for(widget in field.getWidgets()){
for(annotation in widget.getPage().getAnnotations()){
if(annotation.getSubtype() == 'widget'){
pageNo = determineSafe(doc, widget);
doc.close();
return pageNo;
}
}
}
}
}
return -1;
}catch(e){
doc.close();
writeDump(label="catch error",var='#e#');
}
}
doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));
//returns no, page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');
writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript
General solution for single or multiple widget of (duplicate qualified name of single page)..
List<PDAnnotationWidget> widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());
/* field co ordinate also can get here for single or multiple both it will work..*/
//PDRectangle r= widget.get(i).getRectangle();
}

Sorting lucene documents by date

How I can achieve scoring and sorting in lucene as per the start date.
Event which has latest start date should be shown first in search results. I am using lucene Version.LUCENE_44
I have retreived data from DB and stored in Lucene Document as,
public static Document createDoc(Event e) {
Document d = new Document();
//event id
d.add(new StoredField("id", e.getId()));
//event name
d.add(new StoredField("eventname", e.getEName());
TextField field = new TextField("enameSrch", e.getEName(), Store.NO);
field.setBoost(10.0f);
d.add(field);
//event owner
d.add(new StoredField("eventowner", e.getEOwner());
//event start date
d.add(new LongField("edateSort", Long.MAX_VALUE-e.getEStartTime(), Store.YES));
//event tags
if (e.eventTags()!=null) {
field = new TextField("eTagSrch", e.getTags(), Store.NO);
field.setBoost(5.0f);
d.add(field);
d.add(new StoredField("eTags", e.getTags()));
}
And while searching I am doing as,
public List search(String srchTxt){
PhraseQuery enameQuery = new PhraseQuery();
Term term = new Term("enameSrch", srchTxt.toLowerCase());
enameQuery .add(term);
PhraseQuery etagQuery = new PhraseQuery();
term = new Term("eTagSrch", srchTxt.toLowerCase());
etagQuery.add(term);
BooleanQuery b= new BooleanQuery();
b.add(enameQuery , Occur.SHOULD);
b.add(etagQuery , Occur.SHOULD);
SortField startField = new SortField("edateSort", Type.LONG);
SortField scoreField = SortField.FIELD_SCORE;
Sort sort = new Sort(scoreField, startField);
TopFieldDocs tfd = searcher.search(b, 10, sort);
ScoreDoc[] myscore= tfd.scoreDocs;
To rephrase: I want to sort Documents by date, which is stored as a Long field in my Document (see code above)
What your code does is sorts by score, then by date, since your scores coming back are not likely the same, they will almost always be by score anyways.
This is what I would do:
Sort sorter = new Sort(); // new sort object
String field = "fieldName"; // enter the field to sort by
Type type = Type.Long; // since your field is long type
boolean descending = false; // ascending by default
SortField sortField = new SortField(field, type, descending);
sorter.setSort(sortField); // now set the sort field
This will just sort by the field you specified. You can also do:
sorter.setSort(sortField, SortField.FIELD_SCORE); // this will sort by field, then by score

How to Update an Existing Record in a Dataset

I can't seem to update an existing record in my table using a strongly-typed dataset. I can add a new record, but if I make changes to an existing record it doesn't work.
Here is my code:
private void AddEmplMaster()
{
dsEmplMast dsEmpMst = new dsEmplMast();
SqlConnection cn = new SqlConnection();
cn.ConnectionString = System.Configuration.ConfigurationSettings.AppSettings["cn.ConnectionString"];
SqlDataAdapter da1 = new SqlDataAdapter("SELECT * FROM UPR00100", cn);
SqlCommandBuilder cb1 = new SqlCommandBuilder(da1);
da1.Fill(dsEmpMst.UPR00100);
DataTable dtMst = UpdateEmpMst(dsEmpMst);
da1.Update(dsEmpMst.UPR00100);
}
This procedure is called from above to assign the changed fields to a record:
private DataTable UpdateEmpMst(dsEmplMast dsEmpMst)
{
DataTable dtMst = new DataTable();
try
{
dsEmplMast.UPR00100Row empRow = dsEmpMst.UPR00100.NewUPR00100Row();
empRow.EMPLOYID = txtEmplId.Text.Trim();
empRow.LASTNAME = txtLastName.Text.Trim();
empRow.FRSTNAME = txtFirstName.Text.Trim();
empRow.MIDLNAME = txtMidName.Text.Trim();
empRow.ADRSCODE = "PRIMARY";
empRow.SOCSCNUM = txtSSN.Text.Trim();
empRow.DEPRTMNT = ddlDept.SelectedValue.Trim();
empRow.JOBTITLE = txtJobTitle.Text.Trim();
empRow.STRTDATE = DateTime.Today;
empRow.EMPLOYMENTTYPE = "1";
dsEmpMst.UPR00100.Rows.Add(empRow);
}
catch { }
return dtMst;
}
Thank you
UPDATE:
Ok I figured it out. In my UpdateEmpMst() procedure I had to check if the record exists then to retrieve it first. If not then create a new record to add. Here is what I added:
try
{
dsEmplMast.UPR00100Row empRow;
empRow = dsEmpMst.UPR00100.FindByEMPLOYID(txtEmplId.Text.Trim());
if (empRow == null)
{
empRow = dsEmpMst.UPR00100.NewUPR00100Row();
dsEmpMst.UPR00100.Rows.Add(empRow);
}
then I assign my data to the new empRow I created and updates fine.
In order to edit an existing record in a dataset, you need to access a particular column of data in a particular row. The data in both typed and untyped datasets can be accessed via the following:
With the indices of the tables, rows, and columns collections.
By passing the table and column names as strings to their respective collections.
Although typed datasets can use the same syntax as untyped datasets, there are additional advantages to using typed datasets. For more information, see the "To update existing records using typed datasets" section below.
To update existing records in either typed or untyped datasets
Assign a value to a specific column within a DataRow object.
The table and column names of untyped datasets are not available at design time and must be accessed through their respective indices.

Categories

Resources