Indexing of document in elastic search, JAVA API

Indexing of document in elastic search, JAVA API - java

We are Indexing a resume document by using elastic search java API. It works fine. When we are searching a keyword it's return the accurate response(Document) which has that keyword.
But we want to index document in deep. For example A resume has 'Skills' and their 'Skills Month'. Skills month may be 13 months in document. SO i search for that skill and set skill months between 10 to 15 months in elastic search query, then we want that record(Document).
How can we do this?
Here is the code for Indexing:-
IndexResponse response = client
.prepareIndex(userName, document.getType(),
document.getId())
.setSource(extractDocument(document)).execute()
.actionGet();
public XContentBuilder extractDocument(Document document) throws IOException, NoSuchAlgorithmException {
// Extracting content with Tika
int indexedChars = 100000;
Metadata metadata = new Metadata();
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new BytesStreamInput(
Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
} catch (Throwable e) {
logger.debug("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]", e);
System.out.println("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]" +e);
parsedContent = "";
}
XContentBuilder source = jsonBuilder().startObject();
if (logger.isTraceEnabled()) {
source.prettyPrint();
}
// File
source
.startObject(FsRiverUtil.Doc.FILE)
.field(FsRiverUtil.Doc.File.FILENAME, document.getName())
.field(FsRiverUtil.Doc.File.LAST_MODIFIED, new Date())
.field(FsRiverUtil.Doc.File.INDEXING_DATE, new Date())
.field(FsRiverUtil.Doc.File.CONTENT_TYPE, document.getContentType() != null ? document.getContentType() : metadata.get(Metadata.CONTENT_TYPE))
.field(FsRiverUtil.Doc.File.URL, "file://" + (new File(".", document.getName())).toString());
if (metadata.get(Metadata.CONTENT_LENGTH) != null) {
// We try to get CONTENT_LENGTH from Tika first
source.field(FsRiverUtil.Doc.File.FILESIZE, metadata.get(Metadata.CONTENT_LENGTH));
} else {
// Otherwise, we use our byte[] length
source.field(FsRiverUtil.Doc.File.FILESIZE, Base64.decode(document.getContent().getBytes()).length);
}
source.endObject(); // File
// Path
source
.startObject(FsRiverUtil.Doc.PATH)
.field(FsRiverUtil.Doc.Path.ENCODED, SignTool.sign("."))
.field(FsRiverUtil.Doc.Path.ROOT, ".")
.field(FsRiverUtil.Doc.Path.VIRTUAL, ".")
.field(FsRiverUtil.Doc.Path.REAL, (new File(".", document.getName())).toString())
.endObject(); // Path
// Meta
source
.startObject(FsRiverUtil.Doc.META)
.field(FsRiverUtil.Doc.Meta.AUTHOR, metadata.get(Metadata.AUTHOR))
.field(FsRiverUtil.Doc.Meta.TITLE, metadata.get(Metadata.TITLE) != null ? metadata.get(Metadata.TITLE) : document.getName())
.field(FsRiverUtil.Doc.Meta.DATE, metadata.get(Metadata.DATE))
.array(FsRiverUtil.Doc.Meta.KEYWORDS, Strings.commaDelimitedListToStringArray(metadata.get(Metadata.KEYWORDS)))
.endObject(); // Meta
// Doc content
source.field(FsRiverUtil.Doc.CONTENT, parsedContent);
// Doc as binary attachment
source.field(FsRiverUtil.Doc.ATTACHMENT, document.getContent());
// End of our document
source.endObject();
return source;
}
Below code is used for getting response:
QueryBuilder qb;
if (query == null || query.trim().length() <= 0) {
qb = QueryBuilders.matchAllQuery();
} else {
qb = QueryBuilders.queryString(query);//query is a name or string
}
org.elasticsearch.action.search.SearchResponse searchHits = node.client()
.prepareSearch()
.setIndices("ankur")
.setQuery(qb)
.setFrom(0).setSize(1000)
.addHighlightedField("file.filename")
.addHighlightedField("content")
.addHighlightedField("meta.title")
.setHighlighterPreTags("<span class='badge badge-info'>")
.setHighlighterPostTags("</span>")
.addFields("*", "_source")
.execute().actionGet();

Elastic search indices all the column by default for providing better search capabilities. Before you put your JSON documents under some type, it would be great to define your mappings (refer: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-analysis.html)
When you want to search data by exact keyword, you may need to skip a particular column by not analyzing. While indexing a document, the column values will be analyzed and then it will be indexed. You can enforce Elastic saying that "not_analyzed". Then your column value will be indexed as it is. This way you can get a better search results.
For another part for defining your JSON document, it would be good if you use some library to define JSON. I prefer Jackson library for parsing JSON document. This will reduce the lines of code in your project.

Related

Add weights to documents Lucene 8

I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.
I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source
Here is my (updated) code that generates a Document object from a File object:
public static Document getDocument(File f) throws FileNotFoundException, IOException {
Document d = new Document();
//adding a field
FieldType contentType = new FieldType();
contentType.setStored(true);
contentType.setTokenized(true);
contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentType.setStoreTermVectors(true);
String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
d.add(new Field("content", fileContents, contentType));
//adding other fields, then...
//the boost coefficient (updated):
double coef = 1.0 + ranks.get(path);
d.add(new DoubleDocValuesField("boost", coef));
return d;
}
The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.
Edit:
After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by #EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:
public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
try {
Query q_temp = buildQuery(query); //the original query, was working fine alone
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
q = q.rewrite(DirectoryReader.open(bm25IndexDir));
TopDocs results = searcher.search(q, 10);
ScoreDoc[] filterScoreDosArray = results.scoreDocs;
for (int i = 0; i < filterScoreDosArray.length; ++i) {
int docId = filterScoreDosArray[i].doc;
Document d = searcher.doc(docId);
//here, when printing, I see that the document's score is the same as its "boost" value. WHY??
System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
}
return results;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
//function that builds the query, working fine
public static Query buildQuery(String query) {
try {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("content", charTermAttribute.toString()));
}
tokenStream.end(); tokenStream.close();
builder.setSlop(1000);
PhraseQuery q = builder.build();
return q;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}

Starting from Lucene 6.5.0 :
Index-time boosts are deprecated. As a replacement,
index-time scoring factors should be indexed into a doc value field
and combined at query time using eg. FunctionScoreQuery. (Adrien
Grand)
The recommendation instead of using index time boost would be to encode scoring factors (ie. length normalization factors) into doc values fields instead. (cf. LUCENE-6819)

Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):
A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query's score.
So, when does it replace, and when does it modify?
Turns out, the code I was using is for entirely replacing the score by the boost value:
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):
Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));
And now it works! Thanks #EricLavault for the help!

Java IBM Lotus Notes retrive data from view fails in the middle due to null object

I am trying to retrieve data from lotus notes database view using a java program. Below is my code:
int resultsCount = view.getEntryCount();
print("Results found in view = " + resultsCount);
Document doc = view.getFirstDocument();
if (doc != null) {
int count = 1;
while (count <= resultsCount) {
count++;
try {
doc = view.getNextDocument(doc);
if (doc == null) {
print("Record " + count + " error. Null object.");
}
} catch (NotesException e) {
print("Record " + count + " error. Exception.");
}
}
}
else {
print("Record " + count + " error. Null object.");
}
I get below results:
Results found in view = 1567
Record 866 error. Null object.
Why is there a null document found when actually 1567 records present in the db view?
How can I resume to get rest of the records, because view.getNextDocument(doc) fails with Notes Exception after this happens.

Fixed by using
int resultsCount = view.getAllEntries().getCount();
instead of
int resultsCount = view.getEntryCount();
Using view.getAllEntries().getCount() returns the actual entry count which is 866. I am not sure what view.getEntryCount() returns. But it is definitely not the actual document count.
Edit:
As mentioned in XPages getEntryCount vs getAllEntries().getCount() view.getEntryCount() includes replications and save conflicts. Therefore to get actual record count needs to use view.getAllEntries().getCount()

Solr SearchComponent get all documents

I'm just writing a solr plugin (SearchComponent) and want to iterate over all documents that are found for the query. This is the part of my code in the process method:
// Searcher to search a document
SolrIndexSearcher searcher = rb.req.getSearcher();
// Getting the list of documents found for the query
DocList docs = rb.getResults().docList;
// Return if no results are found for the query
if (docs == null || docs.size() == 0) {
return;
}
// Get the iterator for the documents that will be returned
DocIterator iterator = docs.iterator();
// Iterate over all documents and count occurrences
for (int i = 0; i < docs.size(); i++) {
try {
// Getting the current document ID
int docid = iterator.nextDoc();
// Fetch the document from the searcher
Document doc = searcher.doc(docid);
// do stuff
} catch (Exception e) {
LOGGER.error(e.getMessage());
}
}
For now I found a method where I can iterate over all documents that will be returned by i.e. if 1300 documents are found for the query and I only return 20, I will only iterate over 20 with this method for now. I there a possibility to get the full set of documents (1300)?

There is a possibility to do that. You use DocList which contains only 'rows' docs starting from 'start'. If you want to iterate over all 'numFound' docs - use DocSet via
rb.getResults().docSet
For understanding of this mechanism - http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F

Avoiding duplicate entries in a mongoDB with Java and JSON objects

I´m developing an analyzing program for Twitter Data.
I´m using mongoDB and at the moment. I try to write a Java program to get tweets from the Twitter API and put them in the database.
Getting the Tweets already works very well, but I have a problem when I want to put them in the database. As the Twitter API often returns just the same Tweets, I have to place some kind of index in the database.
First of all, I connect to the database and get the collection related to the search-term, or create this collection if this doesn´t exist.
public void connectdb(String keyword)
{
try {
// on constructor load initialize MongoDB and load collection
initMongoDB();
items = db.getCollection(keyword);
BasicDBObject index = new BasicDBObject("tweet_ID", 1);
items.ensureIndex(index);
} catch (MongoException ex) {
System.out.println("MongoException :" + ex.getMessage());
}
}
Then I get the tweets and put them in the database:
public void getTweetByQuery(boolean loadRecords, String keyword) {
if (cb != null) {
TwitterFactory tf = new TwitterFactory(cb.build());
Twitter twitter = tf.getInstance();
try {
Query query = new Query(keyword);
query.setCount(50);
QueryResult result;
result = twitter.search(query);
System.out.println("Getting Tweets...");
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
BasicDBObject basicObj = new BasicDBObject();
basicObj.put("user_name", tweet.getUser().getScreenName());
basicObj.put("retweet_count", tweet.getRetweetCount());
basicObj.put("tweet_followers_count", tweet.getUser().getFollowersCount());
UserMentionEntity[] mentioned = tweet.getUserMentionEntities();
basicObj.put("tweet_mentioned_count", mentioned.length);
basicObj.put("tweet_ID", tweet.getId());
basicObj.put("tweet_text", tweet.getText());
if (mentioned.length > 0) {
// System.out.println("Mentioned length " + mentioned.length + " Mentioned: " + mentioned[0].getName());
}
try {
items.insert(basicObj);
} catch (Exception e) {
System.out.println("MongoDB Connection Error : " + e.getMessage());
loadMenu();
}
}
// Printing fetched records from DB.
if (loadRecords) {
getTweetsRecords();
}
} catch (TwitterException te) {
System.out.println("te.getErrorCode() " + te.getErrorCode());
System.out.println("te.getExceptionCode() " + te.getExceptionCode());
System.out.println("te.getStatusCode() " + te.getStatusCode());
if (te.getStatusCode() == 401) {
System.out.println("Twitter Error : \nAuthentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect.\nEnsure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.");
} else {
System.out.println("Twitter Error : " + te.getMessage());
}
loadMenu();
}
} else {
System.out.println("MongoDB is not Connected! Please check mongoDB intance running..");
}
}
But as I mentioned before, there are often the same tweets, and they have duplicates in the database.
I think the tweet_ID field is a good field for an index and should be unique in the collection.

Set the unique option on your index to have MongoDb enforce uniqueness:
items.ensureIndex(index, new BasicDBObject("unique", true));
Note that you'll need to manually drop the existing index and remove all duplicates or you won't be able to create the unique index.

This question is already answered but I would like to contribute a bit since MongoDB API 2.11 offers a method which receives unique option as a parameter:
public void ensureIndex(DBObject keys, String name, boolean unique)
A minor remind to someone who would like to store json documents on MongoDBNote is that uniqueness must be applied to a BasicObject key and not over values. For example:
BasicDBObject basicObj = new BasicDBObject();
basicObj.put("user_name", tweet.getUser().getScreenName());
basicObj.put("retweet_count", tweet.getRetweetCount());
basicObj.put("tweet_ID", tweet.getId());
basicObj.put("tweet_text", tweet.getText());
basicObj.put("a_json_text", "{"info_details":{"info_id":"1234"},"info_date":{"year":"2012"}, {"month":"12"}, {"day":"10"}}");
On this case, you can create unique index only to basic object keys:
BasicDBObject index = new BasicDBObject();
int directionOrder = 1;
index.put("tweet_ID", directionOrder);
boolean isUnique = true;
items.ensureIndex(index, "unique_tweet_ID", isUnique);
Any index regarding JSON value like "info_id" would not work since it´s not a BasicObject key.
Using indexes on MongDB is not as easy as it sounds. You may also check MongoDB docs for more details here Mongo Indexing Tutorials and Mongo Index Concepts. Direction order might be pretty important to understand once you need a composed index which is well explained here Why Direction order matter.

Extract term frequecny for each word in a lucene 5.2.1 index using java

How to extract term frequency of each word from a Lucene 5.2.1 index using java?
I have code that used to work for a previous Luecene version does not work anymore. I think most code on the Internet are for previous versions of Lucene.

You can get the term frequency of a given term from IndexReader.totalTermFreq, such as:
Term myTerm = new Term("contentfield", "myterm");
long totaltf = myReader.totalTermFreq(myTerm);
If you want to interate all the terms in the index and get the frequency of each, you can use MultiFields for that:
Fields fields = MultiFields.getFields(reader);
Iterator<String> fieldsIter = fields.iterator();
while (fieldsIter.hasNext()) {
String fieldname = fieldsIter.next();
TermsEnum terms = fields.terms(fieldname).iterator();
BytesRef term;
while ((term = terms.next()) != null) {
System.out.println(fieldname + ":" + term.utf8ToString() + " ttf:" + terms.totalTermFreq());
//Or whatever else you want to do with it...
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Indexing of document in elastic search, JAVA API - java

Related

Add weights to documents Lucene 8

Java IBM Lotus Notes retrive data from view fails in the middle due to null object

Solr SearchComponent get all documents

Avoiding duplicate entries in a mongoDB with Java and JSON objects

Extract term frequecny for each word in a lucene 5.2.1 index using java

Categories

Resources