Lucene indexing strategy for documents that change often

Lucene indexing strategy for documents that change often - java

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?

It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?

If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.

Related

Hierarchical Data Model with JPA

Recently I come across a schema model like this
Structure looks exactly the same, i just renamed with Entity name like Table (*)
Starting from Table C, all the tables are having close to 200 Columns, from C to L
Reason for posting this is like, I never come across structure like this before, if anyone who have already experienced like this or worked similar or more complex than this please do share your idea,
Having a structure like this is good or bad, and why?
Assume we need to have API to save data for the table structure like this,
how to design the API
How we are going to manage the Transactional across all these tables
In service code, there are few cases where we might need to get data from these table and transfer to external system.
Catch here is, external system is accepting the request in the flatten structure not in the hierarchy which we have as mentioned above. If this data needs to be transferred to external system, how can we manage marshaling and un marshaling
Last but not least, API which is going to manage the data like this can be consumed atleast 2K a day.
What is your thought on this, I don't know exactly why we need it, it needs a detailed discussion and we need to break up the things.
If I consider Spring Data JPA, Hibernate. What are all things i need to consider,
More Importantly, all these tables row values will be limited based on the the ownerId/tenantId, so the data needs to be consistent across all the tables.

I can not comment on the general aspect of the structure as that is pretty domain specific and one would need to know why this structure was chosen to be able to say if it's good or not. Either way, you probably can't change this anyway, so why bother asking if it's good or not?
Having said that, with such a model there are a few aspects that you should consider:
When updating data, it is pretty important to update only columns that really changed to avoid index trashing and allow the DB to use spare storage in pages. This is a performance concern that usually comes up when using Hibernate with such models as Hibernate usually updates all "updatable" columns, not just the dirty ones. There is an option to do dynamic updates though. Without dynamic updates, you might produce a few more IOs per update and thus keep locks for a longer time which affects the overall scalability.
When reading data, it is very important not to use join fetching by default as that might result in a result set size explosion.

Search application to handle up to 40 million full-text documents using Lucene

My current search application is using lucene for indexing process. And if any documents are change, I believe, we can start re-indexing at the beginning. Is this Correct?
So, if yes, then all documents have to re-indexed each time with new ones are added which will be not appropriate with very large number of content about 40 million full-text documents.
That's why I am specifically concerned that, using Lucene, Is there any way to only index documents that have changed so that to avoid the full re-indexing.
Appreciated for possible suggestions...
Thanking you........

You only need to reindex changed documents, there is no need to reindex everything. IndexWriter has deleteDocuments which can remove documents by query or term. Then, you can reinsert the changed document with addDocument and commit to make this appear atomic.
Also bear in mind that Lucene is just a library and has no idea what kind of external entities are passed for indexing and how/when they change - you, as a developer, are responsible for this.

ElasticSearch: Secondary indecies on field values using Java-API

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.

I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

MongoD JAVA insert vs. update and compare changes

I have a large collection of roughly 3.2 million records, this collection data is being updated monthly but the source data is being fetched as-is, meaning I don't get just the updated records but everything.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
Thanks.

Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log.
Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id.
The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.
Hope this helps.

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.