I am working on a project where i need to use Hibernate Search and i am gonna index just one entity , it's mapped to a huge table with almost 20 million records and more records daily gonna be added to it but not via the application and entity manager which i am working on and hibernate search can't index new changes automatically. the problem is rebuilding whole index for the entity every day gonna take long time .
I wanted to know is there any way to keep my current index and partially rebuild the index documents for just new changes?
If, at the end of the day, you are able to list all the entities that have been modified during the last 24 hours based on information from the database (a date/time of last change for each entity, for example), then yes, there are ways to do that.
First, you can do it "manually" by running your own Hibernate ORM query and calling FullTextSession.index on each element you want to see reindexed. You will have to do this in batches, preferably opening a transaction for each batch, if you have a large number of elements to reindex.
Another, better option is to use the JSR352 integration, which will however require you to use a JSR352-compatible runtime (Spring Batch is not very standard-compliant and won't work; JBeret is known to work). By targeting your single entity and calling restrictedBy(Criterion) when building the parameters, you will be able to restrict the list of entities to reindex.
For example:
Properties jobProperties = MassIndexingJob.parameters()
.forEntity( MyClass.class )
.restrictedBy( Restrictions.ge( "lastChangeDateTime", LocalDate.now().minus( 26, ChronoUnit.HOURS ) ) // 26 to account for DST switch and other slight delays
.build();
long executionId = BatchRuntime.getJobOperator()
.start( MassIndexingJob.NAME, jobProperties );
The MassIndexer unfortunately doesn't offer such feature yet. You can vote for the feature on ticket HSEARCH-499 and explain your use case in a comment: we'll try to prioritize features that benefit many users. And of course, you can always reach out to us to discuss how to implement this and contribute a patch :)
Related
I want to perform a search of a inputs in a list. That list resides in a database. I see two options for doing that-
Hit the db for each search and return the result.
keep a copy in memory synced with table and search in memory and return the result.
I like the second option as it will be faster. However I am confused on how to keep the list in sync with table.
example : I have a list L = [12,11,14,42,56]
and I receive an input : 14
I need to return the result if the input does exists in the list or not. The list can be updated by other applications. I need to keep the list in sync with table.
What would be the most optimized approach here and how to keep the list in sync with database?
Is there any way my application can be informed of the changes in the table so that I can reload the list on demand.
Instead of recreating your own implementation of something that already exists, I would leverage Hibernate's Second Level Cache (2LC) with an implementation such as EhCache.
By using a 2LC, you can specify the time-to-live expiration time for your entities and once they expire, any query would reload them from the database. If the entity cache has not yet expired, Hibernate will hydrate them from the 2LC application cache rather than the database.
If you are using Spring, you might also want to take a look at #Cachable. This operates at the component / bean tier allowing Spring to cache a result-set into a named region. See their documentation for more details.
To satisfied your requirement, you should control the read and write in one place, otherwise, there will always be some unsync case for the data.
We are migrating a whole application originally developed in Oracle Forms a few years back, to a Java (7) web based application with Hibernate (4.2.7.Final) and Hibernate Search (4.1.1.Final).
One of the requirements is: as users are using the new migrated version, they able to use the Oracle Forms version - so Hibernate Search indexes will be out of sync. Is it feasable to implement a servlet so that some PL-SQL accesses some link that updates the local indexes in the application server (AS)?
I thought of implementing a some sort clustering mechanism for hibernate, but as I read through the documentation I realised that as clustering may be a good option for scalabillity and performance, for maintaining legacy data in sync may be a bit overkill.
Does anyone have any idea of how to implement a service, accessible via servlet, to update local AS indexes in a given model entity with a given ID?
I don't know what exactly you mean by the clustering part, but anyways:
It seems like you are facing a similar problem like me. I am currently in the works of creating a Hibernate-Search adaption for JPA providers (that are not Hibernate-ORM, meaning EclipseLink, TopLink, etc.) and I am working on an automatic reindexing feature at the moment. Since JPA doesn't have a event system suitable for reindexation with Hibernate-Search I came up with the idea to use triggers on a database level to keep track of everything.
For a basic OneToOne relationship it's pretty straight forward and for other things like relation-tables or anything that is not stored in the main table of an entity it gets a bit trickier, but once you got a system for OneToOne relationships it's not that hard to get to that next step. Okay, Let's start:
Imagine two Entities: Place and Sorcerer in the Lord of the rings universe. In order to keep things simple let's just say they are in a (quite restrictive :D) 1:1 relationship with each other. Normally you end up with 2 tables named SORCERER and PLACE.
Now you have to create 3 triggers (one for CREATE, one for DELETE and one for UPDATE) on each Table (SORCERER and PLACE) that store information about what entity (only the id, for mapping tables there are always multiple ids) has changed and how (CREATE, UPDATE, DELETE) into special UPDATE tables. Let's call these PLACE_UPDATES and SORCERER_UPDATES.
In addition to the ID of the original Object that has changed and the event-type these will need an ID field that is needed to be UNIQUE among all UPDATE tables. This is needed because if you want to feed information from the Update tables to the Hibernate-Search index you have to make sure the events are in the right order or you will break your index. How such an UNIQUE ID can be created on your database should be easy to find on the internet/stackoverflow.
Okay. Now that you have set up the triggers correctly you will just have to find a way to access all the UPDATES tables in a feasible fashion (I do this via querying from multiple tables at once and sorting each query by our UNIQUE id field and then just comparing the first result of each query with the others) and then update my index.
This can be a bit tricky and you have to find the correct ways of dealing with the specific update event but it can be done (that's what I am currently working on).
If you're interested in that part, you can find it here:
https://github.com/Hotware/Hibernate-Search-JPA/blob/master/hibernate-search-db/src/main/java/com/github/hotware/hsearch/db/events/IndexUpdater.java
The link to the whole project is:
https://github.com/Hotware/Hibernate-Search-JPA/
This uses Hibernate-Search 5.0.0.
I hope this was of help (at least a little bit).
And about your remote indexing problem:
The update tables can easily be used as some kind of dump for events until you send them to the remote machine that is to be updated.
Does anybody know of a library or good code sample that could be used to re-index all/some entities in all/some namespaces ?
If I implement this on my own, is MapReduce what I should consider ?
"I need to re-index ?" feels like a problem many developers have run into but the closest I could find is this, which may be a good start ?
Other option is a homebrewn solution using Task Queues that iterate the datastore namespaces and entities but I'd prefer not the re-invent the wheel and go for a robust, proven solution.
What are the options ?
I'm afraid I don't know of any pre-built system. I think you basically need to create a cursor to iterate through all your entities and then do a get and a put on all of them (or optionally check if they're in the index before doing the put - if you have some that won't need updating, that would save you a write at the cost of a read and/or a small operation).
Follow the example here:
https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Cursors
Create a java.util.concurrent.SynchronousQueue to hold batches of datastore keys.
Create 10 new consumer threads (the current limit) using ThreadManager:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/ThreadManager
Those threads should do the following:
Create a new objectify instance and turn off the session cache and memcache for objectify.
Get a batch of keys from the SynchronousQueue.
Fetch all of those entities using a batch get.
Optionally do a keys-only query for all those entities using the relevant property.
Put all those entities (or exclude the ones that were returned above).
Repeat from step 2.
In a loop, fetch the next 30 keys using a keys-only cursor query and put them into the SynchronousQueue.
Once you've put all of the items into the SynchronousQueue, set a property to stop all the consumer threads once they've done their work.
I am using Objectify for my DAO layer on GAE, I wanna make most of my entity soft-delete-able, is it a good idea to make these entities extending a parent with isActive boolean or should I used embedded or should I just make it an interface isSoftDeleteable?
Reason I am asking is that it seems Objectify storing Entity with same parent class in same Entity kind (at least from what I see in the _ah/admin) and it may slow down the query when everything is under the same entity kind, maybe?
Which is the best way or if there is better way to do soft-delete in GAE?
Please advise and Thanks in advance!
There is no single right answer to this question. The optimal solution chiefly depends on what percentage of your entities are likely going to be in deleted state at any given time.
One option is to store a field like #Index(IfTrue.class) boolean active; and add this filter to all queries:
ofy.load().type(Thing.class).filter("size >", 20).filter("active", true)
The downside of this is that it requires adding extra indexes - possibly several because you may now need multi-property indexes where single-property indexes would have sufficed.
Alternatively, you can store a 'deleted' flag and manually exclude deleted entities from query results. Less indexes to maintain, but it adds extra overhead to each query as you pull back records you don't want. If your deleted entries are sparse, this won't matter.
One last trick. You might find it best to store index a deleted date since it's probably most useful: #Index Date deleted; This lets you filter("deleted", null) to get the active items and also lets you filter by datestamp to get really old entities that you may wish to purge. However, be aware that this will cause the deleted date to index into any multi-property indexes, possibly significantly increasing index size if you have a high percentage of deleted entities. In this case, you may wish to #Index(IfNull.class) Date deleted; and use map-reduce to purge sufficiently old entities.
I agree with StickFigure's answer. Take advantage of the difference between an "empty" index and a "null" index. The tradeoff is that each write will incur more datastore write operations - when you add an index, that's at least 2 additional write ops (ascending and descending) indexes that you need every time you update that value. When you delete the index, it's 2 more writes. Personally, I think this is worth while.
Query time should be fairly predictable whenever you do a query on a single property of an entity kind, because if you think about what's happening underneath the covers, you are browsing a list of items in sequential order before doing a parallel batch get of the entity data.
I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.