Hibernate Search cluster and near-real-time search

Hibernate Search cluster and near-real-time search - java

I'm trying to find the best indexing solution for implementing a search-engine in my clustered webapp, and I cannot find a clear answer to my questions in official documentations.
My Java/Java EE backend will be deployed among several load-balanced instances. The search-engine will require near-real-time availability of indexed data (i.e. less than 5 seconds between the indexation and the retrievability).
Hibernate Search can work in a clustered environment with JGroups but the documentation also says, about near-real-time that as a tradeoff it requires a non-clustered and non-shared index.
Does that mean that NRTIndexManager cannot be used in a JGroups Slave/Master setup ? i.e. can only be used whith one single node ?
Does that mean that with such a setup, the availability of indexed data depends only on the refresh period (period of index copy to slave nodes) ?

With the standard IndexManager, you only see the latest changes when they are written to the disk and you reopen your IndexSearcher.
By default, Hibernate Search writes to disk and opens a new IndexSearcher for each query so you're sure your searches are always in sync with your database.
The NRTIndexManager is different from the standard one because it allows you to search on the latest changes indexed without an explicit write on disk. It's typically used when you need a high throughput and you can't write everything on the disk right away. So it's not really correlated to the fact that you will see your changes right away or not: it's an optimization when you can allow some index data loss - the latest changes might be lost.
As mentioned in the documentation here http://docs.jboss.org/hibernate/search/5.5/reference/en-US/html_single/#jgroups-backend , you can have a sync JGroups with Hibernate Search blocking until all the indexes are in sync. So it can work for your case.
Note that we are currently working for 5.6 on an Elasticsearch backend which might be of some interest to you as it's typically designed for your case. It's still in beta but it's already in pretty good shape. You might want to take a look to it: http://docs.jboss.org/hibernate/search/5.6/reference/en-US/html/ch11.html .

Related

Apache Ignite - Near Cache consistency

I am evaluating Apache Ignite to check if it fits our company's need. So far so good. Now I am trying to understand how the near cache feature works in terms of consistency.
We currently have several micro-services with one Ignite configured in client mode in each. All these instances are connected to several Ignite servers in a cluster. For some use cases (reads>>>writes) it seems reasonable to use a near cache in front of the cache servers. I have checked and it seems to automatically invalidate "stale data" in all instances in case of the write, which is good.
My question: is there any documentation besides this one that explains how it works? In particular, I would like to understand if any subsequent read requests (after the write one) to any other instances will get the updated data (no eventual consistency).
Thanks!

With FULL_SYNC mode all copies are always consistent, no eventual consistency. Near cache functions as a sort of additional backup copy.
I don't think there is a design document on how it works though.

Is it better to hold a repository for every web application (context) or is it better to share a common instance by JNDI or a similar technique

within our company it's kind of standard to create repositories for data which is originally stored in the database as described for example in https://thinkinginobjects.com/2012/08/26/dont-use-dao-use-repository/.
Our web infrastructure consist of a few independent web applications within Tomcat 7 for printing, product description, product order (this is not persisted in the database!), category description etc.
They are all build on Servlet 2 API.
So each instance/implementation of repository holds a specialised kind of data represented by serializable classes and the instances of this serialzable classes are set up/filled by an periodically executed database query (for every resultrow the setters of the fields are called; reminds me of domain oriented entity beans with CMP).
The repositories are initialized on the servlets init sequences (so every servlet keeps it's own set of instances).
Each context has a own connection to the Oracle database (set up by resource description file on deployment).
All the data is read only, we never need to write back to the database.
Because we need some of these data types for more than one web application (context) and some even for more than one servlet within the same web context repositories with an identical data type are instantiated more than once - e.g. four times, twice within the same application.
In the end some of the data is doubled and I'm not sure if this is as clever and efficient as it should be. It should be possible to share the same repository object to more than one application (JNDI?) but at least it must be possible to share it for several servlets within the same application context.
Despite I'm irritated by the idea to use a "self build" repository instead of something like a well tested, open developed cache (ehcache, jcs, ...) because some of these caches also provide options for distributed caches (so it should also work within the same container).
If certain entries are searched the search algorithm iterates over all entries in the repository (s. link above). For every search pattern there are specialised functions which are directly called from within the business logic classes using the "entity beans"; there's no specification object or interface.
In the end the application server as a whole does not perform that well and it uses a hell lot of RAM (at least for approximately 10000 DB entries); this is in my opinion most probably correlated to the use of serializeable XSD-to-JAXB-generated classes.
Additionally every time a application is deployed for tests you have to wait at least two minutes until all entries of the database have been loaded into the repositories - when deploying on live there's a well recognizable out of service phase on context/servlet start up.
I tend to think all of this is closely related to the solutions I described above.
Because I haven't got any experiences in this field and I'm new in the company I don't want to be to obtrusive.
Maybe you can help me to evaluate ideas for a better setup:
Is it for performance and memory better to unify all the repositories into one "repository servlet" and request objects from there via HTTP (don't think so, though it seems quite modular/distributed system friendly) or should I try to go with JNDI (never did that before) and connect to the repository similar to a JDBC database?
Wouldn't it be even more sensible, faster and efficient to at least use only one single connection pool for the whole Tomcat (and reference this connection pool from within the web apps deployment descriptor)? Or might that slow down connections or limit it in any other aspect?
I was told that the cache system (ehcache) didn't work well (at least not with the performance of the self written solution - though: I can't believe that). I imagine the usage of repositories backed by a distributed (as across all contexts) cache used in all web applications should not only reduce memory footprint significantly but should not be significantly slower. - I believe it will be faster and have shorter start up times respectively it shouldn't be needed to redeploy it that often.
I'm very grateful for every tip or hint and your thoughts. Would be marvellous to get a peer review of my ideas based on practical experiences.
So thank you very much in advance!

Is it better to hold a repository for every web application (context) or is it better to share a common instance by JDNI or a similar technique
Unless someone proves me otherwise I would say there is no way to do it, in a standard way, meaning as defined in the Servlet Sepc or in the rest of the Java EE spec canon.
There are technical ways to do it which probably depend on a specific application server implementation, but this cannot be "better" in its universal sense.
If you have two applications that operate on the same data, I wonder whether the partitioning of the applications is useful. Maybe all functionality operating on some kind of data needs to be in the same application?
within our company it's kind of standard to create repositories for data which is originally stored in the database as described for example in https://thinkinginobjects.com/2012/08/26/dont-use-dao-use-repository/.
I looked up Evans in our book shelf. The blog post is quite weird. A repository and a DAO are basically the same thing, it provides CRUD operations for an object or for a tree of objects (Evans says only the the aggregate roots).
The repositories are initialized on the servlets init sequences (so every servlet keeps it's own set of instances). Each context has a own connection to the Oracle database (set up by resource description file on deployment). [ ... ]
In the end the application server as a whole does not perform that well and it uses a hell lot of RAM
When something performs badly its the best to do profiling, e.g. with YourKit or with perf and FlameGraphs if you are on Linux. If your applications need a lot of RAM, analyze the heap e.g. with Eclipse MAT. There is no way somebody can give you a recommendation or hint on a best practice without seeing any line of code.
A general answer would include anyting about performance tuning for Oracle DBs, JDBC, Java Collections and Concurrent Programming, Networking and Operating Systems.
I was told that the cache system (ehcache) didn't work well (at least not with the performance of the self written solution - though: I can't believe that)
I can. EHCache is between 10-20 times slower then a simple HashMap. See: cache benchmarks. You only need a map, when you do a complete preload and don't have any mutations.
I imagine the usage of repositories backed by a distributed (as across all contexts) cache used in all web applications should not only reduce memory footprint significantly but should not be significantly slower
Distributed caches need to go over the network and add serialization/deserialization overhead. That's probably another factor 30 slower. When is the distributed cache updated?
I'm very grateful for every tip or hint and your thoughts.
Wrap up:
Do the normal software engineering homework, do profiling and analyzing and spend the effort of tuning at the right places
Ask specific questions on one topic on stackoverflow and share your code and performance data. Ask a question about one thing at one time and read https://stackoverflow.com/help/on-topic
You may also come to the conclusion that there is nothing to tune. There are applications out there that need a day to build up an in memory data structure from persistent data. Maybe its just a lot of data? If you do not like the downtime use green blue deployment. Also use smaller data sets for development and testing

WebSphere propagate changes across all nodes in cluster

I have a problem with a product that I am currently working on. Essentially, There is some very commonly used (and very seldomly updated) information that is retrieved from the database on server start up. We do not want to query the database every time this information is needed because it is very frequent. There is a way to update this information through the application (only by an admin). When this method is used, the data in the database is updated and the cached data in that single server (1 of 4) is updated. Unfortunately, if a user hits any of the other servers they will not see the updated information. Restarting the cluster remedies the problem however, that is not a feasible solution for our production environment. Now that I have explained the situation, I am open to suggestions. Thank you for your time.

For a simple solution, you can go to the cluster in the admin console and ripple start it. That stops/stars the nodes gracefully and one at a time. The only impact is a 25% reduction in capacity while it is working.

IBM WebSphere Application Server has a Dynamic Cache that you can use to store Java objects. The cache can be set up to use replication over a replication domain so it can be shared across a cluster.
Your code would use the DistributedMap interface to interact with the cache. All settings for the dynamic cache can be included with your application or it can be pre-configured. Examples are included in the javadoc link.

(Similar to Java EE Application-scoped variables in a clustered environment (Websphere)?)
That is, I think the standard answer would be a "Distributed Object Store". But a crude alternative (that we use) would be to configure a list of server:port combinations to contact to inform each cluster member to update their own copy of the data.

Is there an embeddable Java alternative to Redis?

According to this thread, Jedis is the best thing to use if I want to use Redis from Java.
However, I was wondering if there are any libraries/packages providing similarly efficient set operations to those that already exist in Redis, but can be directly embedded in a Java application without the need to set up separate servers. (i.e., using Jetty for web server).
To be more precise, I would like to be able to do the following efficiently:
There are a large set of M users (M not known in advance).
There are a large set of N items.
We want users to examine items, one user/item at a time, which produces a stored result (in a normal database.)
Each time a user arrives, we want to assign to that user the item with the least number of existing results that the user has not already seen before. This produces an approximate round-robin assignment of the items over all arriving users, when we just care about getting all items looked at approximately the same number of times.
The above happens in a parallelized fashion. When M and N are large, Redis accomplishes the above much more efficiently than SQL queries. Is there some way to do this using an embeddable Java library that is a bit more lightweight than starting a Redis server?
I recognize that it's possible to write a pile of code using Java's concurrency libraries that would roughly approximate this (and to some extent, I have done that), but that's not exactly what I'm looking for here.

Have a look at project voldemort . It's an distributed key-value store created by Linked-In, and it supports the ability to be embedded.
In the quick start guide is a small example of running the server embedded vs. stand-alone.
VoldemortConfig config = VoldemortConfig.loadFromEnvironmentVariable();
VoldemortServer server = new VoldemortServer(config);
server.start();
I don't know much about Redis, so I can't compare them feature to feature. In the project we used Voldemort, we used it's readonly backing store with great results. It allowed us to "precompile" a bi-daily database in our processing data-center and "ship it" out to edge data-centers. That way each edge data-center had a local copy of it's dataset.
EDIT: After rereading your question, I wanted to add Gauva's Table -- This Table DataStructure may also be something your looking for and is simlar to what you get with many no-sql databases.

Hazelcast provides a number of distributed data structure implementations which can be used as a pure Java alternative to Redis' services. You could then ship a single "jar" with all required dependencies to run your application. You may have to adjust for the slightly different primitives relative to Redis in your own application.
Commercial solutions in this space include Teracotta's Enterprise Ehcache and Oracle Coherence.

Take a look at lmdb (Lightning Memory Database), because I needed exactly the same thing. I deploy a dropwizard application into a container, and adding redis or another external dependancy is painful. This seems to perform well, has good activity. fyi, though, i have not yet used this in production.
https://github.com/lmdbjava/lmdbjava

Google's Guava Library provides friendly versions of the same (and more) Set operators that redis provides.
https://code.google.com/p/guava-libraries/wiki/CollectionUtilitiesExplained
e.g.
Guava Redis
Sets.intersection(a,b) sinter a b
a.count() scard a
Sets.difference(a,b) sdiff a b
Sets.union(a,b) sunion a b
Multisets are a reasonably straightforward proxy for redis sorted-sets as well.

Database distribution

What are the possibilities to distribute data selectively?
I explain my question with an example.
Consider a central database that holds all the data. This database is located in a certain geographical location.
Application A needs a subset of the information present in the central database. Also, application A may be located in a geographical location different (and maybe far) from the one where the central database is located.
So, I thought about creating a new database at the same location of application A that would contain a subset of information of the central database.
Which technology/product allow me to deploy such a configuration?
Thanks

Look for database replication. SQL Server can do this for sure, others (Oracle, MySQL, ...) should have it, too.
The idea is that the other location maintains a (subset) copy. Updates are exchanged incrementally. The way to treat conflicts depends on your application.

Most major database software such as MySql and SQL server can do the job, but it
is not a good model. With the growth of the application (traffic and users),
not only will you create a load on the central database server (which might be serving
other applications),but you will also be abusing your network bandwidth to transfer data
between the far away database and the application server.
A better model is to keep your data close to the application server, and use the far away
database for backup and recovery purposes only. You can use an FC\IP SAN (or any other
storage network architecture) as your storage network model, based on your applications' needs.

One big question that you didn't address is if Application A needs read-only access to the data or if it needs to be read-write.
The immediate concept that comes to mind when reading your requirements is sharding. In MySQL, this can be accomplished with partitioning. That being said, before you jump into partitions, make sure you read up on their pros and cons. There are instances where partitioning can slow things down if your indexes are not well chosen, or your partitioning scheme is not well thought out.
If your needs are read-only, then this should be a fairly simple solution. You can use MySQL in a Master-Slave context, and use App A off a slave. If you need read-write, then this becomes much more complex.
Depending on your write needs, you can split your reads to your slave, and your writes to the master, but that significantly adds complexity to your code structure (need to deal with multiple connections to multiple dbs). The advantage of this kind of layout is that you don't need to have complex DB infrastructure.
On the flip side, you can keep your code as is, and use a Master-Master replication in MySQL. Although not officially supported by Oracle, a lot of people have had success in this. A quick Google search will find you a huge list of blogs, howtos, etc. Just keep in mind that your code has to be properly written to support this (ex: you cannot use auto-increment fields for PKs, etc).
If you have cash to spend, then you can look at some of the more commercial offerings. Oracle DB and SQL Server both support this.
You can also use Block Based data replication, such as DRDB (and Mysql DRDB) to handle the replication between your nodes, but the problem you always will encounter is what happens if your link between the two nodes fails.
The biggest issue you will encounter is how to handle conflicting updates in 2 separate DB nodes. If your data is geographically dependent, then this may not be an issue for you.
Long story short, this is not an easy (or inexpensive) problem to resolve.

It's important to address the possibility of conflicts at the design phase anytime you are talking about replicating databases.
Moving on from that, SAP's Sybase Replication Server will allow you to do just that, either with Sybase database's or 3rd party databases.
In Sybase's world this is frequently called a corporate roll-up environment. There may be multiple geographically seperated databases each with a subset of data which they have primary control over. At the HQ, there is a server that contains all the various subsets in one repository. You can choose to replicate whole tables, or replicate based on values in individual rows/columns.
This keeps the databases in a loosely consistent state. Transaction rates, Geographic separation, and the latency that can be inherent to network will impact how quickly updates move from one database to another. If a network connection is temporarily down, Sybase Replication Server will queue up transaction, and send them as soon as the link comes back up, but the reliability and stability of the replication system will be affected by the stability of the network connection.
Again, as others have stated it's not cheap, but it's relatively straight forward to implement and maintain.
Disclaimer: I have worked for Sybase, and am still part of the SAP family of companies.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.