Full text search on Google App Engine (Java) - java

There are a few threads floating around on the topic, but I think my use-case is somewhat different.
What I want to do:
Full text search component for my GAE/J app
The index size is small: 25-50MB or so
I do not need live updates to the index, a periodic re-indexing is fine
This is for auto-complete and the like, so it needs to be extremely fast (I get the impression that implementing an inverted index in Datastore introduces considerable latency)
My strategy so far (just planning, haven't tried implementing anything yet):
Use Lucene with RAMDirectory
A periodic cron job creates the index, serializes it to the Datastore, stores an update id (or timestamp)
Search servlet loads the index on startup and creates the RAMDirectory
On each request the servlet checks the current update id and reloads the index as necessary
The main thing I'm fuzzy on is how to synchronize in-memory data between instances - will this work, or am I missing something?
Also, how far can I push it before I start having problems with memory use? I couldn't find anything on RAM quotas for GAE. (This index is small, but I can think of more stuff I'd like to add)
And, of course, any thoughts on better approaches?

If you're okay with periodic rebuilds, and your index is small, your current approach sounds mostly okay. Instead of building the index online and serializing it to the datastore, though, why not build it offline, and upload it with the app? Then, you can instantiate it directly from the disk store, and to push an update, you deploy a new version of your app.

Recently GAE added "text search" service. Take a look at GAE Java Text Search

For autocomplete, perhaps you could store the top N matches for each prefix (basically what you'd put in the drop-down menu) in memcache? The memcache entities could be backed by entities in the datastore and reloaded if needed.

Well, as of GAE 1.5.0 looks like resident Backends can be used to create a search service.
Of course, there's no free quota for these.

App Engine now includes a full-text search API (Experimental): https://developers.google.com/appengine/docs/java/search/

Related

Hibernate Search cluster and near-real-time search

I'm trying to find the best indexing solution for implementing a search-engine in my clustered webapp, and I cannot find a clear answer to my questions in official documentations.
My Java/Java EE backend will be deployed among several load-balanced instances. The search-engine will require near-real-time availability of indexed data (i.e. less than 5 seconds between the indexation and the retrievability).
Hibernate Search can work in a clustered environment with JGroups but the documentation also says, about near-real-time that as a tradeoff it requires a non-clustered and non-shared index.
Does that mean that NRTIndexManager cannot be used in a JGroups Slave/Master setup ? i.e. can only be used whith one single node ?
Does that mean that with such a setup, the availability of indexed data depends only on the refresh period (period of index copy to slave nodes) ?
With the standard IndexManager, you only see the latest changes when they are written to the disk and you reopen your IndexSearcher.
By default, Hibernate Search writes to disk and opens a new IndexSearcher for each query so you're sure your searches are always in sync with your database.
The NRTIndexManager is different from the standard one because it allows you to search on the latest changes indexed without an explicit write on disk. It's typically used when you need a high throughput and you can't write everything on the disk right away. So it's not really correlated to the fact that you will see your changes right away or not: it's an optimization when you can allow some index data loss - the latest changes might be lost.
As mentioned in the documentation here http://docs.jboss.org/hibernate/search/5.5/reference/en-US/html_single/#jgroups-backend , you can have a sync JGroups with Hibernate Search blocking until all the indexes are in sync. So it can work for your case.
Note that we are currently working for 5.6 on an Elasticsearch backend which might be of some interest to you as it's typically designed for your case. It's still in beta but it's already in pretty good shape. You might want to take a look to it: http://docs.jboss.org/hibernate/search/5.6/reference/en-US/html/ch11.html .

Apache Lucene - Optimizing Searching

I am developing a web application in Java (using Spring) that uses a SQL Server database. I use Apache Lucene to implement a search feature for my web application. With Apache Lucene, before I perform a search I create an index of titles. I do this by first obtaining a list of all titles from the database. Then I loop through the list of titles and add each one of them to the index. This happens every time a user searches for something.
I would like to know if there is a better, more efficient way of creating the index? I know my way is very inefficient, and will take a long time to complete when the list of titles is very long.
Any suggestions would be highly appreciated.
Thanks
Before you optimize Lucene: SQL Server already has a full-text search feature. If this covers your use case then use it. It's the easiest way since SQL Server takes care of keeping the search index in sync with the database.
If the SQL Server full-text search does not fit your use case then your application has to create its own search index and keep it in sync with the database. To do this you should:
create / update the search index when your application starts
update the search index when the application inserts, updates or deletes a title
Lucene is flexible where it stores the search index. You can store it in a directory in your file system or in the database (or write you own storage provider). I recommend to store it in the file system as the performance is much better than when you store it in the database.
If you don't have too many titles to index you could also use an in-memory search index which you recreate every time your application starts.
You should:
make Lucene index before you start application
update index when you add/remove/update title in your database
Benefits of this approach:
One full index when application is offline
incremental indexing, each time relevant information is changed

What is the best way make a copy of a blobstore entity on app engine in Java?

Here is our simple usecase: user2 wants to copy user1's document into his or her own repository within our application. Should be simple, right? All we need to do is create a second identical blob in the blobstore with the key returned that we can associate with user2. We must be missing something here. It appears that app engine blob store's primary function is to handle blobs uploaded from and downloaded to a browser, and a simple copy operation initiated server-side is not so simple.
The obvious solution seemed to be using the the experimental file api in java, but no love. It works, until you get up in file size beyond a MB or so, then it fails, somewhat unpredictably. Reading it all into the server layer also seems silly, when we just need to make a copy in the storage layer. In addition, the odds of us getting an experimental feature through into our production environment is slim, although non-zero.
Some information about our environment: the app is written in Java and we're using the blobstore, not cloud storage and are committed to it for now. We're a small departmental team and would like to make the case that app engine is a great platform to use, but this one has us stumped. S3 makes this blindingly simple, are we missing something really stupid here?
We ended up scrapping the idea of making a programatic copy of the blob with the file api and going with a reference as Kalle suggested in his comment, and created a new xref entity that stores information about the copy and the original. When an image or file is deleted, we check the xfef entity for references and delete the ones that point to that image/file (ie created if the deleted image/file was copied from another one). If we don't find any xrefs at all, we delete the blob itself. We didn't like the privacy/compliance implications of leaving orphaned blobs laying around, and even though storage is cheap every $$$ helps. We also liked the idea of keeping a clean house so to speak.
As already mentioned in the comment, keep one blob and pass the key around. But you really never need to delete. It is good practice to keep the blob for archive purposes. So how would delete actually work? In your datastore model, have a boolean delete field. You don't remove the blob key from an entity upon deletion. But rather, you mark the boolean field as true. This way, your product has a record of every user who has ever owned a file. But the user does not need to ever know.
Solution 1: I will launch a Google Compute Engine instance and use the command gsutil to do the copy.
And then shutdown the instance when it's done. This is the fastest way to do the copy to my knowledge
gsutil documentation
Solution 2: But I will personally choose to use a counter as said in the comments, because the point that you said is scary will be the same problem with the copy.
So just use counters with strong unit testing on those for example that will be less scary.
An idea to make it less scary is when you reach 0 for your counter you don't delete the blob right away but do a job to do this later on. By using Scheduled task in Google App Engine. And delete the file and your actual record of it a month later for example.

Is there an embeddable Java alternative to Redis?

According to this thread, Jedis is the best thing to use if I want to use Redis from Java.
However, I was wondering if there are any libraries/packages providing similarly efficient set operations to those that already exist in Redis, but can be directly embedded in a Java application without the need to set up separate servers. (i.e., using Jetty for web server).
To be more precise, I would like to be able to do the following efficiently:
There are a large set of M users (M not known in advance).
There are a large set of N items.
We want users to examine items, one user/item at a time, which produces a stored result (in a normal database.)
Each time a user arrives, we want to assign to that user the item with the least number of existing results that the user has not already seen before. This produces an approximate round-robin assignment of the items over all arriving users, when we just care about getting all items looked at approximately the same number of times.
The above happens in a parallelized fashion. When M and N are large, Redis accomplishes the above much more efficiently than SQL queries. Is there some way to do this using an embeddable Java library that is a bit more lightweight than starting a Redis server?
I recognize that it's possible to write a pile of code using Java's concurrency libraries that would roughly approximate this (and to some extent, I have done that), but that's not exactly what I'm looking for here.
Have a look at project voldemort . It's an distributed key-value store created by Linked-In, and it supports the ability to be embedded.
In the quick start guide is a small example of running the server embedded vs. stand-alone.
VoldemortConfig config = VoldemortConfig.loadFromEnvironmentVariable();
VoldemortServer server = new VoldemortServer(config);
server.start();
I don't know much about Redis, so I can't compare them feature to feature. In the project we used Voldemort, we used it's readonly backing store with great results. It allowed us to "precompile" a bi-daily database in our processing data-center and "ship it" out to edge data-centers. That way each edge data-center had a local copy of it's dataset.
EDIT: After rereading your question, I wanted to add Gauva's Table -- This Table DataStructure may also be something your looking for and is simlar to what you get with many no-sql databases.
Hazelcast provides a number of distributed data structure implementations which can be used as a pure Java alternative to Redis' services. You could then ship a single "jar" with all required dependencies to run your application. You may have to adjust for the slightly different primitives relative to Redis in your own application.
Commercial solutions in this space include Teracotta's Enterprise Ehcache and Oracle Coherence.
Take a look at lmdb (Lightning Memory Database), because I needed exactly the same thing. I deploy a dropwizard application into a container, and adding redis or another external dependancy is painful. This seems to perform well, has good activity. fyi, though, i have not yet used this in production.
https://github.com/lmdbjava/lmdbjava
Google's Guava Library provides friendly versions of the same (and more) Set operators that redis provides.
https://code.google.com/p/guava-libraries/wiki/CollectionUtilitiesExplained
e.g.
Guava Redis
Sets.intersection(a,b) sinter a b
a.count() scard a
Sets.difference(a,b) sdiff a b
Sets.union(a,b) sunion a b
Multisets are a reasonably straightforward proxy for redis sorted-sets as well.

Hibernate Search, Lucene or any other alternative?

I have a query which is doing ILIKE on some 11 string or text fields of table which is not big (500 000), but for ILIKE obviously too big, search query takes round 20 seconds. Database is postgres 8.4
I need to implement this search to be much faster.
What came to my mind:
I made additional TVECTOR column assembled from all columns that need to be searched, and created the full text index on it. The fulltext search was quite fast. But...I can not map this TVECTOR type in my .hbms. So this idea fell off (in any case i thaught it more as a temporary solution).
Hibernate search. (Heard about it first time today) It seems promissing, but I need experienced opinion on it, since I dont wanna get into the new API, possibly not the simplest one, for something which could be done simpler.
Lucene
In any case, this has happened now with this table, but i would like to solution to be more generic and applied for future cases related to full text searches.
All advices appreciated!
Thanx
I would strongly recommend Hibernate Search which provides a very easy to use bridge between Hibernate and Lucene. Rememeber you will be using both here. You simply annotate properties on your domain classes which you wish to be able to search over. Then when you update/insert/delete an entity which is enabled for searching Hibernate Search simply updates the relevant indexes. This will only happen if the transaction in which the database changes occurs was committed i.e. if it's rolled back the indexes will not be broken.
So to answer your questions:
Yes you can index specific columns on specific tables. You also have the ability to Tokenize the contents of the field so that you can match on parts of the field.
It's not hard to use at all, you simply work out which properties you wish to search on. Tell Hibernate where to keep its indexes. And then can use the EntityManager/Session interfaces to load the entities you have searched for.
Since you're already using Hibernate and Lucene, Hibernate Search is an excellent choice.
What Hibernate Search will primarily provide is a mechanism to have your Lucene indexes updated when data is changed, and the ability to maximize what you already know about Hibernate to simplify your searches against the Lucene indexes.
You'll be able to specify what specific fields in each entity you want to be indexed, as well as adding multiple types of indexes as needed (e.g., stemmed and full text). You'll also be able to manage to index graph for associations so you can make fairly complex queries through Search/Lucene.
I have found that it's best to rely on Hibernate Search for the text heavy searches, but revert to plain old Hibernate for more traditional searching and for hydrating complex object graphs for result display.
I recommend Compass. It's an open source project built on top of Lucene that provider a simpler API (than Lucene). It integrates nicely with many common Java libraries and frameworks such as Spring and Hibernate.
I have used Lucene in the past to index database tables. The solution works great, but remeber that you need to maintain the index. Either, you update the index every time your objects are persisted or you have a daemon indexer that dump the database tables in your Lucene index.
Have you considered Solr? It's built on top of Lucene and offers automatic indexing from a DB and a Rest API.
A year ago I would have recommended Compass. It was good at what it does, and technically still happily runs along in the application I developed and maintain.
However, there's no more development on Compass, with efforts having switched to ElasticSearch. From that project's website I cannot quite determine if it's ready for the Big Time yet or even actually alive.
So I'm switching to Hibernate Search which doesn't give me that good a feeling but that migration is still in its initial stages, so I'll reserve judgement for a while longer.
All the projects are based on Lucene. If you want to implement a very advanced features I advice you to use Lucene directly. If not, you may use Solr which is a powerful API on top of lucene that can help you index and search from DB.

Categories

Resources