Apache Lucene - Optimizing Searching

Apache Lucene - Optimizing Searching - java

I am developing a web application in Java (using Spring) that uses a SQL Server database. I use Apache Lucene to implement a search feature for my web application. With Apache Lucene, before I perform a search I create an index of titles. I do this by first obtaining a list of all titles from the database. Then I loop through the list of titles and add each one of them to the index. This happens every time a user searches for something.
I would like to know if there is a better, more efficient way of creating the index? I know my way is very inefficient, and will take a long time to complete when the list of titles is very long.
Any suggestions would be highly appreciated.
Thanks

Before you optimize Lucene: SQL Server already has a full-text search feature. If this covers your use case then use it. It's the easiest way since SQL Server takes care of keeping the search index in sync with the database.
If the SQL Server full-text search does not fit your use case then your application has to create its own search index and keep it in sync with the database. To do this you should:
create / update the search index when your application starts
update the search index when the application inserts, updates or deletes a title
Lucene is flexible where it stores the search index. You can store it in a directory in your file system or in the database (or write you own storage provider). I recommend to store it in the file system as the performance is much better than when you store it in the database.
If you don't have too many titles to index you could also use an in-memory search index which you recreate every time your application starts.

You should:
make Lucene index before you start application
update index when you add/remove/update title in your database
Benefits of this approach:
One full index when application is offline
incremental indexing, each time relevant information is changed

Related

Apache Solr for single database column suggestions

I have a relational database with few tables. Some of them have columns that I want to enable autocompletion / autocorrection on (e.g. titles, tags, categories).
I have seen that Apache Solr, which builds upon Lucene indexing can offer such functionality. Also data can be fed in to Solr from relational database.
My question is: is this the best way I can get autocomplete and autocorrect services for my entities? Or am I killing a mosquito with a bazooka here?
Solr requires a lot of resources, memory and stuff and I wonder if something far simpler can do the trick for me.

How many unique values do you have in title, tags , categories? A few thousand? Then I think you can get away with using a Trie Data structure. A few million records in those columns? Then Solr / Elasticsearch might be good option.
I have used Trie for autosuggestion. Building a Trie is expensive. But you can store the trie in Memcached or even SQL and update it periodically when new data is added to your columns.

Hibernate Search, Lucene or any other alternative?

I have a query which is doing ILIKE on some 11 string or text fields of table which is not big (500 000), but for ILIKE obviously too big, search query takes round 20 seconds. Database is postgres 8.4
I need to implement this search to be much faster.
What came to my mind:
I made additional TVECTOR column assembled from all columns that need to be searched, and created the full text index on it. The fulltext search was quite fast. But...I can not map this TVECTOR type in my .hbms. So this idea fell off (in any case i thaught it more as a temporary solution).
Hibernate search. (Heard about it first time today) It seems promissing, but I need experienced opinion on it, since I dont wanna get into the new API, possibly not the simplest one, for something which could be done simpler.
Lucene
In any case, this has happened now with this table, but i would like to solution to be more generic and applied for future cases related to full text searches.
All advices appreciated!
Thanx

I would strongly recommend Hibernate Search which provides a very easy to use bridge between Hibernate and Lucene. Rememeber you will be using both here. You simply annotate properties on your domain classes which you wish to be able to search over. Then when you update/insert/delete an entity which is enabled for searching Hibernate Search simply updates the relevant indexes. This will only happen if the transaction in which the database changes occurs was committed i.e. if it's rolled back the indexes will not be broken.
So to answer your questions:
Yes you can index specific columns on specific tables. You also have the ability to Tokenize the contents of the field so that you can match on parts of the field.
It's not hard to use at all, you simply work out which properties you wish to search on. Tell Hibernate where to keep its indexes. And then can use the EntityManager/Session interfaces to load the entities you have searched for.

Since you're already using Hibernate and Lucene, Hibernate Search is an excellent choice.
What Hibernate Search will primarily provide is a mechanism to have your Lucene indexes updated when data is changed, and the ability to maximize what you already know about Hibernate to simplify your searches against the Lucene indexes.
You'll be able to specify what specific fields in each entity you want to be indexed, as well as adding multiple types of indexes as needed (e.g., stemmed and full text). You'll also be able to manage to index graph for associations so you can make fairly complex queries through Search/Lucene.
I have found that it's best to rely on Hibernate Search for the text heavy searches, but revert to plain old Hibernate for more traditional searching and for hydrating complex object graphs for result display.

I recommend Compass. It's an open source project built on top of Lucene that provider a simpler API (than Lucene). It integrates nicely with many common Java libraries and frameworks such as Spring and Hibernate.

I have used Lucene in the past to index database tables. The solution works great, but remeber that you need to maintain the index. Either, you update the index every time your objects are persisted or you have a daemon indexer that dump the database tables in your Lucene index.
Have you considered Solr? It's built on top of Lucene and offers automatic indexing from a DB and a Rest API.

A year ago I would have recommended Compass. It was good at what it does, and technically still happily runs along in the application I developed and maintain.
However, there's no more development on Compass, with efforts having switched to ElasticSearch. From that project's website I cannot quite determine if it's ready for the Big Time yet or even actually alive.
So I'm switching to Hibernate Search which doesn't give me that good a feeling but that migration is still in its initial stages, so I'll reserve judgement for a while longer.

All the projects are based on Lucene. If you want to implement a very advanced features I advice you to use Lucene directly. If not, you may use Solr which is a powerful API on top of lucene that can help you index and search from DB.

Full text search on Google App Engine (Java)

There are a few threads floating around on the topic, but I think my use-case is somewhat different.
What I want to do:
Full text search component for my GAE/J app
The index size is small: 25-50MB or so
I do not need live updates to the index, a periodic re-indexing is fine
This is for auto-complete and the like, so it needs to be extremely fast (I get the impression that implementing an inverted index in Datastore introduces considerable latency)
My strategy so far (just planning, haven't tried implementing anything yet):
Use Lucene with RAMDirectory
A periodic cron job creates the index, serializes it to the Datastore, stores an update id (or timestamp)
Search servlet loads the index on startup and creates the RAMDirectory
On each request the servlet checks the current update id and reloads the index as necessary
The main thing I'm fuzzy on is how to synchronize in-memory data between instances - will this work, or am I missing something?
Also, how far can I push it before I start having problems with memory use? I couldn't find anything on RAM quotas for GAE. (This index is small, but I can think of more stuff I'd like to add)
And, of course, any thoughts on better approaches?

If you're okay with periodic rebuilds, and your index is small, your current approach sounds mostly okay. Instead of building the index online and serializing it to the datastore, though, why not build it offline, and upload it with the app? Then, you can instantiate it directly from the disk store, and to push an update, you deploy a new version of your app.

Recently GAE added "text search" service. Take a look at GAE Java Text Search

For autocomplete, perhaps you could store the top N matches for each prefix (basically what you'd put in the drop-down menu) in memcache? The memcache entities could be backed by entities in the datastore and reloaded if needed.

Well, as of GAE 1.5.0 looks like resident Backends can be used to create a search service.
Of course, there's no free quota for these.

App Engine now includes a full-text search API (Experimental): https://developers.google.com/appengine/docs/java/search/

"Should I use multiple indices in Solr?", and some other quick Q

Imagine a classifieds website, a very simple one where users don't have login details.
I have this currently with MySql as a db. The db has several tables, because of the categories, but one main table for the classified itself. Total of 7 tables in my case.
I want to use only Solr as a "db" because some people on SO thinks it would be better, and I agree, if it works that is.
Now, I have some quick questions about doing this:
Should I have multiple scheema.xml files or config.xml files?
How do I query multiple indices?
How would this (having multiple indices) affect performance and do I need a more powerful machine (memory, cpu etc...) for managing this?
Would you eventually go with only Solr instead of what I planned to do, which is to use Solr to search and return ID numbers which I use to query and find the classifieds in MySql?
I have some 300,000 records today, and they probably won't be increasing.
I have not tested how the records would affect performance when using Solr with MySql, because I am still creating the website, but when using only MySql it is quite slow.
I am hoping it will be better with Solr + MySql, but as I said, if it is possible I will go with only Solr.
Thanks

4 : If an item has status fields that get updated much more frequently than the rest of the record then it's better to store that information in a database and retrieve it when you access the item. For example, if you stored your library book holdings in a solr index, you would store the 'borrowed' status in a database. Updating Solr can take a fair bit of resources and if you don't need to search on a field it doesn't really need to be in Solr.

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Lucene - Optimizing Searching - java

You should: make Lucene index before you start application update index when you add/remove/update title in your database Benefits of this approach: One full index when application is offline incremental indexing, each time relevant information is changed

Related

Apache Solr for single database column suggestions

Hibernate Search, Lucene or any other alternative?

Full text search on Google App Engine (Java)

"Should I use multiple indices in Solr?", and some other quick Q

Questions about SOLR documents and some more

Categories

Resources