Questions about SOLR documents and some more - java

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Related

Apache Solr for single database column suggestions

I have a relational database with few tables. Some of them have columns that I want to enable autocompletion / autocorrection on (e.g. titles, tags, categories).
I have seen that Apache Solr, which builds upon Lucene indexing can offer such functionality. Also data can be fed in to Solr from relational database.
My question is: is this the best way I can get autocomplete and autocorrect services for my entities? Or am I killing a mosquito with a bazooka here?
Solr requires a lot of resources, memory and stuff and I wonder if something far simpler can do the trick for me.
How many unique values do you have in title, tags , categories? A few thousand? Then I think you can get away with using a Trie Data structure. A few million records in those columns? Then Solr / Elasticsearch might be good option.
I have used Trie for autosuggestion. Building a Trie is expensive. But you can store the trie in Memcached or even SQL and update it periodically when new data is added to your columns.

How to read Lucene indexes from Solr

I have an existing web application which uses lucene to create indexes. Now as per the requirement, I have to set up Solr which will serve as a search engine for many other web application inculding my web app. I do not want to create indexes within Solr. Hence, I need to tell Solr to read indexes from lucene instead of creating indexes within Solr and reading from its own.
As a beginner of Solr, first I used nutch to create indxes and then used those indxes within Solr. But I'm unaware how to make Solr understand to read indexes from lucene. I did not find any documentation around this. Kindly advice how to achive this.
It is not possible in any reliable way.
It's like saying you built an application in Ruby and now want to use Rails on top of existing database structure. Solr (as Rails) has it's own expectations naming and workflows around Lucene core and there is no migration path.
You can try using Luke to confirm the internal data structures differences between Lucene and Solr for yourself.
I have never done that before but as Solr is built on Lucene, you can try these steps. dataDir is the main point here
I am assuming you deploying it in /usr/local so change accordingly, and have basing knowledge of solr configuration.
Download Solr and copy dist/apache-solr-x.x.x.war to tomcat/webapps
Copied example/solr/conf to /usr/local/solr/
Set solr.home to /usr/local/solr
In solrconfig.xml, change dataDir to /usr/local/solr/data (Solr looks for the index directory inside)
change schema.xml accordingly ie. you need to change fields,

Apache Lucene - Optimizing Searching

I am developing a web application in Java (using Spring) that uses a SQL Server database. I use Apache Lucene to implement a search feature for my web application. With Apache Lucene, before I perform a search I create an index of titles. I do this by first obtaining a list of all titles from the database. Then I loop through the list of titles and add each one of them to the index. This happens every time a user searches for something.
I would like to know if there is a better, more efficient way of creating the index? I know my way is very inefficient, and will take a long time to complete when the list of titles is very long.
Any suggestions would be highly appreciated.
Thanks
Before you optimize Lucene: SQL Server already has a full-text search feature. If this covers your use case then use it. It's the easiest way since SQL Server takes care of keeping the search index in sync with the database.
If the SQL Server full-text search does not fit your use case then your application has to create its own search index and keep it in sync with the database. To do this you should:
create / update the search index when your application starts
update the search index when the application inserts, updates or deletes a title
Lucene is flexible where it stores the search index. You can store it in a directory in your file system or in the database (or write you own storage provider). I recommend to store it in the file system as the performance is much better than when you store it in the database.
If you don't have too many titles to index you could also use an in-memory search index which you recreate every time your application starts.
You should:
make Lucene index before you start application
update index when you add/remove/update title in your database
Benefits of this approach:
One full index when application is offline
incremental indexing, each time relevant information is changed

Nutch + Solr on top level page only

I've been trying to use Nutch to crawl over over the first page of the domains in my urls file and then use Solr to make keywords in the crawled data searchable. So far I haven't been able to get anything working this way, unless the two pages are linked together.
I realize this is probably an issue of the pages having no incoming links, and therefore the PageRank algorithm discards the page content. I tried adjusting the parameters so that the default score is higher for urls not in the graph, but I'm still getting the same results.
Is there anything people know of that can build an index over pages with no incoming links?
Thanks!
Try a nutch inject command to insert the "no-incomming-link" URL into the nutch DB.
I guess that if you don't see anything in your solr indexes, it is because no data for those URLs is stored in the nutch DB (since nutch will take care to sync its DB with the indexes). Not having data in the DB may be explained by the fact that the URLs are isolated, hence you can try the inject command to include those sites.
I would try to actually see the internal DB to verify the nutch behavior, since before inserting values in the indexes, nutch stores data inside its DBs.
Assigning a higher score has no effect, since lucene will give you a result as long as the data is in the index.
Solr now reads HTML files using Tika by default, so that's not a problem.
http://wiki.apache.org/solr/TikaEntityProcessor
If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

"Should I use multiple indices in Solr?", and some other quick Q

Imagine a classifieds website, a very simple one where users don't have login details.
I have this currently with MySql as a db. The db has several tables, because of the categories, but one main table for the classified itself. Total of 7 tables in my case.
I want to use only Solr as a "db" because some people on SO thinks it would be better, and I agree, if it works that is.
Now, I have some quick questions about doing this:
Should I have multiple scheema.xml files or config.xml files?
How do I query multiple indices?
How would this (having multiple indices) affect performance and do I need a more powerful machine (memory, cpu etc...) for managing this?
Would you eventually go with only Solr instead of what I planned to do, which is to use Solr to search and return ID numbers which I use to query and find the classifieds in MySql?
I have some 300,000 records today, and they probably won't be increasing.
I have not tested how the records would affect performance when using Solr with MySql, because I am still creating the website, but when using only MySql it is quite slow.
I am hoping it will be better with Solr + MySql, but as I said, if it is possible I will go with only Solr.
Thanks
4 : If an item has status fields that get updated much more frequently than the rest of the record then it's better to store that information in a database and retrieve it when you access the item. For example, if you stored your library book holdings in a solr index, you would store the 'borrowed' status in a database. Updating Solr can take a fair bit of resources and if you don't need to search on a field it doesn't really need to be in Solr.

Categories

Resources