How to read Lucene indexes from Solr - java

I have an existing web application which uses lucene to create indexes. Now as per the requirement, I have to set up Solr which will serve as a search engine for many other web application inculding my web app. I do not want to create indexes within Solr. Hence, I need to tell Solr to read indexes from lucene instead of creating indexes within Solr and reading from its own.
As a beginner of Solr, first I used nutch to create indxes and then used those indxes within Solr. But I'm unaware how to make Solr understand to read indexes from lucene. I did not find any documentation around this. Kindly advice how to achive this.

It is not possible in any reliable way.
It's like saying you built an application in Ruby and now want to use Rails on top of existing database structure. Solr (as Rails) has it's own expectations naming and workflows around Lucene core and there is no migration path.
You can try using Luke to confirm the internal data structures differences between Lucene and Solr for yourself.

I have never done that before but as Solr is built on Lucene, you can try these steps. dataDir is the main point here
I am assuming you deploying it in /usr/local so change accordingly, and have basing knowledge of solr configuration.
Download Solr and copy dist/apache-solr-x.x.x.war to tomcat/webapps
Copied example/solr/conf to /usr/local/solr/
Set solr.home to /usr/local/solr
In solrconfig.xml, change dataDir to /usr/local/solr/data (Solr looks for the index directory inside)
change schema.xml accordingly ie. you need to change fields,

Related

How can liferay server cluster using one shared lucene index on a mounted volume?

I have a liferay cluster(2 servers), while each liferay boundle has one lucene files, I want to separate these lucene files into a mounted volume, like EFS. Is there any way that I can do this? I had tried, but failed, the main reason is that the server will lock the lucene file when indexing, and another server can not access.
When using a clustered environment, it is recommended to not use a plain file base lucene search index. Liferay rather recommends (Liferay Clustering) to use a pluggable enterprise search such as SOLR or Elasticsearch. There are also some help advices on that page for setup such an environment.
As Liferay says:
Sharing a Search Index (not recommended unless you have a file
locking-aware SAN)
That's why, the best option are:
Use pluggable engines like SolR or ElasticSearch (Elasticray or others).
Configure Liferay cluster with 1 node writer and 1 node reader with the property:
index.read.only=false
IMHO, I would try to use elasticsearch for indexes because it's the one used in the last versions (7+) and Lucene is not as powerful as Elastic, for example with the performance.

Importing nodes into Neo4j using batch importer with automatic indexing

I have imported nodes using jdbc importer but am unable to figure out auto_index support. How do I get auto indexing?
The tool you link to does give instructions for indexing, but I've never used it and it doesn't seem to be up to date. I would recommend you use one of the importing tools listed here. You can convert your comma separated file to tab separated and use this batch importer or one of the neo4j-shell tools, both of which support automatic indexing.
If you want to use a JDBC driver, for instance with some data transfer tool like Pentaho Kettle, there are instructions and links on the Neo4j import page, first link above.
I know from another question that you use regular expressions heavily and it is possible that 'automatic index', which is a Lucene index, may be very good for that, since you can query the index with regexp directly. But if you want to index your nodes within their labels, the new type of index in 2.0, then you don't need to setup indexing before importing. You can create an index at any time and it is populated in the background. If that's what you want, you can read the documentation about working with indices from Java API and Cypher.

Nutch + Solr on top level page only

I've been trying to use Nutch to crawl over over the first page of the domains in my urls file and then use Solr to make keywords in the crawled data searchable. So far I haven't been able to get anything working this way, unless the two pages are linked together.
I realize this is probably an issue of the pages having no incoming links, and therefore the PageRank algorithm discards the page content. I tried adjusting the parameters so that the default score is higher for urls not in the graph, but I'm still getting the same results.
Is there anything people know of that can build an index over pages with no incoming links?
Thanks!
Try a nutch inject command to insert the "no-incomming-link" URL into the nutch DB.
I guess that if you don't see anything in your solr indexes, it is because no data for those URLs is stored in the nutch DB (since nutch will take care to sync its DB with the indexes). Not having data in the DB may be explained by the fact that the URLs are isolated, hence you can try the inject command to include those sites.
I would try to actually see the internal DB to verify the nutch behavior, since before inserting values in the indexes, nutch stores data inside its DBs.
Assigning a higher score has no effect, since lucene will give you a result as long as the data is in the index.
Solr now reads HTML files using Tika by default, so that's not a problem.
http://wiki.apache.org/solr/TikaEntityProcessor
If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

Relation of SOLR to DB to App in a Text Search Engine

I recently overheard a few coworkers talking about an article one of them had read involving the use of SOLR in conjunction with a database and an app to provide a "super-charged" text search engine for the app itself. From what I could make out, SOLR is a web service that exposes Lucene's text searching capabilities to a web-enabled app.
I wasn't able to find the article they were talking about, but doing a few relevant Google searches chaulks up several super-abstract articles on text search engines using SOLR.
What I'm wondering is: what's the relationship between all 3 components here?
Who calls who? Does Lucene somehow regularly extract and cache text data from the DB, and then the app queries SOLR for Lucene's text content? What's a typical software stack/setup for a Java-based, SOLR-powered text search engine? Thanks in advance!
You're right in your basic outline here: SOLR is a webservice and syntax helper that sits on top of Lucene.
Essentially, SOLR is configured to index specific data based on a number of configuration options (that include weighting, string manipulation, etc.) SOLR can either be pointed at a DB as its source of data to index, or individual documents (eg, XML files) can be submitted via the web API for indexing.
A web application would typically make an HTTP(s) request to the SOLR API, and SOLR would return indexed data that matches the query. For all intents and purposes, the web app sees SOLR as an HTTP API; it doesn't need to be aware of Lucene in any way. So essentially, the data flow looks like:
Website --> SOLR API --> indexed datasource (DB or document collection)
In terms of "when" SOLR looks at the DB to index new or updated data, this can be configured in a number of ways, but is most typically triggered by calling a specific function of the SOLR API that causes a reindex. This could occur manually, via a scheduled job, programmatically from the web app, etc.
This is what I understood when I started implementing it for my project -
SOLR can be termed as a middleman between your application server and
the DB. SOLR consists of its own server (jetty) which will be up and
listening to any request coming from your app server.
Your application server calls SOLR, giving it the module name and the
search pattern
SOLR will be fed with some xml config files which will tell it, which
table of your schema has to be cached (or indexed) for the given
module name
SOLR might be using Lucene's text search capabilities to understand
the "search pattern" and get the desired result from the already
cached/indexed data
SOLR indexing (full or partial) can be done manually (by executing
commands through GET URLs) or in regular intervals using the SOLR
config files
You can refer Apache SOLR site for more information

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...
First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.
Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.
2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??
As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Categories

Resources