Relation of SOLR to DB to App in a Text Search Engine

Relation of SOLR to DB to App in a Text Search Engine - java

I recently overheard a few coworkers talking about an article one of them had read involving the use of SOLR in conjunction with a database and an app to provide a "super-charged" text search engine for the app itself. From what I could make out, SOLR is a web service that exposes Lucene's text searching capabilities to a web-enabled app.
I wasn't able to find the article they were talking about, but doing a few relevant Google searches chaulks up several super-abstract articles on text search engines using SOLR.
What I'm wondering is: what's the relationship between all 3 components here?
Who calls who? Does Lucene somehow regularly extract and cache text data from the DB, and then the app queries SOLR for Lucene's text content? What's a typical software stack/setup for a Java-based, SOLR-powered text search engine? Thanks in advance!

You're right in your basic outline here: SOLR is a webservice and syntax helper that sits on top of Lucene.
Essentially, SOLR is configured to index specific data based on a number of configuration options (that include weighting, string manipulation, etc.) SOLR can either be pointed at a DB as its source of data to index, or individual documents (eg, XML files) can be submitted via the web API for indexing.
A web application would typically make an HTTP(s) request to the SOLR API, and SOLR would return indexed data that matches the query. For all intents and purposes, the web app sees SOLR as an HTTP API; it doesn't need to be aware of Lucene in any way. So essentially, the data flow looks like:
Website --> SOLR API --> indexed datasource (DB or document collection)
In terms of "when" SOLR looks at the DB to index new or updated data, this can be configured in a number of ways, but is most typically triggered by calling a specific function of the SOLR API that causes a reindex. This could occur manually, via a scheduled job, programmatically from the web app, etc.

This is what I understood when I started implementing it for my project -
SOLR can be termed as a middleman between your application server and
the DB. SOLR consists of its own server (jetty) which will be up and
listening to any request coming from your app server.
Your application server calls SOLR, giving it the module name and the
search pattern
SOLR will be fed with some xml config files which will tell it, which
table of your schema has to be cached (or indexed) for the given
module name
SOLR might be using Lucene's text search capabilities to understand
the "search pattern" and get the desired result from the already
cached/indexed data
SOLR indexing (full or partial) can be done manually (by executing
commands through GET URLs) or in regular intervals using the SOLR
config files
You can refer Apache SOLR site for more information

Related

How do I identify a JanusGraph's name?

I want to port a social networking application from sql to JanusGraph. I'll be building the backend using Java because it has amazing documentation in janusgraph's official website. I have some beginner questions.
JanusGraph graph = JanusGraphFactory.open("my_setup.properties");
Is .properties file, the only identifier to access a graph? or is it
the file path? (In sql we have a name for database. Is there anything like a graph name?)
If I have the copy of properties file with same
preferences and rename it to my_setup_2.properties, will it access
the same graph or it'll create a new graph?
Is there any way I can identify these vertices belongs to this graph
from my storage backend or search backend?
For what kind of queries storage backend is used and for what kind of
queries search backend is used?
Is there anyway to dump my database? (for porting the graph from one
server to another just like sql dump)
I have only found hosting service providers for Janusgraph 0.1.1
(which is outdated. Latest one is 0.2.1 which supports latest elasticsearch) If I go to production with janusgraph 0.1.1 version how bad will it affect me if I use elasticsearch for search backend?

Is .properties file, the only identifier to access a graph? or is it
the file path? (In sql we have a name for database. Is there anything
like a graph name?)
JanusGraph has a pluggable storage and index backend. The .properties file just tells JanusGraph which backend to use and how they are configured. Different graphs instances will just point to different storage folders, indexes, etc. By looking at the documentation for the config file, it seems though you can specify a graphname which can be used with the ConfiguredGraphFactory to open a graph in this fashion ConfiguredGraphFactory.open("graphName")
If I have the copy of properties file with same preferences and rename
it to my_setup_2.properties, will it access the same graph or it'll
create a new graph?
Yes it will access the same data and hence the same graph.
Is there any way I can identify these vertices belongs to this graph
from my storage backend or search backend?
I don't know exactly for every storage backend but in the case of Elasticsearch, indexes created by JanusGraph are prefixed with janusgraph. I think there are similar mechanisms for other backends.
For what kind of queries storage backend is used and for what kind of
queries search backend is used?
The index backend is used whenever you add an has step on a property indexed with a mixed index. I think all other queries, including an has step on a property configured with a composite index will use the storage backend. For OLAP workloads you can even plug Spark or Giraph on your storage backend to do the heavy lifting.
Is there anyway to dump my database? (for porting the graph from one
server to another just like sql dump)
Graphs can be exported and imported to graph file formats like GraphML. It allows you to interface with other graph tools like Gephi for example. You won't be able to sql dump from your SQL database and directly import that to JanusGraph though. If you consider loading a lot of nodes and edges at once, please go through the documentation about bulk loading.
I have only found hosting service providers for Janusgraph 0.1.1
(which is outdated. Latest one is 0.2.1 which supports latest
elasticsearch) If I go to production with janusgraph 0.1.1 version how
bad will it affect me if I use elasticsearch for search backend?
I don't know about any hosting providers for JanusGraph 2.x. You will easily find hosted services for the pluggable storage backends compatible with JanusGraph 2.x.

How can liferay server cluster using one shared lucene index on a mounted volume?

I have a liferay cluster(2 servers), while each liferay boundle has one lucene files, I want to separate these lucene files into a mounted volume, like EFS. Is there any way that I can do this? I had tried, but failed, the main reason is that the server will lock the lucene file when indexing, and another server can not access.

When using a clustered environment, it is recommended to not use a plain file base lucene search index. Liferay rather recommends (Liferay Clustering) to use a pluggable enterprise search such as SOLR or Elasticsearch. There are also some help advices on that page for setup such an environment.

As Liferay says:
Sharing a Search Index (not recommended unless you have a file
locking-aware SAN)
That's why, the best option are:
Use pluggable engines like SolR or ElasticSearch (Elasticray or others).
Configure Liferay cluster with 1 node writer and 1 node reader with the property:
index.read.only=false
IMHO, I would try to use elasticsearch for indexes because it's the one used in the last versions (7+) and Lucene is not as powerful as Elastic, for example with the performance.

How to read Lucene indexes from Solr

I have an existing web application which uses lucene to create indexes. Now as per the requirement, I have to set up Solr which will serve as a search engine for many other web application inculding my web app. I do not want to create indexes within Solr. Hence, I need to tell Solr to read indexes from lucene instead of creating indexes within Solr and reading from its own.
As a beginner of Solr, first I used nutch to create indxes and then used those indxes within Solr. But I'm unaware how to make Solr understand to read indexes from lucene. I did not find any documentation around this. Kindly advice how to achive this.

It is not possible in any reliable way.
It's like saying you built an application in Ruby and now want to use Rails on top of existing database structure. Solr (as Rails) has it's own expectations naming and workflows around Lucene core and there is no migration path.
You can try using Luke to confirm the internal data structures differences between Lucene and Solr for yourself.

I have never done that before but as Solr is built on Lucene, you can try these steps. dataDir is the main point here
I am assuming you deploying it in /usr/local so change accordingly, and have basing knowledge of solr configuration.
Download Solr and copy dist/apache-solr-x.x.x.war to tomcat/webapps
Copied example/solr/conf to /usr/local/solr/
Set solr.home to /usr/local/solr
In solrconfig.xml, change dataDir to /usr/local/solr/data (Solr looks for the index directory inside)
change schema.xml accordingly ie. you need to change fields,

Nutch + Solr on top level page only

I've been trying to use Nutch to crawl over over the first page of the domains in my urls file and then use Solr to make keywords in the crawled data searchable. So far I haven't been able to get anything working this way, unless the two pages are linked together.
I realize this is probably an issue of the pages having no incoming links, and therefore the PageRank algorithm discards the page content. I tried adjusting the parameters so that the default score is higher for urls not in the graph, but I'm still getting the same results.
Is there anything people know of that can build an index over pages with no incoming links?
Thanks!

Try a nutch inject command to insert the "no-incomming-link" URL into the nutch DB.
I guess that if you don't see anything in your solr indexes, it is because no data for those URLs is stored in the nutch DB (since nutch will take care to sync its DB with the indexes). Not having data in the DB may be explained by the fact that the URLs are isolated, hence you can try the inject command to include those sites.
I would try to actually see the internal DB to verify the nutch behavior, since before inserting values in the indexes, nutch stores data inside its DBs.
Assigning a higher score has no effect, since lucene will give you a result as long as the data is in the index.

Solr now reads HTML files using Tika by default, so that's not a problem.
http://wiki.apache.org/solr/TikaEntityProcessor
If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.