Nutch + Solr on top level page only

Nutch + Solr on top level page only - java

I've been trying to use Nutch to crawl over over the first page of the domains in my urls file and then use Solr to make keywords in the crawled data searchable. So far I haven't been able to get anything working this way, unless the two pages are linked together.
I realize this is probably an issue of the pages having no incoming links, and therefore the PageRank algorithm discards the page content. I tried adjusting the parameters so that the default score is higher for urls not in the graph, but I'm still getting the same results.
Is there anything people know of that can build an index over pages with no incoming links?
Thanks!

Try a nutch inject command to insert the "no-incomming-link" URL into the nutch DB.
I guess that if you don't see anything in your solr indexes, it is because no data for those URLs is stored in the nutch DB (since nutch will take care to sync its DB with the indexes). Not having data in the DB may be explained by the fact that the URLs are isolated, hence you can try the inject command to include those sites.
I would try to actually see the internal DB to verify the nutch behavior, since before inserting values in the indexes, nutch stores data inside its DBs.
Assigning a higher score has no effect, since lucene will give you a result as long as the data is in the index.

Solr now reads HTML files using Tika by default, so that's not a problem.
http://wiki.apache.org/solr/TikaEntityProcessor
If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

Related

Batch indexing to solr

I have a java class that sends http post requests to a solr instance to index json files. it is implemented in a multithreaded manner. However, I have realized that sending so many http requests (close to 20,000) is causing the network to be a bottle neck. I read online that I can do batch indexing, but I can't find any clear examples. Is there any advice to batch index solr?
Thank you.

For generic JSON, you must have a configuration somewhere in solrconfig.xml that defines how it is treated.
One of the parameters is split. You might be able to use it to combine your JSON documents into a one bigger one that Solr would split and process separately. Notice that the specific format may be a little different for different Solr versions. Get the correct version of the downloadable reference guide PDF, if something is not working.
Or, if you can generate it, use JSON format Solr understands directly and which has full support for multiple documents.

How to read Lucene indexes from Solr

I have an existing web application which uses lucene to create indexes. Now as per the requirement, I have to set up Solr which will serve as a search engine for many other web application inculding my web app. I do not want to create indexes within Solr. Hence, I need to tell Solr to read indexes from lucene instead of creating indexes within Solr and reading from its own.
As a beginner of Solr, first I used nutch to create indxes and then used those indxes within Solr. But I'm unaware how to make Solr understand to read indexes from lucene. I did not find any documentation around this. Kindly advice how to achive this.

It is not possible in any reliable way.
It's like saying you built an application in Ruby and now want to use Rails on top of existing database structure. Solr (as Rails) has it's own expectations naming and workflows around Lucene core and there is no migration path.
You can try using Luke to confirm the internal data structures differences between Lucene and Solr for yourself.

I have never done that before but as Solr is built on Lucene, you can try these steps. dataDir is the main point here
I am assuming you deploying it in /usr/local so change accordingly, and have basing knowledge of solr configuration.
Download Solr and copy dist/apache-solr-x.x.x.war to tomcat/webapps
Copied example/solr/conf to /usr/local/solr/
Set solr.home to /usr/local/solr
In solrconfig.xml, change dataDir to /usr/local/solr/data (Solr looks for the index directory inside)
change schema.xml accordingly ie. you need to change fields,

Relation of SOLR to DB to App in a Text Search Engine

I recently overheard a few coworkers talking about an article one of them had read involving the use of SOLR in conjunction with a database and an app to provide a "super-charged" text search engine for the app itself. From what I could make out, SOLR is a web service that exposes Lucene's text searching capabilities to a web-enabled app.
I wasn't able to find the article they were talking about, but doing a few relevant Google searches chaulks up several super-abstract articles on text search engines using SOLR.
What I'm wondering is: what's the relationship between all 3 components here?
Who calls who? Does Lucene somehow regularly extract and cache text data from the DB, and then the app queries SOLR for Lucene's text content? What's a typical software stack/setup for a Java-based, SOLR-powered text search engine? Thanks in advance!

You're right in your basic outline here: SOLR is a webservice and syntax helper that sits on top of Lucene.
Essentially, SOLR is configured to index specific data based on a number of configuration options (that include weighting, string manipulation, etc.) SOLR can either be pointed at a DB as its source of data to index, or individual documents (eg, XML files) can be submitted via the web API for indexing.
A web application would typically make an HTTP(s) request to the SOLR API, and SOLR would return indexed data that matches the query. For all intents and purposes, the web app sees SOLR as an HTTP API; it doesn't need to be aware of Lucene in any way. So essentially, the data flow looks like:
Website --> SOLR API --> indexed datasource (DB or document collection)
In terms of "when" SOLR looks at the DB to index new or updated data, this can be configured in a number of ways, but is most typically triggered by calling a specific function of the SOLR API that causes a reindex. This could occur manually, via a scheduled job, programmatically from the web app, etc.

This is what I understood when I started implementing it for my project -
SOLR can be termed as a middleman between your application server and
the DB. SOLR consists of its own server (jetty) which will be up and
listening to any request coming from your app server.
Your application server calls SOLR, giving it the module name and the
search pattern
SOLR will be fed with some xml config files which will tell it, which
table of your schema has to be cached (or indexed) for the given
module name
SOLR might be using Lucene's text search capabilities to understand
the "search pattern" and get the desired result from the already
cached/indexed data
SOLR indexing (full or partial) can be done manually (by executing
commands through GET URLs) or in regular intervals using the SOLR
config files
You can refer Apache SOLR site for more information

Ideal place to store Binary data that can be rendered by calling a url

I am looking for an ideal (performance effective and maintainable) place to store binary data. In my case these are images. I have to do some image processing,scale the images and store in a suitable place which can be accesses via a RESTful service.
From my research so far I have a few options, like:
NoSql solution like MongoDB,GridFS
Storing as files in a file system in a directory hierarchy and then using a web server to access the images by url
Apache Jackrabbit Document repository
Store in a cache something like Memcache,Squid Proxy
Any thoughts of which one you would pick and why would be useful or is there a better way to do it?

Just started using GridFS to do exactly what you described.
From my experience thus far, the main advantage to GridFS is that it obviates the need for a separate file storage system. Our entire persistency layer is already put into Mongo, and so the next logical step would be to store our filesystem there as well. The flat namespacing just rocks and allows you a rich query language to fetch your files based off whatever metadata you want to attach to them. In our app we used an 'appdata' object that embedded all the ownership information, ensure
Another thing to consider with NoSQL file storage, and especially GridFS, is that it will shard and expand along with your other data. If you've got your entire DB key-value store inside the mongo server, then eventually if you ever have to expand your server cluster with more machines, your filesystem will grow along with it.
It can feel a little 'black box' since the binary data itself is split into chunks, a prospect that frightens those used to a classic directory based filesystem. This is alleviated with the help of admin programs like RockMongo.
All in all to store images in GridFS is as easy as inserting the docs themselves, most of the drivers for all the major languages handle everything for you. In our environment we took image uploads at an endpoint and used PIL to perform resizing. The images were then fetched from mongo at another endpoint that just output the data and mimetyped it as a jpeg.
Best of luck!
EDIT:
To give you an example of a trivial file upload with GridFS, here's the simplest approach in PyMongo, the python library.
from pymongo import Connection
import gridfs
binary_data = 'Hello, world!'
db = Connection().test_db
fs = gridfs.GridFS(db)
#the filename kwarg sets the filename in the mongo doc, but you can pass anything in
#and make custom key-values too.
file_id = fs.put(binary_data, filename='helloworld.txt',anykey="foo")
output = fs.get(file_id).read()
print output
>>>Hello, world!
You can also query against your custom values if you like, which can be REALLY useful if you want your queries to be based off custom information relative to your application.
try:
file = fs.get_last_version({'anykey':'foo'})
return file.read()
catch gridfs.errors.NoFile:
return None
These are just some simple examples, and the drivers for alot of the other languages (PHP, Ruby etc.) all have cognates.

I would go for jackrabbit in combination with its REST framework sling http://sling.apache.org
Sling allows you to upload/download files via REST calls or webdav while the underlying jackrabbit repository gives you a performant storage with the possibility to store your files in a tree structure (or flat if you like).
Both jackrabbit and sling support an event mechanism where you can asynchronously process the image after upload to i.e. create thumbnails.
The manual at http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html describes how to manipulate data using the REST interface provided by sling.

Storing the images as blobs in an RDBMS in another option, and you immediately get some guarantees about integrity, security etc (if this is setup properly on the database), store extra metadata, manage the collection with SQL etc.

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Nutch + Solr on top level page only - java

Solr now reads HTML files using Tika by default, so that's not a problem. http://wiki.apache.org/solr/TikaEntityProcessor If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

Related

Batch indexing to solr

How to read Lucene indexes from Solr

Relation of SOLR to DB to App in a Text Search Engine

Ideal place to store Binary data that can be rendered by calling a url

Questions about SOLR documents and some more

Categories

Resources