Java crawl web and store in cassandra

Java crawl web and store in cassandra - java

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.

I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Related

Using Lucene as storage

I would like to know if it would be recommended to use Lucene as data storage. I am saying 'recommended' because I already know that it's possible.
I am asking this question because the only Q&A I could find on SO was this one: Lucene as data store which is kind of outdated (from 2010) even if it is almost exactly the same question.
My main concern about having data exclusively in Lucene is the storage reliability. I have been using Lucene since 2011 and at that time (version 2.4) it was not improbable to encounter a CorruptIndexException, basically meaning that the data would be lost if you didn't have it somewhere else.
However, in the newest versions (from 4.x onward), I've never experienced any problem with Lucene indices.
The answer should not consider the performance too much as I already have a pretty good idea of what to expect in that field.
I am also open to hear about SOLR and ElasticSearch reliability experiences... (how often are the shards failing, what options do we have when this occurs, etc)

This sounds like a good good match for Solrcloud as it is able and willing to handle the load and also takes care of the backup. My only concern would be that it is not a datastore, it "only" works with the indexing of those documents.

We are using SolrCloud for data storage and reliability is pretty good till now.
However make sure that you configure and tune it well or else you could find nodes failing and zookeeper being unable to detect some of them after some time..

Information retrieval in dbpedia using spotlight

I have recently come across dbpedia-spotlight and I want to do an information retrieval. I have a set of queries and dbpedia and using Information retrieval I need to get the output. I was not able to understand the documentation so can you give me a sample code to start working.
I have tried terrier but that was equally difficult.

Terrier is more popular as a research tool, where you can try out various standard IR models against standard test collections, e.g. TREC, ClueWeb etc.
If you want to develop quickly a reasonably functional search system, Lucene is the best thing to try. Go through this Lucene in 5 minutes tutorial. I guess this should be fairly simple to use.

What language (Java or Python) + framework for mid sized web project?

I plan to start a mid sized web project, what language + framework would you recommend?
I know Java and Python. I am looking for something simple.
Is App Engine a good option? I like the overall simplicity and free hosting, but I am worried about the datastore (how difficult is it to make it similarly fast as a standard SQL solution? + I need fulltext search + I need to filter objects by several parameters).
What about Java with Stripes? Should I use another framework in addition to Stripes (e.g. for database).
UPDATE:
Thanks for the advice, I finally decided to use Django with Eclipse/PyDev as an IDE.
Python/Django is simple and elegant, it's widely used and there is a great documentation. A small disadvantage is that perhaps I'll have to buy a VPS, but it shouldn't be very hard to port the project to App Engine, which is free to some extent.

Since you mentioned python, I would suggest looking into Django. You may need to look harder for hosting options, however...

Is App Engine a good option? I like the overall simplicity and free hosting, but I am worried about the datastore (how difficult is it to make it similarly fast as a standard SQL solution? + I need fulltext search + I need to filter objects by several parameters).
App Engine is nice. It supports Python or Java (with some limitations), and it provides free hosting for small needs (rare, at least for Java). But I wouldn't expect the exact same performances as with dedicated servers, the cloud is about scalability, not performance (you won't always get the fastest response time for a single hit; however, GAE would handle gazillions of concurrent hits without any problem while your servers would be on fire). But this scalability is not without cost; if you don't need it, the development tradeoffs may be too much trouble. And also note that it does not support full-text search out of the box (what an irony), you will have to use extra tooling.
What about Java with Stripes? Should I use another framework besides Stripes (e.g. for database).
I like Stripes very much. I love its conventions over configuration approach, it's a very elegant and simple framework (but still powerful). Definitely not a bad choice. For persistence, if you go for GAE, you will have to use JPA or JDO. If you don't, it's at your discretion (although I would go for JPA).
See also
Google AppEngine - A Second Look

As many things in life, this depends on what your goals are. If you intend to learn a web framework that is used in corporate environments, then choose a Java solution. If not, don't. Python is certainly more elegant and generally more fun in pretty much every way.
As to which framework to use, django has the most mindshare, as evidenced by the number of questions asked about it here. My understanding is that it's also pretty good. It's best suited for CMS-like web sites, though - at least that's what it's coming from and what it's optimized for. You might also have a look at one of the simpler, nimbler ones, such as the relatively new flask. All of these are enjoyable, though they may not all have all features on AppEngine.

Kay and Tipfy are excellent Python framework choices when you target specifically GAE. Kay is modelled after and similar to Django, but is better suited to GAE.

I've been kick App Engine around a little bit, and so far the DataStore is pretty quick... there is a bit of a learning curve compared to SQL, but I've had no real issues. I'm not sure about fulltext search, however filtering is simple, you would just run each filter one at a time.
class DBModel(db.Model):
field1 = db.StringProperty()
field2 = db.StringProperty()
field3 = db.IntegerProperty()
GQLObj = DBModel.all().filter('field1 =', 'Foo')
GQLObj = GQLObj.filter('field2 =', 'Bar')
As far as hosting, with GAE I'm not sure you even get a choice, I know you can register your own domain with google though.

I don't think the datastore is a problem. Many people will reject it out of hand because they want a standard relational database; if you are willing to consider a datastore in general then I doubt you will have any problems with the GAE datastore. Personally, I quite like it.
The thing that might trip you up is the operational limitations. For example, did you know that an HTTP request must complete within 10 seconds?
What if you get 50% of the way through a project and then find that a web service you are using sometimes take 15 seconds to respond? Now you are toast. You can't pay extra to get the limit raised or anything like that.
So, my point is that you must approach GAE with great care. Learn about the limitations and make sure that they will not be a problem before you start using it.

It depends on your personality. There's no right answer to this question any more than there's a right answer to "what kind of car should I drive?"
If you're artistic and believe code should be beautiful, use Rails.
If you're a real hacker type, I think you'll find a full-stack framework such as Rails or Django to be unsatisfying. These frameworks are "opinionated" software, which means you have to really embrace the author's vision to be most productive.
The wonderful thing about web development in the Python world is there's several great minimal frameworks. I've used several, including web.py, GAE's webapp, and cherrypy. These frameworks are like "here's a request, give me a string to serve up." It's raw. Don't think you'll be stuck in Python concatenating strings though, God no. There's also several excellent templating libraries for Python. I can personally recommend Cheetah but Mako also looks good.

Google App Engine + GWT and you have a pretty powerful combination for developing web applications. The datastore is quite fast, and it has so far done the job quite nicely for me.
In my project I had to do a lot of redesigning of my database model, because it was made for a traditional relational database, and some things were not (directly) possible with the datastore.
GWT has a fairly moderate learning curve, but it gets the job done very well. The gui code is really easy to get started with, but it's the asynchronous way of thinking that's the hardest part.
As for search I don't think it's supported in the framework. Filtering is possible on parameters.
There are some limitations to GAE, and you should consider them before putting all your eggs in that basket. The fact that GAE uses J2EE distribution standards makes the application very easy to move to a dedicated server, should the limitations of GAE become a problem. In fact I only think you would have to refactor the part of your code that makes the queries and stores the data (which shouldn't be much more than 100 lines).

I've built several apps on GAE (with Python) over the last year. It's hard to beat the ease with which you can get an app up and running quickly. Don't discount the value in that alone.
While you may not understand the datastore yet, it is extremely well documented and there are great resources - including this one - to help you get past any problem you might have.

Situations to prefer Apache Lucene over Solr?

There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.

If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.

If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.

Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.

I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).

Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.

Extend JackRabbit or build up from Lucene?

I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?

I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.

is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.

I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.

I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java crawl web and store in cassandra - java

Related

Using Lucene as storage

Information retrieval in dbpedia using spotlight

What language (Java or Python) + framework for mid sized web project?

Situations to prefer Apache Lucene over Solr?

Extend JackRabbit or build up from Lucene?

Categories

Resources