Situations to prefer Apache Lucene over Solr? - java

There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.

If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.

If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.

Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.

I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).

Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.

Related

using hadoop on the current application

we have an application which is written in Java, and uses solr,Elastic Search, Neo4j,MySQL and few more .
we require to increase our data size dramatically (from millions to billions)
So here the options I had in order to make this work:
clustering individual components notably solr, ES, Neo4j and MySQl
use what everyone talks about nowadays : Hadoop
Problem with first is hard to manage
the second option sounds too good to be true. So my questions are :
Can I actually assume that Hadoop can do that before digging in?
what other criteria do I need to consider?
Is there any alternative solution for such task?
Solr is for data searching. If you want to process the big data (meets criteria of volume, velocity and variety) such as ETL and reporting, you would need Hadoop.
Hadoop consist of several eco system components. You can refer to below link for documentation:
https://hadoop.apache.org

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.
I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Add faceting over multivalued to application using Hibernate Search

we use Hibernate Search in our application. We use faceting. Recently we have found a big limitation. Faceting over fields that can have multiple values doesn't work properly with Hibernate Search - if a document has multiple values for faceted field (ex. multiple categories), only one of the values is taken into account.
I can currently think of a couple two solutions:
use bobo-browse (http://code.google.com/p/bobo-browse/)
solr (http://lucene.apache.org/solr/)
In both solutions we continue to maintain the index using Hiberante Search and making queries as we did before (using Hiberante Search), and run additional bobo-browse or solr query for faceting, where required (bobo-browse or solr would use index in kind of "read-only" manner). The problem is that we update index quite often, and would like to get really fresh data in faceting queries. Bobo-browse doesn't automatically integrate with Hiberante Search, and to keep search up to date, I might get into some problems (ex. https://groups.google.com/forum/?fromgroups=#!topic/bobo-browse/sn_Efc-YClU). The documentation looks a bit untidy and not yet completed. Solr on the other hand seems like a really big thing to add, just to get faceting work properly. And I'm still afraid I might run into some problems with updating/refreshing index.
Do you have any experience in that matter? Any suggestions?
As a Hibernate Search developer, I'd suggest to join us and help implement what you need.
Noone of us actually needed multivalued faceting so we're not really sure which solution to pick either; it seems you have a real need, that's perfect to explore the alternatives and try them out.
Hibernate Search already depends on many Solr modules especially because of the large collection of excellent analysers. I'm confident we could find a way to embed the faceting logic of Solr and package it nicely in our consistent API, without the need to actually start Solr in server mode.
I guess we could do the same with Bobo-browse; I'd prefer Solr to not add other dependencies, but if bobo-browse proofs a superior solution why not.. but you can help us in this choice.
What would you get in exchange?
we'll maintain it: compatibility will stay with any future version. hopefully you'll help a bit.
eternal gratitude from other users ;)
rock solid testing from thousands of other users
bugfixes and improvements from ..
a rock star badge on your CV
What is required?
unit tests
documentation updates
sensible code
https://community.jboss.org/wiki/ContributingToHibernateSearch
I also use Bobo Browse in combination with Hibernate Search. I also have the problem with regular updates and the read-only issue. Bobo is not the easiest library out there and I've looked several times at ways to integrate with Hibernate Search and just gave up because of the complexity.
I use timed reloads of the index in order to ensure freshness but that creates a lot of garbage to be collected. Lucene has over time optimized the process of reopening indexreaders, but the Bobo team is not really focused on supporting that. https://linkedin.jira.com/browse/BOBO-31 describes this issue.
The Hibernate Search infrastructure should provide enough flexibility to integrate. Zoie is a real-time indexing system like Hibernate Search that is integrated with Bobo https://linkedin.jira.com/wiki/display/BOBO/Realtime+Faceting+with+Zoie Perhaps it can inspire your efforts
This is something of a solution to the multi-value facet-count problem for hibernate-search.
Blog: http://outbottle.com/hibernate-search-multivalue-facet-counts/
The blog is complete with a Java Class that can be reused to generate facet-counts for single-value and multi-value fields.
The solution provided is based on the BitSet solution provided here: http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html
The blog has a Maven project which demonstrates the solution quite comprehensively. The project demonstrates using the hibernate-search faceting API to filter on....
a date-range AND a 1-to-many (single-value) facet-group AND a many-to-many (multi-value) facet-group combined.
The solution is then invoked to correctly derive facet-counts for each facet-group.
The solution facilitates results similar to this jsFiddle emulation: http://goo.gl/y5C9UO (except that the emulation does not demo the range faceting).
The jsFiddle is part of a larger blog which explores the concept of facet searching in general: http://outbottle.com/understanding-faceted-searching/. If you’re like me and are finding the whole notion of facet-searching quite confusing then this will help.
It may not be the best solution in the world so feel free to feedback.

implementing simple Document management

My qustion is: How would you go on implementing simple DMS(document management) based on following requirements?
DMS shouls be distributed web application.
Support for document versioning.
Support for document locking.
Document search.
Im already clear on what technologies I want to use. I will use Sring MVC, Hibernate and relational (most likely MYSQL) database.
One thing Im not very clear on is if I need to use webdav, since I could just upload or download documets. I thing I have to because I need to acomplish point 2. and especially point 3. somehow. Is this the right way to go?
Any examples or experience with this would come very handy :). May be Milton is not the best library to pick for webdav?
#Eduard, regarding dependencies on 3rd parties - are you doing this as a college/university exercise or something that will affect real users in a production environment?
At the risk of sounding very pretentious; don't reimplement the wheel! I'd definitely 2nd the call to use JCR, this way you are depending a standard and not a 3rd party implementation.
JCR is a well defined standard (that means a lot of people invested commercial effort (i.e. cash and expertise in huge amounts) into this). I would seriously reconsider looking into JCR - think of it as an API where 3rd parties provide the implementation (no vendor lockin).
Have a look at the features you'll get out-of-the-box, I believe 99 - 110% of the functionality you require is available through a JCR implementation. Plus you'll benefit from the fact the code you'll be using has been tested by hundreds of people in real world situations.
Where I'd differ from bmscomp is in suggesting JackRabbit http://jackrabbit.apache.org/
Option 1:
I am not sure about webdav, no real experience on it. But I would highly recommend you using a Document database like MongoDB.
With mongodb, you can:
1. Handle document versions
2. MongoDB has atomic operations, you can add your logic of document locking.
This will give you some awesome added benefits of search your documents store.
Option 2:
Apache Jackrabbit: A Content repository
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.
Think about using JCR Java content Repository
http://en.wikipedia.org/wiki/Content_repository_API_for_Java or you can have a look at the job done on Alfresco or and Exo framework they did a good job
You can use these open source projects to meet your requirements:
http://sourceforge.net/projects/logicaldoc/ -
LogicalDOC is a modern document management system with a nice interface, easy to use and very fast. It uses open source Java technologies such as GWT, Spring, Lucene in order to provide a flexible and scalable DMS platform. http://www.logicaldoc.com
http://sourceforge.net/projects/openkm/ -
OpenKM Document Management - DMS Updated 2011-05-25
OpenKM is powerful scalable Document Management System (DMS). OpenKM uses Jboss + J2EE + Ajax web (GWT) + Jackrabbit (lucene) Open Source technologies. http://www.openkm.com/
Spring MVC is a good choice. If you want to use a relational database then can also check out Datanucleus. At least the JDO layer (plus maybe the JPA layer) provides versioning support. For search I recommend apache solr, based on lucene, wich has excellent and powerful fulltext search capabilites.
Although webdav seems like the natural choice as a simple and cross plattform file transfer protocol I never had good experiences. Either the Client or the Server didn't work well (konqueror, internet explorer, zope 2, ...). So abstract from the protocol and provide multiple ways to access the file.

Extend JackRabbit or build up from Lucene?

I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?
I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.
is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.
I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.
I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...

Categories

Resources