Using Lucene as storage - java

I would like to know if it would be recommended to use Lucene as data storage. I am saying 'recommended' because I already know that it's possible.
I am asking this question because the only Q&A I could find on SO was this one: Lucene as data store which is kind of outdated (from 2010) even if it is almost exactly the same question.
My main concern about having data exclusively in Lucene is the storage reliability. I have been using Lucene since 2011 and at that time (version 2.4) it was not improbable to encounter a CorruptIndexException, basically meaning that the data would be lost if you didn't have it somewhere else.
However, in the newest versions (from 4.x onward), I've never experienced any problem with Lucene indices.
The answer should not consider the performance too much as I already have a pretty good idea of what to expect in that field.
I am also open to hear about SOLR and ElasticSearch reliability experiences... (how often are the shards failing, what options do we have when this occurs, etc)

This sounds like a good good match for Solrcloud as it is able and willing to handle the load and also takes care of the backup. My only concern would be that it is not a datastore, it "only" works with the indexing of those documents.

We are using SolrCloud for data storage and reliability is pretty good till now.
However make sure that you configure and tune it well or else you could find nodes failing and zookeeper being unable to detect some of them after some time..

Related

Couchbase 1 -> 2 migration. Backups obtaining and consistency check

Given: an old Java project using a Couchbase 1. It has a big problem with runtime backups. Both, the simple copying files and the cbbackup way doesn't work. The obtained backups were corrupted, Couchbase doesn't start with them. The only way to obtain data snapshot were a relatively long application shutdown.
Now, we're migrating to the Couchbase 2+. cbbackup fails with something like this (senseless message for me, no any design docs were in the Couchbase 1):
/pools/default/buckets/default/ddocs; reason: provide_design done
But, if we use the resulting files, Couchbase seems wake up and works properly.
Question 1: Any insights and help with the whole spoiled backups' situation?
Question 2: How, at least, we could assure a consistency of the new database backups in our case?
(Writing a huge check pack for all docs and fields through the client is very expensive and the last option.)
I appreciate any help, this is a vague legacy infrastructure for the team, googling and Couchbase documents aren't help us much.
Question 1: Any insights and help with the whole spoiled backups' situation?
Couchbase 1.x used SQlite as the on-disk format (shared into 4 files per Bucket IIRC), which has a number of issues at scale.
One of the major changes in Couchbase 2 was to move to a custom append-only file format (couchstore), which is much less susceptible to any corruption issues (as once written a block is never modified), until a new compacted file is later created by an automated job.
How, at least, we could assure a consistency of the new database backups in our case? (Writing a huge check pack for all docs and fields through the client is very expensive and the last option.)
If you want to check consistency of the backup you need to do something along the lines of what you mention.
Note however that if you're backing up a live system (as most people are) then the live system is likely to have changed between taking the backup and when you compare it.
As a final aside, I would suggest looking at the 1.8.x to 2.0 Upgrade Guide on the Couchbase website. Note that 2.x is now pretty old (3.x is current as of writing, 4.0 is in beta) hence the 2.0 documentation is in an archived section of the website.

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.
I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Add faceting over multivalued to application using Hibernate Search

we use Hibernate Search in our application. We use faceting. Recently we have found a big limitation. Faceting over fields that can have multiple values doesn't work properly with Hibernate Search - if a document has multiple values for faceted field (ex. multiple categories), only one of the values is taken into account.
I can currently think of a couple two solutions:
use bobo-browse (http://code.google.com/p/bobo-browse/)
solr (http://lucene.apache.org/solr/)
In both solutions we continue to maintain the index using Hiberante Search and making queries as we did before (using Hiberante Search), and run additional bobo-browse or solr query for faceting, where required (bobo-browse or solr would use index in kind of "read-only" manner). The problem is that we update index quite often, and would like to get really fresh data in faceting queries. Bobo-browse doesn't automatically integrate with Hiberante Search, and to keep search up to date, I might get into some problems (ex. https://groups.google.com/forum/?fromgroups=#!topic/bobo-browse/sn_Efc-YClU). The documentation looks a bit untidy and not yet completed. Solr on the other hand seems like a really big thing to add, just to get faceting work properly. And I'm still afraid I might run into some problems with updating/refreshing index.
Do you have any experience in that matter? Any suggestions?
As a Hibernate Search developer, I'd suggest to join us and help implement what you need.
Noone of us actually needed multivalued faceting so we're not really sure which solution to pick either; it seems you have a real need, that's perfect to explore the alternatives and try them out.
Hibernate Search already depends on many Solr modules especially because of the large collection of excellent analysers. I'm confident we could find a way to embed the faceting logic of Solr and package it nicely in our consistent API, without the need to actually start Solr in server mode.
I guess we could do the same with Bobo-browse; I'd prefer Solr to not add other dependencies, but if bobo-browse proofs a superior solution why not.. but you can help us in this choice.
What would you get in exchange?
we'll maintain it: compatibility will stay with any future version. hopefully you'll help a bit.
eternal gratitude from other users ;)
rock solid testing from thousands of other users
bugfixes and improvements from ..
a rock star badge on your CV
What is required?
unit tests
documentation updates
sensible code
https://community.jboss.org/wiki/ContributingToHibernateSearch
I also use Bobo Browse in combination with Hibernate Search. I also have the problem with regular updates and the read-only issue. Bobo is not the easiest library out there and I've looked several times at ways to integrate with Hibernate Search and just gave up because of the complexity.
I use timed reloads of the index in order to ensure freshness but that creates a lot of garbage to be collected. Lucene has over time optimized the process of reopening indexreaders, but the Bobo team is not really focused on supporting that. https://linkedin.jira.com/browse/BOBO-31 describes this issue.
The Hibernate Search infrastructure should provide enough flexibility to integrate. Zoie is a real-time indexing system like Hibernate Search that is integrated with Bobo https://linkedin.jira.com/wiki/display/BOBO/Realtime+Faceting+with+Zoie Perhaps it can inspire your efforts
This is something of a solution to the multi-value facet-count problem for hibernate-search.
Blog: http://outbottle.com/hibernate-search-multivalue-facet-counts/
The blog is complete with a Java Class that can be reused to generate facet-counts for single-value and multi-value fields.
The solution provided is based on the BitSet solution provided here: http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html
The blog has a Maven project which demonstrates the solution quite comprehensively. The project demonstrates using the hibernate-search faceting API to filter on....
a date-range AND a 1-to-many (single-value) facet-group AND a many-to-many (multi-value) facet-group combined.
The solution is then invoked to correctly derive facet-counts for each facet-group.
The solution facilitates results similar to this jsFiddle emulation: http://goo.gl/y5C9UO (except that the emulation does not demo the range faceting).
The jsFiddle is part of a larger blog which explores the concept of facet searching in general: http://outbottle.com/understanding-faceted-searching/. If you’re like me and are finding the whole notion of facet-searching quite confusing then this will help.
It may not be the best solution in the world so feel free to feedback.

DSpace and Physical Separation of Documents

I'm looking for a document repository that supports physical data separation. I was looking into Open Text's LiveLink, but that price range is out of my league. I just started looking into DSpace! Its free and open source and works with PostgreSQL.
I'm really trying to figure out if DSpace supports physically separating the data. I guess I would create multiple DSpace Communities, and I want the documents for each Community store on different mounts. Is this possible? Or is there a better way?
Also, I wont be using DSpace's front end. The User will not even know DSpace is storing the docs. I will build my own front-end and use Java to talk to DSpace. Does anyone know a good Java API Wrappers for DSpace.
This answer is a pretty long time after you asked the question, so not sure if it's of any use to you.
DSpace currently does not support storing different parts of the data on different partitions, although your asset store can be located on its own partition separate from the database and the application. You may get want you need from the new DuraSpace application which deals with synchronising your data to the cloud.
In terms of Java APIs, DSpace supports SWORD 1.3 which is a deposit-only protocol, and a lightweight implementation of WebDAV with SOAP. Other than that, it's somewhat lacking. If you are looking for a real back-end repository with good web services you might look at Fedora which is a pure back end, with no native UI, and probably more suited to your needs. DSpace and Fedora are both part of the DuraSpace organisation so you can probably benefit from that also. I'm not knowledgeable enough about the way that Fedora stores data to say whether you can physically separate the storage, though.
Hope that helps.

Situations to prefer Apache Lucene over Solr?

There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.
If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.
If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.
Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.
I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).
Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.

Categories

Resources