Building document store with search capability - java

I need to create a document store with search capabilities. Sounds simple...
That means that I have documents which I need to store in database. I thought about CouchDB, and about few other document-oriented databases, but I'm still not sure what would be the best solution.
On the other side, I thought about integrating Solr in some kind of web application which I'm going to use for uploading, indexing, search, update, delete documents.
And, of course, the main problem is that most of these documents are written using Cyrillic characters.
Maybe I'm trying to combine things that do not match together.
Could someone give me an advice what would be the best way to implement solution like this.
Best,
Joksimovic

Brate Srbine/CrnogorĨe :)
I suggest you use MongoDB as your database and use Solr to get index/search capability.
I used Solr in my previous (government tender) project and it's GREAT.
No bugs, easy to use when you get into it and it's blindingly fast.

Looks like for your needs Thinking sphinx could help. You could store documents in any database(SQL-oriented or not) and search them with sphinx.
Sphinx supports cyrillic characters from the box, also it's possible to use stemming, faceted search, fuzzy search, etc. May be it helps you.
Read more about sphinx here

I am also working on such a content management system. Utill now i am going to use a database to store the metadata.
Store the documents on file system.
Dont go for storing documents in database like SQL server. since it has a limitation and licensing cost.For search you can use Solr (better in terms of support and acceptance in open source over sphinx)
Choosing a stand-alone full-text search server: Sphinx or SOLR?
. either way you need to populate indexes. then call API methods to search.

Related

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.
I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Querying in Solr

I want to know what are the query classes that Solr use for querying. And what are the difference in querying using lucene and Solr
I am not sure what you are asking, but SOLR is basically a search/indexing server. It has an external http based api for sending documents to be indexed and to search them.
One of the core pieces of SOLR is Lucene. This is the library that actually indexes/searches stuff.
If you need the API/query info for SOLR (which should mirror very closely that of lucene), look on lucene.apache.org
Solr allows you to have a distributed search engine that is exposed as a web-service to your client application. If you are asking, how to use it on the client side, just look at solrj api. If you ask for internal SOLR apis and classes, then you could start from the QueryComponent class, e.g. http://lucene.apache.org/solr/api/org/apache/solr/handler/component/QueryComponent.html.
Lucene is the technology used by solr to perform searches.
I'm not 100% what you are asking but if its how do i query solr, then you simply visit or curl a url, the url will contain the solr query. e.g.
price:[0-1000]
or
name:test
the first part (before the :) is the field,and the second part is the search which can be text,numeric range etc...
there is plenty of documentation regarding this on solr's wiki
Let me know what your actual problem is and ill gladly help

Whats the best way to implement a simple document management system?

I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?
You may want to take a look at Content repository API for Java and the several implementations (some of them free).
Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.
The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?

Situations to prefer Apache Lucene over Solr?

There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.
If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.
If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.
Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.
I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).
Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.

SOLR and PHP help needed

I have understood how to add xml files to SOLR and be able to search them via the SOLR ADMIN interface...
I need to know however, how to make SOLR work with PHP, and index MYSQL records...
This is what I want to do:
I have a mysql table, which I would like to add to SOLR (index it), so that instead of searching the MYSQL table directly via PHP, I first take the querystring, send it to SOLR, and then SOLR sends back results in form of ID:nrs, then use the ID:s to query mysql and fetch proper records...
I have no clue on how to communicate with SOLR using PHP, any help is appreciated!
Thanks
There's a good article here that will help you through the integration of PHP and SOLR:
http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/
There's a number of PHP interfaces to SOLR, that article references PHP SOLR client:
http://code.google.com/p/solr-php-client/
but there's also this:
http://pecl.php.net/package/solr
I'd suggest that you start with using DataImportHandler (http://wiki.apache.org/solr/DataImportHandler) for indexing the database and use one of the many Solr PHP clients (see SolrPHP wiki page). Note that Solr also emits JSON responses so if you are familiar with JSON, it may be the easiest way to get started.
I've been there too and it was the first time I found the Internet to be annoying! Maybe that was because I was in such a hurry to learn it in under a minute. Here's what i suggest:
1.
Don't panic. Understanding the working or even just the implementation takes more than just a few seconds. So, keep some time aside for this.
2.
Learn how to use JSON. You can use this to communicate across languages.
3.
Check the apache site

Categories

Resources