I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my first post on SO so apologies if it doesn't quite fit into the mould it is meant to.
I'm currently working with a university helping them to implement a test suite to further refine some research they have been conducting. Their research is based around dynamic schema searching. After spending some time evaluating the various open source search solutions I settled on elasticsearch as the base platform and I am wondering what the best way to proceed would be. I have spent about a week looking into the elasticsearch documentation and the code itself and also reading the documentation of Lucene but I am struggling to see a clear way forward.
The goal of the project is to provide the researches with a piece of software they can use to plugin revisions of the searching algorithm to test and refine. They would like to be able to write the pluggable algorithm in languages other then Java that is supported by the JVM like Groovy, Python or Closure but that isn't a hard requirement. Part of that will be to provide them with a front end to run queries and see output and an admin interface to add documents to an index. I am comfortable with all of that thanks to the very powerful and complete REST API. What I am not so sure about is how to proceed with implementing the pluggable search algorithm.
The researcher's algorithm requires 4 inputs to function:
The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world document (they actually call them text events). Rather, for now, it corresponds to one sentence (having that configurable might also be useful). I figure the best way to handle this is to break down documents into their sentences (using Apache Tika or something similar), putting each sentence in as its own document in the index. I am confident I can do this in the Admin UI I provide using the mapper-attachement plugin as a starting point. The downside is that breaking up the document before giving it to elasticsearch isn't a very configurable way of doing it. If they want to change the resolution to their algorithm, they would need to re-add all documents to the index again. If the index stored that full documents as is and the search algorithm could chose what resolution to work at per query then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it into their pluggable search algorithm. I'm really struggling where to start with this one. It seems from looking at Luecene that I need to provide my own search/query implementation, but I'm not sure if this is right or not. There also doesn't seem to be any search plugins listed on the elasticsearch site, so I'm not even sure if it is possible. The important things here are that the algorithm needs to operate at the index level with the query terms available to generate its schema before using the schema to score each document in the index. From what I can tell, this means that the scripting interface provided by elasticsearch won't be of any use. The description of the scripting interface in the elasticsearch guide makes it sound like a script operates at the document level and not the index level. Other concerns/considerations are the ability to program this algorithm in a range of languages (just like the scripting interface) and the ability to augment what is returned by the REST API for a search to include the schema the algorithm generated (which I assume means I will need to define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here? It seems like I am going to have to write my own search plugin that can accept scripts as it's core algorithm. The plugin will be responsible for organising the 4 inputs that I outlined earlier before passing control to the script. It will also be responsible for getting the output from the script and returning it via it's own REST API. Does this seem logical? If so, how do I get started with doing this? What parts of the code do I need to look it?
You should store 1 sentence per document if that's how their algorithm works. You can always reindex if they change their model.
Lucene is pretty good at finding matches, so I suspect your co-workers' algorithm will be dealing with scoring. ElasticSearch supports custom scoring script. You can pass params to a given scoring script. You can use groovy for scripting in ES.
http://www.elasticsearch.org/guide/reference/modules/scripting.html
To use larger datastructures in your search algorithm, it does not make sense to pass those datastructures as params, you might find it useful to use other datasources in scoring script.
For example Redis: http://java.dzone.com/articles/connecting-redis-elasticsearch .
There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.
If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.
If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.
Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.
I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).
Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.
I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?
I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.
is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.
I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.
I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...
I can't seem to find any recent talk on the choice. Back in '06 there was criticism on Hibernate Search as being incomplete and not being ready to compete with Compass, is it now? Has anyone used both and have some perspective on making the decision.
I am developing a web app in Java in my free time, its just me so I'm looking to cut corners everywhere possible while minimizing the effect on the end product. Having said that the searching capabilities of my project are priority one! I have spent a good deal of time making the database model to back the system. The ability to get the user what they're looking for is what will set my app apart. So, speed is expendable ...obviously to a reasonable degree.
Here are my current thoughts on the technologies for this app, and if you see any glaring newb mistakes be gentle ...I'm an expert at nothing.
DB: PostgreSQL
Platform: Java
Frameworks: Spring, Hibernate, Seam
Obviously, I've chosen all free (as in beer) technologies and ones that as far as I can tell play nice together. So what do you guys think, Compass or Hibernate Search to round things out?
-Nomad311
<Careful. Biased person here: I am the project lead of Hibernate Search and author of Hibernate Search in Action by Manning>
If you are targeting Hibernate as your persistence provider, I think you are better off using Hibernate Search as the integration is very smooth (configuration, entity discovery down the the same APIs and programamtic model).
If you want to index a lot of "stuffs" that are not in your database, then Compass is a better fit.
We are working on Hibernate Search 3.2 at the moment: our road map is here
Hibernate Search is a complete product, and it's based on Lucene, which is one of the fastest open source search engine out here.
As an example, some benchmarks:
http://developers.slashdot.org/story/09/07/06/131243/Open-Source-Search-Engine-Benchmarks
Plus, it's fully integrated with Seam and Hibernate (look at the example in the Seam dist).
I suggest you to be more specific about:
Hibernate Search as being incomplete
I'd like to know in which part it is incomplete.
Compass is no more as of elasticsearch is new one after compass. So I think better to use some stable one. (Possibly Hibernate Search)
My developers are waging a civil war. In one camp, they've embraced Hibernate and Spring. In the other camp, they've denounced frameworks - they're considering Hibernate though.
The question is: Are there any nasty surprises, weaknesses or pit-falls that newbie Hibernate-Spring converts are likely to stumble on?
PS: We've a DAO library that's not very sophisticated. I doubt that it has Hibernate's richness, but it's reaching some sort of maturity (i.e. it's not been changed in the last few projects it's included).
They've denounced frameworks?
That's nuts. If you don't use an off-the-shelf framework, then you create your own. It's still a framework.
I've used Hibernate a number of times in the past. Each time I've run into edge cases where determining the syntax devolved into a scavenger hunt through the documentation, Google, and old versions. It is a powerful tool but poorly documented (last I looked).
As for Spring, just about every job I've interviewed for or looked at in the past few years involved Spring, it's really become the de-facto standard for Java/web. Using it will help your developers be more marketable in the future, and it'll help you as you'll have a large pool of people who'll understand your application.
Writing your own framework is tempting, educational, and fun. Not so great on results.
Hibernate has quirks to be sure but that is because the problem it is trying to solve is complex. Every time someone complains about Hibernate I remind them of all of the boring DAO code that they would have to maintain if they weren't using it.
A few tips:
Hibernate is no substitute for a good database design. Hibernate schemas are OK but you will have to tweak them occasionally
Eventually you are going to have to understand how Hibernate lazy loads classes and how that affects things. Hibernate modifies the Java bytecode and you will need to delve into the depths sooner or later if only to explain why object links are null.
Use annotations if you can.
Take the time to learn the Hibernate performance tuning techniques, it will save you in the long run.
If you have a fairly complex database, Hibernate may not be for you. At work we have a fairly complex database with lots of data, and Hibernate doesn't really work for us. We've started using iBATIS instead. However, I know a lot of development shops who use Hibernate successfully - and it does do a lot of grunt work for you - so it's worth considering.
Spring is a good tool if you know how to use it properly.
I would say that frameworks are definitely a good thing - like others have pointed out, you don't want to reinvent the wheel. Spring contains a lot of modules which will mean you won't have to write so much code. Don't succumb to the "Not Invented Here" syndrome!
Lazy loading is the big gotcha in MVC applications that use Hibernate for their persistence framework. You load the object in the controller and pass it to the JSP view. Some or all of the members of the class are proxied and everything blows up because you Hibernate session was closed when the controller completed.
You will need to read the Open Session in View article to understand the problem and get a solution. If you are using Spring the this blog article describes the Spring solution to the open session in view issue.
This is one thing (I could remember) that I fell into when I was in my Hibernate days.
When you delete (several) child objects from a collection (in a parent entity) and then add new entities to the same collection in one transaction without flushing in the middle, Hibernate will do "insert" before "delete". If the child table has a unique constraint in one of its columns, and you are expecting that you would not violate it since you have already deleted some data before (just like I was), then get ready to be frustrated.
Hibernate forum suggests:
It was a DB design flaw, redesign;
flush (or commit if you will) in between the deletes and inserts;
I couldn't do both, and end up tweaking the Hibernate source and recompiling. It was only 1 line of code. But the effort to find that one line was equal to approximately 27 cups of coffee and 3 sleepless nights.
This is just one example of problems and quirks you might end up when using Hibernate with no real expert on your team (expert: someone with adequate knowledge about the philosophy and internal working of Hibernate). Your problem, solution, litre of coffee, and sleepless night count may vary. But you get the idea.
I haven't worked much with Java but I did work in large groups of Java developers. The impression I've got was that Spring is OK. But everybody was upset at Hibernate. Half the team if asked "If you could change one thing, what would you change?" and they'd say "Get rid of Hibernate.". When I started to learn Hibernate it struck me at amazingly complex, but I didn't learn enough (thankfully I've moved along) to know if the complexity was justified or not (maybe it was require to solve some complex problems).
The team got rid of Spring in favor of Guice, but that was more like a political change, at least from my point of view and other developers I've talked to.
I have always found Hibernate to be a bit complex and hard to learn. But as JPA (Java Persistence API) and EJB (Enterprise Java Beans) 3.0 has existed for a while things have gotten a lot easier, I much prefer annotating my classes to create mappings via JavaDoc or XML. Check out the support in Hibernate. The added bonus is that it is possible (but not effortless) to change the database framework later on if needed. I have used OpenJPA with great results.
Lately I have been using JCR (Java Content Repository) more and more. I love the way that my modules can share a single data storage and that I can let the structure and properties evolve. I find it a lot easier working with nodes and properties rather that mapping my objects to a database. A good implementation is Jackrabbit.
As for Spring, it has a lot of features I like, but the amount of XML needed to configure means I will never use it. Instead I utilize Guice and absolutely love it.
To roundup, I would show your doubting developers how Hibernate will make their life easier. As for Spring I would seriously check if Guice is a viable alternative and then try to show how Spring/Guice makes development better and easier.
I've done a lot of Spring/Hibernate development. Over time the way people used both in combination has changed a bit. The original HibernateTemplate approach has proved to be difficult to debug since it swallows and wraps otherwise useful exceptions; talk to the Hiberante API directly!
Please keep looking at the generated SQL (configure your development logging to show SQL). Having an abstraction layer to the database doesn't mean you don't have to think in SQL anymore; you won't get good performance if you otherwise.
Consider the project. I've choosen iBatis over Hibernate on several occasions where we had stringent performance requirements, complex legacy schemas or good DBa's capable of writing excellent SQL.
As for Hibernate: a very good tool for application which deals with a rapidly changing database schema, a large amount of tables, do lots of simple CRUD operations. Reports with complex queries involved are rather less well handled. But in these case I prefer mixing in JDBC or native queries. So, for a short answer: I do think time spent learning Hibernate is a good investment (they say it is compliant with EJB3.0 and JPA standards, also, but that didn't come into the equation when I evaluated it for my personal use).
As for Spring... see The Bile Blog :)
Remember: frameworks are not silver bullets, but you should not reinvent the wheel either.
I find it really helps to use well-known frameworks such as Hibernate because it fits your code into a specific mold, or a way of thinking. Meaning, since you're using Hibernate, you write code a certain way, and most if not all developers who know Hibernate will be able to follow your line of thinking quite easily.
There's a downside to this, of course. Before you become a hot shot Hibernate developer, you're going to find that you're trying to fit a square into a circular hole. You KNOW what you want to do, and how you were supposed to do it before Hibernate came into the picture, but finding the Hibernate way of doing it may take... quite a bit of time.
Still, for companies that frequently hire consultants (who need to understand a lot of source code in a short amount of time) or where the developers sign on and quit frequently, or where you just don't want to bet that your key developers will stay forever and never change jobs -- Hibernate and other standard frameworks are a pretty good idea I think.
/Ace
Spring and Hibernate are frameworks that are tricky to master. It may not be a good idea to use them in projects with tight deadlines while you're still trying to figure out the frameworks.
The benefits of the frameworks is basically to try to provide a platform to allow for consistent codes to be products. From experience, you'd be well advised to have developers experienced with the frameworks setting in place best practices.
Depending on the design of your application and/or database, there are also quirks that you'll need to circumvent to ensure that the frameworks do not hinder performance.
In my opinion, the biggest advantage of Spring is that it encourages and enables better development practices, in particular loose coupling, testing, and more interfaces. Hibernate without Spring can be really painful, but the two together are very useful.
Retrofitting an existing project to any framework is going to be painful, but the refactoring process often has serious benefits for long-term maintainability.
I have to agree with many posts on this one. I've used both, extensively, in a variety of settings. If I could undo a design decision it would be to have used Hibernate. We actually budgeted a release in one of our products to swap Hibernate for iBatis and Spring-JDBC for a best-of-all-worlds approach. I can have a new developer get up to speed using Spring-JDBC, Spring-MVC, Spring-Ioc, and iBatis faster than if I just tasked them with Hibernate.
Hibernate is just too complicated for this KISS developer. And heaven help you with hibernate if your DBA sees the generated SQL the database sees and sends you back with optimized versions.
The top answer mentions that Hibernate is poorly documented. I agree that the online reference manual could be more complete. However, a book written by Hibernate's authors, 'Java persistence with Hibernate' is a must-read for every Hibernate user and very complete.
#slim - I am with you again this morning.
It sounds like a classic case of Not Invented Here Syndrome. If they aren't keen on spring, they should consider other options rather than rolling their own framework (whether they acknowledge doing it or not). Guice comes to mind as an possibility. Also picocontainer. There are others out there, depending on what you need.
Spring and Hibernate definitely make life easier.
Getting started with them might be a little time-consuming at the beginning, but you'll certainly benefit from it later. Now the XML is being replaced by annotations, you don't need to type hundreds of lines of XML either.
You may want to consider AppFuse to reduce your learning-curve: generate an application, study and adapt it, and off you go.
Frameworks are not evil. even the Java SDK is a framework.
What they probably fight is framework proliferation. You shouldn't bring a framework to a project just for the kick of it, it should bring consistent value in a reasonable time. Every framework requires a learning curve, but should reward you with increased productivity and features later on.
If you struggle with code that is hard to debug because of inconsistent database usage, complicated cache mechanisms, or a myriad of other reasons. Hibernate will add great value.
apart from the learning curve (which took about 1 month of practical work for me) there weren't any pitfalls, provided you have someone around to explain the basics for you.