implementing simple Document management

implementing simple Document management - java

My qustion is: How would you go on implementing simple DMS(document management) based on following requirements?
DMS shouls be distributed web application.
Support for document versioning.
Support for document locking.
Document search.
Im already clear on what technologies I want to use. I will use Sring MVC, Hibernate and relational (most likely MYSQL) database.
One thing Im not very clear on is if I need to use webdav, since I could just upload or download documets. I thing I have to because I need to acomplish point 2. and especially point 3. somehow. Is this the right way to go?
Any examples or experience with this would come very handy :). May be Milton is not the best library to pick for webdav?

#Eduard, regarding dependencies on 3rd parties - are you doing this as a college/university exercise or something that will affect real users in a production environment?
At the risk of sounding very pretentious; don't reimplement the wheel! I'd definitely 2nd the call to use JCR, this way you are depending a standard and not a 3rd party implementation.
JCR is a well defined standard (that means a lot of people invested commercial effort (i.e. cash and expertise in huge amounts) into this). I would seriously reconsider looking into JCR - think of it as an API where 3rd parties provide the implementation (no vendor lockin).
Have a look at the features you'll get out-of-the-box, I believe 99 - 110% of the functionality you require is available through a JCR implementation. Plus you'll benefit from the fact the code you'll be using has been tested by hundreds of people in real world situations.
Where I'd differ from bmscomp is in suggesting JackRabbit http://jackrabbit.apache.org/

Option 1:
I am not sure about webdav, no real experience on it. But I would highly recommend you using a Document database like MongoDB.
With mongodb, you can:
1. Handle document versions
2. MongoDB has atomic operations, you can add your logic of document locking.
This will give you some awesome added benefits of search your documents store.
Option 2:
Apache Jackrabbit: A Content repository
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.

Think about using JCR Java content Repository
http://en.wikipedia.org/wiki/Content_repository_API_for_Java or you can have a look at the job done on Alfresco or and Exo framework they did a good job

You can use these open source projects to meet your requirements:
http://sourceforge.net/projects/logicaldoc/ -
LogicalDOC is a modern document management system with a nice interface, easy to use and very fast. It uses open source Java technologies such as GWT, Spring, Lucene in order to provide a flexible and scalable DMS platform. http://www.logicaldoc.com
http://sourceforge.net/projects/openkm/ -
OpenKM Document Management - DMS Updated 2011-05-25
OpenKM is powerful scalable Document Management System (DMS). OpenKM uses Jboss + J2EE + Ajax web (GWT) + Jackrabbit (lucene) Open Source technologies. http://www.openkm.com/

Spring MVC is a good choice. If you want to use a relational database then can also check out Datanucleus. At least the JDO layer (plus maybe the JPA layer) provides versioning support. For search I recommend apache solr, based on lucene, wich has excellent and powerful fulltext search capabilites.
Although webdav seems like the natural choice as a simple and cross plattform file transfer protocol I never had good experiences. Either the Client or the Server didn't work well (konqueror, internet explorer, zope 2, ...). So abstract from the protocol and provide multiple ways to access the file.

Related

Necessary and essential features that java framework must include for testing,development and production

May be this question broad and hard to answer at current moment
But,when i went through different frameworks emerging by one after another like
Hadoop Distributed File System
HBase,
Hive,
Cassandra,
Hypertable,
Amazon S3,
BigTable,
DynamoDB,
MongoDB,
Redis,
Riak,
Neo4J,
Stripes,
Wicket,
Compojure,
Conjure,
Grails,
JRoR,
JSF,
Lift,
Netty,
Noir,
Play,
Scalatra,
Seam,
Sitemesh,
Spark,
Spring MVC,
Stripes,
Struts,
Tapestry,
VRaptor,
Vert.x,
Stripes,
Tapestry
OpenXava
It is always buzzing me.
Each framework has some unique features.Each one promises to solve some particular testing,development and production need with respect to increasing no of users ,data expansion,distributed computing and security ,performance and many more .
But,many functionality is common on them .Striving for unique some functionality we have to shift from one framework to another As,a java developer i would like to have following features included in one framework
like
Out of box support for testing for unit and integration testing
Fast prototyping
Distributed multithreading,caching,logging ,session management ,moduaralization
Security extension
Framework extension
Easily integration with big data .
Distributed data computation
Asychronous operation
High performance
I would like to know what other features others really want to have in one framework ?What others developer really want to have features included in one framework .What are the necessary and essential featues that every framework must include .Please share yours idea.

The reason they differ is because there is no consensus on these matters. Depending on your background and expertise the answer will be different. Every project was started knowing full well what the alternatives were and not being content with them.
This question is useless I'm afraid.

I think XKCD describes it well (replace standard with framework):
There is simply no way given an enough complex problem to solve it for all use cases and users.

The key to understanding is not in how much common is there about all those frameworks but in that most of them cover only part of needs. For instance, Hadoop and Stripes do nothing in common. Only few frameworks claim they cover everything (Java EE and Spring in fact) but in reality they just try to collect several unrelated technologies under one brand name.
The real domains are: presentation layer, data access layer and (arguably) something else.

what is JCR in java and spring

I have been learning java spring hibernate MVC for 3 months and got pretty idea of that . But i have not understood what JCR means.
I mean for e.g in my simple webiste in spring MVC what part can be done in JCR

JCR would be an alternative persistence mechanism used in place of JPA (Hibernate), which hides JDBC from your application. In theory, the Java classes you have in your model might remain the same as you have now. However, if any classes in your model came about only because you needed to model some lower-level data structures for JPA, then these classes might not be needed with JCR.
You'd need a good reason to replace an existing use of JPA with JCR. For example, you may have discovered that using JPA requires jumping through a lot of extra hoops and doing things you'd not really need to do.
Having said that, JCR certainly has some advantages and capabilities that are not otherwise found in JPA:
JCR supports structured data, unstructured data, and everything in between. JCR allows a flexible schema and can be very NoSQL-ish. JPA is very structured, with a fixed schema.
JCR is hierarchical - some use cases are extremely hierarchical, and doing that with a relational model can be very difficult/expensive
JCR has built-in events
Most JCR implementations can store content in a variety of systems. Some can even access and federate existing content in other systems.
No length limitation of string values
JCR has full-text search support
JCR has multiple query languages, including JCR-SQL2 (very SQL-like)
There are some libraries that map Java classes to your node structures, and thus are very similar to JPA/Hibernate
It all depends on whether these features are beneficial for your application.

Java Content Repository(JCR), tries to address these problems (and many others) in an implementation-independent way; that is, the API will be the same regardless of the underlying resource (eg a database, a local or virtual file system). Sitting on top of the data storage, JCR offers content services like granular access control, versioning, content events, full-text search and filtering among others. With an impressive expert group behind JSR-170 led by Day Software, including Content Management Systems (CMS) vendors like Vignette, Hummingbird Ltd., Stellent and the usual Java-driven solution providers like BEA Systems, IBM and Oracle, the specification is likely to become the de-facto standard for content management and document storage.

Re. the descision on when to use JCR vs. a relational data model, have a look at david's model. Was an eye-opener for me....

Whats the best way to implement a simple document management system?

I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?

You may want to take a look at Content repository API for Java and the several implementations (some of them free).

Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.

The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?

Situations to prefer Apache Lucene over Solr?

There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.

If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.

If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.

Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.

I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).

Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.

Extend JackRabbit or build up from Lucene?

I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?

I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.

is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.

I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.

I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.