DSpace and Physical Separation of Documents

DSpace and Physical Separation of Documents - java

I'm looking for a document repository that supports physical data separation. I was looking into Open Text's LiveLink, but that price range is out of my league. I just started looking into DSpace! Its free and open source and works with PostgreSQL.
I'm really trying to figure out if DSpace supports physically separating the data. I guess I would create multiple DSpace Communities, and I want the documents for each Community store on different mounts. Is this possible? Or is there a better way?
Also, I wont be using DSpace's front end. The User will not even know DSpace is storing the docs. I will build my own front-end and use Java to talk to DSpace. Does anyone know a good Java API Wrappers for DSpace.

This answer is a pretty long time after you asked the question, so not sure if it's of any use to you.
DSpace currently does not support storing different parts of the data on different partitions, although your asset store can be located on its own partition separate from the database and the application. You may get want you need from the new DuraSpace application which deals with synchronising your data to the cloud.
In terms of Java APIs, DSpace supports SWORD 1.3 which is a deposit-only protocol, and a lightweight implementation of WebDAV with SOAP. Other than that, it's somewhat lacking. If you are looking for a real back-end repository with good web services you might look at Fedora which is a pure back end, with no native UI, and probably more suited to your needs. DSpace and Fedora are both part of the DuraSpace organisation so you can probably benefit from that also. I'm not knowledgeable enough about the way that Fedora stores data to say whether you can physically separate the storage, though.
Hope that helps.

Related

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.

I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Looking for simple way of representing database contents as filesystem (Windows)

I have a Document Management System that stores documents in a database. I'm looking for a simple way (not too much and complicated protocol to implement) to show the database as a drive in windows (so it can be browsed and manipulated using any windows program, like explorer or office).
I have something in mind that I provide some kind of network share and that can be mounted as a drive in windows. Unfortunately all candidate network protocols for file sharing seem to require substantial effort to implement.
I first considered CIFS, but after reading up on that quickly decided that its BY FAR to complicated for me to implement. Next thougt was NFS, but its not supported by Windows (XP) natively and also seems quite complicated to implement.
FTP might be an option, but implementing an FTP server is again much more complicated than I naively expected.
There might be a simpler protocol to use I haven't thought of.
Is there anything I can (ab)use easily for this purpose?
Ideally I want some kind of (pure Java) premade server where I could easly strip out the part that accesses a local file system and replace it with my own code accessing the database OR a protocol simple enough that I can implement it myself reasonably quickly and more importantly, compatible and reliable.

First, you need to make the correct bindings between your DBS and the database, and define/write an API in your program that will describe an easy access to the needed resources. Writing an API Will allow to maximize inter-operablity of your solution with other services or plugins you might want to add later.
After that, you should serve this services to clients through a permissive and robust protocol such as WebDav which stands for Web-based Distributed Authoring and Versioning. He is natively supported by Windows, you can this way interact with every services implementing WebDav (Windows, most web browsers ...), and you can of course also mount this kind of services as a virtual drive. Plus, it is also supported on Linux and MacOS X, I believe nativ'ely but i'm not sure. In fact, WebDav is an extension to HTTP, and is described in the RFC 4918.
Basically every HTTP (which handles server side response management) library for Java could be used to implement WebDav, if you havé some time and want to do it yourself.
To implement WebDav in a way and with an acceptable effort, I searched for some Java librairies on the web and found these ones, it's now up to you to decide which one really fits your needs :
Milton http://milton.ettrema.com/index.html
A list of some of the existing implementations of the WebDav protocol into open source projects on WebDAV.org (you can find there some pretty amazing projects such as The Jakarta Slide project although I think it is not supported anymore and other projects/librairies that shows the importance of WebDav today). http://www.webdav.org/projects/

Loading facebook's big text file to memory (39MB) for autocompletion

I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?

I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.

Whats the best way to implement a simple document management system?

I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?

You may want to take a look at Content repository API for Java and the several implementations (some of them free).

Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.

The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?

Extend JackRabbit or build up from Lucene?

I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?

I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.

is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.

I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.

I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

DSpace and Physical Separation of Documents - java

Related

Java crawl web and store in cassandra

Looking for simple way of representing database contents as filesystem (Windows)

Loading facebook's big text file to memory (39MB) for autocompletion

Whats the best way to implement a simple document management system?

Extend JackRabbit or build up from Lucene?

Categories

Resources