How to write custom storage plugin for apache drill

How to write custom storage plugin for apache drill - java

I have my data in a propriety format, None of the ones supported by Apache drill.
Are there any tutorial on how to write my own storage plugin to handle such data.

This is something that really should be in the docs but currently is not. The interface isn't too complicated, but it can be a bit much to look at one of the existing plugins and understand everything that is going on.
There are 2 major components to writing a storage plugin, exposing information to the query planner and schema management system and then actually implementing the translation from the datasource API to the drill record representation.
The Kudu plugin was added recently and is a reasonable model for a storage system with a lot of the elements Drill can take advantage of. One thing I would note is that if your storage system is not distributed and you just plan on making all remote reads you don't have to do as much work around affinities/work lists/assignments in the group scan. If I have some time soon I'll try to write up a doc on the different parts of the interface and maybe write a tutorial about one of the existing plugins.
https://github.com/apache/drill/tree/master/contrib/storage-kudu/src/main/java/org/apache/drill/exec/store/kudu

Related

Process XML data with Java

I am software written in Java which read an external XML file (let's call it "datasource.xml").
This file contains different information and this information are extracted using XPath queries.
The fact is that, according to what kind of information is extracted from that file (datasource.xml) a different work flow is needed. At the moment workflows are "hard coded" in my Java classes but I want to make my software indipedent so that it can work with any datasource.xml, no matter of its structure. But of course I have to specify somewhere how to deal with the extracted data. I was thinking to use (again) JAXB and specify inside the XML file (and from its XSD I will create JAXB classes) the kind of workflow is needed.
Could it be a good solution??
Thanks

have you checked out Drools (a project from JBoss) very easy to learn & is an excellent workflow tool.
building your own workflow engine is quite complex & there are a lot of considerations to be taken into account.

You can think of using activiti, Another workflow solution. It has APIs available and can be used as a workflow service layer in your application.

Like others, I think you will be better off using a higher-level tool for this rather than hand-coding the logic in Java. Take a look at XProc (for example the Calabash implementation), or Orbeon, or Cocoon. They all have a learning curve associated with them, but once mastered, you will have a much more flexible architecture than with hard-coded Java logic.

Advanced Java File I/O tutorials? Tips? Advice?

I'm working on a project right now that will make use of Java File I/O that goes beyond the simple "write this string to a file" documentation and tutorials that I find on the net. This project will essentially provide a database mechanism, similar to the popular "NoSQL" databases that are gaining a lot of press these days. However, I'm unable to find a ton of documentation that provides detailed information on which APIs to use, how to use them, etc. I've also been looking for any generally accepted design patterns around Java File I/O, but without any luck.
If I had to list a couple of requirements, I'd say:
Pseudo-transactional support (not a hard requirement, as it can be implemented higher up in the API stack)
Ability to write data of an arbitrary length in a structure that can be read back later on
Indexing
Ability to remove an object from the "database" efficiently
Fast searching
Possible multi-threaded access (multiple read threads, single write, most likely)
Can anyone point me to any tutorials, documentation, design patterns, etc. that may be helpful? Are there any open source frameworks that revolve around Java File I/O? I know of a lot of frameworks that provide wrappers around NIO for the purposes of Network I/O, but nothing File-related.
Thanks for any help you can provide!

Take a look at Apache Commons Transaction. It supports transactional file access, by performing the work in temporary files, and committing the work by moving them to the actual files.
You might also be interested in the XADisk project, although I haven't pored through it's sources.
As far as searching is concerned, the Apache Solr and Lucene projects would be of help.

Whats the best way to implement a simple document management system?

I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?

You may want to take a look at Content repository API for Java and the several implementations (some of them free).

Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.

The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?

Using a version control system as a data backend

I'm involved in a project that, among other things, involves storing edits and changes to a large hierarchical document (HTML-formatted text). We want to include versioning of textual changes and of structural changes.
Currently we're maintaining the tree of document sections in a relational database, but as we start working on how to manage versioning of structural changes, it's clear that we're in danger of having to write a lot of the functionality that a version control system provides.
We don't want to reinvent the wheel. Is it possible that we could use an existing version control system as the data store, at least for the document itself? Presumably we could do so by writing out new versions to the filesystem, and keeping that directory under version control (and programmatically doing commits and so forth) but it would be better if we could directly interact with the repository via code.
The VCS that we are most familiar with is Subversion, but I'm not thrilled with how Subversion represents changes to the directory structure -- it would be nice if we could see that a particular revision included moving a section from Chapter 2 to Chapter 6, rather than just seeing a new version of the tree. This sounds more like the way a system like Mercurial handles changes to the structure.
Any advice? Do VCS's have public APIs and so forth? The project is in Java (with Spring) if it matters.

Maybe you could use a JCR (JSR-170) compliant repository like Jackrabbit instead. To me, what you're describing is exactly what JCR is for. Have a look at this article.

You can certainly program SCMs via APIs. Check out SVNKit for Java and Subversion, or JGit for Java and Git. Mercurial doesn't appear to offer such an API.
Whatever you do, wrap up your implementation in a suitable API, so you can swap one SCM for another, or maybe bin the concept of an SCM at some stage in the future. It may well be a pragmatic solution to your problem, however, and worthy of more investigation.

Try http://svnkit.com/ for Subversion.

Here you have a pure Java SVN lib SVNkit it can be used by Eclipse SVN integration so it should be fairly stable.

Extend JackRabbit or build up from Lucene?

I've been working on a site idea the general concept is a full text search of documents that also allows user ratings based on these rating I wanted to boost the item's value in the Lucene index. But I'm trying to find if I should extend JackRabbit or just build from the Lucene base. Is there any good way to extend JackRabbit in this way and effect the index or would it be best to work directly off Lucene?
Either way I go I am strongly leaning to using groovy on grails with either the searchable plugin or work directly with JackRabbit is there any major reasons I should just stick to Java?
Clarification:
I would like to boost an item based on the average user rating of an item, is JackRabbit open enough or expandable enough where I can capture user ratings then have those effect the index within JackRabbit or is it so far out of the core of JackRabbit I should just build up from Lucene?

I recommend using JCR, with the implementation of Jackrabbit behind it. JCR allows you to separate between what you store and how you store it.
By staying within a JCR framework, you should be able to easily switch among JCR implementations. (There are several, not just Apache's.) Even within Jackrabbit are many persistence managers, not just Lucene. This flexibility is useful when you want to trade off between storage space and performance.
JCR already includes full text searches and the ability to maintain user ratings. It should be a good fit for your project.

is there any major reasons I should just stick to Java?
Not really. As you probably already know, you can use any Java library with Groovy/Grails, so there's nothing you can do in Java that you can't do in Groovy. Although the contrary is also true, in my experience, it takes a lot more (boilerplate) code to get things done in Java.
Although Java is considerable faster than Groovy, this doesn't necessarily mean your app will be faster if written in Java, as the bottleneck could likely be the database rather than code execution.
As for whether you should use Lucene/Searchable or JackRabbit, it's very difficult to say without knowing much about what you can achieve. All you've told us so far is that you want to index documents and boost certain items in the index. You can certainly do both of those with Lucene.

I would recommend using JCR/Jackrabbit on top of Lucene for a couple of reasons:
1) Your repository structure could readily support document nodes with child nodes that store all of your meta-data including owner, ratings, flagging, comments, etc.
2) JCR is ideal for document/node based app development, providing a lot of the heavy lifting at the framework level while not getting in your way at the app level.

I would recommend you to use Apache Sling, it comes with Jackrabbit/Lucene built-in.
Most of the committers are also involved with Jackrabbit, so it's designed to work well with it -- even better, it's designed to run on top of it.
One of the nice features of Sling is that it mounts the entire JCR repository in the URL space and exposes it via REST endpoints.
So you can access your documents/metadata very easily by doing a simple HTTP request to it. It also allows you to write your own servlets and expose them as REST endpoints. (This is extremely easy -- no fiddling about with applicationContext.xml files, just 1 annotation)
It also allows you to write jsp, esp, groovy, ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to write custom storage plugin for apache drill - java

I have my data in a propriety format, None of the ones supported by Apache drill. Are there any tutorial on how to write my own storage plugin to handle such data.

Related

Process XML data with Java

Advanced Java File I/O tutorials? Tips? Advice?

Whats the best way to implement a simple document management system?

Using a version control system as a data backend

Extend JackRabbit or build up from Lucene?

Categories

Resources