Which version of Hadoop API to use

Which version of Hadoop API to use - java

There are several versions of Hadoop APIs that are available as part of Cloudera and Yahoo distributions. Furthermore, for Cloudera there is cdh3u1 to cdh3u4 versions.
I saw that the API methods also change in the way they are named and the parameters they accept.
Which version of Hadoop API, and from where, can I use that is latest and stable?

Which version of Hadoop API, and from where, can I use that is latest and stable?
First thing to note that the latest and stable API don't go together. It takes some time for the latest API to become rock solid, with all the bugs found out and fixed.
If you are interested in packaged software, then go to Cloudera and download a stable or an alpha version and try it out. For HortonWorks you can download HDP 1.0 which is the only version available. Cloudera has been releasing CDH close to 4 years on a regular basis, so it is more mature compared to HDP from HortonWorks. CDH has got the next generation MapReduce included, while HDP has got the legacy MapReduce architecture.
The above mentioned packages (CDH and HDP) have a set of frameworks well integrated and tested. So, it's matter of learning how to use the frameworks. There is no need to worry about the interoperability issues across different frameworks.
If you wanted to really learn about Hadoop, I would suggest to download the software from Apache Hadoop and then go ahead with the installation and configuration. The same applies for Pig, Hive and other softwares also. You might find out some compatibility issues, which have to be resolved as you go on.
In the Apache Hadoop space, there is 1x track which has the stable legacy MR architecture and then the 2x track which has the next generation MapReduce architecture.

Related

Why are Apache Projects sensitive to Java Versions?

Why is it that I can't keep the latest version of Java, that I have to downgrade to Java 8 for almost all Apache Projects?
Is Java not backward compatible?
A program compiled to ByteCode on a older JDK should be run perfectly well on the JVM of a newer JDK.
Why is it that I have to go through the pains of building from source?
I thought this was one of the things that Java was supposed to overcome!

This problem is not exclusive to Apache projects. With the newer Java Versions it's not so much about 'understanding the older code' but more about 'am I allowed to use these features the old way' (modularization).
In some cases the older code also uses features that are just not part of the latest JDK any more (e.g. removal of JEE Modules). I'd recommend to read Oracles Migation Guide for more on this topic.
For (bigger) projects the Migration to >Java8 is something that needs to be planned and organized and takes a lot of time.

Akka - ClusterSingleton with JDK7

The problem
I'm experimenting with Akka's cluster support. I got stuck with ClusterSingleton support which appears to require JDK8... Which I can't use.
As per documentation here I need to include the following library:
<dependency>
<groupId>com.typesafe.akka</groupId>
<artifactId>akka-cluster-tools_2.11</artifactId>
<version>2.4-SNAPSHOT</version>
</dependency>
As it appears in my tests, the entire akka-*2.11, (compiled with Scala 2.11) requires JDK8, including akka-cluster-tools. I'm not a Scala guy, but it seems quite strange - Scala 2.11.1 release notes suggest, that JDK7 is more than enough:
The Scala 2.11.x series targets Java 6, with (evolving) experimental support for Java 8
Options
What are my options? I see the following:
Drop the idea of using Akka since new releases will require JDK8 as it seems. JDK8 is sadly not an option
Hope there would be akka-cluster-tools_2.10 and my problems disappear. Will there be akka-cluster-tools_2.10?
Forget about akka-cluster-tools_2.10 and use akka-contrib_2.10 instead.
There's a chance it would work, although
It's going to be more difficult, as the current documentation refers to akka-cluster-tools
I'm just starting with Akka and already need to use deprecated libraries...
Thanks
f

As Ryan said in the comment, Akka 2.4 (which isn't yet) requires/will require Java 8.
You can still use ClusterSingleton, Sharding, DistributedPubSub in Akka 2.3 just that it is in the akka-contrib package. And you can find the docs for it under http://doc.akka.io/docs/akka/2.3.12/contrib/index.html so no problems with that.
Additionally the differences in the API:s between 2.3 and 2.4 for the cluster stuff isn't very huge, so it is very much possible to make that move in the future without being too big an effort.
The back side of it might be that improvements to the cluster tools in 2.4 might not necessarily be back ported to 2.3 and that the main development effort will be focused on 2.4 in the future.

Hadoop client JDK compatibility for communicating with Cloudera cluster

At the moment, we have both our Hadoop cluster (CDH 4.1.2) and services that communicate with it via hadoop-client running Java 6. We're planning to move those client components to Java 8 leaving Hadoop servers running on top of Java 6 as they do now because Cloudera has declared support for JDK 8 only since version 5.3.0, and we're not planning to upgrade Hadoop - details here:
https://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_rn_new_in_530.html
Therefore, our concern is whether different versions of Java in cluster (6) and in client components (8) may lead to problems of any kind. Internet hasn't been of big help as long as Hadoop compatibility with Java is mainly discussed regarding migration of server components, so please share your experience relevant to this matter if you have one.

Use Apache Hadoop JAR files or vendor specific?

I am creating an application for Hadoop which should run on all Distributions of Hadoop provided by different vendors like: Cloudera, MapR, Hortonworks, Pivotal...etc. My application would be deployed on application servers like WebLogic, JBoss or can be deployed on tomcat also.
So my question here is:- Suppose some version of all these vendors use the same underlying Hadoop version say Hadoop 2.0, so should i use the JAR files given by these vendors or use the JAR files given by Apache hadoop?
I mean the JAR files that have the same classes as Apache hadoop but have their name in them like blablaCDH5.2blabla.jar, so should i use this one or the one from Apache? So i can build a single version for Hadoop 2.0 and use it for all vendors. Can that be done or i have to build different flavours of my app for all vendor distributions.
Thanks in advance

One approach, which may vary slightly based on your version control and build systems, would be to have separate build scripts using the dependencies from the different distributions.
Where test cases fail for a given distribution you could have a branch/fork for that distribution or, probably less desirable, have a specific build which does some pre-build magic for that distribution.
This way you should be able to maintain a consistent trunk while being able to track and handle issues that come up in vendor/version specific distribution going forward. This would definitely be possible with git and most build systems (e.g gradle, maven or ant).

You can create a shims layer that allows your application to run with any hadoop distribution.As most of the distribution has different hadoop versions it is very difficult to deal with this problem.So most of the vendors are now creating shims layer that can work with any hadoop distribution.Shim layer has now been implemented in many applications like Pentaho,hive,gora etc.

It depends on how deep into hadoop API you are threading.
If your application only submits jobs to the cluster, you are probably ok with vanilla libraries as long as you stick to one specific version. If you are doing advanced stuff and using hadoop internals, it may be necessary to include vendor specific ones.

Dennis you can build your application using jars provided by Apache Hadoop , because all of them are modified form of Apache hadoop. These all distributions have same baseline structure so using jars provided by Apache hadoop won't create any problem.
In fact I am providing you links for cloudera in which they are using jars provided by Apache Hadoop itself.This the required link.

Method to remotely access HDFS in JAVA that works on old and new versions of Hadoop

I am trying to remotely access the HDFS with a program written in JAVA. WebHDFS works well with most recent versions of Hadoop, but which protocol(s) should I choose that will work on the largest number of versions of Hadoop?
If possible, I would like to use a single protocol that will work on all versions of Hadoop as long as it won't run much slower than using different protocols for different versions of Hadoop.

LibHDFS is present in older (1.x) and newer (2.x) releases of hadoop. It is pure java and has pretty stable API.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Which version of Hadoop API to use - java

Related

Why are Apache Projects sensitive to Java Versions?

Akka - ClusterSingleton with JDK7

Hadoop client JDK compatibility for communicating with Cloudera cluster

Use Apache Hadoop JAR files or vendor specific?

Method to remotely access HDFS in JAVA that works on old and new versions of Hadoop

Categories

Resources