How to configure SparkContext for a HA enabled Cluster

How to configure SparkContext for a HA enabled Cluster - java

When I am trying to run the spark application in YARN mode using the HDFS file system it works fine when I provide the below properties.
sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname);
sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress);
sparkConf.set("spark.yarn.stagingDir",stagingDirectory );
But the problems with this are:
Since my HDFS is NamdeNode HA enabled it won't work when I provide spark.yarn.stagingDir the commons URL of hdfs
E.g. hdfs://hdcluster/user/tmp/ gives an error that says:
has unknown host hdcluster
But it works fine when I give the URL as hdfs://<ActiveNameNode>/user/tmp/, but we don't know in advance which will be active so how do I resolve this?
And few things I have noticed are SparkContext takes the Hadoop configuration but SparkConfiguration class won't have any methods to accepts Hadoop configuration.
How do I provide the resource Manager address when Resource Manager are running in HA?

You need to use the configuration parameters that are already present in hadoop config files like yarn-site.xml, hdfs-site.xml
Initialize the Configuration object using:
val conf = new org.apache.hadoop.conf.Configuration()
To check the current HDFS URI, use:
val currentFS = conf.get("fs.defaultFS");
You will get an output with the URI of your namenode, something like:
res0: String = hdfs://namenode1
To check the address of current resource manager in use, try:
val currentRMaddr = conf.get("yarn.resourcemanager.address")

I have had the exact same issue. Here is the solution (finally):
You have to configure the internal Spark Context Hadoop Configuration for HDFS HA. When instantiating the Spark Context or Spark Session, it will find all configurations which have keys starting with spark.hadoop. and use them in instantiate the Hadoop Configuration.
So, In order to be able to use hdfs://namespace/path/to/file and not get an Invalid Host Exception is to add the following configuration options
spark.hadoop.fs.defaultFS = "hdfs://my-namespace-name"
spark.hadoop.ha.zookeeper.quorum = "real.hdfs.host.1.com:2181,real.hdfs.host.2.com:2181"
spark.hadoop.dfs.nameservices = "my-namespace-name"
spark.hadoop.dfs.client.failover.proxy.provider.my-namespace-name = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
spark.hadoop.dfs.ha.automatic-failover.enabled.my-namespace-name = true
spark.hadoop.dfs.ha.namenodes.my-namespace-name = "realhost1,realhost2"
spark.hadoop.dfs.namenode.rpc-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:8020"
spark.hadoop.dfs.namenode.servicerpc-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:8022"
spark.hadoop.dfs.namenode.http-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:50070"
spark.hadoop.dfs.namenode.https-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:50470"
spark.hadoop.dfs.namenode.rpc-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:8020"
spark.hadoop.dfs.namenode.servicerpc-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:8022"
spark.hadoop.dfs.namenode.http-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:50070"
spark.hadoop.dfs.namenode.https-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:50470"
spark.hadoop.dfs.replication = 3
spark.hadoop.dfs.blocksize = 134217728
spark.hadoop.dfs.client.use.datanode.hostname = false
spark.hadoop.dfs.datanode.hdfs-blocks-metadata.enabled = true

You are probably looking at HADOOP_CONF_DIR=/path/to/hdfs-site.xml/and/core-site.xml property in spark-env.sh. The mentioned envioronment variable should point to location where hdfs-site.xml and core-site.xml exists (Same those used in starting hadoop HA cluster). You should be able to then use hdfs://namespace/path/to/file without issues

Related

Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate]

I'm trying to make my Spark Streaming application reading his input from a S3 directory but I keep getting this exception after launching it with spark-submit script:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)
at MainClass$.main(MainClass.scala:1190)
at MainClass.main(MainClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm setting those variables through this block of code as suggested here http://spark.apache.org/docs/latest/ec2-scripts.html (bottom of the page):
val ssc = new org.apache.spark.streaming.StreamingContext(
conf,
Seconds(60))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",args(2))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",args(3))
args(2) and args(3) are my AWS Access Key ID and Secrete Access Key of course.
Why it keeps saying they are not set?
EDIT: I tried also this way but I get the same exception:
val lines = ssc.textFileStream("s3n://"+ args(2) +":"+ args(3) + "#<mybucket>/path/")

Odd. Try also doing a .set on the sparkContext. Try also exporting env variables before you start the application:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
^^this is how we do it.
UPDATE: According to #tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

The following configuration works for me, make sure you also set "fs.s3.impl":
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)

On AWS EMR the above suggestions did not work. Instead I updated the following properties in the conf/core-site.xml:
fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey with your S3 credentials.

For those using EMR, use the Spark build as described at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and just reference S3 with the s3:// URI. No need to set S3 implementation or additional configuration as credentials are set by IAM or role.

I wanted to put the credentials more securely in a config file on one of my encrypted partitions. So I did export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running my spark application, and put a file in that directory (encrypted via ecryptfs) called core-site.xml containing the credentials like this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
HADOOP_CONF_DIR can also be set in conf/spark-env.sh.

Latest EMR releases (tested on 4.6.0) require the following configuration:
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
Although in most cases out of the box config should work - this is if you have different S3 credentials from the ones you launched the cluster with.

In java, the following are the lines of code. You have to add AWS creds in SparkContext only and not SparkSession.
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);

this works for me in 1.4.1 shell:
val conf = sc.getConf
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
conf.set("spark.hadoop.fs.s3.awsAccessKeyId", <your access key>)
conf.set("spark.hadoop.fs.s3.awsSecretAccessKey", <your secret key>)
SparkHadoopUtil.get.conf.addResource(SparkHadoopUtil.get.newConfiguration(conf))
...
sqlContext.read.parquet("s3://...")

Augmenting #nealmcb's answer, the most straightforward way to do this is to define
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
in conf/spark-env.sh or export that env variable in ~/.bashrc or ~/.bash_profile.
That will work as long as you can access s3 through hadoop. For instance, if you can run
hadoop fs -ls s3n://path/
then hadoop can see the s3 path.
If hadoop can't see the path, follow the advice of contained in How can I access S3/S3n from a local Hadoop 2.6 installation?

Authenticating users via LDAP with Shiro

Total newbie to java/groovy/grails/shiro/you-name-it, so bear with me. I have exhausted tutorials and all the "Shiro LDAP" searches available and still cannot get my project working.
I am running all of this on GGTS with jdk1.7.0_80, Grails 2.3.0, and Shiro 1.2.1.
I have a working project and have successfully ran quick-start-shiro,which built the domains ShiroRole and ShiroUser, the controller authController, the view login.gsp, and the relam ShiroDbRealm. I created a faux user in BootStrap with
def user = new ShiroUser(username: "user123", passwordHash: new Sha256Hash("password").toHex())
user.addToPermissions("*:*")
user.save()
and can successfully log into my homepage, and for all intents and purposes, that is as far as I have gotten. I cannot find a top-down tutorial of how to now log in with my username and password (authenticated through a LDAP server that I have available). From what I understand, I need to create a shiro.ini file and include something along the lines of
[main]
ldapRealm = org.apache.shiro.realm.activedirectory.ActiveDirectoryRealm
ldapRealm.url = ldap://MYURLHERE/
However I don't even know where to put this shiro.ini file. I've seen /src/main/resources, but there is no such directory. Do I manually create this or is it some script creation?
The next step seems to be creating the SecurityManager which reads the shiro.ini somehow with code along the lines of
Factory<org.apache.shiro.mgt.SecurityManager> factory = new IniSecurityManagerFactory("actived.ini");
// Setting up the SecurityManager...
org.apache.shiro.mgt.SecurityManager securityManager = factory.getInstance();
SecurityUtils.setSecurityManager(securityManager);
However this always appears in some Java file in tutorials, but my project is a Groovy project inside of GGTS. Do I need to create a Java file and put it in src/java or something like that?
I've recently found that I may need a ShiroLdapRealm file (similar to ShiroDbRealm) with information like
def appConfig = grailsApplication.config
def ldapUrls = appConfig.ldap.server.url ?: [ "ldap://MYURLHERE/" ]
def searchBase = appConfig.ldap.search.base ?: ""
def searchUser = appConfig.ldap.search.user ?: ""
def searchPass = appConfig.ldap.search.pass ?: ""
def usernameAttribute = appConfig.ldap.username.attribute ?: "uid"
def skipAuthc = appConfig.ldap.skip.authentication ?: false
def skipCredChk = appConfig.ldap.skip.credentialsCheck ?: false
def allowEmptyPass = appConfig.ldap.allowEmptyPasswords != [:] ? appConfig.ldap.allowEmptyPasswords : true
and the corresponding info in Config along the lines of
ldap.server.url = ["ldap://MYRULHERE/"]
ldap.search.base = 'dc=COMPANYNAME,dc=com'
ldap.search.user = '' // if empty or null --> anonymous user lookup
ldap.search.pass = 'password' // only used with non-anonymous lookup
ldap.username.attribute = 'AccountName'
ldap.referral = "follow"
ldap.skip.credentialsCheck = false
ldap.allowEmptyPasswords = false
ldap.skip.authentication = false
But putting all these pieces together hasn't gotten me anywhere! Am I at least on the right track? Any help would be greatly appreciated!

For /src/main/resources it will automatically created for you if you used maven for your project. Moreover, you can also create that directory manually.

Java - Apache Spark communication

I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Getting InvalidConfigurationException in JGit while pulling remote branch

I'm trying to pull the remote master branch in my currently checked out local branch. Here's the code for it
checkout.setName(branchName).call();
PullCommand pullCommand = git.pull();
System.out.println("Pulling master into " + branchName + "...");
StoredConfig config = git.getRepository().getConfig();
config.setString("branch", "master", "merge", "refs/heads/master");
pullCommand.setRemote("https://github.com/blackblood/TattooShop.git");
pullCommand.setRemoteBranchName("master");
pullResult = pullCommand.setCredentialsProvider(credentialsProvider).call();
When I run the code I get the following error on this line pullCommand.setRemote("https://github.com/blackblood/TattooShop.git");
Error :
org.eclipse.jgit.api.errors.InvalidConfigurationException:
No value for key remote.https://github.com/blackblood/TattooShop.git.url found in configurationCouldn't pull from remote. Terminating...
at org.eclipse.jgit.api.PullCommand.call(PullCommand.java:247)
at upload_gen.Launcher.updateFromRemote(Launcher.java:179)
at upload_gen.Launcher.main(Launcher.java:62)
Following are the contents of my .git/config file
[core]
repositoryformatversion = 0
filemode = false
bare = false
logallrefupdates = true
symlinks = false
ignorecase = true
hideDotFiles = dotGitOnly
[remote "origin"]
url = https://github.com/blackblood/TattooShop.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[remote "heroku"]
url = git#heroku.com:tattooshop.git
fetch = +refs/heads/*:refs/remotes/heroku/*

This seems to be a bug in JGit. According to the JavaDoc of setRemote(), it sets the remote (uri or name) to be used for the pull operation but apparently only the remote name works.
Given your configuration you can work around the issue by using the remote name like this:
pullCommand.setRemote( "origin" );
I recommend to open a bug report in the JGit bugzilla so that this gets fixed in future versions of JGit.

How can I submit a Cascading job to a remote YARN cluster from Java?

I know that I can submit a Cascading job by packaging it into a JAR, as detailed in the Cascading user guide. That job will then run on my cluster if I manually submit it using hadoop jar CLI command.
However, in the original Hadoop 1 Cascading version, it was possible to submit a job to the cluster by setting certain properties on the Hadoop JobConf. Setting fs.defaultFS and mapred.job.tracker caused the local Hadoop library to automatically attempt to submit the job to the Hadoop1 JobTracker. However, setting these properties does not seem to work in the newer version. Submitting to a CDH5 5.2.1 Hadoop cluster using Cascading version 2.5.3 (which lists CDH5 as a supported platform) leads to an IPC exception when negotiating with the server, as detailed below.
I believe that this platform combination -- Cascading 2.5.6, Hadoop 2, CDH 5, YARN, and the MR1 API for submission -- is a supported combination based on the compatibility table (see under "Prior Releases" heading). And submitting the job using hadoop jar works fine on this same cluster. Port 8031 is open between the submitting host and the ResourceManager. An error with the same message is found in the ResourceManager logs on the server side.
I am using the cascading-hadoop2-mr1 library.
Exception in thread "main" cascading.flow.FlowException: unhandled exception
at cascading.flow.BaseFlow.complete(BaseFlow.java:894)
at WordCount.main(WordCount.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): Unknown rpc kind in rpc headerRPC_WRITABLE
at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1368)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Demo code is below, which is basically identical to the WordCount sample from the Cascading user guide.
public class WordCount {
public static void main(String[] args) {
String inputPath = "/user/vagrant/wordcount/input";
String outputPath = "/user/vagrant/wordcount/output";
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "wordcount" );
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
Properties properties = AppProps.appProps()
.setName( "word-count-application" )
.setJarClass( WordCount.class )
.buildProperties();
properties.put("fs.defaultFS", "hdfs://192.168.30.101");
properties.put("mapred.job.tracker", "192.168.30.101:8032");
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
flow.complete();
}
}
I've also tried setting a bunch of other properties to try to get it working:
mapreduce.jobtracker.address
mapreduce.framework.name
yarn.resourcemanager.address
yarn.resourcemanager.host
yarn.resourcemanager.hostname
yarn.resourcemanager.resourcetracker.address
None of these worked, they just cause the job to run in local mode (unless mapred.job.tracker is also set).

I've now resolved this problem. It comes from trying to use the older Hadoop classes that Cloudera distributes, particularly JobClient. This will happen if you use hadoop-core with the provided 2.5.0-mr1-cdh5.2.1 version, or the hadoop-client dependency with this same version number. Although this claims to be the MR1 version, and we are using the MR1 API to submit, this version actually ONLY supports submission to the Hadoop1 JobTracker, and it does not support YARN.
In order to allow submitting to YARN, you must use the hadoop-client dependency with the non-MR1 2.5.0-cdh5.2.1 version, which still supports submission of MR1 jobs to YARN.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to configure SparkContext for a HA enabled Cluster - java

Related

Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate]

Authenticating users via LDAP with Shiro

Java - Apache Spark communication

Getting InvalidConfigurationException in JGit while pulling remote branch

How can I submit a Cascading job to a remote YARN cluster from Java?

Categories

Resources