Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate]

Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate] - java

I'm trying to make my Spark Streaming application reading his input from a S3 directory but I keep getting this exception after launching it with spark-submit script:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)
at MainClass$.main(MainClass.scala:1190)
at MainClass.main(MainClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm setting those variables through this block of code as suggested here http://spark.apache.org/docs/latest/ec2-scripts.html (bottom of the page):
val ssc = new org.apache.spark.streaming.StreamingContext(
conf,
Seconds(60))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",args(2))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",args(3))
args(2) and args(3) are my AWS Access Key ID and Secrete Access Key of course.
Why it keeps saying they are not set?
EDIT: I tried also this way but I get the same exception:
val lines = ssc.textFileStream("s3n://"+ args(2) +":"+ args(3) + "#<mybucket>/path/")

Odd. Try also doing a .set on the sparkContext. Try also exporting env variables before you start the application:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
^^this is how we do it.
UPDATE: According to #tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

The following configuration works for me, make sure you also set "fs.s3.impl":
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)

On AWS EMR the above suggestions did not work. Instead I updated the following properties in the conf/core-site.xml:
fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey with your S3 credentials.

For those using EMR, use the Spark build as described at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and just reference S3 with the s3:// URI. No need to set S3 implementation or additional configuration as credentials are set by IAM or role.

I wanted to put the credentials more securely in a config file on one of my encrypted partitions. So I did export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running my spark application, and put a file in that directory (encrypted via ecryptfs) called core-site.xml containing the credentials like this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
HADOOP_CONF_DIR can also be set in conf/spark-env.sh.

Latest EMR releases (tested on 4.6.0) require the following configuration:
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
Although in most cases out of the box config should work - this is if you have different S3 credentials from the ones you launched the cluster with.

In java, the following are the lines of code. You have to add AWS creds in SparkContext only and not SparkSession.
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);

this works for me in 1.4.1 shell:
val conf = sc.getConf
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
conf.set("spark.hadoop.fs.s3.awsAccessKeyId", <your access key>)
conf.set("spark.hadoop.fs.s3.awsSecretAccessKey", <your secret key>)
SparkHadoopUtil.get.conf.addResource(SparkHadoopUtil.get.newConfiguration(conf))
...
sqlContext.read.parquet("s3://...")

Augmenting #nealmcb's answer, the most straightforward way to do this is to define
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
in conf/spark-env.sh or export that env variable in ~/.bashrc or ~/.bash_profile.
That will work as long as you can access s3 through hadoop. For instance, if you can run
hadoop fs -ls s3n://path/
then hadoop can see the s3 path.
If hadoop can't see the path, follow the advice of contained in How can I access S3/S3n from a local Hadoop 2.6 installation?

Related

Azure Databricks running Autoloader Implementation from java Jar throws org.apache.spark.sql.AnalysisException

currently I am running into an issue but do not understand why this is happning. I have implemented a Java function which uses the Databricks Autoloader to readstream all parquet files from an azure blob storage and "write" it in a dataframe (Dataset because it is in Java written). The code is executed from an Jar which I build in Java and running as a Job on a Shared Cluster.
Code:
Dataset<Row> newdata= spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
// .option("cloudFiles.useNotifications", "true")
.schema(dfsample.schema()).option("cloudFiles.includeExistingFiles", "true").load(filePath);
newdata.show();
But unfortunatelly I get the following exception:
WARN SQLExecution: Error executing delta metering
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
cloudFiles
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
What makes me wonder is, that the exactly same code is running fine inside a Databricks Notebook written in Scala:
val df1 = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", "true").option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.subscriptionId", storagesubscriptionid).schema(df_schema).option("cloudFiles.includeExistingFiles", "false").load(filePath);
display(df1);
I expect a Dataset object containing all the new data from the blobstorage parquet files in schema: id1:int, id2:int, content:binary

So finally, I have found a way to get Autoloader working inside my Java Jar.
As Vincent already commented you have to combine readstream with a writestream.
So I am simply writing the files which have been detected by the autoloader, to a Azure Data Lake.
spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", STORAGE_SUBSCRIPTION_ID)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", SP_TENANT_ID)
.option("cloudFiles.clientId", SP_APPLICATION_ID)
.option("cloudFiles.clientSecret", SP_CLIENT_SECRET)
.option("cloudFiles.resourceGroup", STORAGE_RESOURCE_GROUP)
.option("cloudFiles.connectionString", STORAGE_SAS_CONNECTION_STRING)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.useNotifications", "true")
.schema(DF_SCHEMA)
.load(BLOB_STORAGE_LANDING_ZONE_PATH)
.writeStream()
.format("delta")
.option("checkpointLocation", DELTA_TABLE_RAW_DATA_CHECKPOINT_PATH)
.option("mergeSchema", "true")
.trigger(Trigger.Once())
.outputMode("append")
.start(DELTA_TABLE_RAW_DATA_PATH).awaitTermination();
This works fine with Java when you need to run a Jar as Databricks Jobs.
But to be honest I am still wondering why, from inside a Notebook, I don't have to use writestream in scala language to receive new files from the autoloader.

Reading encrypted private key in PKCS#8 format through bouncycastle, Java failing in docker container

I am trying to read a PKCS#8 private key which looks like following:
key.k8 --> (Sample key. Passphrase - 123456):
-----BEGIN ENCRYPTED PRIVATE KEY-----
MIIFLTBXBgkqhkiG9w0BBQ0wSjApBgkqhkiG9w0BBQwwHAQILbKY9hPxYSoCAggA
MAwGCCqGSIb3DQIJBQAwHQYJYIZIAWUDBAEqBBCvaGt2Hmm2NpHpxbLvHKyOBIIE
0IQ7dVrAGXLZl0exYIvyxLAu6zO00jL6b3sb/agTcCFOz8JU6fBanxY0d5aYO4Dn
mynQG7BoljU470s0zIwW/wk0MmdUFl4nXWBX/4qnG0sZqZ9KZ7I8R/WrBkmpX8C/
4pjdVhu8Ht8dfOYbkbjMBTohDJz8vJ0QwDIXi9yFjjef+QjwrFOl6kAeDJFVMGqc
s7K/wOnhsL1XxfW9uTulPiZh5YTZKcatMkeGDR7c+cg5I+Mutim92diWuCekhNoa
uvhUy1M3cbs7Azp1Mhz+V0CDKklI95EvN4u23WhiJPCjAofC/e45/heOP3Dwm7WZ
zHEY1C/X8PsTl6MEEIF3ZJP+4Vr0corAs1L2FqE6oOng8dFFYmF5eRyBx6bxFd05
iYbfOH24/b3qtFKPC689kGEd0gWp1dwES35SNNK+cJqVRTjgI0oKhOai3rhbGnmp
tx4+JqploQgTorj4w9asbtZ/qZA2mYSSR/Q64SHv7LfoUCI9bgx73MqRQBgvI5yS
b4BoFBnuEgOduZLaGKGjKVW3m5/q8oiDAaspcSLCJMIrdOTYWJB+7mfxX4Xy0vEe
5m2jXpSLQmrfjgpSTpHDKi/3b6OzKOcHjSFBf8IoiHuLc5DVvLECzDUxxaMrTZ71
0YXvEPwl2R9BzEANwwR9ghJvFg1Be/d5W/WA1Efe6cNQNBlmErxD6l+4KDUgGjTr
Aaksp9SZAv8uQAsg7C57NFHpTA5Hznr5JctL+WlO+Gk0cAV6i4Py3kA6EcfatsnS
PqP2KbxT+rb2ATMUZqgWc20QvDt6j0CTA1BuVD1PNhnAUFvb2ocyEEXOra22DPPS
UPu6jirSIyFcjqFjJ9A1FD9L4/UuX2UkDSLqblFlYB1+G55KZp+EKz8SZoN5qXy1
LyMtnacEP5OtRDrOjopzVNiuV1Uv63M9QVi1hZlVLJEomgjWuvuyEuIwDaY2uryW
vx+jJEZyySFkb1JwAbrm+p6sCTFnbQ/URKC2cit/FJyKqNim6VQvGL8Sez34qV3z
D13QJgTZfsy+BaZoaQ6cJTXtJ8cN0IcQciOiDNBKMW66zO6ujS8G+KNviNQypDm6
h4sOgjMqLaZ4ezPEdNj/gaxV7Y15nVRu0re8dVkaa5t9ft/sh6A+yeTD5tS5hHkf
NI7uJPTaTXVoz7xq2PAJUTWujMLMZKtmNOzNqYvxWRy3tCOFobBQkMxqEBEwHd+x
SA+gFcJKJ+aNfCGZJ5fFr8rNlhtOF6uMwOAlfiUlP/pCUDUCKPjZVj4K95yNc8Io
jSZSPb5tGPe0HqXgc6IAfQarlUZt90oVtzL0OfOfTxe1bEzS2ccNadbx/6vjLBc4
q5UuUBppl3rXpbuZ7J1Rp3n2byF4APxFdT2LHKq+MYMfWUToau/TCMT4lFIM9tM8
7TuuyUT2PKzf/xlsl4iScw96z9xxGPQrXn7IA2W5iL+0eCLztJdjNRX1FisdfIBL
PraOVlmF8jHKbFdRZ8Yi8pApbQjvHi24g7dX7u/cq1FH/VE+nJ0O8YVCYVDw13CW
h0p7yD7BuB0R+0WnR0yvkp30vK4/rtCB+Ob8bH/+HvAZrAU5X8jq/wsQbLkrLHZV
6A6GGfX8+hy5AoaXsH1BHnMyXkaF6Mv29z8JcslDJxX/
-----END ENCRYPTED PRIVATE KEY-----
Following code is being used to parse the private key:
InputStream privateKeyInputStream = getPrivateKeyInputStream(); // reads the key file from classpath and share as DataStream
logger.info("InputStreamExists --> {} ", privateKeyInputStream.available());
PEMParser pemParser = new PEMParser(new InputStreamReader(privateKeyInputStream));
Object pemObject = pemParser.readObject();
if (pemObject instanceof PKCS8EncryptedPrivateKeyInfo) {
// Handle the case where the private key is encrypted.
PKCS8EncryptedPrivateKeyInfo encryptedPrivateKeyInfo = (PKCS8EncryptedPrivateKeyInfo) pemObject;
InputDecryptorProvider pkcs8Prov =
new JceOpenSSLPKCS8DecryptorProviderBuilder().build(passphrase.toCharArray());
privateKeyInfo = encryptedPrivateKeyInfo.decryptPrivateKeyInfo(pkcs8Prov); // fails here
}
InputStream resourceAsStream = null;
if ("local".equals(privateKeyMode)) {
resourceAsStream = this.getClass().getResourceAsStream(privateKeyPath);
} else {
File keyFile = new File(privateKeyPath);
logger.info(
"Key file found in {} mode. FileName : {}, Exists : {}",
privateKeyMode,
keyFile.getName(),
keyFile.exists());
try {
resourceAsStream = new DataInputStream(new FileInputStream(keyFile));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
When I am running this code through intelliJ on windows, the code works fine but when I run it through docker container I am getting following exception:
org.bouncycastle.pkcs.PKCSException: unable to read encrypted data: failed to construct sequence from byte[]: Extra data detected in stream
snowflake-report-sync | at org.bouncycastle.pkcs.PKCS8EncryptedPrivateKeyInfo.decryptPrivateKeyInfo(Unknown Source) ~[bcpkix-jdk15on-1.64.jar!/:1.64.00.0]
snowflake-report-sync | at com.optum.snowflakereportsync.configuration.SnowFlakeConfig.getPrivateKey(SnowFlakeConfig.java:103) ~[classes!/:na]
snowflake-report-sync | at com.optum.snowflakereportsync.configuration.SnowFlakeConfig.getConnectionProperties(SnowFlakeConfig.java:67) ~[classes!/:na]
Following is Dockerfile used:
FROM adoptopenjdk/openjdk11-openj9:latest
COPY build/libs/snowflake-report-sync-*.jar snowflake-report-sync.jar
RUN mkdir /encryption-keys
COPY encryption-keys/ /encryption-keys/ #keys are picked from docker filesystem when running in container
EXPOSE 8080
CMD java -Dcom.sun.management.jmxremote -noverify ${JAVA_OPTS} -jar snowflake-report-sync.jar
Options tried:
Ensured that key file is being read while running in container. Logger "InputStreamExists --> {}" gives number of bytes
Ran dos2unix on key.k8 just to make sure there are no Window's "^M" characters which be could be causing issue as container is linux one : FROM adoptopenjdk/openjdk11-openj9:latest
Not sure what I am doing wrong but any help or pointers would be appreciated.

Like #Bragolgirith suspected, BouncyCastle seems to have problems with OpenJ9. I guess it is not a Docker issue, because I can reproduce it on GitHub Actions, too. It is also not limited to BouncyCastle 1.64 or 1.70, it happens in both versions. It also happens on OpenJ9 JDK 11, 14, 17 on Windows, MacOS and Linux, but for the same matrix of Java and OS versions it works on Adopt-Hotspot and Zulu.
Here is an example Maven project and a failed matrix build. So if you select another JVM type, you should be fine. I know that #Bragolgirith already suggested that, but I wanted to make the problem reproducible for everyone and also provide an MCVE, in case someone wants to open a BC or OpenJ9 issue.
P.S.: It is also not a character set issue with the InputStreamReader. This build fails exactly the same as before after I changed the constructor call.
Update: I have created BC-Java issue #1099. Let's see what the maintainers can say about this.
Update 2: The solution to your problem is to explicitly set the security provider to BC for your input decryptor provider. Thanks to David Hook for his helpful comment in #1099.
BouncyCastleProvider securityProvider = new BouncyCastleProvider();
Security.addProvider(securityProvider);
// (...)
InputDecryptorProvider pkcs8Prov = new JceOpenSSLPKCS8DecryptorProviderBuilder()
// Explicitly setting security provider helps to avoid ambiguities
// which otherwise can cause problems, e.g. on OpenJ9 JVMs
.setProvider(securityProvider)
.build(passphrase.toCharArray());
See this commit and the corresponding build, now passing on all platforms, Java versions and JVM types (including OpenJ9).
Because #Bragolgirith mentioned it in his answer: If you want to avoid the explicit new JceOpenSSLPKCS8DecryptorProviderBuilder().setProvider(securityProvider), the call Security.insertProviderAt(securityProvider, 1) instead of simply Security.addProvider(securityProvider) would in this case also solve the problem. But this holds true only as long as no other part of your code or any third-party library sets another provider to position 1 afterwards, as explained in the Javadoc. So maybe it is not a good idea to rely on that.

Edit:
On second thought, when creating the JceOpenSSLPKCS8DecryptorProviderBuilder, you're not explicitly specifying the provider:
new JceOpenSSLPKCS8DecryptorProviderBuilder()
.setProvider(BouncyCastleProvider.PROVIDER_NAME) // add this line
.build(passphrase.toCharArray());
It seems OpenJ9 uses a different provider/algo selection mechanism and selects the SunJCE's AESCipher class as CipherSpi by default, while Hotspot selects BouncyCastleProvider's AES class.
Explicitly specifying the provider should work in all cases.
Alternatively, when adding the BouncyCastleProvider you could insert it at the first preferred position (i.e. Security.insertProviderAt(new BouncyCastleProvider(), 1) instead of Security.addProvider(new BouncyCastleProvider())) so that it gets selected.
(It's still unclear to me why the provider selection mechanism differs between the different JVMs.)
Original post:
I've managed to reproduce the issue and at this point I'd say it's an incompatibility issue with the OpenJ9 JVM.
Starting from a Hotspot base image instead, e.g.
FROM adoptopenjdk:11-jre-hotspot
makes the code work.
(Not yet entirely sure whether the fault lies with the Docker image itself, the OpenJ9 JVM or BouncyCastle)

How to configure SparkContext for a HA enabled Cluster

When I am trying to run the spark application in YARN mode using the HDFS file system it works fine when I provide the below properties.
sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname);
sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress);
sparkConf.set("spark.yarn.stagingDir",stagingDirectory );
But the problems with this are:
Since my HDFS is NamdeNode HA enabled it won't work when I provide spark.yarn.stagingDir the commons URL of hdfs
E.g. hdfs://hdcluster/user/tmp/ gives an error that says:
has unknown host hdcluster
But it works fine when I give the URL as hdfs://<ActiveNameNode>/user/tmp/, but we don't know in advance which will be active so how do I resolve this?
And few things I have noticed are SparkContext takes the Hadoop configuration but SparkConfiguration class won't have any methods to accepts Hadoop configuration.
How do I provide the resource Manager address when Resource Manager are running in HA?

You need to use the configuration parameters that are already present in hadoop config files like yarn-site.xml, hdfs-site.xml
Initialize the Configuration object using:
val conf = new org.apache.hadoop.conf.Configuration()
To check the current HDFS URI, use:
val currentFS = conf.get("fs.defaultFS");
You will get an output with the URI of your namenode, something like:
res0: String = hdfs://namenode1
To check the address of current resource manager in use, try:
val currentRMaddr = conf.get("yarn.resourcemanager.address")

I have had the exact same issue. Here is the solution (finally):
You have to configure the internal Spark Context Hadoop Configuration for HDFS HA. When instantiating the Spark Context or Spark Session, it will find all configurations which have keys starting with spark.hadoop. and use them in instantiate the Hadoop Configuration.
So, In order to be able to use hdfs://namespace/path/to/file and not get an Invalid Host Exception is to add the following configuration options
spark.hadoop.fs.defaultFS = "hdfs://my-namespace-name"
spark.hadoop.ha.zookeeper.quorum = "real.hdfs.host.1.com:2181,real.hdfs.host.2.com:2181"
spark.hadoop.dfs.nameservices = "my-namespace-name"
spark.hadoop.dfs.client.failover.proxy.provider.my-namespace-name = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
spark.hadoop.dfs.ha.automatic-failover.enabled.my-namespace-name = true
spark.hadoop.dfs.ha.namenodes.my-namespace-name = "realhost1,realhost2"
spark.hadoop.dfs.namenode.rpc-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:8020"
spark.hadoop.dfs.namenode.servicerpc-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:8022"
spark.hadoop.dfs.namenode.http-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:50070"
spark.hadoop.dfs.namenode.https-address.my-namespace-name.realhost1 = "real.hdfs.host.1.com:50470"
spark.hadoop.dfs.namenode.rpc-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:8020"
spark.hadoop.dfs.namenode.servicerpc-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:8022"
spark.hadoop.dfs.namenode.http-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:50070"
spark.hadoop.dfs.namenode.https-address.my-namespace-name.realhost2 = "real.hdfs.host.2.com:50470"
spark.hadoop.dfs.replication = 3
spark.hadoop.dfs.blocksize = 134217728
spark.hadoop.dfs.client.use.datanode.hostname = false
spark.hadoop.dfs.datanode.hdfs-blocks-metadata.enabled = true

You are probably looking at HADOOP_CONF_DIR=/path/to/hdfs-site.xml/and/core-site.xml property in spark-env.sh. The mentioned envioronment variable should point to location where hdfs-site.xml and core-site.xml exists (Same those used in starting hadoop HA cluster). You should be able to then use hdfs://namespace/path/to/file without issues

How can I submit a Cascading job to a remote YARN cluster from Java?

I know that I can submit a Cascading job by packaging it into a JAR, as detailed in the Cascading user guide. That job will then run on my cluster if I manually submit it using hadoop jar CLI command.
However, in the original Hadoop 1 Cascading version, it was possible to submit a job to the cluster by setting certain properties on the Hadoop JobConf. Setting fs.defaultFS and mapred.job.tracker caused the local Hadoop library to automatically attempt to submit the job to the Hadoop1 JobTracker. However, setting these properties does not seem to work in the newer version. Submitting to a CDH5 5.2.1 Hadoop cluster using Cascading version 2.5.3 (which lists CDH5 as a supported platform) leads to an IPC exception when negotiating with the server, as detailed below.
I believe that this platform combination -- Cascading 2.5.6, Hadoop 2, CDH 5, YARN, and the MR1 API for submission -- is a supported combination based on the compatibility table (see under "Prior Releases" heading). And submitting the job using hadoop jar works fine on this same cluster. Port 8031 is open between the submitting host and the ResourceManager. An error with the same message is found in the ResourceManager logs on the server side.
I am using the cascading-hadoop2-mr1 library.
Exception in thread "main" cascading.flow.FlowException: unhandled exception
at cascading.flow.BaseFlow.complete(BaseFlow.java:894)
at WordCount.main(WordCount.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): Unknown rpc kind in rpc headerRPC_WRITABLE
at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1368)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Demo code is below, which is basically identical to the WordCount sample from the Cascading user guide.
public class WordCount {
public static void main(String[] args) {
String inputPath = "/user/vagrant/wordcount/input";
String outputPath = "/user/vagrant/wordcount/output";
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "wordcount" );
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
Properties properties = AppProps.appProps()
.setName( "word-count-application" )
.setJarClass( WordCount.class )
.buildProperties();
properties.put("fs.defaultFS", "hdfs://192.168.30.101");
properties.put("mapred.job.tracker", "192.168.30.101:8032");
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
flow.complete();
}
}
I've also tried setting a bunch of other properties to try to get it working:
mapreduce.jobtracker.address
mapreduce.framework.name
yarn.resourcemanager.address
yarn.resourcemanager.host
yarn.resourcemanager.hostname
yarn.resourcemanager.resourcetracker.address
None of these worked, they just cause the job to run in local mode (unless mapred.job.tracker is also set).

I've now resolved this problem. It comes from trying to use the older Hadoop classes that Cloudera distributes, particularly JobClient. This will happen if you use hadoop-core with the provided 2.5.0-mr1-cdh5.2.1 version, or the hadoop-client dependency with this same version number. Although this claims to be the MR1 version, and we are using the MR1 API to submit, this version actually ONLY supports submission to the Hadoop1 JobTracker, and it does not support YARN.
In order to allow submitting to YARN, you must use the hadoop-client dependency with the non-MR1 2.5.0-cdh5.2.1 version, which still supports submission of MR1 jobs to YARN.

Mongo Hadoop Connector Issue

I am trying to run a MapReduce job: I pull from Mongo and then write to HDFS, but I cannot seem to get the job to run. I could not find an example, but the issues I am having that if I set an input path of Mongo it loos for the output path of Mongo. And now I am getting an authentication error when my MongoDB instance does not have authentication.
final Configuration conf = getConf();
final Job job = new Job(conf, "sort");
MongoConfig config = new MongoConfig(conf);
MongoConfigUtil.setInputFormat(getConf(), MongoInputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/trythisdir"));
MongoConfigUtil.setInputURI(conf,"mongodb://localhost:27017/fake_data.file");
//conf.set("mongo.output.uri", "mongodb://localhost:27017/fake_data.file");
job.setJarByClass(imageExtractor.class);
job.setMapperClass(imageExtractorMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass( MongoInputFormat.class );
// Execute job and return status
return job.waitForCompletion(true) ? 0 : 1;
Edit: This is the current error I am having:
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't connect and authenticate to get collection
at com.mongodb.hadoop.util.MongoConfigUtil.getCollection(MongoConfigUtil.java:353)
at com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitterByStats(MongoSplitterFactory.java:71)
at com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitter(MongoSplitterFactory.java:107)
at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:56)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1079)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1096)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
at com.orbis.image.extractor.mongo.imageExtractor.run(imageExtractor.java:103)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.orbis.image.extractor.mongo.imageExtractor.main(imageExtractor.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.NullPointerException
at com.mongodb.MongoURI.<init>(MongoURI.java:148)
at com.mongodb.MongoClient.<init>(MongoClient.java:268)
at com.mongodb.hadoop.util.MongoConfigUtil.getCollection(MongoConfigUtil.java:351)
... 22 more

Late answer.. It may be helpul for people. I encountered with same problem while playing with Apache Spark.
I think you should set correctly mongo.input.uri and mongo.output.uri which will be used by hadoop and also input and output formats.
/*Correct input and output uri setting on spark(hadoop)*/
conf.set("mongo.input.uri", "mongodb://localhost:27017/dbName.inputColName");
conf.set("mongo.output.uri", "mongodb://localhost:27017/dbName.outputColName");
/*Set input and output formats*/
job.setInputFormatClass( MongoInputFormat.class );
job.setOutputFormatClass( MongoOutputFormat.class )
Btw, if "mongo.input.uri" or "mongo.output.uri" strings have typos it causes same error.

Replace:
MongoConfigUtil.setInputURI(conf, "mongodb://localhost:27017/fake_data.file");
by:
MongoConfigUtil.setInputURI(job.getConfiguration(), "mongodb://localhost:27017/fake_data.file");
The conf object is already 'consumed' by your job, so you need to set it directly on the configuration of the job.

You haven't shared the complete code so it's hard to tell, but what you've got there does not look consistent with typical usage of the MongoDB Connector for Hadoop.
I would suggest that you start with the examples in github.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.