Oozie: Launch Map-Reduce from Oozie <java> action?

Oozie: Launch Map-Reduce from Oozie <java> action? - java

I am trying to execute a Map-Reduce task in an Oozie workflow using a <java> action.
O'Reilley's Apache Oozie (Islam and Srinivasan 2015) notes that:
While it’s not recommended, Java action can be used to run Hadoop MapReduce jobs because MapReduce jobs are nothing but Java programs after all. The main class invoked can be a Hadoop MapReduce driver and can call Hadoop APIs to run a MapReduce job. In that mode, Hadoop spawns more mappers and reducers as required and runs them on the cluster.
However, I'm not having success using this approach.
The action definition in the workflow looks like this:
<java>
<!-- Namenode etc. in global configuration -->
<prepare>
<delete path="${transformOut}" />
</prepare>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
<main-class>package.containing.TransformTool</main-class>
<arg>${transformIn}</arg>
<arg>${transformOut}</arg>
<file>${avroJar}</file>
<file>${avroMapReduceJar}</file>
</java>
The Tool implementation's main() implementation looks like this:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new TransformTool(), args);
if (res != 0) {
throw new Exception("Error running MapReduce.");
}
}
The workflow will crash with the "Error running MapReduce" exception above every time; how do I get the output of the MapReduce to diagnose the problem? Is there a problem with using this Tool to run a MapReduce application? Am I using the wrong API calls?
I am extremely disinclined to use the Oozie <map-reduce> action, as each action in the workflow relies on several separately versioned AVRO schemas.
What's the issue here? I am using the 'new' mapreduce API for the task.
Thanks for any help.

> how do I get the output of the MapReduce...
Back to the basics.
Since you don't care to mention which version of Hadoop and which version of Oozie you are using, I will assume a "recent" setup (e.g. Hadoop 2.7 w/ TimelineServer and Oozie 4.2). And since you don't mention which kind of interface you use (command-line? native Oozie/Yarn UI? Hue?) I will give a few examples using good'old'CLI.
> oozie jobs -localtime -len 10 -filter name=CrazyExperiment
Shows the last 10 executions of "CrazyExperiment" workflow, so that you can inject the appropriate "Job ID" in next commands.
> oozie job -info 0000005-151217173344062-oozie-oozi-W
Shows the status of that execution, from Oozie point of view. If your Java action is stuck in PREP mode, then Oozie failed to submit it to YARN; otherwise you will find something like job_1449681681381_5858 under "External ID". But beware! The job prefix is a legacy thing; the actual YARN ID is application_1449681681381_5858.
> oozie job -log 0000005-151217173344062-oozie-oozi-W
Shows the Oozie log, as could be expected.
> yarn logs -applicationId application_1449681681381_5858
Shows the consolidated logs for AppMaster (container #1) and Java action Launcher (container #2) -- after execution is over. The stdout log for Launcher contains a whole shitload of Oozie debug stuff, the real stdout is at the very bottom.
In case your Java action successfully spawned another YARN job, and you were careful to display the child "application ID", you should be able to retrieve it there and run another yarn logs command against it.
Enjoy your next 5 days of debugging ;-)

Related

java.lang.IllegalArgumentException: Attempt to add ([custom-jar-with-spark-code].jar) multiple times to the distributed cache

I am trying to run a simple Java Spark job using Oozie on an EMR cluster. The job just takes files from an input path, does few basic actions on it and places the result in different output path.
When I try to run it from command line using spark-submit as shown below, it works fine:
spark-submit --class com.someClassName --master yarn --deploy-mode cluster /home/hadoop/some-local-path/my-jar-file.jar yarn s3n://input-path s3n://output-path
Then I set up the same thing in an Oozie workflow. However, when run from there the job always fails. The stdout log contains this line:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
java.lang.IllegalArgumentException: Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
I found a KB note and another question here on StackOverflow that deals with a similar error. But for them, the job was failing due to an internal JAR file - not the one the user is passing to run. Nonetheless, I tried out its resolution steps to remove jar files common between spark & oozie in share-lib and ended up removing a few files from "/user/oozie/share/lib/lib_*/spark". Unfortunately, that did not solve the problem either.
Any ideas on how to debug this issue?

So we finally figured out the issue - at least in our case.
While creating the Workflow using Hue, when a Spark Action is added, it by default prompts for "File" and "Jar/py name". We provided the path to the JAR file that we wanted to run and the name of that JAR file respectively in those fields and it created the basic action as seen below:
The final XML that it created was as follows:
<action name="spark-210e">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>CleanseData</name>
<class>com.data.CleanseData</class>
<jar>JCleanseData.jar</jar>
<spark-opts>--driver-memory 2G --executor-memory 2G --num-executors 10 --files hive-site.xml</spark-opts>
<arg>yarn</arg>
<arg>[someArg1]</arg>
<arg>[someArg2]</arg>
<file>lib/JCleanseData.jar#JCleanseData.jar</file>
</spark>
<ok to="[nextAction]"/>
<error to="Kill"/>
</action>
The default file tag in it was causing the issue in our case.
So, we removed it and edited the definition to as seen below and that worked. Note the change to <jar> tag as well.
<action name="spark-210e">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>CleanseData</name>
<class>com.data.CleanseData</class>
<jar>hdfs://path/to/JCleanseData.jar</jar>
<spark-opts>--driver-memory 2G --executor-memory 2G --num-executors 10 --files hive-site.xml</spark-opts>
<arg>yarn</arg>
<arg>[someArg1]</arg>
<arg>[someArg1]</arg>
</spark>
<ok to="[nextAction]"/>
<error to="Kill"/>
</action>
PS: We had a similar issue with Hive actions too. The hive-site.xml file we were supposed to pass with Hive action - which created a <job-xml> tag - was also causing issues. So we removed it and it worked as expected.

SparkLauncher Run spark-submit with yarn-client with user as hive

Trying to run spark job with masterURL=yarn-client. Using SparkLauncher 2.10. The java code is wrapped in nifi processor. Nifi is currently running as root. When I do yarn application -list, I see the spark job started with USER = root. I want to run it with USER = hive.
Following is my SparkLauncher code.
Process spark = new SparkLauncher()
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Do I need to pass user as driver extra options? Environment is non-kerberos.
Read somewhere that I need to pass user name as driver extra java option. Cannot find that post now!!

export HADOOP_USER_NAME=hive worked. SparkLauncher has overload to accept Map of environment variables. As for spark.yarn.principle, the environment is non-kerberos. As per my reading yarn.principle works only with kerboros. Did the following
Process spark = new SparkLauncher(getEnvironmentVar(ps.getRunAs()))
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Instead of new SparkLancher() used SparkLauncher(java.util.Map<String,String> env).Added or replacedHADOOP_USER_NAME=hive.
Checked yarn application -listlaunches as intended withUSER=hive.

Oozie java action logger logs are not shown on Oozie console

I am executing a map-reduce code by calling Driver class in Oozie java action. Map reduce run successfully and I get the output as expected. However, the log statements in my driver class are not shown on oozie job logs. I am using log4j for logging in my driver class.
Do I need to make some configuration changes to see the logs ?. Snippet of my workflow.xml `
<action name="MyAppDriver">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/home/hadoop/work/surjan/outpath/20160430" />
</prepare>
<main-class>com.surjan.driver.MyAppMainDriver</main-class>
<arg>/home/hadoop/work/surjan/PoC/wf-app-dir/MyApp.xml</arg>
<job-xml>/home/hadoop/work/surjan/PoC/wf-app-dir/AppSegmenter.xml</job-xml>
</java>
<ok to="sendEmailSuccess"/>
<error to="sendEmailKill"/>
</action>
`

The logs going into Yarn.
In my case I have a custom java action. If you look in Yarn UI you have to dig into the mapper task that the java action is in. So in my case the oozie wf item was 0070083-200420161025476-oozie-xxx-W and oozie job -log ${wf_id} shows the java action 0070083-200420161025476-oozie-xxx-W#java1 failed with a Java exception. I cannot see any context to that. Looking on the oozie web UI only the "Job Error Log" is populated as per what is shown on the commandline. The actual logging isn't shown. The ooize job -info ${wf_id} status shows failed:
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#java1 ERROR job_1587370152078_1090 FAILED/KILLEDJA018
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#fail OK - OK E0729
------------------------------------------------------------------------------------------------------------------------------------
You can search for the actual yarn on the web console Yarn Resource Manager UI (not within the "yarn logs" web console which are yarns own logs not the logs of what it is hosting). You can easily grep for the correct id on the commandline by looking for the ooize wf job id:
user#host:~/apidemo$ yarn application --list -appStates FINISHED | grep 0070083-200420161025476
20/04/22 20:42:12 INFO client.AHSProxy: Connecting to Application History server at your.host.com/130.178.58.221:10200
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
application_1587370152078_1090 oozie:launcher:T=java:W=java-whatever-sql-task-wf:A=java1:ID=0070083-200420161025476-oozie-xxx-W MAPREDUCE kerberos-id default FINISHED SUCCEEDED 100% https://your.host.com:8090/jobhistory/job/job_1587370152078_1090
user#host:~/apidemo$
Note that oozie says things failed. Yet the status of the action is "FINISHED" and the yarn application status is "SUCCESS" which seems strange.
Helpfully that commandline output also shows the url to the job history. That opens the webpage that takes you to the parent application that ran your java. If you click on the little logs link in the page you see some logs. If you look closely that page said it ran 1 operation of "task type map". If you click on the link in that row it takes you to the actual task which in my case is task_1587370152078_1090_m_000000. You have to click into that to see the first attempt which is attempt_1587370152078_1090_m_000000_0 then on the right hand side you have a tiny logs link which shows some more specific logging.
You can also ask yarn for the logs once you know the application id:
yarn logs -applicationId application_1587370152078_1090
That showed me very detailed logs including the custom java logging that I didn't see on the console easily where I could see what was really going on.
Note that if you are writing custom code you want to let yarn set the log4j properties file rather than supply your own version so that the yarn tools can find your logs. The code will be run with a flag:
-Dlog4j.configuration=container-log4j.properties
The detailed logs show all the jars that are added to the classpath. You should ensure that your custom code uses the same jars and log4j version.

Configuring a Jenkins build using Selenium with Browserstack

Has anyone out in the community successfully created a Selenium build in Jenkins using Browserstack as their cloud provider, while requiring a local testing connection behind a firewall?
I can say for sure Saucelabs is surprisingly easy to execute builds with the Sauce Jenkins plugin in a continuous deployment environment as I have done it. I cannot however, say the same for Browserstack. The organization I work with currently uses Browserstack, and although their service does support automated testing using a binary application I find it troublesome with Jenkins. I need to make absolutely sure Browserstack is not a viable solution, if so. I love Saucelabs and what their organization provides, but if Browserstack works I don't want to switch if I don't need to.
The Browserstack documentation instructs you to run a command, with some available options, in order to create a local connection before execution.
nohup ./[binary file] -localIdentifier [id] [auth key] localhost,3000,0 &
I have added the above statement as a pre-build step shell command. I have to also add 'nohup' as once the binary creates a successful connection, the build never actually starts since I have not exited as displayed in the output below.
BrowserStackLocal v3.5
You can now access your local server(s) in our remote browser.
Press Ctrl-C to exit
Normally I can successfully execute the first build without a problem. Subsequent build configurations using the same command never connect. The above message displays, but during test execution Browserstack reports no local testing connection was established. This confuses me.
To give you a better idea of what's being executed, I have 15 build configurations for various projects suites and browser combinations. Two Jenkins executors exist and I have more than 5 Browserstack VM's available at any given time. Five of the builds will automatically begin execution when the associated project code is pushed to the staging server, filling up both executors. One of them will begins and end fine. None of the others will as Browserstack reports local testing is not available.
Saucelabs obviously has this figured out with their plugin, which is great. If Browserstack requires shell commands to create local testing connections, I must be doing something wrong, out of order, etc.
Environment:
Java 7
Selenium 2.45
JUnit 4.11
Maven 3.1.1
Allure 1.4.10
Jenkins 1.5
Can someone post some information who use Browserstack in a continuous testing environment while utilizing multiple parallel test executions and tell me how each build is configured?
Thanks,

I've recently looked into BrowserStack with Selenium and the BrowserStack Plugin has made this task much easier.
Features
Manage your BrowserStack credentials globally or per build job.
Set up and tear down BrowserStack Local for testing internal, dev or
staging environments.
Embed BrowserStack Automate reports in your
Jenkins job results.
Much easier integration all round.

This is Umang replying on behalf of BrowserStack.
To start with, you are using the correct command for setting up the Local Testing connection. Although you do not need to specify the ‘localhost,3000,0’ details. We would also suggest you use the “-forcelocal” parameter while initiating the connection. The command should be as follows:
nohup ./[binary file] [auth key] -localIdentifier [id] -forcelocal &
The parameter “-forcelocal” will route all traffic via your IP address. Also, the process to initiate the connection before running your tests is correct.
However, here I’d like on confirm on the “id” you’ve specified while creating the connection. As you shared, there are 15 build configurations and I understand that each build has a different “id” specified. Please make sure that “id” specified while setting up the Local Testing connection and in the tests (“browserstack.localIdentifier” = “id”) is the same. Else, you will receive the error “[browserstack.local] is set to true but local testing through BrowserStack is not connected”

Integrating BrowserStack with Jenkins is a little bit tricky, but don't worry, it's perfectly doable :-)
The BrowserStackLocal client needs to be started as a background process, as per Umang's suggestion, and that's pretty much how the SauceLabs plugin works as well.
The trouble is that when Jenkins sees that you start daemon processes all by yourself and not via a plugin, it kills them. That's why you need to convince it otherwise.
I've described how to go about it step by step in this article, but if you're using the Pipeline Plugin, you can use the below script as a starting point:
node {
with_browser_stack 'linux-x64', {
// Execute tests: here's where a step like
// sh 'mvn clean verify'
// would go
}
}
// ----------------------------------------------
def with_browser_stack(type, actions) {
// Prepare the BrowserStackLocal client
if (! fileExists("/var/tmp/BrowserStackLocal")) {
sh "curl -sS https://www.browserstack.com/browserstack-local/BrowserStackLocal-${type}.zip > /var/tmp/BrowserStackLocal.zip"
sh "unzip -o /var/tmp/BrowserStackLocal.zip -d /var/tmp"
sh "chmod +x /var/tmp/BrowserStackLocal"
}
// Start the connection
sh "BUILD_ID=dontKillMe nohup /var/tmp/BrowserStackLocal 42MyAcc3sK3yV4lu3 -onlyAutomate > /var/tmp/browserstack.log 2>&1 & echo \$! > /var/tmp/browserstack.pid"
try {
// Execute tests
actions()
}
finally {
// Stop the connection
sh "kill `cat /var/tmp/browserstack.pid`"
}
}
You'd of course need to replace the fake access key (42MyAcc3sK3yV4lu3) with yours, or provide it via an environmental variable.
The important part here is the BUILD_ID, because that's what the Jenkins ProcessTreeKiller looks for when it decides whether to kill your daemon process or not.
Hope this helps!
Jan

Paul Whelan's answer to use the Jenkin's BrowserStack Plugin is currently the simplest way to integrate Jenkins with BrowserStack. The plugin supports all Jenkins versions >1.580.1.
To ensure that you get BrowserStack test reports you will need to configure your project's pom.xml as documented on the plugin wiki.

Just in case anyone else was having problems with this
For BrowserStackLocal v4.8 I found that -localidentifier has been removed from the binary options. (This is probably old news!)
When I removed the capabilities['browserstack.localIdentifier'] property from our automated tests the connection started working.
local binary
browserstack <key> -v -forcelocal
selenium setup
Capybara.register_driver :browserstack do |app|
capabilities = Selenium::WebDriver::Remote::Capabilities.new
# If we're running the BrowserStackLocal binary, we need to
# tell the driver as well
capabilities['browserstack.local'] = true
# other useful options
capabilities['browserstack.debug'] = true
capabilities['browserstack.javascriptEnabled'] = true
capabilities['javascriptEnabled'] = true
# etc ...

Dfs and Mapreduce in hadoop 2.4.1

Am using hadoop 2.4.1, When I try to use dfs in hadoop 2.4.1, Everything is working fine. I always use start-dfs.sh script to start so the following services will be up and running in the system
datanode, namenode and secondary namenode - which is exactly fine
Yesterday, I try to configure the mapred.xml in etc/hadoop/mapred.xml as following
**conf/mapred-site.xml:**
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
and I did the following
1.Formatted the namenode
2. I started start-all.sh
When I look into the logs, only following logs are available,
1. hadoop-datanode.log + out
2. hadoop-namenode.log + out
3. hadoop-secondarynamenode.log + out
4. yarn-nodemanager.log + out
5. yarn-resourcemanager.log + out
When I gave jps, only following services were running,
1. secondarynamenode
2. namenode
3. datanode
4. nodemanager
5. resourcemanager
I dont find the job tracker there, moreover mapreduce logs are also not available, Is that we need to specify something for mapreduce in haddop 2.4.1
Additional info, I checked with web console port of 50030 - job tracker, which is not available,
I grepped with the port check of 9001 nothing is running
Anyhelp is accepted pls

From Hadoop 2.0 onwards mapreduce default processing framework has been changed from classic mapreduce to YARN. When you use start-all.sh for starting hadoop, it internally invokes start-yarn.sh and start-dfs.sh.
If you wanted to use mapreduce instead of yarn, use should start dfs and mapreduce service separately using start-dfs.sh and start-mapred.sh( mapreduce1 binaries are located inside the directory ${HADOOP_HOME}/bin-mapreduce1 and all configuration files are under the directory ${HADOOP_HOME}/etc/hadoop-mapreduce1).
All YARN and HDFS start up scrips are located in the sbin directory in your hadoop home, where you cannot find start-mapred.sh script. start-mapred.sh script is in the directory bin-mapreduce1.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.