Apache Beam job hangs up when submitted via spark-submit

Apache Beam job hangs up when submitted via spark-submit - java

I am just trying to execute Apache Beam example code in local spark setup. I generated the source and built the package as mentioned in this page. And submitted the jar using spark-submit as below,
$ ~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master local target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The code gets submitted and starts to execute. But gets stuck at step Evaluating ParMultiDo(ExtractWords). Below is the log after submitting the job.
Am not able to find any error message. Can someone please help in finding whats wrong?
Edit: I also tried using below command,
~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master spark://Quartics-MacBook-Pro.local:7077 target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The job is now stuck at INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.2:59049 with 366.3 MB RAM, BlockManagerId(0, 192.168.0.2, 59049, None). Attached the screenshots of Spark History & Dashboard below.The dashboard shows the job is running, but no progress at all.

This is just a version issue. I was able to run the job in Spark 1.6.3. Thanks to all the people who just down voted this question without explanations.

Related

getting InvocationTargetException when running my glue job

I am trying to understand why the following error occurs.
"Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.SparkSession. java.lang.reflect.InvocationTargetException"
Basically, I am trying to use delta module to perform "upsert" method on my table in a glue job.
when I run the following code, I get the error mentioned above.
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars.packages", "io.delta:delta-core_2.11:0.5.0")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()```
This is the only piece I run and get the error. Do you have any ideas why this is happening?

Most probably, you are using the wrong version, probably Glue3.0. There were some workarounds to use Delta with Glue2.0 but those might give that kind of error when you try them with Glue3.0. Also setting the spark session config inside does not work for some parameters and it depends on the version I guess.
But no worries, AWS announced the 4th version of Glue, here is the official announcement.
Here is the official guide on using Delta with Glue, and below I will state the key points to make it work.
The first and the most tricky part is giving the configuration for delta. You can now do it the way you do in Glue4.0. In the older versions, you did this by sending the conf parameter inside the conf parameter through the job parameters of Glue :)
--conf = spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
You have to set the --datalake-format parameter in job params as delta.
After that, make sure you selected Glue4.0. Also, make sure to handle symlink manifest files in your scripts or using crawlers.
If you want more flexibility you can also choose to use the EMR service of AWS, here is a walkthrough on using Delta there.

Apache Livy : Could not find or load main class org.apache.livy.server.LivyServer

I am trying to start Apache Livy 0.8.0 server on my windows 10 machine for spark 3.1.2 and hadoop 3.2.1. I am taking help from here.. I have successfully built apache livy using maven (I have attached a of it) But I am not able to run the livy server. When I run it I get the following error -
> starting C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer, logging to D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer:
Error: Could not find or load main class org.apache.livy.server.LivyServer
full log in D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
I am using Git bash. If you need more information I will provide

The error got resolved when I used Windows Subsystem for Linux (WSL).

How to control Spark logging in IntelliJ

I'm usually running stuff from JUnit but I've also tried running from main and it makes no difference.
I have read nearly two dozen SO questions, blog posts, and articles and tried almost everything to get Spark to stop logging so much.
Things I've tried:
log4j.properties in resources folder (in src and test)
Using spark-submit to add a log4j.properties which failed with "error: missing application resources"
Logger.getLogger("com").setLevel(Level.WARN);
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);Logger.getRootLogger().setLevel(Level.WARN);spark.sparkContext().setLogLevel("WARN");
In another project I got the logging to be quiet with:
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);
but it is not working here.
How I'm creating my SparkSession:
SparkSession spark = SparkSession
.builder()
.appName("RS-LDA")
.master("local")
.getOrCreate();
Let me know if you'd like to see more of my code.
Thanks

I'm using IntelliJ and Spark and, this work for me:
Logger.getRootLogger.setLevel(Level.ERROR)
You could change Log Spark configurations too.
$ cd SPARK_HOME/conf
$ gedit log4j.properties.template
# find this lines in the file
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
and change to ERROR
log4j.rootCategory=ERROR, console
In this file you have other options tho change too
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
.....
And finally rename the log4j.properties.template file
$ mv log4j.properties.template log4j.properties
You can follow this link for further configuration:
Logging in Spark with Log4j
or this one too:
Logging in Spark with Log4j. How to customize the driver and executors for YARN cluster mode.

It might be an old question, but I just ran by the same problem.
To fix it what I did was:
Adding private static Logger log = LoggerFactory.getLogger(Spark.class);
as a field for the class.
spark.sparkContext().setLogLevel("WARN"); after creating the spark session
Step 2 will work only after step 1.

Submitting Spark application on YARN from Eclipse IDE

I am facing an issue when I try to submit my Spark application on Yarn from eclipse. I try to submit a simple SVM program, but i gives below error. I Have macbook, and I will be so thankful if somebody give me detailed answer
16/09/17 10:04:19 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '.../MyProject/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:368)
at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:500)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at SVM.main(SVM.java:21)

Go to
Run Configurations --> Environment
in Eclipse and add the environment variable SPARK_HOME.

Oozie java action logger logs are not shown on Oozie console

I am executing a map-reduce code by calling Driver class in Oozie java action. Map reduce run successfully and I get the output as expected. However, the log statements in my driver class are not shown on oozie job logs. I am using log4j for logging in my driver class.
Do I need to make some configuration changes to see the logs ?. Snippet of my workflow.xml `
<action name="MyAppDriver">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/home/hadoop/work/surjan/outpath/20160430" />
</prepare>
<main-class>com.surjan.driver.MyAppMainDriver</main-class>
<arg>/home/hadoop/work/surjan/PoC/wf-app-dir/MyApp.xml</arg>
<job-xml>/home/hadoop/work/surjan/PoC/wf-app-dir/AppSegmenter.xml</job-xml>
</java>
<ok to="sendEmailSuccess"/>
<error to="sendEmailKill"/>
</action>
`

The logs going into Yarn.
In my case I have a custom java action. If you look in Yarn UI you have to dig into the mapper task that the java action is in. So in my case the oozie wf item was 0070083-200420161025476-oozie-xxx-W and oozie job -log ${wf_id} shows the java action 0070083-200420161025476-oozie-xxx-W#java1 failed with a Java exception. I cannot see any context to that. Looking on the oozie web UI only the "Job Error Log" is populated as per what is shown on the commandline. The actual logging isn't shown. The ooize job -info ${wf_id} status shows failed:
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#java1 ERROR job_1587370152078_1090 FAILED/KILLEDJA018
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#fail OK - OK E0729
------------------------------------------------------------------------------------------------------------------------------------
You can search for the actual yarn on the web console Yarn Resource Manager UI (not within the "yarn logs" web console which are yarns own logs not the logs of what it is hosting). You can easily grep for the correct id on the commandline by looking for the ooize wf job id:
user#host:~/apidemo$ yarn application --list -appStates FINISHED | grep 0070083-200420161025476
20/04/22 20:42:12 INFO client.AHSProxy: Connecting to Application History server at your.host.com/130.178.58.221:10200
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
application_1587370152078_1090 oozie:launcher:T=java:W=java-whatever-sql-task-wf:A=java1:ID=0070083-200420161025476-oozie-xxx-W MAPREDUCE kerberos-id default FINISHED SUCCEEDED 100% https://your.host.com:8090/jobhistory/job/job_1587370152078_1090
user#host:~/apidemo$
Note that oozie says things failed. Yet the status of the action is "FINISHED" and the yarn application status is "SUCCESS" which seems strange.
Helpfully that commandline output also shows the url to the job history. That opens the webpage that takes you to the parent application that ran your java. If you click on the little logs link in the page you see some logs. If you look closely that page said it ran 1 operation of "task type map". If you click on the link in that row it takes you to the actual task which in my case is task_1587370152078_1090_m_000000. You have to click into that to see the first attempt which is attempt_1587370152078_1090_m_000000_0 then on the right hand side you have a tiny logs link which shows some more specific logging.
You can also ask yarn for the logs once you know the application id:
yarn logs -applicationId application_1587370152078_1090
That showed me very detailed logs including the custom java logging that I didn't see on the console easily where I could see what was really going on.
Note that if you are writing custom code you want to let yarn set the log4j properties file rather than supply your own version so that the yarn tools can find your logs. The code will be run with a flag:
-Dlog4j.configuration=container-log4j.properties
The detailed logs show all the jars that are added to the classpath. You should ensure that your custom code uses the same jars and log4j version.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.