I am using spark version 2.1.0, I use
./bin/spark-submit --master spark://some_master_host_name:7077 \
--class com.foo.bar.SomeApp \
--deploy-mode cluster \
--conf spark.jars=libraries/avro-1.8.2.jar \
--jars libraries/avro-1.8.2.jar \
--conf spark.executor.extraClassPath=avro-1.8.2.jar \
--conf spark.driver.extraClassPath=avro-1.8.2.jar \
--conf spark.files.fetchTimeout=600s \
--conf spark.driver.extraJavaOptions="-XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/some_file_name.hprof" \
--conf spark.executor.extraJavaOptions="-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/some_file_name.hprof" \
/some/where/on/disk/FOO-assembly.jar
to submit spark application to the cluster. On application side I have following debug logging enabled to inspect classpath
private static void printSystemClasspath() {
System.err.println("-------System Classpath--------------");
String[] entries = System.getProperty("java.class.path").split("\\:");
for (String entry : entries) {
System.err.println(entry + " - " + new File(entry).exists());
}
System.err.println("-------------------------------------");
}
private static void printCurrentThreadClassloaderClasspath() {
System.err.println("-------Current Thread Classpath--------------");
ClassLoader cl = Thread.currentThread().getContextClassLoader();
URL[] urls = ((URLClassLoader) cl).getURLs();
for (URL url : urls) {
System.err.println(url.getFile());
}
System.err.println("-------------------------------------");
}
Problem : I am trying to override spark's supplied avro version and supply 1.8.2 at runtime.
I have proper version of avro classes packaged inside FOO-assembly.jar but when I use spark.driver.userClassPathFirst or spark.executor.userClassPathFirst I run into some other conflicts. I want to be able to just override avro library at runtime to upgraded version.
I receive first entry in System classpath as avro-1.8.2.jar but the file isn't present on the file system (code evaluates to false)
How can I override the classpath in this situation ?
Related
I'm having problem running Java written spark application on AWS EMR.
Locally, everything runs fine. When I submit a job to EMR, I always get "Completed" withing 20 seconds even though job should take minutes. There is no output being produced, no log messages are being printed.
I'm still confused as weather it should be ran as Spark application or CUSTOM_JAR type.
Look of my main method:
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("RandomName")
.getOrCreate();
//process stuff
String from_path = args[0];
String to_path = args[1];
Dataset<String> dataInput = spark.read().json(from_path).toJSON();
JavaRDD<ResultingClass> map = dataInput.toJavaRDD().map(row -> convertData(row)); //provided function didn't include here
Dataset<Row> dataFrame = spark.createDataFrame(map, ResultingClass.class);
dataFrame
.repartition(1)
.write()
.mode(SaveMode.Append)
.partitionBy("year", "month", "day", "hour")
.parquet(to_path);
spark.stop();
}
I've tried these:
1)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=Spark,Name=MyApp,Args=[--deploy-mode,cluster,--master,yarn, \
--conf,spark.yarn.submit.waitAppCompletion=false, \
--class,com.my.class.with.main.Foo,s3://mybucket/script.jar, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE --region us-west-2 --profile default
Completes in 15 sec without error, output result or logs I've added.
2)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[--deploy-mode,cluster, \
--conf,spark.yarn.submit.waitAppCompletion=true, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
Reads parameters wrongly, sees --deploy-mode as first parameter and cluster as second instead of buckets
3)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
I get this: Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession
When I include all dependencies (which I do not need to locally)
I get: Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I do not want to hardcode the "yarn" into the app.
I find AWS documentation very confusing as to what is the proper way to run this.
Update:
Running command directly on the server does work. So the problem must be in the way I'm defining a cli command.
spark-submit --class com.my.class.with.main.Foo \
s3://mybucket/script.jar \
"s3://partitioned-input-data/*/*/*/*/*.txt" \
"s3://output-bucket/table-name"
The 1) was working.
The step overview on the aws console said that the task was finished within 15 seconds, but in reality it was still running on the cluster. It took him an hour to do the work and I can see the result.
I do not know why the step is misreporting the result. I'm using emr-5.9.0 with Ganglia 3.7.2, Spark 2.2.0, Zeppelin 0.7.2.
I have a spark application packaged with maven. At run-time, I have to give 3 arguments (paths of 3 files to create RDDs). So I used spark-submit command as the officiel website of spark indicates:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
.. # other options
<application-jar> \
[application-arguments]
My submit-command looks like:
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar ["C:\Users\pc\Desktop\pathToFile1.csv", "C:\Users\pc\Desktop\pathToFile2.csv", "C:\Users\pc\Desktop\pathToFile3.csv"]
I moddified my Main class as follows to get paths at runtime:
String pathToFile1=args[0];
String pathToFile2=args[1];
String pathToFile3=args[2];
But I get an error message that says that the specified path does not exist. What am I doing wrong here?
#bradimus you were right i dont have to use [], i have to write it as :
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar C:\Users\pc\Desktop\pathToFile1.csv C:\Users\pc\Desktop\pathToFile2.csv C:\Users\pc\Desktop\pathToFile3.csv
I'm trying to build a Machine Learning program with Spark 1.6
I have started the Spark shell with the following settings:
spark-shell --driver-class-path sqljdbc_6.0/enu/sqljdbc42.jar --driver-memory 25G --executor-memory 30G --num-executors 180 --conf spark.driver.maxResultSize=0 --conf spark.ui.port=4042 --conf spark.default.parallelism=100 --conf spark.sql.shuffle.partitions=1000`
My code works until I try to predict/use the model.
After executing this code:
scala> val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
I get this error message:
/usr/bin/spark-shell: line 41: 33686 Killed
"$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$#"
I hope somebody can help me because I don't have any idea how I could make this code run smoothly!
Here is the Link to the complete full track of the error.
https://app.box.com/s/w247yaoaiuogqot2zr76qjbwr9rzeb7b
I am trying to connect jconsole to a specified port for a local process. I can connect to the local process using the PID but not using the remote option.
I am using ubuntu 14.04 and JDK 1.7
This is what I am doing to run my app.
grails \
-Dcom.sun.management.jmxremote=true \
-Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=xxx.xxx.xxx.xxx \
-Dserver.port=8090 \
run-app
hostname -i also gives me xxx.xxx.xxx.xxx
Grails 2.3 and later uses "forked mode" by default, where the JVM running run-app spawns a separate process to run the target application. Therefore, rather than passing the -D options to grails you should configure them in BuildConfig.groovy. Find the grails.project.fork option and add jvmArgs:
grails.project.fork = [
run:[...., jvmArgs:['-Dcom.sun.management.jmxremote=true',
'-Dcom.sun.management.jmxremote.port=9999',
// etc.
]]
]
Using the -D options on the command line as you are currently doing will set up the JMX connector in the grails process, not in your application.
Adding the below code to resources.groovy resolved the issue for me.
String serverURL = grailsApplication.config.grails.serverURL
URL url = new URL(serverURL)
System.setProperty("java.rmi.server.hostname", "${url.host}")
rmiRegistry(org.springframework.remoting.rmi.RmiRegistryFactoryBean) {
port = 9999
alwaysCreate: true
}
serverConnector(org.springframework.jmx.support.ConnectorServerFactoryBean) { bean ->
bean.dependsOn = ['rmiRegistry']
objectName = "connector:name=rmi"
serviceUrl = "service:jmx:rmi://${url.host}/jndi/rmi://${url.host}:9999/jmxrmi"
environment = ['java.rmi.server.hostname' : "${url.host}",
'jmx.remote.x.password.file' : "${grailsApplication.parentContext.getResource('/WEB-INF/jmx/jmxremote.password').file.absolutePath}",
'jmx.remote.x.access.file' : "${grailsApplication.parentContext.getResource('/WEB-INF/jmx/jmxremote.access').file.absolutePath}",
'com.sun.management.jmxremote.authenticate': true,
'com.sun.management.jmxremote.local.only' : false,
'com.sun.management.jmxremote' : true]
}
I am trying to attach a custom (java) partitioner to my MapReduce streaming job. I am using this command:
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ./NumericPartitioner.jar -D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6
The important bit of that is the file NumericPartitioner.jar, which resides in the same folder the command is being run in (a level down from the Hadoop root installation.) Here is its code:
package newjoin;
import java.util.*;
import java.lang.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class NumericPartitioner extends Partitioner<Text,Text>
{
#Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return Integer.parseInt(key.toString().split("\\s")[0]) % numReduceTasks;
}
}
And yet, when I try to run the above command, I get:
-partitioner : class not found : newjoin.NumericPartitioner
Streaming Command Failed!
What's going on here, and how can I get mapReduce to find my partitioner?
-libjars option is to make your third-party JAR’s available to the remote map and reduce task JVM’s.
But for making these same third party JAR’s available to the client JVM( JVM that’s created when you run the hadoop jar command) , you need to specify in HADOOP_CLASSPATH variable
$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:./NumericPartitioner.jar
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ${HADOOP_CLASSPATH} -D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6