Spark Naive Bayes ML OutofMemory Error - Prediction

Spark Naive Bayes ML OutofMemory Error - Prediction - java

I'm trying to build a Machine Learning program with Spark 1.6
I have started the Spark shell with the following settings:
spark-shell --driver-class-path sqljdbc_6.0/enu/sqljdbc42.jar --driver-memory 25G --executor-memory 30G --num-executors 180 --conf spark.driver.maxResultSize=0 --conf spark.ui.port=4042 --conf spark.default.parallelism=100 --conf spark.sql.shuffle.partitions=1000`
My code works until I try to predict/use the model.
After executing this code:
scala> val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
I get this error message:
/usr/bin/spark-shell: line 41: 33686 Killed
"$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$#"
I hope somebody can help me because I don't have any idea how I could make this code run smoothly!
Here is the Link to the complete full track of the error.
https://app.box.com/s/w247yaoaiuogqot2zr76qjbwr9rzeb7b

Related

AWS EC2 Spark-Salesforce integration- java.lang.NoClassDefFoundError: com/sforce/ws/ConnectionException

I am getting the below error even though I generated the partner.jar correctly.
I have generated the partner.jar in my windows machine using the below jars.
antlr-runtime-3.5.3.jar,
force-wsc-56.1.0.jar,
js-1.7R2.jar,
partner.wsdl,
ST4-4.3.4.jar,
tools.jar,
java -classpath tools.jar;force-wsc-56.1.0.jar;ST4-4.3.4.jar;js-1.7R2.jar;antlr-runtime-3.5.3.jar com.sforce.ws.tools.wsdlc partner.wsdl partner.jar
I have set the class path correctly before generating the partner.jar.
And I copied this partner.jar to AWS EC2 machine and try executing the below code. But its still failing with the connection error.
df = spark.read.format("com.springml.spark.salesforce") \ .option("username", "dinesh123#force.com") \
.option("password", "passwordtoken) \
.option("login", "https://dev-yh.develop.my.salesforce.com/")\ .option("soql", soql) \
.option("inferSchema", True) \
.load()
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.NoClassDefFoundError: com/sforce/ws/ConnectionException
at com.springml.salesforce.wave.api.APIFactory.forceAPI(APIFactory.java:49)
at com.springml.spark.salesforce.DefaultSource.createRelation(DefaultSource.scala:102)
at com.springml.spark.salesforce.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)

Spark Job with Kafka on Kubernetes

We have a Spark Java application which reads from database and publishes messages on Kafka. When we execute the job locally on windows command line with following arguments it is working as expected :
bin/spark-submit -class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --class com.data.ingestion.DataIngestion data-ingestion-1.0-SNAPSHOT.jar
Similarly, when try to run the command using k8s master
bin/spark-submit --master k8s://https://172.16.3.105:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=localhost:5000/spark-example:0.2 --class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
It gives following error:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

Based on the error, it would indicate at least one node in the cluster does not have /opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar
I suggest you create an uber jar that includes this Kafka Structured Streaming package or use --packages rather than local files in addition to setup a solution like Rook or MinIO to have a shared filesystem within k8s/spark

Seems Scala version and Spark Kafka version were not aligned.

Apache druid No known server

I am trying to setup the Apache Druid on a single machine following quickstart guide here. When I start historical server, it shows io.druid.java.util.common.IOE: No known server exception on screen.
Command:
java `cat conf-quickstart/druid/historical/jvm.config xargs` \
-cp "conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/*" \
io.druid.cli.Main server historical
Full stack-trace-
2018-04-07T18:23:40,234 WARN [main]
io.druid.java.util.common.RetryUtils - Failed on try 1, retrying in
1,246ms. io.druid.java.util.common.IOE: No known server at
io.druid.discovery.DruidLeaderClient.getCurrentKnownLeader(DruidLeaderClient.java:276)
~[druid-server-0.12.0.jar:0.12.0] at
io.druid.discovery.DruidLeaderClient.makeRequest(DruidLeaderClient.java:128)
~[druid-server-0.12.0.jar:0.12.0] at
io.druid.query.lookup.LookupReferencesManager.fetchLookupsForTier(LookupReferencesManager.java:569)
~[druid-server-0.12.0.jar:0.12.0] at
io.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:420)
~[druid-server-0.12.0.jar:0.12.0] at
io.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$4(LookupReferencesManager.java:398)
~[druid-server-0.12.0.jar:0.12.0] at
io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:63)
[java-util-0.12.0.jar:0.12.0] at
io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81)
[java-util-0.12.0.jar:0.12.0] at
io.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:388)
[druid-server-0.12.0.jar:0.12.0]
I have tried to setup from scratch many times with exactly the same steps mentioned on quick-start guide, but I am unable to resolve this error. How to resolve this error?

If you already tried to start druid, then delete the druid-X.Y.Z/log and druid-X.Y.Z/var folders.
Start zookeeper ./zookeeper-X.Y.Z/bin/zkServer.sh start
Recreate those folders you erased with druid-X.Y.Z/bin/init
Run each command in a new tab in this order
java `cat conf-quickstart/druid/coordinator/jvm.config | xargs` -cp "conf-quickstart/druid/_common:conf-quickstart/druid/coordinator:lib/*" io.druid.cli.Main server coordinator
java `cat conf-quickstart/druid/overlord/jvm.config | xargs` -cp "conf-quickstart/druid/_common:conf-quickstart/druid/overlord:lib/*" io.druid.cli.Main server overlord
java `cat conf-quickstart/druid/broker/jvm.config | xargs` -cp "conf-quickstart/druid/_common:conf-quickstart/druid/broker:lib/*" io.druid.cli.Main server broker
java `cat conf-quickstart/druid/historical/jvm.config | xargs` -cp "conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/*" io.druid.cli.Main server historical
java `cat conf-quickstart/druid/middleManager/jvm.config | xargs` -cp "conf-quickstart/druid/_common:conf-quickstart/druid/middleManager:lib/*" io.druid.cli.Main server middleManager
You should now have 1 tab open for each of those commands (so 5).
Insert the data curl -X 'POST' -H 'Content-Type:application/json' -d #quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task
You will then see {"task":"index_hadoop_wikiticker_2018-06-06T19:17:51.900Z"}

Running Java Spark program on AWS EMR

I'm having problem running Java written spark application on AWS EMR.
Locally, everything runs fine. When I submit a job to EMR, I always get "Completed" withing 20 seconds even though job should take minutes. There is no output being produced, no log messages are being printed.
I'm still confused as weather it should be ran as Spark application or CUSTOM_JAR type.
Look of my main method:
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("RandomName")
.getOrCreate();
//process stuff
String from_path = args[0];
String to_path = args[1];
Dataset<String> dataInput = spark.read().json(from_path).toJSON();
JavaRDD<ResultingClass> map = dataInput.toJavaRDD().map(row -> convertData(row)); //provided function didn't include here
Dataset<Row> dataFrame = spark.createDataFrame(map, ResultingClass.class);
dataFrame
.repartition(1)
.write()
.mode(SaveMode.Append)
.partitionBy("year", "month", "day", "hour")
.parquet(to_path);
spark.stop();
}
I've tried these:
1)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=Spark,Name=MyApp,Args=[--deploy-mode,cluster,--master,yarn, \
--conf,spark.yarn.submit.waitAppCompletion=false, \
--class,com.my.class.with.main.Foo,s3://mybucket/script.jar, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE --region us-west-2 --profile default
Completes in 15 sec without error, output result or logs I've added.
2)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[--deploy-mode,cluster, \
--conf,spark.yarn.submit.waitAppCompletion=true, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
Reads parameters wrongly, sees --deploy-mode as first parameter and cluster as second instead of buckets
3)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
I get this: Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession
When I include all dependencies (which I do not need to locally)
I get: Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I do not want to hardcode the "yarn" into the app.
I find AWS documentation very confusing as to what is the proper way to run this.
Update:
Running command directly on the server does work. So the problem must be in the way I'm defining a cli command.
spark-submit --class com.my.class.with.main.Foo \
s3://mybucket/script.jar \
"s3://partitioned-input-data/*/*/*/*/*.txt" \
"s3://output-bucket/table-name"

The 1) was working.
The step overview on the aws console said that the task was finished within 15 seconds, but in reality it was still running on the cluster. It took him an hour to do the work and I can see the result.
I do not know why the step is misreporting the result. I'm using emr-5.9.0 with Ganglia 3.7.2, Spark 2.2.0, Zeppelin 0.7.2.

Error when running my jar with arguments

I have a spark application packaged with maven. At run-time, I have to give 3 arguments (paths of 3 files to create RDDs). So I used spark-submit command as the officiel website of spark indicates:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
.. # other options
<application-jar> \
[application-arguments]
My submit-command looks like:
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar ["C:\Users\pc\Desktop\pathToFile1.csv", "C:\Users\pc\Desktop\pathToFile2.csv", "C:\Users\pc\Desktop\pathToFile3.csv"]
I moddified my Main class as follows to get paths at runtime:
String pathToFile1=args[0];
String pathToFile2=args[1];
String pathToFile3=args[2];
But I get an error message that says that the specified path does not exist. What am I doing wrong here?

#bradimus you were right i dont have to use [], i have to write it as :
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar C:\Users\pc\Desktop\pathToFile1.csv C:\Users\pc\Desktop\pathToFile2.csv C:\Users\pc\Desktop\pathToFile3.csv

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Naive Bayes ML OutofMemory Error - Prediction - java

Related

AWS EC2 Spark-Salesforce integration- java.lang.NoClassDefFoundError: com/sforce/ws/ConnectionException

Spark Job with Kafka on Kubernetes

Apache druid No known server

Running Java Spark program on AWS EMR

Error when running my jar with arguments

Categories

Resources