I am trying to understand why the following error occurs.
"Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.SparkSession. java.lang.reflect.InvocationTargetException"
Basically, I am trying to use delta module to perform "upsert" method on my table in a glue job.
when I run the following code, I get the error mentioned above.
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars.packages", "io.delta:delta-core_2.11:0.5.0")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()```
This is the only piece I run and get the error. Do you have any ideas why this is happening?
Most probably, you are using the wrong version, probably Glue3.0. There were some workarounds to use Delta with Glue2.0 but those might give that kind of error when you try them with Glue3.0. Also setting the spark session config inside does not work for some parameters and it depends on the version I guess.
But no worries, AWS announced the 4th version of Glue, here is the official announcement.
Here is the official guide on using Delta with Glue, and below I will state the key points to make it work.
The first and the most tricky part is giving the configuration for delta. You can now do it the way you do in Glue4.0. In the older versions, you did this by sending the conf parameter inside the conf parameter through the job parameters of Glue :)
--conf = spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
You have to set the --datalake-format parameter in job params as delta.
After that, make sure you selected Glue4.0. Also, make sure to handle symlink manifest files in your scripts or using crawlers.
If you want more flexibility you can also choose to use the EMR service of AWS, here is a walkthrough on using Delta there.
Related
In JMeter I want to use a client certificate without all the overhead of converting the certificate and do not forget to click on the SSL Manger Menu after JMeter restart.
I want the flexibility to use different certificates where ever needed.
The Java Solution here looks very promising. I tried to use a JSR223 PreProcessor with Groovy. This fails with the first line. It was unable to import a standard Java Class.
2017-11-08 16:02:39,139 ERROR o.a.j.m.JSR223PreProcessor: Problem in JSR223 script, JSR223 PreProcessor
javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script37.groovy: 1: unable to resolve class java.security.Keystore
# line 1, column 1.
import java.security.Keystore;
What do I have to do to use standard Java classes?
The whole idea is based on a solution used in SoapUI.
import com.eviware.soapui.settings.SSLSettings
import com.eviware.soapui.model.settings.Settings
import com.eviware.soapui.SoapUI
Settings settings = SoapUI.getSettings()
settings.setString(SSLSettings.KEYSTORE, "../certificates/foo.p12")
settings.setString(SSLSettings.KEYSTORE_PASSWORD , "bar")
settings.reloadSettings()
Will something like this work in JMeter? Which client is used to send the HTTP samplers?
These are not "standard Java classes", it looks like something from SoapUI
You need to have these com.eviware.soapui.* classes under JMeter Classpath in order to make it work. Once you add the necessary .jars JMeter restart will be required to pick them up. However I doubt you will be able to use this com.eviware.soapui.model.settings.Settings class instance in JMeter test.
There is an easier way to configure JMeter to use client-side certificates: just add the next lines to system.properties file:
javax.net.ssl.keyStoreType=pkcs12
javax.net.ssl.keyStore=../certificates/foo.p12
javax.net.ssl.keyStorePassword=bar
or pass them via -D command-line argument to JMeter startup script like:
jmeter -Djavax.net.ssl.keyStoreType=pkcs12 -Djavax.net.ssl.keyStore=../certificates/foo.p12 -Djavax.net.ssl.keyStorePassword=bar -n -t test.jmx -l result.jtl
See How to Set Your JMeter Load Test to Use Client Side Certificates article for more details on the approach.
I am just trying to execute Apache Beam example code in local spark setup. I generated the source and built the package as mentioned in this page. And submitted the jar using spark-submit as below,
$ ~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master local target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The code gets submitted and starts to execute. But gets stuck at step Evaluating ParMultiDo(ExtractWords). Below is the log after submitting the job.
Am not able to find any error message. Can someone please help in finding whats wrong?
Edit: I also tried using below command,
~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master spark://Quartics-MacBook-Pro.local:7077 target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The job is now stuck at INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.2:59049 with 366.3 MB RAM, BlockManagerId(0, 192.168.0.2, 59049, None). Attached the screenshots of Spark History & Dashboard below.The dashboard shows the job is running, but no progress at all.
This is just a version issue. I was able to run the job in Spark 1.6.3. Thanks to all the people who just down voted this question without explanations.
Background & Problem
I am having a bit of trouble running the examples in Spark's MLLib on a machine running Fedora 23. I have built Spark 1.6.2 with the following options per Spark documentation:
build/mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.4 \
-Dhadoop.version=2.4.0 -DskipTests clean package
and upon running the binary classification example:
bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
examples/target/scala-*/spark-examples-*.jar \
--algorithm LR --regType L2 --regParam 1.0 \
data/mllib/sample_binary_classification_data.txt
I receive the following error:
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.92-1.b14.fc23.x86_64/jre/bin/java: symbol lookup error: /tmp/jniloader5830472710956533873netlib-native_system-linux-x86_64.so: undefined symbol: cblas_dscal
Errors of this form (symbol lookup error with netlib) are not limited to this particular example. On the other hand, the Elastic Net example (./bin/run-example ml.LinearRegressionWithElasticNetExample) runs without a problem.
Attempted Solutions
I have tried a number of solutions to no avail. For example, I went through some of the advice here https://datasciencemadesimpler.wordpress.com/tag/blas/, and while I can successfully import from com.github.fommil.netlib.BLAS and LAPACK, the aforementioned symbol lookup error persists.
I have read through the netlib-java documentation at fommil/netlib-java, and have ensured my system has the libblas and liblapack shared object files:
$ ls /usr/lib64 | grep libblas
libblas.so
libblas.so.3
libblas.so.3.5
libblas.so.3.5.0
$ ls /usr/lib64 | grep liblapack
liblapacke.so
liblapacke.so.3
liblapacke.so.3.5
liblapacke.so.3.5.0
liblapack.so
liblapack.so.3
liblapack.so.3.5
liblapack.so.3.5.0
The most promising advice I found was here http://fossdev.blogspot.com/2015/12/scala-breeze-blas-lapack-on-linux.html, which suggests including
JAVA_OPTS="- Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.NativeRefBLAS"
in the sbt script. So, I included appended those options to _COMPILE_JVM_OPTS="..." in the build/mvn script, which also did not resolve the problem.
Finally, a last bit of advice I found online suggested passing the following flags to sbt:
sbt -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS \
-Dcom.github.fommil.netlib.LAPACK=com.github.fommil.netlib.F2jLAPACK \
-Dcom.github.fommil.netlib.ARPACK=com.github.fommil.netlib.F2jARPACK
and again the issue persists. I am limited to two links in my post, but the advice can be found as the README.md of lildata's 'scaladatascience' repo on github.
Has anybody suffered this issue and successfully resolved it? Any and all help or advice is deeply appreciated.
It's been a couple months, but I got back to this problem and was able to get a functioning workaround (posting here in case anybody else has the same issue).
It came down to library precedence; so, by calling:
$ export LD_PRELOAD=/path/to/libopenblas.so
prior to launching Spark, everything works as expected.
I figured out the solution after reading:
https://github.com/fommil/netlib-java/issues/88 (directly addresses this issue)
JNI "symbol lookup error" in shared library on Linux (similar linking issue, doesn't have to do with Spark but answers are informative with regards to linking)
Our application is a RoR app, and currently uses JRuby version 1.7.22, and JRE 8_65. Our app is an on-prem solution, so we use JRuby to host our application on JVM at the target, Windows Server 2012 R2 system. We compile our ruby code, using
jruby -S jrubyc
This takes the .rb file and compiles it to a .class file. In the original .rb, it loads in the class file, like so.
load __FILE__.sub(/\.rb$/, ".class")
This all works with JRuby 1.7.22
Now, we want to update JRuby to 9.0.5.0, but are experiencing some problems when it comes to deploying our application. Basically, that line of code above inside of the .rb file is not working anymore, and we get the error when trying to run a rake db:setup
rake aborted!
LoadError: C:/appname/app/models/app_attribute.class is not compiled Ruby; use java_import to load normal classes
C:/appname/app/models/app_attribute.rb:1:in `<top>'
C:/appname/db/seeds.rb:10:in `<top>'
C:/appname/db/seeds.rb:9:in `block in (root)'
Tasks: TOP => db:setup => db:seed
(See full trace by running task with --trace)
Great. So I replace load with java_import
rake aborted!
ArgumentError: not a valid Java identifier: C:/appname/app/models/app_attribute.class
uri:classloader:/jruby/java/core_ext/object.rb:43:in `block in java_import'
uri:classloader:/jruby/java/core_ext/object.rb:34:in `java_import'
C:/appname/app/models/app_attribute.rb:1:in `<top>'
C:/appname/db/seeds.rb:10:in `<top>'
C:/appname/db/seeds.rb:9:in `block in (root)'
Tasks: TOP => db:setup => db:seed
(See full trace by running task with --trace)
Still not working, no matter what I try. I looked at this post: https://github.com/jruby/jruby/issues/3018
I tried to pass the parameter
jruby -Xaot.loadClasses=true
But I get a warning saying that aot.LoadClasses is not recognized. EVEN THOUGH I see it in the properties when I type
jruby -Xproperties
I have done A LOT of research on this, and have probably have looked at everything on the internet regarding this. Any input will be greatly appreciated. Is there something I missing? I am not fully adept in Java.
Thank you.
might be the same issue as https://github.com/jruby/jruby/issues/3651
which means you'll need to wait for 9.1 or use a snapshot http://ci.jruby.org/
since, the error is slightly different you should look into reproducing with snapshot and if it fails (might be Windows related) a step-by-step reproduction might speed-up getting the issue resolved.
jruby -Xaot.loadClasses=true
this is not needed with Warbler
But I get a warning saying that aot.LoadClasses is not recognized. EVEN THOUGH I see it in the properties when I type
hmm, could you reproduce this with an empty script and no JRUBY_OPTS ?
I have done A LOT of research on this, and have probably have looked at everything on the internet regarding this. Any input will be greatly appreciated.
you might want to try looking into the issue next time :) or considering getting some support
Is there something I missing? I am not fully adept in Java.
you shouldn't be missing anything - its not a Java issue ...
I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:
train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))
When I do
training_data = train_dataRDD.collectAsMap()
It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.
It looks like heap space is small. How can I set it to bigger limits?
EDIT:
Things that I tried before running:
sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')
I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html
It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.
After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.
sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor
Close your existing spark application and re run it. You will not encounter this error again. :)
If you're looking for the way to set this from within the script or a jupyter notebook, you can do:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "15g") \
.appName('my-cool-app') \
.getOrCreate()
I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.
The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.
As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.
I got the same error and I just assigned memory to spark while creating session
spark = SparkSession.builder.master("local[10]").config("spark.driver.memory", "10g").getOrCreate()
or
SparkSession.builder.appName('test').config("spark.driver.memory", "10g").getOrCreate()