I have the following code:
val conf = new SparkConf().setAppName("Spark Test")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"dbtable" -> "SELECT security_id FROM ix_tri_pi")).load()
data.foreach {
row => println(row.getInt(1))
}
And I try to submit it with:
spark-submit \
--class "com.novus.analytics.spark.SparkTest" \
--master "local[4]" \
/Users/smabie/workspace/analytics/analytics-spark/target/scala-2.10/analytics-spark.jar \
--conf spark.executer.extraClassPath=sqlite-jdbc-3.8.7.jar \
--conf spark.driver.extraClassPath=sqlite-jdbc-3.8.7.jar \
--driver-class-path sqlite-jdbc-3.8.7.jar \
--jars sqlite-jdbc-3.8.7.jar
But I get the following exception:
Exception in thread "main" java.sql.SQLException: No suitable driver
I am using Spark version 1.6.1, if that helps.
Thanks!
Try defining your jar as the last parameter of spark-submit.
Did you try to specify driver class in options explicitly?
options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"driver" -> "org.sqlite.JDBC",
"dbtable" -> "SELECT security_id FROM ix_tri_pi"))
I had similar issue trying to load PostgreSQL table.
Also, possible cause may be in classloading:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
http://spark.apache.org/docs/latest/sql-programming-guide.html#troubleshooting
Related
I have a Java Spark application, in which I need to read all the row keys from an HBase table.
Up until now, I worked with Spark 2.4.7 and we migrated to Spark 3.2.3. I used newAPIHadoopRDD but HBase is returning an empty result after the Spark version upgrade, with no errors/warnings.
My function looks as follows:
Configuration config = dataContext.getConfig();
Scan scan = new Scan();
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL, new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
config.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
StructField field = DataTypes.createStructField("id", DataTypes.StringType, true);
StructType schema = DataTypes.createStructType(Arrays.asList(field));
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> keyToEmptyValues = sparkSession.sparkContext().newAPIHadoopRDD(config,
TableInputFormat.class, ImmutableBytesWritable.class, Result.class).toJavaRDD();
JavaRDD<ImmutableBytesWritable> keys = JavaPairRDD.fromJavaRDD(keyToEmptyValues).keys();
JavaRDD<Row> nonPrefixKeys = removeKeyPrefixFromRDD(keys);
Dataset<Row> keysDataset = sparkSession.createDataFrame(nonPrefixKeys, schema);
I read the following post: Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD
I added the flag mentioned: --conf "spark.hadoopRDD.ignoreEmptySplits=false", the results are still empty.
I'm using Spark 3.2.3, and Hbase 2.2.4.
My spark-submit command looks as follows:
spark-submit --name "enrich" --master local[*] --class ingest.main.Main --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf "spark.driver.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --conf "spark.hadoopRDD.ignoreEmptySplits=false" --conf "spark.executor.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --driver-java-options "-Dlog4j.configuration=file:/opt/bitnami/spark/log4j.properties" --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf 'spark.driver.extraJavaOptions=-Dexecute=Enrich -Denrich.files.path="/hfiles/10/enrichments" -Dhfiles.path="/hfiles/10/enrichments"' ingest-1.jar
Why do I get nothing from HBase (Hbase is full, I made a scan command in the HBase shell)?
Is there any better way to retrieve the row keys from an HBase table with Spark 3.2.3?
I am going to submit pyspark task, and submit an environment with the task.
I need --archives to submit the zip package contain full environment.
The working spark submit command is this
/my/spark/home/spark-submit
--master yarn
--deploy-mode cluster
--driver-memory 10G
--executor-memory 8G
--executor-cores 4
--queue rnd
--num-executors 8
--archives /data/me/ld_env.zip#prediction_env
--conf spark.pyspark.python=./prediction_env/ld_env/bin/python
--conf spark.pyspark.driver.python=./prediction_env/ld_env/bin/python
--conf spark.executor.memoryOverhead=4096
--py-files dist/mylib-0.1.0-py3-none-any.whl my_task.py
I am trying to start spark app programatically with SparkLauncher
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.addPyFile(pyPath)
// .addPyFile(archives)
// .addFile(archives)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
I need somewhere to put my zip file that will unpack on yarn. But I don't see any archives function.
Use config spark.yarn.dist.archivesas document in Running on yarn and tutorial
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.setConf("spark.yarn.dist.archives", archives)
.addPyFile(pyPath)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
So, adding .setConf("spark.yarn.dist.archives", archives) fix the problem.
error occurs while executing
airflow#41166b660d82:~$ spark-submit --master yarn --deploy-mode cluster --keytab keytab_name.keytab --principal --jars keytab_name#REALM --jars /path/to/spark-hive_2.11-2.3.0.jar sranje.py
from airflow docker container not in CDH env (not managed by CDH CM). sranje.py is simple select * from hive table.
App is accepted on CDH yarn and executed twice with this error:
...
2020-12-31 10:11:43 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
2020-12-31 10:11:43 ERROR ApplicationMaster:70 - User application exited with status 1
2020-12-31 10:11:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...
We assume that "some .jar's and java dependencies" are missing. Any ideas?
Details
there is a valid krb ticket before executing spark cmd
if we ommit --jars /path/to/spark-hive_2.11-2.3.0.jar, pyhton error is different
...
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
...
versions of spark(2.3.0), hadoop(2.6.0) and java are same as CDH
hive-site.xml, yarn-site.xml etc are also provided and valid
this same spark-submit app executes OK from node inside of CDH cluster
we tried adding additional --jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar
developers use this code as an example:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext, HiveContext, functions as F
from pyspark.sql.utils import AnalysisException
from datetime import datetime
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)
current_date = str(datetime.now().strftime('%Y-%m-%d'))
hive_source = "lnz_ch.lnz_cfg_codebook"
source_df = hiveContext.table(hive_source).na.fill("")
print("Number of records: {}".format(source_df.count()))
print("First 20 rows of the table:")
source_df.show(20)
different script, same error
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("ZekoTest").enableHiveSupport().getOrCreate()
data = spark.sql("SELECT * FROM lnz_ch.lnz_cfg_codebook")
data.show(20)
spark.close()
Thank you.
Hive dependecies are resolved with:
downloading hive.tar.gz with exact version of CDH Hive
created symlinks from hive/ to spark/
ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/
additional jars downloaded from maven repo to spark/jars/
hive-hcatalog-core-2.3.0.jar
slf4j-api-1.7.26.jar
spark-hive_2.11-2.3.0.jar
spark-hive-thriftserver_2.11-2.3.0.jar
refresh env var
HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' ' ':')
SPARK_DIST_CLASSPATH=$(hadoop classpath)
beeline works, but pyspark throws error
2021-01-07 15:02:20 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.table.
: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
But, that's another question. Thank you all.
I'm facing a very strange issue with pyspark on macOS Sierra. My goal is to parse dates in ddMMMyyyy format (eg: 31Dec1989) but get errors. I run Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I tried also using Anaconda 4.2.0 (it ships with Python 2.7.12), but get errors too.
The same code on Ubuntu Server 15.04 with same Java version and Python 2.7.9 works without any error.
The official documentation about spark.read.load() states:
dateFormat – sets the string that indicates a date format. Custom date
formats follow the formats at java.text.SimpleDateFormat. This applies
to date type. If None is set, it uses the default value value,
yyyy-MM-dd.
The official Java documentation talks about MMM as the right format to parse month names like Jan, Dec, etc. but it throws a lot of errors starting with java.lang.IllegalArgumentException.
The documentation states that LLL can be used too, but pyspark doesn't recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL'.
I know of another solution to dateFormat, but this is the fastest way to parse data and the simplest to code. What am I missing here?
In order to run the following examples you simply have to place test.csv and test.py in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py.
My test case using ddMMMyyyy format
I have a plain-text file named test.csv containing the following two lines:
col1
31Dec1989
and the code is the following:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMMyyyy", \
mode="FAILFAST")
df.show()
I get errors. I tried also moving month name before or after days and year (eg: 1989Dec31 and yyyyMMMdd) without success.
A working example using ddMMyyyy format
This example is identical to the previous one except from the date format. test.csv now contains:
col1
31121989
The following code prints the content of test.csv:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMyyyy", \
mode="FAILFAST")
df.show()
The ouput is the following (I omit the various verbose lines):
+----------+
| column|
+----------+
|1989-12-31|
+----------+
UPDATE1
I made a simple Java class that uses java.text.SimpleDateFormat:
import java.text.*;
import java.util.Date;
class testSimpleDateFormat
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd");
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
This code doesn't work on my environment and throws this error:
java.text.ParseException: Unparseable date: "1989Dec31"
but works perfectly on another system (Ubuntu 15.04). This seems a Java issue, but I don't know how to solve it. I installed the latest available version of Java and all of my software has been updated.
Any ideas?
UPDATE2
I've found how to make it work under pure Java by specifying Locale.US:
import java.text.*;
import java.util.Date;
import java.util.*;
class HelloWorldApp
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US);
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println(pe);
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
Now, the question becomes: how to specify Java's Locale in pyspark?
Probably worth noting that this was resolved on the Spark mailing list on 24 Oct 2016. Per the original poster:
This worked without setting other options: spark/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Duser.language=en" test.py
and was reported as SPARK-18076 (Fix default Locale used in DateFormat, NumberFormat to Locale.US) against Spark 2.0.1 and was resolved in Spark 2.1.0.
Additionally, while the above workaround (passing in --conf "spark.driver.extraJavaOptions=-Duser.language=en") for the specific issue the submitter raised is no longer needed if using Spark 2.1.0, a notable side-effect is that for Spark 2.1.0 users, you can no longer pass in something like --conf "spark.driver.extraJavaOptions=-Duser.language=fr" if you wanted to parse a non-English date, e.g. "31mai1989".
In fact, as of Spark 2.1.0, when using spark.read() to load a csv, I think it's no longer possible to use the dateFormat option to parse a date such as "31mai1989", even if your default locale is French. I went as far as changing the default region and language in my OS to French and passed in just about every locale setting permutation I could think of, i.e.
JAVA_OPTS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
JAVA_ARGS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
LC_ALL=fr_FR.UTF-8 \
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
--conf "spark.executor.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
test.py
to no avail, resulting in
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
But again, this only affects parsing non-English dates in Spark 2.1.0.
You have already identified the issue as one of locale in the Spark's JVM. You can check the default country and language settings that are used by your Spark's JVM by going to http://localhost:4040/environment/ after launching the spark shell. Search for "user.language" and user.country" under System Properties section. It should be US and en.
You can change them like this, if needed.
Option 1: Edit the spark-defaults.conf file in {SPARK_HOME}/conf folder. Add the following settings:
spark.executor.extraJavaOptions -Duser.country=US -Duser.language=en
spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en
Option 2: Pass the options to pyspark as a command line option
$pyspark --conf spark.driver.extraJavaOptions="-Duser.country=US,-Duser.language=en" spark.executor.extraJavaOptions="-Duser.country=US,-Duser.language=en"
Option 3: Change the language and region in your Mac OS. For example - What settings in Mac OS X affect the `Locale` and `Calendar` inside Java?
P.S. - I have only verified that Option 1 works. I have not tried out the other 2. More details about Spark configuration are here - http://spark.apache.org/docs/latest/configuration.html#runtime-environment
I haven't tested this but I'd give the following a try:
--conf spark.executor.extraJavaOptions="-Duser.timezone=America/Los_Angeles"
--conf spark.driver.extraJavaOptions="-Duser.timezone=America/Los_Angeles"
I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?
Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.
There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'