The eventual goal that I want to achieve is that I want to query my MongoDB Collection through Spark SQL using Scala Code as an Independent application. I have successfully installed Spark on my local which is running "Windows 10" operating system. I can run spark-shell, Spark Master node and worker node. So from the looks of it, the apache spark is working fine on my p.c
I can also query my MongoDB collection by running the scala code in Spark Shell.
Problem:
When I try to use the same code from my Scala project using MongoDB Spark Connector for scala. I am running into an error which I am unable to figure out. IT seems like an environment issue, I looked it up and many people suggested that it happens if you use Java 9 or higher version. I am using Java 8 so that's not the issue in my case. That is why I have also posted my java -version snapshot in the post.
But when I run the code, I get the following error. It would be great help IF somebody can advise me in any direction.
Scala Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.config._
import com.mongodb.spark._
object SparkSQLMongoDBConnector {
def main(args: Array[String]): Unit ={
var sc: SparkContext = null
var conf = new SparkConf()
conf.setAppName("MongoSparkConnectorIntro")
.setMaster("local")
.set("spark.hadoop.validateOutputSpecs", "false")
.set("spark.mongodb.input.uri","mongodb://127.0.0.1/metadatastore.metadata_collection?readPreference=primaryPreferred")
.set("spark.mongodb.output.uri","mongodb://127.0.0.1/metadatastore.metadata_collection?readPreference=primaryPreferred")
sc = new SparkContext(conf)
val spark = SparkSession.builder().master("spark://192.168.137.221:7077").appName("MongoSparkConnectorIntro").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/metadatastore.metadata_collection?readPreference=primaryPreferred").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/metadatastore.metadata_collection?readPreference=primaryPreferred").getOrCreate()
val readConfig = ReadConfig(Map("collection" -> "spark", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val customRdd = MongoSpark.load(sc, readConfig)
println(customRdd.count)
println(customRdd.first.toString())
}
}
SBT:
scalaVersion := "2.12.8"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
Java Version:
Error:
This is the error that I face when I run the Scala code in the IntelliJ.
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInorg.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at SparkSQLMongoDBConnector$.main(SparkSQLMongoDBConnector.scala:35)
at SparkSQLMongoDBConnector.main(SparkSQLMongoDBConnector.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3410)
at java.base/java.lang.String.substring(String.java:1883)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:50)
... 16 more
Any help would be much appreciated.
Shell checks your Java version via java.version variable
private static boolean IS_JAVA7_OR_ABOVE =
System.getProperty("java.version").substring(0, 3).compareTo("1.7") >= 0;
Make sure it is defined.
This line was changed in Hadoop 2.7+, but by default, Spark uses 2.6.5.
Related
Currently working on converting below code as "JAR" to register permanent UDF in Databricks cluster. facing issue like NoClassDefFoundError, But i added required Library dependencies while building Jar using SBT. source code : https://databricks.com/notebooks/enforcing-column-level-encryption.html
Used Below in build.sbt
scalaVersion := "2.13.4"
libraryDependencies += "org.apache.hive" % "hive-exec" % "0.13.1"
libraryDependencies += "com.macasaet.fernet" % "fernet-java8" % "1.5.0"
Guide me on right libraries if anything wrong above.
Kindly help me on this,
import com.macasaet.fernet.{Key, StringValidator, Token}
import org.apache.hadoop.hive.ql.exec.UDF;
class Validator extends StringValidator {
override def getTimeToLive() : java.time.temporal.TemporalAmount = {
Duration.ofSeconds(Instant.MAX.getEpochSecond());
}
}
class udfDecrypt extends UDF {
def evaluate(inputVal: String, sparkKey : String): String = {
if( inputVal != null && inputVal!="" ) {
val keys: Key = new Key(sparkKey)
val token = Token.fromString(inputVal)
val validator = new Validator() {}
val payload = token.validateAndDecrypt(keys, validator)
payload
} else return inputVal
}
}
Make sure the fernet-java library is installed in your cluster.
This topic is related to
Databricks SCALA UDF cannot load class when registering function
I tried more to install the jar file to the cluster via the Libraries in the config, not drop directly to DBFS as the userguide, then I faced the issue with validator not found and the question routed me here.
I added the maven repo to the Libraries config, but then the cluster failed to installed it, with error
Library resolution failed because unresolved dependency: com.macasaet.fernet:fernet-java8:1.5.0: not found
databricks cluster libraries
Have you experienced with this?
On Windows Server 2016, we are trying to connect over JDBC with a Jython script but it is giving following error:
java.lang.ClassNotFoundException: java.lang.ClassNotFoundException:
com.microsoft.sqlserver.jdbc.SQLServerDriver
RazorSQL, on the same machine, connects without error using these settings:
Driver Class: com.microsoft.sqlserver.jdbc.SQLServerDriver
Driver Location: \Program Files (x86)\RazorSQL\drivers\sqlserver\sqljdbc.jar
As a result, we set the CLASSPATH to same location with this command:
set CLASSPATH=C:\Program Files (x86)\RazorSQL\drivers\sqlserver\sqljdbc.jar
...but when running the code below - we still get the same ClassNotFound error.
This is our Python code:
jclassname = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
database = "our_database_name"
db_elem = ";databaseName={}".format(database) if database else ""
host = "###.##.###.###" # ip address
port = "1433"
user = "user_name"
password = "password"
url = (
jdbc:sqlserver://{host}:{port}{db_elem}"
";user={user};password={password}".format(
host=host, port=port, db_elem=db_elem,
er=user, password=password)
)
print url
driver_args = [url]
jars = None
libs = None
db = jaydebeapi.connect(jclassname, driver_args, jars=jars,
libs=libs)
This is how we are running our Python script:
C:\jython2.7.0\bin\jython.exe C:\path_to_our_script.py
How is that RazorSQL is connecting fine - but somehow Python cannot? How do we remove this CLASSPATH error?
You have to load the JARs at runtime using the system Classloader.
Please refer to this answer.
The following code snippet has been taken from this Gist.
def loadJar(jarFile):
'''load a jar at runtime using the system Classloader (needed for JDBC)
adapted from http://forum.java.sun.com/thread.jspa?threadID=300557
Author: Steve (SG) Langer Jan 2007 translated the above Java to Jython
Reference: https://wiki.python.org/jython/JythonMonthly/Articles/January2007/3
Author: seansummers#gmail.com simplified and updated for jython-2.5.3b3+
>>> loadJar('jtds-1.3.1.jar')
>>> from java import lang, sql
>>> lang.Class.forName('net.sourceforge.jtds.jdbc.Driver')
<type 'net.sourceforge.jtds.jdbc.Driver'>
>>> sql.DriverManager.getDriver('jdbc:jtds://server')
jTDS 1.3.1
'''
from java import io, net, lang
u = io.File(jarFile).toURL() if type(jarFile) <> net.URL else jarFile
m = net.URLClassLoader.getDeclaredMethod('addURL', [net.URL])
m.accessible = 1
m.invoke(lang.ClassLoader.getSystemClassLoader(), [u])
if __name__ == '__main__':
import doctest
doctest.testmod()
Also look at - https://wiki.python.org/jython/JythonMonthly/Articles/January2007/3
So I am new to spark. My versions are: Spark 2.1.2, Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131). I am using IntellijIdea 2018 Community on Windows 10 (x64). And whenever I am trying to run a simple word count example I get the following error:
18/10/22 01:43:14 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at
least 471859200. Please increase heap size using the --driver-memory
option or spark.driver.memory in Spark configuration. at
org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:216)
at
org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:198)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:330) at
org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:174) at
org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
at org.apache.spark.SparkContext.(SparkContext.scala:432) at
WordCount$.main(WordCount.scala:5) at WordCount.main(WordCount.scala)
PS: this is the code of the wordcounter that use as an example:
import org.apache.spark.{SparkConf,SparkContext}
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("mySpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile(args(0))
val wordcount = rdd.flatMap(_.split("\t") ).map((_, 1))
.reduceByKey(_ + _)
for (arg <- wordcount.collect())
print(arg + " ")
println()
// wordcount.saveAsTextFile(args(1))
// wordcount.saveAsTextFile("myFile")
sc.stop()
}
}
So my question is how to get rid of this error. I have searched for the solution and tried installing different versions of Spark and JDK and Hadoop, but it didn't help. I don't know where may be the problem.
If you are in IntelliJ you may struggle a lot, What I did and it worked is that I have initialized SparkContext before SparkSession by doing
val conf:SparkConf = new SparkConf().setAppName("name").setMaster("local")
.set("spark.testing.memory", "2147480000")
val sc:SparkContext = new SparkContext(conf)
There is maybe a better solution, because here I actually don't need to initialise SparkContext since it is implicitly done by initialising SparkSession.
Go to settings - run/debug configurations -> and for VM options put
-Xms128m -Xms512m -XX:MaxPermSize=300m -ea
The problem is that every job fails with the following exception:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at ps.sparkapp.Classification$.main(Classification.scala:35)
at ps.sparkapp.Classification.main(Classification.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
This exception meas that the task can not find the method. I develop using the intelij community edition. I have no problems compiling the package. All dependencies are packaged correctly. Here my build.sbt:
name := "SparkApp"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.1"
scala -version
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
I found out that this error has somehow to do with scala because it only happens when I use functionality that is native to scala, e.g scala for loop, .map or .drop(2).
The class and everything is still written in scala, but if i avoid functionality like .map or drop(2) then everything works fine.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.Vector
object Classification {
def main(args: Array[String]) {
...
//df.printSchema()
var dataset = df.groupBy("user_id","measurement_date").pivot("rank").min()
val col = dataset.schema.fieldNames.drop(2) // <- here the error happens
// take all features and put them into one vector
val assembler = new VectorAssembler()
.setInputCols(col)
.setOutputCol("features")
val data = assembler.transform(dataset)
data.printSchema()
data.show()
sc.stop()
}
}
As said if I do not use .drop(2) everything works perfectly, but avoiding these methods is no option since that is very painful..
I could not find any solution on the web, any ideas?
BTW: I can use these methods within the spark-shell, which i find strange.
Thanks in advance.
NOTE 1)
I use:
SPARK version 2.1.1
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Try adding the actual Scala libraries etc as a project dependency. E.g.:
libraryDependencies += "org.scala-lang" % "scala-library" % "2.11.6"
I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?
Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.
There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'