Add custom file to jar path in spark-submit cli

Add custom file to jar path in spark-submit cli - java

I am creating spark jar file with following scala code embedded in it:
import com.typesafe.config.ConfigFactory
object GetRequest {
def main(args: Array[String]): Unit = {
val api_credentials = ConfigFactory.load("application.conf")
val username = api_credentials.getString("pi.api.username")
val password = api_credentials.getString("pi.api.password")
}
While submitting the jar,it is not able to find application.conf file which is inside path C:\Users\abc\Desktop\ApiSparkJob\resource.How to mention the same in spark-submit command in cli?

The resource file bundled inside a jar wouldn't be available for each spark worker therefore you need to pass the file using --files argument
--files application.conf
If your resource manager is YARN, refer to the below code
import org.apache.hadoop.fs.{FileSystem, Path}
import java.io.{BufferedReader, File, InputStreamReader}
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.sql.SparkSession
object GetRequest {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
val yarnStagingDir: String = System.getenv("SPARK_YARN_STAGING_DIR")
val confFile: Path = new Path(yarnStagingDir.concat("/application.conf")
val fs: FileSystem = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val br: BufferedReader = new BufferedReader(new InputStreamReader(fileSystem.open(confFile)))
val api_credentials: Config = ConfigFactory.parseReader(br).resolve()
val username: String = api_credentials.getString("pi.api.username")
val password: String = api_credentials.getString("pi.api.password")
br.close()
}
}
// Don't close the filesystem fs.close() as it ends your job since same filesystem is used to access hive warehouse directory.

Related

How to resolve current committed offsets differing from current available offsets?

I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:
Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
Current State: ACTIVE
Thread State: RUNNABLE
Any idea on what the issue might be and how to resolve it? Code is the following (inspired from xebia-france spark-structured-streaming-blog). Actually, I think it ran earlier already but now there is a problem.
import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryException
object AvroConsumer {
private val topic = "customer-avro4"
private val kafkaUrl = "http://localhost:9092"
private val schemaRegistryUrl = "http://localhost:8081"
private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topic)
.load()
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
import org.apache.spark.sql.functions._
val formattedDataFrame = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
val writer = formattedDataFrame
.writeStream
.format("parquet")
.option("checkpointLocation", "hdfs://localhost:9000/data/spark/parquet/checkpoint")
while (true) {
val query = writer.start("hdfs://localhost:9000/data/spark/parquet/total")
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
}
object DeserializerWrapper {
val deserializer: AvroDeserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}

Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.

hadoop distributed copy overwrite not working

I am trying to use the org.apache.hadoop.tools.DistCp class to copy some files over into a S3 bucket. However overwrite functionality is not working in spite of explicitly setting the overwrite flag to true
Copying works fine but it does not overwrite if there are existing files. The copy mapper skips those files. I have explicitly set the "overwrite" option to true.
import com.typesafe.scalalogging.LazyLogging
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
import org.apache.hadoop.util.ToolRunner
import scala.collection.JavaConverters._
object distcptest extends App with LazyLogging {
def copytoS3( hdfsSrcFilePathStr: String, s3DestPathStr: String) = {
val hdfsSrcPathList = List(new Path(hdfsSrcFilePathStr))
val s3DestPath = new Path(s3DestPathStr)
val distcpOpt = new DistCpOptions(hdfsSrcPathList.asJava, s3DestPath)
// Overwriting is not working inspite of explicitly setting it to true.
distcpOpt.setOverwrite(true)
val conf: Configuration = new Configuration()
conf.set("fs.s3n.awsSecretAccessKey", "secret key")
conf.set("fs.s3n.awsAccessKeyId", "access key")
conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val distCp: DistCp = new DistCp(conf, distcpOpt)
val filepaths: Array[String] = Array(hdfsSrcFilePathStr, s3DestPathStr)
try {
val distCp_result = ToolRunner.run(distCp, filepaths)
if (distCp_result != 0) {
logger.error(s"DistCP has failed with - error code = $distCp_result")
}
}
catch {
case e: Exception => {
e.printStackTrace()
}
}
}
copytoS3("hdfs://abc/pqr", "s3n://xyz/wst")
}

I think the problem is you called ToolRunner.run(distCp, filepaths).
If you check the source code of DistCp, in run method will overwrite inputOptions, so the DistCpOptions passed to constructor will not work.
#Override
public int run(String[] argv) {
...
try {
inputOptions = (OptionsParser.parse(argv));
...
} catch (Throwable e) {
...
}
...
}

Scala URLClassLoader is not reloading class file

I am running a scala project where I need to execute some rules. The rules will be dynamically added or removed from scala class file at runtime.
So, I want whenever the rules class modify, it should reload to get the changes without stopping the execution process.
I used runtime.getruntime.exec() to compile it
and am URL Class loader to get the modified code from classes
The exec() run fines. and in target folder classes gets modifies also, even when I am using URL Class Loader, not getting any error.
But it is giving me same result which i have on starting of the project. It's not giving me modification code.
Below is the code which I am using.
package RuleEngine
import akka.actor._
import akka.http.scaladsl.Http
import akka.http.scaladsl.server.Directives._
import akka.stream.ActorMaterializer
import akka.util.Timeout
import scala.io.StdIn
import Executor.Compute
import scala.concurrent.{Await, ExecutionContextExecutor}
import scala.concurrent.duration._
object StatsEngine {
def main(args: Array[String]) {
implicit val system: ActorSystem = ActorSystem("StatsEngine")
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val executionContext: ExecutionContextExecutor = system.dispatcher
implicit val timeout = Timeout(10 seconds)
val computeDataActor = system.actorOf(Props[Compute],"ComputeData")
val route = {
post {
path("computedata"/) {
computeDataActor ! "Execute"
complete("done")
}
}
}
val bindingFuture = Http().bindAndHandle(route , "localhost", 9000)
println(s"Server online at http://localhost:9000/\nPress RETURN to stop...")
}
}
This is the main object file where I have created Akka HTTP to make API's
It will call computeDataActor whose code is below.
package Executor
import java.io.File
import java.net.URLClassLoader
import CompiledRules.RulesList
import akka.actor.Actor
class Compute extends Actor{
def exceuteRule(): Unit ={
val rlObj = new RulesList
rlObj.getClass.getDeclaredMethods.map(name=>name).foreach(println)
val prcs = Runtime.getRuntime().exec("scalac /home/hduser/MStatsEngine/Test/RuleListCollection/src/main/scala/CompiledRules/RuleList.scala -d /home/hduser/MStatsEngine/Test/RuleListCollection/target/scala-2.11/classes/")
prcs.waitFor()
val fk = new File("/home/hduser/MStatsEngine/Test/RuleListCollection/target/scala-2.11/classes/").toURI.toURL
val classLoaderUrls = Array(fk)
val urlClassLoader = new URLClassLoader(classLoaderUrls)
val beanClass = urlClassLoader.loadClass("CompiledRules.RulesList")
val constructor = beanClass.getConstructor()
val beanObj = constructor.newInstance()
beanClass.getDeclaredMethods.map(x=>x.getName).foreach(println)
}
override def receive: Receive ={
case key:String => {
exceuteRule()
}
}
}
Rules are imported which is mentioned below.
package CompiledRules
class RulesList{
def R1 : Any = {
return "executing R1"
}
def R2 : Any = {return "executing R2"}
// def R3 : Any = {return "executing R3"}
//def R4 : Any = {return "executing R4"}
def R5 : Any = {return "executing R5"}
}//Replace
So, whene i execute code, and on calling API, I will get ouput as
R1
R2
R5
Now, without stopping the project, I will uncomment R3 and R4. And I will call API again,
As I am executing code again, using
runtime.getruntime.exec()
it will compile the file and update classes in target
So, i used URLClassLoader to get new object of modification code.
But Unfortunately I am getting same result always which i have on starting of the project
R1
R2
R5
Below is link for complete project
Source Code

val beanClass = urlClassLoader.loadClass("CompiledRules.RulesList")
val constructor = beanClass.getConstructor()
val beanObj = constructor.newInstance()
Is just creating the newInstance of Already loaded class.
Java's builtin Class loaders always checks if a class is already loaded before loading it.
loadClass
protected Class<?> loadClass(String name,
boolean resolve)
throws ClassNotFoundException
Loads the class with the specified binary name. The default implementation of this method searches for classes in the following order:
Invoke findLoadedClass(String) to check if the class has already been loaded.
Invoke the loadClass method on the parent class loader. If the parent is null the class loader built-in to the virtual machine is used, instead.
Invoke the findClass(String) method to find the class.
To reload a class you will have to implement your own ClassLoader subclass as in this link

No such file or class on classpath (scala)

//package com.examples
/**
* Created by kalit_000 on 27/09/2015.
*/
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark._
import java.sql.{ResultSet, DriverManager, Connection}
import kafka.producer.KeyedMessage
import kafka.producer.Producer
import kafka.producer.ProducerConfig
import java.util.Properties
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark._
object SqlServerKafkaProducer {
def main(args: Array[String]): Unit =
{
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
val conf = new SparkConf().setMaster("local[2]").setAppName("MSSQL_KAFKA_PRODUCER")
val sc=new SparkContext(conf)
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val url = "jdbc:sqlserver://localhost;user=admin;password=oracle;database=AdventureWorks2014"
val username = "admin"
val password = "oracle"
var connection: Connection = null
Class.forName(driver)
/*Create connection and statement to run against sql server and execute*/
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val resultSet = statement.executeQuery("select top 10 CustomerID,StoreID,TerritoryID,AccountNumber from AdventureWorks2014.dbo.Customer")
resultSet.setFetchSize(10);
val columnnumber = resultSet.getMetaData().getColumnCount.toInt
/*OP COLUMN NAMES*/
var i = 0.toInt;
for (i <- 1 to columnnumber.toInt)
{
val columnname=resultSet.getMetaData().getColumnName(i)
println("Column Names are:- %s".format(columnname))
}
/*OP DATA*/
while (resultSet.next())
{
var list = new java.util.ArrayList[String]()
for (i <- 1 to columnnumber.toInt)
{
list.add(resultSet.getObject(i).toString())
//println("Column Names are:- %s".format(columnname))
}
println(list)
/*Buils kafka properties file*/
val props:Properties = new Properties()
props.put("metadata.broker.list", "localhost:9092")
props.put("serializer.class", "kafka.serializer.StringEncoder")
/*send message using kafka producer.send to topic trade*/
val config= new ProducerConfig(props)
val producer= new Producer[String,String](config)
//val x=list.collect().mkString("\n").replace("[","").replace("]","").replace(",","~")
producer.send(new KeyedMessage[String, String]("trade", list.toString().replace("[","").replace("]","").replace(",","~")))
}
/*close SQL Server database connection*/
connection.close()
}
}
I built jar using Maven in intellijIDEA this is scala spark project the jar file is created using under folder (C:\Users\kalit_000\IdeaProjects\SparkCookBook\target\SparkCookBook-0.0.1-SNAPSHOT-jar-with-dependencies.jar) when I tried to run the jar file using command
scala -classpath "C:\Users\kalit_000\IdeaProjects\SparkCookBook\target\SparkCookBook-1.0-SNAPSHOT-jar-with-dependencies.jar" SqlServerKafkaProducer
I am getting error which says
Error:-
No such file or class on classpath: SqlServerKafkaProducer.class
I can see my class inside the jar file I used Java decompiler software to open up Jar file.
can anyone help ?
I am able to compile in intellij Idea successfully.

groovy win cmd line class and script

I'm trying to run a groovy(2.4.3) script on windows that calls a goovy class xxxxx.groovy. I've tried a number of variations using classpath and various scripts, some examples below, always getting MultipleCompliationErrorsException.... unable to resolve class
classfile is firstclass.groovy
import org.apache.commons.io.FilenameUtils
class firstclassstart {
def wluid, wlpwd, wlserver, port
private wlconnection, connectString, jmxConnector, Filpath, Filpass, Filname, OSRPDpath, Passphrase
// object constructor
firstclassstart(wluid, wlpwd, wlserver, port) {
this.wluid = wluid
this.wlpwd = wlpwd
this.wlserver = wlserver
this.port = port
}
def isFile(Filpath) {
// Create a File object representing the folder 'A/B'
def folder = new File(Filpath)
if (!org.apache.commons.io.FilenameUtils.isExtension(Filpath, "txt")) {
println "bad extension"
return false
} else if (!folder.exists()) {
// Create all folders up-to and including B
println " path is wrong"
return false
} else
println "file found"
return true
}
}
cmd line script test.groovy
import firstclass
def sample = new firstclass.firstclassstart("weblogic", "Admin123", "x.com", "7002")
//def sample = new firstclassstart("weblogic", "Admin123", "x.com", "7002")
sample.isFile("./firstclass.groovy")
..\groovy -cp "firstclass.groovy;commons-io-1.3.2.jar" testfc.groovy
script test.groovy
GroovyShell shell = new GroovyShell()
def script = shell.parse(new File('mylib/firstclass.groovy'))
firstclass sample = new script.firstclass("uid", "pwd", "url", "port")
sample.getstatus()
c:>groovy test.groovy
script test.groovy v2 put firstclass.groovy in directory test below script
import test.firstclass
firstclass sample = new script.firstclass("uid", "pwd", "url", "port")
sample.getstatus()
c:>groovy test.groovy
just looking for a bullet proof, portable way to oranize my java classes, .groovy classess, etc. and scripts.
Thanks

I think that you can do using for example your first approach:
groovy -cp mylib/firstclass.groovy mylib/test.groovy
However I see some problems in your code which are probably causing MultipleCompliationErrorsException.
Since you're including firstclass.groovy in your classpath, you've to add the import firstclass in the test.groovy.
Why are you using script.firstclass in test.groovy? you're class is called simply firstclass.
In your firstclass.groovy you're using import org.apache.commons.io.FilenameUtils and probably other, however you're not including it in the classpath.
So finally I think that, you've to change your test.groovy for something like:
import firstclass
firstclass sample = new firstclass("uid", "pwd", "url", "port")
sample.getstatus()
And in your command add the remaining includes for apache Commons IO to the classpath.
groovy -cp "mylib/firstclass.groovy;commons-io-2.4.jar;" mylib/testexe.groovy
Hope this helps,
UPDATE BASED ON OP CHANGES:
After the changes you've some things wrong, I try to enumerate it:
If your file is called firstclass.groovy your class must be class firstclass not class firstclassstart.
In your test.groovy use new firstclass not new firstclass.firstclassstart.
So the thing is, your code must be:
class file firstclass.groovy:
import org.apache.commons.io.FilenameUtils
class firstclass {
def wluid, wlpwd, wlserver, port
private wlconnection, connectString, jmxConnector, Filpath, Filpass, Filname, OSRPDpath, Passphrase
// object constructor
firstclass(wluid, wlpwd, wlserver, port) {
this.wluid = wluid
this.wlpwd = wlpwd
this.wlserver = wlserver
this.port = port
}
def isFile(Filpath) {
// Create a File object representing the folder 'A/B'
def folder = new File(Filpath)
if (!org.apache.commons.io.FilenameUtils.isExtension(Filpath, "txt")) {
println "bad extension"
return false
} else if (!folder.exists()) {
// Create all folders up-to and including B
println " path is wrong"
return false
} else
println "file found"
return true
}
}
script test.groovy:
import firstclass
def sample = new firstclass("weblogic", "Admin123", "x.com", "7002")
sample.isFile("./firstclass.groovy")
Finally the command to execute it:
groovy -cp "firstclass.groovy;commons-io-1.3.2.jar" test.groovy
With this changes your code must works, I try it and works as expected.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Add custom file to jar path in spark-submit cli - java

Related

How to resolve current committed offsets differing from current available offsets?

hadoop distributed copy overwrite not working

Scala URLClassLoader is not reloading class file

No such file or class on classpath (scala)

groovy win cmd line class and script

Categories

Resources