I'm building a spark application to load two json files, compare them, and print the differences. I also try to validate these files using amazon library aws deequ , but I'm getting the below exception:
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/07 11:56:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load com.deeq.CompareDataFrames: com/amazon/deequ/checks/Check
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please
when I submit the job to spark:
./spark-submit --class com.deeq.CompareDataFrames--master
spark://saif-VirtualBox:7077 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
I'm using Ubuntu to host spark, it was working without any issues before I added deequ to run some validation. I wonder if I'm missing something in the deployment process. It doesn't seem like this error is a well-know one on the internet.
Code :
import com.amazon.deequ.VerificationResult;
import com.amazon.deequ.VerificationSuite;
import com.amazon.deequ.checks.Check;
import com.amazon.deequ.checks.CheckLevel;
import com.amazon.deequ.checks.CheckStatus;
import com.amazon.deequ.constraints.Constraint;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Option;
import scala.Tuple2;
import scala.collection.mutable.ArraySeq;
import scala.collection.mutable.Seq;
public class CompareDataFrames {
public static void main(String[] args) {
SparkSession session = SparkSession.builder().appName("CompareDataFrames").getOrCreate();
session.sparkContext().setLogLevel("ALL");
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("CUST_ID", DataTypes.StringType, true),
DataTypes.createStructField("RECORD_LOCATOR_ID", DataTypes.StringType, true),
DataTypes.createStructField("EVNT_ID", DataTypes.StringType, true)
});
Dataset<Row> first = session.read().option("multiline", "true").schema(schema).json("/home/saif/Downloads/FILE_DEV1.json");
System.out.println("======= DataSet 1 =======");
first.printSchema();
first.show(false);
Dataset<Row> second = session.read().option("multiline", "true").schema(schema).json("/home/saif/Downloads/FILE_DEV2.json");
System.out.println("======= DataSet 2 =======");
second.printSchema();
second.show(false);
// This will show all the rows which are present in the first dataset
// but not present in the second dataset. But the comparison is at row
// level and not at column level.
System.out.println("======= Expect =======");
first.except(second).show();
StructType one = first.schema();
JavaPairRDD<String, Row> pair1 = first.toJavaRDD().mapToPair((PairFunction<Row, String, Row>)
row -> new Tuple2<>(row.getString(1), row));
JavaPairRDD<String, Row> pair2 = second.toJavaRDD().mapToPair((PairFunction<Row, String, Row>)
row -> new Tuple2<>(row.getString(1), row));
System.out.println("======= Pair1 & Pair2 were created =======");
JavaPairRDD<String, Row> subs = pair1.subtractByKey(pair2);
JavaRDD<Row> rdd = subs.values();
Dataset<Row> diff = session.createDataFrame(rdd, one);
System.out.println("======= Diff Show =======");
diff.show();
Seq<Constraint> cons = new ArraySeq<>(0);
VerificationResult vr = new VerificationSuite().onData(first)
.addCheck(new Check(CheckLevel.Error(), "unit test", cons)
.isComplete("EVNT_ID", Option.empty())
)
.run();
Seq<Check> checkSeq = new ArraySeq<>(0);
if (vr.status() != CheckStatus.Success()) {
Dataset<Row> vrr = vr.checkResultsAsDataFrame(session, vr, checkSeq);
vrr.show(false);
}
}
}
**Maven: **
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>1.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-java8-compat_2.13</artifactId>
<version>0.9.1</version>
</dependency>
Please follow the below approaches to resolve your problem.
Approach 1.
spark submit with --jars option,
Download the jar from the following Maven Repo to your machine, https://mvnrepository.com/artifact/com.amazon.deequ/deequ/1.0.4 to ~/Downloads/deequ-1.0.4.jar
./spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 --jars ~/Downloads/deequ-1.0.4.jar ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
Approach 2.
spark submit with --packages option,
./spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 --packages com.amazon.deequ:deequ:1.0.4 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
Notes:
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
Edit 1:
Approach 3:
If the above solutions 1 & 2 is not working then use maven-shade-plugin to build the uber jar and proceed with the spark-submit.
use the below pom.xml file for building uber jar using maven-shade-plugin. adding the below pom and rebuild your jar and deploy it with spark-submit.
spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.deeq</groupId>
<artifactId>deeq-trial-1.0-SNAPSHOT</artifactId>
<version>1.0</version>
<name>Spark-3.0 Spark Application</name>
<url>https://maven.apache.org</url>
<repositories>
<repository>
<id>codelds</id>
<url>https://code.lds.org/nexus/content/groups/main-repo</url>
</repository>
<repository>
<id>central</id>
<name>Maven Repository Switchboard</name>
<layout>default</layout>
<url>https://repo1.maven.org/maven2</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.12.8</scala.version>
<java.version>1.8</java.version>
<CodeCacheSize>512m</CodeCacheSize>
<es.version>2.4.6</es.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>1.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-java8-compat_2.13</artifactId>
<version>0.9.1</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>eclipse-add-source</id>
<goals>
<goal>add-source</goal>
</goals>
</execution>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile-first</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
<execution>
<id>attach-scaladocs</id>
<phase>verify</phase>
<goals>
<goal>doc-jar</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<recompileMode>incremental</recompileMode>
<useZincServer>true</useZincServer>
<args>
<arg>-unchecked</arg>
<arg>-deprecation</arg>
<arg>-feature</arg>
</args>
<jvmArgs>
<jvmArg>-Xms1024m</jvmArg>
<jvmArg>-Xmx1024m</jvmArg>
<jvmArg>-XX:ReservedCodeCacheSize=${CodeCacheSize}</jvmArg>
</jvmArgs>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-target</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-Xlint:all,-serial,-path</javacArg>
</javacArgs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>org.xerial.snappy</exclude>
<exclude>org.scala-lang.modules</exclude>
<exclude>org.scala-lang</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>shaded.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Related
I got this error when trying to run Spark Streaming to read data from Kafka, I searched it on google and the answers didn't fix my error.
I fixed a bug here Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class ( Java) with the answer of https://stackoverflow.com/users/9023547/chandan but then got this error again.
This is terminal when I run project :
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])'
at org.apache.spark.streaming.kafka010.KafkaRDD.count(KafkaRDD.scala:89)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:216)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:36)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
at org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:36)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:36)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
at org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at org.apache.spark.streaming.DStreamGraph.$anonfun$generateJobs$2(DStreamGraph.scala:123)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:122)
at org.apache.spark.streaming.scheduler.JobGenerator.$anonfun$generateJobs$1(JobGenerator.scala:252)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:250)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:186)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:91)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:90)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
21/05/31 22:00:40 ERROR JobScheduler: Error in job generator
java.lang.IllegalStateException: JobGenerator has already been stopped accidentally.
at org.apache.spark.util.EventLoop.post(EventLoop.scala:107)
at org.apache.spark.streaming.scheduler.JobGenerator.$anonfun$timer$1(JobGenerator.scala:63)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
Exception in thread "main" java.lang.IllegalStateException: JobGenerator has already been stopped accidentally.
at org.apache.spark.util.EventLoop.post(EventLoop.scala:107)
at org.apache.spark.streaming.scheduler.JobGenerator.$anonfun$timer$1(JobGenerator.scala:63)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
This is file pom.xml of project :
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>TikiData</groupId>
<artifactId>TikiData</artifactId>
<version>V1</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.8.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.13</artifactId>
<version>2.8.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalatest/scalatest -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>2.2.6</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalatest/scalatest -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>2.2.6</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<release>11</release>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
demo.KafkaDemo
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
This is file main of project :
package demo;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import scala.Tuple2;
public class KafkaDemo {
public static void main(String[] args) throws InterruptedException {
// Create a local StreamingContext and batch interval of 10 second
SparkConf conf = new SparkConf().setMaster("local").setAppName("Kafka Spark Integration");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
//Define Kafka parameter
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "0");
// // Automatically reset the offset to the earliest offset
// kafkaParams.put("auto.offset.reset", "earliest");
// kafkaParams.put("enable.auto.commit", false);
//Define a list of Kafka topic to subscribe
Collection<String> topics = Arrays.asList("hello-kafka");
//Create an input Dstream which consume message from Kafka topics
JavaInputDStream<ConsumerRecord<String, String>> stream;
stream = KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),ConsumerStrategies.Subscribe(topics, kafkaParams));
// Read value of each message from Kafka
JavaDStream<String> lines = stream.map((Function<ConsumerRecord<String, String>, String>) kafkaRecord -> kafkaRecord.value());
// Split message into words
JavaDStream<String> words = lines.flatMap((FlatMapFunction<String, String>) line -> Arrays.asList(line.split(" ")).iterator());
// Take every word and return Tuple with (word,1)
JavaPairDStream<String,Integer> wordMap = words.mapToPair((PairFunction<String, String, Integer>) word -> new Tuple2<>(word,1));
// Count occurance of each word
JavaPairDStream<String,Integer> wordCount = wordMap.reduceByKey((Function2<Integer, Integer, Integer>) (first, second) -> first+second);
//Print the word count
wordCount.print();
// Start the computation
jssc.start();
jssc.awaitTermination();
}
}
The answer is the same as before. Make all Spark and Scala versions the exact same. What's happening is kafka_2.13 depends on Scala 2.13, and the rest of your dependencies are 2.11... Spark 2.4 doesn't support Scala 2.13
You can more easily do this with Maven properties
<properties>
<scala.minor.version>2.11</scala.minor.version>
<spark.version>2.4.2</spark.version>
</properties>
You also should not include Kafka as a dependency , and I'd suggest Scala 2.12, but that's up to you, since you're not using Scala anyway
You should only need Spark core and these three to run that code
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.minor.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.minor.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.minor.version}.8</version>
</dependency>
Also worth pointing out that Spark 2.4 doesn't use Hadoop 3 clients, and the DStream Kafka API is effectively deprecated in favor of Structured Streaming (spark-sql-kafka-0-10 dependency)
spark version on my machine is 3.1.1 so i change it back to 3.1.1 in my pom.xml file, fix all scala and spark versions to generic version like this:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>TikiData</groupId>
<artifactId>TikiData</artifactId>
<version>V1</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.1.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<release>11</release>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
demo.KafkaDemo
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
I am running spark streaming job to consume from kafka using Direct approach (for kafka 0.1.0 or greater). Built POM file using maven-assembly-plugin and checked the contents of jar file using jar tf <jar file> | grep ConsumerRecord. I get the following output
org/apache/kafka/clients/consumer/ConsumerRecord.class
org/apache/kafka/clients/consumer/ConsumerRecords$ConcatenatedIterable$1.class
org/apache/kafka/clients/consumer/ConsumerRecords$ConcatenatedIterable.class
org/apache/kafka/clients/consumer/ConsumerRecords.class
But when i run spark-submit job on my cluster (with master as both local & yarn), I get the following exception.
java.lang.ClassNotFoundException:
org.apache.kafka.clients.consumer.ConsumerRecord
Other option that I tried is - Built a shaded jar using maven-shade-plugin. Same result there as well.
PFB my POM file
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.myCompany</groupId>
<artifactId>spark-streaming-test</artifactId>
<version>1</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.2.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<finalName>shade-${artifactId}-${version}</finalName>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.myCompany.ReadFromKafka</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
And here is my spark streaming code (taken from - https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html)
package com.myCompany;
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import scala.Tuple2;
public class ReadFromKafka {
public static void main(String args[]) throws InterruptedException {
SparkConf conf = new SparkConf();// .setAppName("Decryption-spark-streaming").setMaster("yarn");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "server1:9093");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "my_cg");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("security.protocol", "SSL");
kafkaParams.put("ssl.truststore.location", "abc.jks");
kafkaParams.put("ssl.truststore.password", "changeit");
kafkaParams.put("ssl.keystore.location", "abc.jks");
kafkaParams.put("ssl.keystore.password", "changeme");
kafkaParams.put("ssl.key.password", "changeme");
Collection<String> topics = Arrays.asList("myTopic");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
stream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(consumerRecords -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset());
});
});
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// some time later, after outputs have completed
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
});
// Start the computation
jsc.start();
jsc.awaitTermination();
}}
Adding the dependent jar file(spark-streaming-kafka-0-10_2.11.jar) to the spark-submit command helped in resolving this issue
spark-submit --master yarn --deploy-mode cluster --name spark-streaming-test\
--executor-memory 1g --num-executors 4 --driver-memory 1g --jars\
/home/spark/jars/spark-streaming-kafka-0-10_2.11.jar --class\
com.mycompany.ReadFromKafka spark-streaming-test-1-jar-with-dependencies.jar
I started last week a formation in Kafka and Storm at OpenClassRooms. During practical work, I encounter an error when I try to execute a JAR containing my java code for Storm.
No problem when compiling the project in Java, no problem when packaging with maven, the problem only occurs when running the JAR
theirman#vm-debian:/data/eclipse-workspace/velos$ storm jar target/velos-1.0-SNAPSHOT.jar velos.App remote
Running: /usr/lib/jvm/java/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/apps/storm -Dstorm.log.dir=/apps/storm/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib:/usr/lib64 -Dstorm.conf.file= -cp /apps/storm/*:/apps/storm/lib/*:/apps/storm/extlib/*:target/velos-1.0-SNAPSHOT.jar:/apps/storm/conf:/apps/storm/bin: -Dstorm.jar=target/velos-1.0-SNAPSHOT.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} velos.App remote
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/storm/kafka/spout/KafkaSpoutConfig
at velos.App.main(App.java:22)
Caused by: java.lang.ClassNotFoundException: org.apache.storm.kafka.spout.KafkaSpoutConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 1 more
App.java
package velos;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.generated.StormTopology;
import org.apache.storm.kafka.spout.KafkaSpout;
import org.apache.storm.kafka.spout.KafkaSpoutConfig;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseWindowedBolt;
import org.apache.storm.tuple.Fields;
public class App
{
public static void main( String[] args ) throws AlreadyAliveException, InvalidTopologyException, AuthorizationException
{
TopologyBuilder builder = new TopologyBuilder();
KafkaSpoutConfig.Builder<String, String> spoutConfigBuilder = KafkaSpoutConfig.builder("localhost:9092", "velib-stations");
spoutConfigBuilder.setGroupId("city-stats");
KafkaSpoutConfig<String, String> spoutConfig = spoutConfigBuilder.build();
builder.setSpout("stations", new KafkaSpout<String, String>(spoutConfig));
builder.setBolt("station-parsing", new StationParsingBolt()).shuffleGrouping("stations");
builder.setBolt("city-stats", new CityStatsBolt().withTumblingWindow(BaseWindowedBolt.Duration.of(1000*60*5))).fieldsGrouping("station-parsing", new Fields("city"));
builder.setBolt("save-results", new SaveResultsBolt()).fieldsGrouping("city-stats", new Fields("city"));
StormTopology topology = builder.createTopology();
Config config = new Config();
config.setMessageTimeoutSecs(60*30);
String topologyName = "velos";
if(args.length > 0 && args[0].equals("remote")) {
StormSubmitter.submitTopology(topologyName, config, topology);
}
else
{
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(topologyName, config, topology);
}
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>velos</groupId>
<artifactId>velos</artifactId>
<version>1.0-SNAPSHOT</version>
<name>velos</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.12</artifactId>
<version>0.10.2.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka</artifactId>
<version>1.0.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
storm-kafka or storm-kafka-client likely are not provided on the storm classpath, so you would need to remove the scope from those
Then you will also want to try shading your JAR so that all dependencies are available at runtime
First of all I'm trying to deploy a Spark Java application on yarn cluster using the following command:
spark-submit --master yarn --class com.batchjob.BatchJob D:\batchjob-0.0.1-SNAPSHOT-shaded.jar
My Java code:
public class BatchJob {
public static void main(String[] args) throws IOException {
// get spark confgiruation
SparkConf sparkConf = new SparkConf().setAppName("Example Spark App");//.setMaster("local");
// setup spark session to be able to work with Dataset
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
// import data
Dataset<Row> input = spark.read().csv("hdfs://localhost:9000/input_dir/data.csv");
input.show();
// map to Dataset of Activity
Dataset<Activity> activityDataset = input.map((row) -> {
if (row.size() != 8)
throw new RuntimeException("Row must have size of 8!");
return new Activity(Long.parseLong(row.getString(0)), row.getString(1), row.getString(2), row.getString(3),
row.getString(4), row.getString(5), row.getString(6), row.getString(7));
}, Encoders.bean(Activity.class));
/*
* Actions & Transformations
*/
activityDataset.createOrReplaceTempView("activity");
Dataset<Row> sqlResult = spark.sql("SELECT " + "product, timestamp, referrer, "
+ "SUM( CASE WHEN action = 'page_view' THEN 1 ELSE 0 END) AS page_view_count, "
+ "SUM( CASE WHEN action = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart_count, "
+ "SUM( CASE WHEN action = 'purchase' THEN 1 ELSE 0 END) AS purchase_count " + "FROM activity "
+ "GROUP BY product, timestamp, referrer").cache();
sqlResult.write().partitionBy("referrer").mode(SaveMode.Append).parquet("hdfs://localhost:9000/lambda/batch1");
spark.close();
}
}
and my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId>
<artifactId>batchjob</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>batchjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<resources>
<resource>
<directory>.</directory>
<includes>
<include>src/main/resources/*.*</include>
</includes>
</resource>
</resources>
</build>
</project>
Yarn cluster is started using command .\HADOOP_HOME\sbin\start-yarn.cmd and the node with the command .\HADOOP_HOME\sbin\start-dfs.cmd. Note: I am on Windows 10!
For testing purpose if I run the locally everything is fine and the I am able to see the result of the code on http://localhost:9870/explorer.html#/.
The problem appear when I'm trying to let the yarn to decide how the Java Spark application is managed changing the --master with yarn instead using local and I'm facing the next error:
2018-08-31 16:32:00 INFO Client:54 - Deleted staging directory file:/C:/Users/razvan.parautiu/.sparkStaging/application_1535721878844_0003
2018-08-31 16:32:00 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.batchjob.BatchJob.main(BatchJob.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've checked another posts with the same error but unfortunately doesn't work...
I'm using Maven and Vert.x to create an application.
Everything is working fine while I run it in my IDE (IntelliJ) but I can't make it work using command line.
I have a Launcher class that deploy some verticles but the problem is the same with all my verticles.
So far here is what I've tried:
vertx run Launcher.java
vertx run com.packagename.Launcher.java
// user-content-service-0.1.jar create via: mvn clean package
run com.packagename.Launcher.java -cp target\user-content-service-0.1.jar
As error I get this:
.../path/Launcher.java:8: error: cannot find symbol
private static final Logger logger = logManager.getLogger(Launcher.class);
symbol: variable LogManager
location: class com.packagename.Launcher
java.lang.RuntimeException: Compilation failed
...
The issue seems to come from the fact that the compiler is not able to find dependancies
Here is what my pom.xml looks like:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.packagename</groupId>
<artifactId>user-content-service</artifactId>
<packaging>jar</packaging>
<version>0.1</version>
<name>Project - user-content-service</name>
<url>http://maven.apache.org</url>
<properties>
<vertx.version>[3.5.0,3.6)</vertx.version>
<java.version>1.8</java.version>
<maven-compiler-plugin.version>3.3</maven-compiler-plugin.version>
<log4j.version>[2.10.0,2.11)</log4j.version>
<junit.version>4.12</junit.version>
</properties>
<dependencies>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-core</artifactId>
<version>${vertx.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-unit</artifactId>
<version>${vertx.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
<version>LATEST</version>
</dependency>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-config</artifactId>
<version>${vertx.version}</version>
</dependency>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-web</artifactId>
<version>${vertx.version}</version>
</dependency>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-web-client</artifactId>
<version>${vertx.version}</version>
</dependency>
<dependency>
<groupId>guru.nidi.raml</groupId>
<artifactId>raml-tester</artifactId>
<version>0.9.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And that's my Launcher.java file:
package com.packagename;
import io.vertx.core.AbstractVerticle; import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger;
public class Launcher extends AbstractVerticle {
private static final Logger logger = LogManager.getLogger(Launcher.class);
#Override
public void start() {
vertx.deployVerticle("com.packagename.DynamoDBVerticle", logRes -> {
if (logRes.succeeded()) {
vertx.deployVerticle("com.packagename.UploaderVerticle", uploaderRes -> {
if (uploaderRes.succeeded()) {
vertx.deployVerticle("com.packagename.ServerVerticle", serverRes -> {
if (!serverRes.succeeded()) {
logger.error("Could not start server");
}
});
}
else {
logger.error("Could not start uploader");
}
});
}
else {
logger.error("Could not start Dynamo");
}
});
} }
Any clue what could I be doing wrong here ?
Thanks !
You are having a classpath issue; because you package your jarfile with maven-jar in the package phase, you are only getting that in the build directory; if you want to run on the command-line with java -jar, you should create a fat-jar (aka uber-jar, a jar file specifically packaged in order for it to contain your classes and all dependencies declared in the pom file).
You can add the maven-shade-plugin to do that in the package phase, ie:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<manifestEntries>
<Main-Class>io.vertx.core.Launcher</Main-Class>
<Main-Verticle>com.packagename.Launcher</Main-Verticle>
</manifestEntries>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/services/io.vertx.core.spi.VerticleFactory</resource>
</transformer>
</transformers>
<artifactSet/>
<outputFile>${project.build.directory}/${project.artifactId}-${project.version}-fat.jar</outputFile>
</configuration>
</execution>
</executions>
</plugin>
Note: you have a class name clash between the vert.x Launcher (used as Main-Class, in the fat-jar), and your own Launcher (which is actually a Verticle): I'd suggest to rename it.
After maven package, you will be then able to do:
java -jar target/user-content-service-0.1-fat.jar
Check in the vert.x samples for more information.