First of all I'm trying to deploy a Spark Java application on yarn cluster using the following command:
spark-submit --master yarn --class com.batchjob.BatchJob D:\batchjob-0.0.1-SNAPSHOT-shaded.jar
My Java code:
public class BatchJob {
public static void main(String[] args) throws IOException {
// get spark confgiruation
SparkConf sparkConf = new SparkConf().setAppName("Example Spark App");//.setMaster("local");
// setup spark session to be able to work with Dataset
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
// import data
Dataset<Row> input = spark.read().csv("hdfs://localhost:9000/input_dir/data.csv");
input.show();
// map to Dataset of Activity
Dataset<Activity> activityDataset = input.map((row) -> {
if (row.size() != 8)
throw new RuntimeException("Row must have size of 8!");
return new Activity(Long.parseLong(row.getString(0)), row.getString(1), row.getString(2), row.getString(3),
row.getString(4), row.getString(5), row.getString(6), row.getString(7));
}, Encoders.bean(Activity.class));
/*
* Actions & Transformations
*/
activityDataset.createOrReplaceTempView("activity");
Dataset<Row> sqlResult = spark.sql("SELECT " + "product, timestamp, referrer, "
+ "SUM( CASE WHEN action = 'page_view' THEN 1 ELSE 0 END) AS page_view_count, "
+ "SUM( CASE WHEN action = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart_count, "
+ "SUM( CASE WHEN action = 'purchase' THEN 1 ELSE 0 END) AS purchase_count " + "FROM activity "
+ "GROUP BY product, timestamp, referrer").cache();
sqlResult.write().partitionBy("referrer").mode(SaveMode.Append).parquet("hdfs://localhost:9000/lambda/batch1");
spark.close();
}
}
and my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId>
<artifactId>batchjob</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>batchjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<resources>
<resource>
<directory>.</directory>
<includes>
<include>src/main/resources/*.*</include>
</includes>
</resource>
</resources>
</build>
</project>
Yarn cluster is started using command .\HADOOP_HOME\sbin\start-yarn.cmd and the node with the command .\HADOOP_HOME\sbin\start-dfs.cmd. Note: I am on Windows 10!
For testing purpose if I run the locally everything is fine and the I am able to see the result of the code on http://localhost:9870/explorer.html#/.
The problem appear when I'm trying to let the yarn to decide how the Java Spark application is managed changing the --master with yarn instead using local and I'm facing the next error:
2018-08-31 16:32:00 INFO Client:54 - Deleted staging directory file:/C:/Users/razvan.parautiu/.sparkStaging/application_1535721878844_0003
2018-08-31 16:32:00 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.batchjob.BatchJob.main(BatchJob.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've checked another posts with the same error but unfortunately doesn't work...
Related
Is there a way to obfuscate exploded/packaged output of jib-maven-plugin with yGuard (or some other obfuscator)?
I can think of a way using other tools such as exec-maven-plugin + jib cli.
Another possible way can be to devise a 3rd party jib-extension or even fork/hack jib-maven-plugin all together.
Maybe someone can share their experience with that.
For context I am trying to ship a Spring Boot application build using Maven and AntRun for yGuard.
I managed to figure it out myself.
Here is my exec-maven-plug + jib cli solution for anyone out there.
To test paste it, adapt it to your environment, and run mvn clean package -P local.
This pom.xml is from a multi-module setup, so you may have to refactor or omit the <parent></parent> tag to suit you needs.
What it does:
Cleans target directory
Compiles your source files into the target directory
Obfuscates classes in the target directory
Repackages classes, libs, resources in exploded form (for better layer caching) correctly into Spring Boot's non-standard BOOT-INF output directory
Performs docker build, docker push under the hood through jib cli
Performs docker pull at the end (for debug purposes)
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<artifactId>redacted</artifactId>
<groupId>com.redacted</groupId>
<version>0.0.1</version>
</parent>
<artifactId>sm-test</artifactId>
<version>0.0.1</version>
<name>test</name>
<description>test</description>
<packaging>jar</packaging>
<properties>
<profile.name>default</profile.name>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<mainClass>com.redacted.smtest.SmTestApplication</mainClass>
<java.server.user>0:0</java.server.user>
<java.to.image.tag>ghcr.io/redacted/${project.artifactId}:${project.version}-${profile.name}</java.to.image.tag>
<java.from.image.tag>openjdk:11.0.14-jre#sha256:e2e90ec68d3eee5a526603a3160de353a178c80b05926f83d2f77db1d3440826</java.from.image.tag>
<java.from.classpath>../target/${project.name}/${profile.name}/${project.build.finalName}.jar</java.from.classpath>
</properties>
<profiles>
<!-- local -->
<profile>
<id>local</id>
<properties>
<profile.name>local</profile.name>
</properties>
<activation>
<property>
<name>noTest</name>
<value>true</value>
</property>
</activation>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<executions>
<execution>
<id>jib jar</id>
<phase>package</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>jib</executable>
<workingDirectory>.</workingDirectory>
<skip>false</skip>
<arguments>
<argument>jar</argument>
<argument>--mode=exploded</argument>
<argument>--target=${java.to.image.tag}</argument>
<argument>--from=${java.from.image.tag}</argument>
<argument>--user=${java.server.user}</argument>
<argument>--creation-time=${maven.build.timestamp}</argument>
<argument>--jvm-flags=-Xms32m,-Xmx128m,-Dspring.profiles.active=default</argument>
<argument>${java.from.classpath}</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>docker pull</id>
<phase>install</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>docker</executable>
<workingDirectory>.</workingDirectory>
<skip>false</skip>
<arguments>
<argument>pull</argument>
<argument>${java.to.image.tag}</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.yworks</groupId>
<artifactId>yguard</artifactId>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<finalName>${project.name}</finalName>
<directory>../target/${project.name}/${profile.name}/</directory>
<outputDirectory>../target/${project.name}/${profile.name}/classes</outputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<executions>
<execution>
<id>obfuscate</id>
<phase>compile</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<skip>false</skip>
<tasks>
<property name="runtime_classpath" refid="maven.runtime.classpath"/>
<!--suppress UnresolvedMavenProperty -->
<taskdef name="yguard" classname="com.yworks.yguard.YGuardTask" classpath="${runtime_classpath}"/>
<mkdir dir="${project.build.directory}/obfuscated"/>
<yguard>
<inoutpair in="${project.build.directory}/classes"
out="${project.build.directory}/obfuscated"/>
<externalclasses>
<!--suppress UnresolvedMavenProperty -->
<pathelement path="${runtime_classpath}"/>
</externalclasses>
<rename mainclass="${mainClass}"
logfile="${project.build.directory}/rename.log.xml"
scramble="true"
replaceClassNameStrings="true">
<property name="error-checking" value="pedantic"/>
<property name="naming-scheme" value="best"/>
<property name="language-conformity" value="compatible"/>
<property name="overload-enabled" value="true"/>
<!-- Generated by sm-test -->
<map>
<class map="d1e6064d$5a15$449b$a632$b2d967a61021" name="com.redacted.smtest.YGuardMappingRunner"/>
</map>
</rename>
</yguard>
<delete dir="${project.build.directory}/classes/com"/>
<copy todir="${project.build.directory}/classes/com/" overwrite="true">
<fileset dir="${project.build.directory}/obfuscated/com/" includes="**"/>
</copy>
<delete dir="${project.build.directory}/obfuscated"/>
</tasks>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<mainClass>${mainClass}</mainClass>
<excludes>
<exclude>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
</project>
Here is an associate jib.yaml file for jib cli to work:
apiVersion: jib/v1alpha1
kind: BuildFile
workingDirectory: "/app"
entrypoint: ["java","-cp","/app/resources:/app/classes:/app/libs/*"]
layers:
entries:
- name: classes
files:
- properties:
filePermissions: 755
src: /classes
dest: /app/classes
I also had to write a small throw-away class to generate custom mapping for unique class and package names for yGuard (apparently it cannot manage to do it on its own):
package com.redacted.smtest;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.CommandLineRunner;
import org.springframework.stereotype.Component;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.UUID;
#Slf4j
#Component
public class YGuardMappingRunner implements CommandLineRunner {
#Override
public void run(String... args) throws Exception {
if (args == null || args.length == 0)
return;
generateYGuardMapping(Path.of(args[0]));
}
void generateYGuardMapping(Path path) throws IOException {
var packageSb = new StringBuilder();
var classSb = new StringBuilder();
mapPackages(path, packageSb);
mapClasses(path, classSb);
if (packageSb.length() > 0 && classSb.length() > 0)
log.info(
"\n<!-- Generated by sm-test -->\n<map>\n{}\n{}</map>",
packageSb.toString(),
classSb.toString());
else if (packageSb.length() > 0 && classSb.length() == 0)
log.info("\n<!-- Generated by sm-test -->\n<map>\n{}</map>", packageSb.toString());
else if (packageSb.length() == 0 && classSb.length() > 0)
log.info("\n<!-- Generated by sm-test -->\n<map>\n{}</map>", classSb.toString());
}
private void mapClasses(Path path, StringBuilder classSb) throws IOException {
try (var stream = Files.walk(path, Integer.MAX_VALUE)) {
stream
.distinct()
.filter(o -> o.getNameCount() >= 12)
.filter(Files::isRegularFile)
.map(o -> o.subpath(8, o.getNameCount()))
.map(o -> o.toString().replace("\\", ".").replace(".java", ""))
.filter(o -> !o.contains("Sm"))
.sorted()
.forEach(
o ->
classSb.append(
String.format("%2s<class map=\"%s\" name=\"%s\"/>%n", "", getRandStr(), o)));
}
}
private void mapPackages(Path path, StringBuilder packageSb) throws IOException {
try (var stream = Files.walk(path, Integer.MAX_VALUE)) {
stream
.map(Path::getParent)
.distinct()
.filter(o -> o.getNameCount() >= 12)
.map(o -> o.subpath(8, o.getNameCount()))
.map(o -> o.toString().replace("\\", "."))
.sorted()
.forEach(
o ->
packageSb.append(
String.format(
"%2s<package map=\"%s\" name=\"%s\"/>%n", "", getRandStr(), o)));
}
}
private String getRandStr() {
return UUID.randomUUID().toString().replaceAll("[-]+", "\\$");
}
}
I'm building a spark application to load two json files, compare them, and print the differences. I also try to validate these files using amazon library aws deequ , but I'm getting the below exception:
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/07 11:56:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load com.deeq.CompareDataFrames: com/amazon/deequ/checks/Check
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please
when I submit the job to spark:
./spark-submit --class com.deeq.CompareDataFrames--master
spark://saif-VirtualBox:7077 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
I'm using Ubuntu to host spark, it was working without any issues before I added deequ to run some validation. I wonder if I'm missing something in the deployment process. It doesn't seem like this error is a well-know one on the internet.
Code :
import com.amazon.deequ.VerificationResult;
import com.amazon.deequ.VerificationSuite;
import com.amazon.deequ.checks.Check;
import com.amazon.deequ.checks.CheckLevel;
import com.amazon.deequ.checks.CheckStatus;
import com.amazon.deequ.constraints.Constraint;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Option;
import scala.Tuple2;
import scala.collection.mutable.ArraySeq;
import scala.collection.mutable.Seq;
public class CompareDataFrames {
public static void main(String[] args) {
SparkSession session = SparkSession.builder().appName("CompareDataFrames").getOrCreate();
session.sparkContext().setLogLevel("ALL");
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("CUST_ID", DataTypes.StringType, true),
DataTypes.createStructField("RECORD_LOCATOR_ID", DataTypes.StringType, true),
DataTypes.createStructField("EVNT_ID", DataTypes.StringType, true)
});
Dataset<Row> first = session.read().option("multiline", "true").schema(schema).json("/home/saif/Downloads/FILE_DEV1.json");
System.out.println("======= DataSet 1 =======");
first.printSchema();
first.show(false);
Dataset<Row> second = session.read().option("multiline", "true").schema(schema).json("/home/saif/Downloads/FILE_DEV2.json");
System.out.println("======= DataSet 2 =======");
second.printSchema();
second.show(false);
// This will show all the rows which are present in the first dataset
// but not present in the second dataset. But the comparison is at row
// level and not at column level.
System.out.println("======= Expect =======");
first.except(second).show();
StructType one = first.schema();
JavaPairRDD<String, Row> pair1 = first.toJavaRDD().mapToPair((PairFunction<Row, String, Row>)
row -> new Tuple2<>(row.getString(1), row));
JavaPairRDD<String, Row> pair2 = second.toJavaRDD().mapToPair((PairFunction<Row, String, Row>)
row -> new Tuple2<>(row.getString(1), row));
System.out.println("======= Pair1 & Pair2 were created =======");
JavaPairRDD<String, Row> subs = pair1.subtractByKey(pair2);
JavaRDD<Row> rdd = subs.values();
Dataset<Row> diff = session.createDataFrame(rdd, one);
System.out.println("======= Diff Show =======");
diff.show();
Seq<Constraint> cons = new ArraySeq<>(0);
VerificationResult vr = new VerificationSuite().onData(first)
.addCheck(new Check(CheckLevel.Error(), "unit test", cons)
.isComplete("EVNT_ID", Option.empty())
)
.run();
Seq<Check> checkSeq = new ArraySeq<>(0);
if (vr.status() != CheckStatus.Success()) {
Dataset<Row> vrr = vr.checkResultsAsDataFrame(session, vr, checkSeq);
vrr.show(false);
}
}
}
**Maven: **
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>1.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-java8-compat_2.13</artifactId>
<version>0.9.1</version>
</dependency>
Please follow the below approaches to resolve your problem.
Approach 1.
spark submit with --jars option,
Download the jar from the following Maven Repo to your machine, https://mvnrepository.com/artifact/com.amazon.deequ/deequ/1.0.4 to ~/Downloads/deequ-1.0.4.jar
./spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 --jars ~/Downloads/deequ-1.0.4.jar ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
Approach 2.
spark submit with --packages option,
./spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 --packages com.amazon.deequ:deequ:1.0.4 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
Notes:
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
Edit 1:
Approach 3:
If the above solutions 1 & 2 is not working then use maven-shade-plugin to build the uber jar and proceed with the spark-submit.
use the below pom.xml file for building uber jar using maven-shade-plugin. adding the below pom and rebuild your jar and deploy it with spark-submit.
spark-submit --class com.deeq.CompareDataFrames --master
spark://saif-VirtualBox:7077 ~/Downloads/deeq-trial-1.0-SNAPSHOT.jar
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.deeq</groupId>
<artifactId>deeq-trial-1.0-SNAPSHOT</artifactId>
<version>1.0</version>
<name>Spark-3.0 Spark Application</name>
<url>https://maven.apache.org</url>
<repositories>
<repository>
<id>codelds</id>
<url>https://code.lds.org/nexus/content/groups/main-repo</url>
</repository>
<repository>
<id>central</id>
<name>Maven Repository Switchboard</name>
<layout>default</layout>
<url>https://repo1.maven.org/maven2</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.12.8</scala.version>
<java.version>1.8</java.version>
<CodeCacheSize>512m</CodeCacheSize>
<es.version>2.4.6</es.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>1.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-java8-compat_2.13</artifactId>
<version>0.9.1</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>eclipse-add-source</id>
<goals>
<goal>add-source</goal>
</goals>
</execution>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile-first</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
<execution>
<id>attach-scaladocs</id>
<phase>verify</phase>
<goals>
<goal>doc-jar</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<recompileMode>incremental</recompileMode>
<useZincServer>true</useZincServer>
<args>
<arg>-unchecked</arg>
<arg>-deprecation</arg>
<arg>-feature</arg>
</args>
<jvmArgs>
<jvmArg>-Xms1024m</jvmArg>
<jvmArg>-Xmx1024m</jvmArg>
<jvmArg>-XX:ReservedCodeCacheSize=${CodeCacheSize}</jvmArg>
</jvmArgs>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-target</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-Xlint:all,-serial,-path</javacArg>
</javacArgs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>org.xerial.snappy</exclude>
<exclude>org.scala-lang.modules</exclude>
<exclude>org.scala-lang</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>shaded.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
I am running spark streaming job to consume from kafka using Direct approach (for kafka 0.1.0 or greater). Built POM file using maven-assembly-plugin and checked the contents of jar file using jar tf <jar file> | grep ConsumerRecord. I get the following output
org/apache/kafka/clients/consumer/ConsumerRecord.class
org/apache/kafka/clients/consumer/ConsumerRecords$ConcatenatedIterable$1.class
org/apache/kafka/clients/consumer/ConsumerRecords$ConcatenatedIterable.class
org/apache/kafka/clients/consumer/ConsumerRecords.class
But when i run spark-submit job on my cluster (with master as both local & yarn), I get the following exception.
java.lang.ClassNotFoundException:
org.apache.kafka.clients.consumer.ConsumerRecord
Other option that I tried is - Built a shaded jar using maven-shade-plugin. Same result there as well.
PFB my POM file
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.myCompany</groupId>
<artifactId>spark-streaming-test</artifactId>
<version>1</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.2.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<finalName>shade-${artifactId}-${version}</finalName>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.myCompany.ReadFromKafka</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
And here is my spark streaming code (taken from - https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html)
package com.myCompany;
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import scala.Tuple2;
public class ReadFromKafka {
public static void main(String args[]) throws InterruptedException {
SparkConf conf = new SparkConf();// .setAppName("Decryption-spark-streaming").setMaster("yarn");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "server1:9093");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "my_cg");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("security.protocol", "SSL");
kafkaParams.put("ssl.truststore.location", "abc.jks");
kafkaParams.put("ssl.truststore.password", "changeit");
kafkaParams.put("ssl.keystore.location", "abc.jks");
kafkaParams.put("ssl.keystore.password", "changeme");
kafkaParams.put("ssl.key.password", "changeme");
Collection<String> topics = Arrays.asList("myTopic");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
stream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(consumerRecords -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset());
});
});
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// some time later, after outputs have completed
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
});
// Start the computation
jsc.start();
jsc.awaitTermination();
}}
Adding the dependent jar file(spark-streaming-kafka-0-10_2.11.jar) to the spark-submit command helped in resolving this issue
spark-submit --master yarn --deploy-mode cluster --name spark-streaming-test\
--executor-memory 1g --num-executors 4 --driver-memory 1g --jars\
/home/spark/jars/spark-streaming-kafka-0-10_2.11.jar --class\
com.mycompany.ReadFromKafka spark-streaming-test-1-jar-with-dependencies.jar
I want to load cassandra table to a datafram in spark, I have followed the sample programes below (found in this answer), but I am getting an execption mentioned below,
I have tried to load the table to RDD first then convert it to Datafrme, loading the RDD is successful, but when I try to convert it to a dataframe I am getting the same execption faced in the first methdology, any suggestions ? I am using Spark 2.0.0, Cassandra 3.7, and Java 8.
public class SparkCassandraDatasetApplication {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkCassandraDatasetApplication")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
//Read data to dataframe
// this is throwing an exception
Dataset<Row> dataset = spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mykeyspace");
put("table", "mytable");
}
}).load();
//Print data
dataset.show();
spark.stop();
}
}
When submitted I am getting this exception:
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3.S3FileSystem not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2623)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2634)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:115)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
Using the RDD method to read from cassandra is successful ( i have tested it with count() call), but converting the RDD to DF is throwing the same exception faced in the first method.
public class SparkCassandraRDDApplication {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("App")
.config("spark.sql.warehouse.dir", "/file:/opt/spark/temp")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
SparkContext sc = spark.sparkContext();
//Read
JavaRDD<UserData> resultsRDD = javaFunctions(sc).cassandraTable("mykeyspace", "mytable",CassandraJavaUtil.mapRowTo(UserData.class));
//This is again throwing an exception
Dataset<Row> usersDF = spark.createDataFrame(resultsRDD, UserData.class);
//Print
resultsRDD.foreach(data -> {
System.out.println(data.id);
System.out.println(data.username);
});
sc.stop();
}
}
Please check if "hadoop-common-2.2.0.jar" is available in classpath. You can test your application by creating a jar including all the dependencies. Use below pom.xml in which maven-shade-plugin is used to include all the dependencies to create uber jar.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.abaghel.examples.spark</groupId>
<artifactId>spark-cassandra</artifactId>
<version>1.0.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-M3</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.abaghel.examples.spark.cassandra.SparkCassandraDatasetApplication</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
You can run the jar like below
spark-submit --class com.abaghel.examples.spark.cassandra.SparkCassandraDatasetApplication spark-cassandra-1.0.0-SNAPSHOT.jar
I am creating a simple helloworld hadoop project. I really do not know what to include to get around this error. It seems the hadoop libraries need some resource I am not including.
I have tried adding the following argument to the run configurations.. But it is not helping the issue..
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
Here is my code:
/**
* Writes a static string to a file using the Hadoop Libraries
*/
public class WriteToFile {
public static void main(String[] args) {
//String to print to file
final String HELLOWORLD = "Hello World! This is Chris writing to the file.";
try {
//Instantiating the configuration
Configuration conf = new Configuration();
//Creating the file system
FileSystem fs = FileSystem.get(conf);
//Instantiating the path
Path path = new Path("/user/c4511/homework1.txt");
//Checking for the existence of the file
if(fs.exists(path)){
//delete if it already exists
fs.delete(path, true);
}
//Creating an output stream
FSDataOutputStream fsdos = fs.create(path);
//Writing helloworld static string to the file
fsdos.writeUTF(HELLOWORLD);
//Closing all connection
fsdos.close();
fs.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
What is causing this issue?
And here is the error I am getting
Nov 17, 2014 9:30:30 AM org.apache.hadoop.conf.Configuration loadResource
SEVERE: error parsing conf file: javax.xml.parsers.ParserConfigurationException: Feature 'http://apache.org/xml/features/xinclude' is not recognized.
Exception in thread "main" java.lang.RuntimeException: javax.xml.parsers.ParserConfigurationException: Feature 'http://apache.org/xml/features/xinclude' is not recognized.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1833)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1689)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1635)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:790)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:166)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:158)
at WriteToFile.main(WriteToFile.java:24)
Caused by: javax.xml.parsers.ParserConfigurationException: Feature 'http://apache.org/xml/features/xinclude' is not recognized.
at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1720)
... 6 more
I had the same Exception in my project when I moved the project from 2.5.1 to 2.6.0. I had to solve it using the maven pom file, when I added xerces:* to the shaded jar file.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>emc.lab.hadoop</groupId>
<artifactId>DartAnalytics</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>DartAnalytics</name>
<description>Examples for usage of Dart simulated data</description>
<properties>
<main.class>OffsetRTMain</main.class>
<hadoop.version>2.6.0</hadoop.version>
<minimize.jar>true</minimize.jar>
</properties>
<!-- <repositories> <repository> <id>mvn.twitter</id> <url>http://maven.twttr.com</url>
</repository> </repositories> -->
<build>
<plugins>
<plugin>
<!-- The shade plugin allows us to compile the dependencies into the
jar file -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<!-- minimize the jar removes all files that are not addressed in the
file. but the filters include stuff we must include -->
<minimizeJar>${minimize.jar}</minimizeJar>
<filters>
<filter>
<artifact>com.hadoop.gplcompression:hadoop-lzo</artifact>
<includes>
<include>**</include>
</includes>
</filter>
<filter>
<!-- This solves the hadoop 2.6.0 problem with ClassNotFound of "org.apache.xerces.jaxp.DocumentBuilderFactoryImpl" -->
<artifact>xerces:*</artifact>
<includes>
<include>**</include>
</includes>
</filter>
<filter>
<artifact>org.apache.hadoop:*</artifact>
<excludes>
<exclude>**</exclude>
</excludes>
</filter>
</filters>
<finalName>uber-${project.artifactId}-${project.version}</finalName>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>${main.class}</mainClass>
</transformer>
</transformers>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- you can add this to the local repo by running mvn install:install-file
-Dfile=libs/hadoop-lzo-0.4.20-SNAPSHOT.jar -DgroupId=com.hadoop.gplcompression
-DartifactId=hadoop-lzo -Dversion=0.4.20 -Dpackaging=jar from the main project
directory -->
<!-- Another option is to build from outside the EMC network and get access
to the twitter maven repository by changing the version to a version in the
repository and un-commenting the repository addition -->
<dependency>
<groupId>com.hadoop.gplcompression</groupId>
<artifactId>hadoop-lzo</artifactId>
<version>0.4.20</version>
</dependency>
<dependency>
<groupId>net.sf.trove4j</groupId>
<artifactId>trove4j</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>com.twitter.elephantbird</groupId>
<artifactId>elephant-bird-core</artifactId>
<version>4.5</version>
</dependency>
<!-- <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId>
<version>18.0</version> </dependency> -->
</dependencies>