SparkStreaming: class cast exception for SerializedOffset - java

I'm writing a custom Spark structured streaming source (using v2 interfaces and Spark 2.3.0) in Java/Scala.
When testing the integration with the Spark offsets/checkpoint, I get the following error:
18/06/20 11:58:49 ERROR MicroBatchExecution: Query [id = 58ec2604-3b04-4912-9ba8-c757d930ac05, runId = 5458caee-6ef7-4864-9968-9cb843075458] terminated with error
java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
This is my Offset implementation (simplified version, I removed JSON (de) serialization):
package mypackage
import org.apache.spark.sql.execution.streaming.SerializedOffset
import org.apache.spark.sql.sources.v2.reader.streaming.Offset
case class MyOffset(offset: Long) extends Offset {
override val json = "{\"offset\":"+offset+"}"
}
private object MyOffset {
def apply(offset: SerializedOffset): MyOffset = new MyOffset(0L)
}
Dou you have any advice about how to solve this problem?

Check that Spark version of your client app is exactly the same as Spark version of your cluster. I used Spark v.2.4.0 dependencies in spark job application, but cluster had Spark engine v.2.3.0. When I downgraded dependencies to v.2.3.0 error has gone.

Related

Error in SQL statement: NoClassDefFoundError: com/macasaet/fernet/Validator

Currently working on converting below code as "JAR" to register permanent UDF in Databricks cluster. facing issue like NoClassDefFoundError, But i added required Library dependencies while building Jar using SBT. source code : https://databricks.com/notebooks/enforcing-column-level-encryption.html
Used Below in build.sbt
scalaVersion := "2.13.4"
libraryDependencies += "org.apache.hive" % "hive-exec" % "0.13.1"
libraryDependencies += "com.macasaet.fernet" % "fernet-java8" % "1.5.0"
Guide me on right libraries if anything wrong above.
Kindly help me on this,
import com.macasaet.fernet.{Key, StringValidator, Token}
import org.apache.hadoop.hive.ql.exec.UDF;
class Validator extends StringValidator {
override def getTimeToLive() : java.time.temporal.TemporalAmount = {
Duration.ofSeconds(Instant.MAX.getEpochSecond());
}
}
class udfDecrypt extends UDF {
def evaluate(inputVal: String, sparkKey : String): String = {
if( inputVal != null && inputVal!="" ) {
val keys: Key = new Key(sparkKey)
val token = Token.fromString(inputVal)
val validator = new Validator() {}
val payload = token.validateAndDecrypt(keys, validator)
payload
} else return inputVal
}
}
Make sure the fernet-java library is installed in your cluster.
This topic is related to
Databricks SCALA UDF cannot load class when registering function
I tried more to install the jar file to the cluster via the Libraries in the config, not drop directly to DBFS as the userguide, then I faced the issue with validator not found and the question routed me here.
I added the maven repo to the Libraries config, but then the cluster failed to installed it, with error
Library resolution failed because unresolved dependency: com.macasaet.fernet:fernet-java8:1.5.0: not found
databricks cluster libraries
Have you experienced with this?

String Index Out Of Bounds Exception When Initializing Spark Context

I've been working with Spark for more than 5 years. Recently, I encountered a basic error I have never seen before, and it has stopped development cold. When I do a routine call to create a Spark Context, I get an ExceptionInInitializerError caused by a StringIndexOutOfBoundsException. Here is a simple sample of my code:
public class SparkTest {
public static final SparkConf SPARK_CONFIGURATION = new SparkConf().setAppName("MOSDEX").setMaster("local[*]");
public static final JavaSparkContext SPARK_CONTEXT= new JavaSparkContext(SPARK_CONFIGURATION);
public static final SparkSession SPARK_SESSION= SparkSession.builder()
.config(SPARK_CONFIGURATION)
.getOrCreate();
public static void main(String[] args) {
setupTest();
}
public static void setupTest() {
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = SPARK_CONTEXT.parallelize(data);
int sum= distData.reduce((a, b) -> a + b);
System.out.println("Sum of " + data.toString() + " = " + sum);
System.out.println();
}//SetupTest
public SparkTest() {
super();
}
}//class SparkTest
Here is the error message chain:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/Owner/.m2/repository/org/apache/spark/spark-unsafe_2.11/2.4.5/spark-unsafe_2.11-2.4.5.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/05 13:55:21 INFO SparkContext: Running Spark version 2.4.5
20/04/05 13:55:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at io.github.JeremyBloom.mosdex.SparkTest.<clinit>(SparkTest.java:28)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
at java.base/java.lang.String.substring(String.java:1909)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:50)
... 16 more
I also get the same error when I use SparkContext instead of JavaSparkContext. I've done extensive search for this error and have not seen anyone else who has it, so I don't think it's a bug in Spark. I've used this code in other applications previously (with earlier versions of Spark) without a problem.
I'm using the latest version of Spark (2.4.5). Why isn't this working?
I am using spark 2.4.5 and jdk1.8.0_181 its working fine for me
package examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
public class SparkTest {
public static final SparkConf SPARK_CONFIGURATION = new SparkConf().setAppName("MOSDEX").setMaster("local[*]");
public static final JavaSparkContext SPARK_CONTEXT= new JavaSparkContext(SPARK_CONFIGURATION);
public static final SparkSession SPARK_SESSION= SparkSession.builder()
.config(SPARK_CONFIGURATION)
.getOrCreate();
public static void main(String[] args) {
setupTest();
}
public static void setupTest() {
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = SPARK_CONTEXT.parallelize(data);
int sum= distData.reduce((a, b) -> a + b);
System.out.println("Sum of " + data.toString() + " = " + sum);
System.out.println();
}//SetupTest
public SparkTest() {
super();
}
}//class SparkTest
Result :
[2020-04-05 18:14:42,184] INFO Running Spark version 2.4.5 (org.apache.spark.SparkContext:54)
...
[2020-04-05 18:14:44,060] WARN Using an existing SparkContext; some configuration may not take effect. (org.apache.spark.SparkContext:66)
Sum of [1, 2, 3, 4, 5] = 15
AFAIK you are facing issue with java version as mentioned in this. HADOOP-14586 StringIndexOutOfBoundsException breaks org.apache.hadoop.util.Shell on 2.7.x with Java 9
Change the java version which is suitable with hadoop version.
See here
Latest Release (Spark 2.4.5) - Apache Spark docs
Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.5 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
NOTE : As per comment, java 13 is not supported by Spark. You need to downgrade the java version to java 8
It turns out that if you use Hadoop > 2.8, you can use Java 13 (I'm now using Hadoop 2.8.5). This is tricky if you are using Spark 2.4.5 because on Maven, it comes prebuilt with Hadoop 2.6. You have to create a separate dependency for Hadoop 2.8.5 which overrides the prebuilt components. It took me quite a bit of experimenting to make that work. Plus, I'm working in Windows, so I also needed to link Hadoop with winutils, which is another complication. None of this is very well documented, so I had to read a lot of posts on Stackoverflow to get it working.
Had the same issue with below versions
1)Spark with version 2.12-2.4.4
2)Hadoop with version 2.6.5
3)JDK with version java 16 in Spring STS
Solution : In Spring STS corrected the jdk version to jdk 1.8 and issue is resolved.

ZetaSQL Sample Using Apache beam

I am Facing Issues while Using ZetaSQL in Apache beam Framework (2.17.0-SNAPSHOT). After Going through documentation of the apache beam I am not able to find any sample for ZetaSQL.
I tried to add the Planner:
options.setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner");
But Still Facing Issue, Snippet is added below for help.
```
String sql =
"SELECT CAST (1243 as INT64), "
+ "CAST ('2018-09-15 12:59:59.000000+00' as TIMESTAMP), "
+ "CAST ('string' as STRING);";
ZetaSQLQueryPlanner zetaSQLQueryPlanner = new ZetaSQLQueryPlanner();
BeamRelNode beamRelNode = zetaSQLQueryPlanner.convertToBeamRel(sql);
PCollection<Row> stream = BeamSqlRelUtils.toPCollection(p, beamRelNode);
p.run();
I Understand we need the below Snippet but failed to create the config
Frameworks.newConfigBuilder()
and while Running Code I found below Exceptions:
Exception in thread "main" java.util.ServiceConfigurationError: com.google.zetasql.ClientChannelProvider: Provider com.google.zetasql.JniChannelProvider could not be instantiated
at java.util.ServiceLoader.fail(Unknown Source)
at java.util.ServiceLoader.access$100(Unknown Source)
at java.util.ServiceLoader$LazyIterator.nextService(Unknown Source)
Update: as of 06/23/2020, Beam ZetaSQL is supported on Mac OS as well (not all versions but at least most recent ones)!
====
I think it is related to your OS. Beam is as unified framework but your exception looks from its dependency: ZetaSQL parser. If you change to a newer version of linux I think your code snippet should work.

Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist in PySpark

This is the snippet:
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
d = spark.read.format("csv").option("header", True).option("inferSchema", True).load('file.csv')
d.show()
After this runs into the error:
An error occurred while calling o163.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
All the other methods work well. Tried researching alot but in vain. Any lead will be highly appreciated
This is an indicator of a Spark version mismatch. Before Spark 2.3 show method took only two arguments:
def show(self, n=20, truncate=True):
since 2.3 it takes three arguments:
def show(self, n=20, truncate=True, vertical=False):
In your case Python client seems to invoke the latter one, while the JVM backend uses the older version.
Since SparkContext initialization undergone significant changes in 2.4, which would cause failure on SparkContext.__init__, you're likely using:
2.3.x Python library.
2.2.x JARs.
You can confirm that by checking versions directly from your session, Python:
sc.version
vs. JVM:
sc._jsc.version()
Problems like this, are usually a result of misconfigured PYTHONPATH (either directly, or by using pip installed PySpark on top per-existing Spark binaries) or SPARK_HOME.
On spark-shell console, enter the variable name and see the data type.
As an alternative, you can tab twice after variable named. and it will show necessary function which could be applied.
Example of a DataFrame object.
res23: org.apache.spark.sql.DataFrame = [order_id: string, book_name: string ... 1 more field]

Error using Spark's Kryo serializer with java protocol buffers that have arrays of strings

I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60
Try to register the Lazy class with:
Kryo kryo = new Kryo()
kryo.register(com.google.protobuf.LazyStringArrayList.class)
Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc.
I think your RDD's type contains class OntologyHumanName. like: RDD[(String, OntologyHumanName)], and this type RDD in shuffle stage by coincidence. View this: https://github.com/EsotericSoftware/kryo#kryoserializable kryo can't do serialization on abstract class.
Read the spark doc: http://spark.apache.org/docs/latest/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
on kryo doc:
public class SomeClass implements KryoSerializable {
// ...
public void write (Kryo kryo, Output output) {
// ...
}
public void read (Kryo kryo, Input input) {
// ...
}
}
but the class: OntologyHumanName is generated by protobuf automatically. So I don't think this's a good way to do.
Try to use case class replace OntologyHumanName to avoid doing serialization on class OntologyHumanName directly. This way I didn't try, it's doesn't work possiblly.
case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
An ugly way. I just converted protobuf class to scala things. This way can't be failed. likeļ¼š
import scala.collection.JavaConverters._
val humanNameObj: OntologyHumanName = ...
val families: List[String] = humamNameObj.getFamilyList.asScala //use this to replace the humanNameObj.
hope resolve your problem above.

Categories

Resources