I've been working with Spark for more than 5 years. Recently, I encountered a basic error I have never seen before, and it has stopped development cold. When I do a routine call to create a Spark Context, I get an ExceptionInInitializerError caused by a StringIndexOutOfBoundsException. Here is a simple sample of my code:
public class SparkTest {
public static final SparkConf SPARK_CONFIGURATION = new SparkConf().setAppName("MOSDEX").setMaster("local[*]");
public static final JavaSparkContext SPARK_CONTEXT= new JavaSparkContext(SPARK_CONFIGURATION);
public static final SparkSession SPARK_SESSION= SparkSession.builder()
.config(SPARK_CONFIGURATION)
.getOrCreate();
public static void main(String[] args) {
setupTest();
}
public static void setupTest() {
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = SPARK_CONTEXT.parallelize(data);
int sum= distData.reduce((a, b) -> a + b);
System.out.println("Sum of " + data.toString() + " = " + sum);
System.out.println();
}//SetupTest
public SparkTest() {
super();
}
}//class SparkTest
Here is the error message chain:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/Owner/.m2/repository/org/apache/spark/spark-unsafe_2.11/2.4.5/spark-unsafe_2.11-2.4.5.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/05 13:55:21 INFO SparkContext: Running Spark version 2.4.5
20/04/05 13:55:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at io.github.JeremyBloom.mosdex.SparkTest.<clinit>(SparkTest.java:28)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
at java.base/java.lang.String.substring(String.java:1909)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:50)
... 16 more
I also get the same error when I use SparkContext instead of JavaSparkContext. I've done extensive search for this error and have not seen anyone else who has it, so I don't think it's a bug in Spark. I've used this code in other applications previously (with earlier versions of Spark) without a problem.
I'm using the latest version of Spark (2.4.5). Why isn't this working?
I am using spark 2.4.5 and jdk1.8.0_181 its working fine for me
package examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
public class SparkTest {
public static final SparkConf SPARK_CONFIGURATION = new SparkConf().setAppName("MOSDEX").setMaster("local[*]");
public static final JavaSparkContext SPARK_CONTEXT= new JavaSparkContext(SPARK_CONFIGURATION);
public static final SparkSession SPARK_SESSION= SparkSession.builder()
.config(SPARK_CONFIGURATION)
.getOrCreate();
public static void main(String[] args) {
setupTest();
}
public static void setupTest() {
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = SPARK_CONTEXT.parallelize(data);
int sum= distData.reduce((a, b) -> a + b);
System.out.println("Sum of " + data.toString() + " = " + sum);
System.out.println();
}//SetupTest
public SparkTest() {
super();
}
}//class SparkTest
Result :
[2020-04-05 18:14:42,184] INFO Running Spark version 2.4.5 (org.apache.spark.SparkContext:54)
...
[2020-04-05 18:14:44,060] WARN Using an existing SparkContext; some configuration may not take effect. (org.apache.spark.SparkContext:66)
Sum of [1, 2, 3, 4, 5] = 15
AFAIK you are facing issue with java version as mentioned in this. HADOOP-14586 StringIndexOutOfBoundsException breaks org.apache.hadoop.util.Shell on 2.7.x with Java 9
Change the java version which is suitable with hadoop version.
See here
Latest Release (Spark 2.4.5) - Apache Spark docs
Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.5 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
NOTE : As per comment, java 13 is not supported by Spark. You need to downgrade the java version to java 8
It turns out that if you use Hadoop > 2.8, you can use Java 13 (I'm now using Hadoop 2.8.5). This is tricky if you are using Spark 2.4.5 because on Maven, it comes prebuilt with Hadoop 2.6. You have to create a separate dependency for Hadoop 2.8.5 which overrides the prebuilt components. It took me quite a bit of experimenting to make that work. Plus, I'm working in Windows, so I also needed to link Hadoop with winutils, which is another complication. None of this is very well documented, so I had to read a lot of posts on Stackoverflow to get it working.
Had the same issue with below versions
1)Spark with version 2.12-2.4.4
2)Hadoop with version 2.6.5
3)JDK with version java 16 in Spring STS
Solution : In Spring STS corrected the jdk version to jdk 1.8 and issue is resolved.
Related
This is the snippet:
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
d = spark.read.format("csv").option("header", True).option("inferSchema", True).load('file.csv')
d.show()
After this runs into the error:
An error occurred while calling o163.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
All the other methods work well. Tried researching alot but in vain. Any lead will be highly appreciated
This is an indicator of a Spark version mismatch. Before Spark 2.3 show method took only two arguments:
def show(self, n=20, truncate=True):
since 2.3 it takes three arguments:
def show(self, n=20, truncate=True, vertical=False):
In your case Python client seems to invoke the latter one, while the JVM backend uses the older version.
Since SparkContext initialization undergone significant changes in 2.4, which would cause failure on SparkContext.__init__, you're likely using:
2.3.x Python library.
2.2.x JARs.
You can confirm that by checking versions directly from your session, Python:
sc.version
vs. JVM:
sc._jsc.version()
Problems like this, are usually a result of misconfigured PYTHONPATH (either directly, or by using pip installed PySpark on top per-existing Spark binaries) or SPARK_HOME.
On spark-shell console, enter the variable name and see the data type.
As an alternative, you can tab twice after variable named. and it will show necessary function which could be applied.
Example of a DataFrame object.
res23: org.apache.spark.sql.DataFrame = [order_id: string, book_name: string ... 1 more field]
So I am new to spark. My versions are: Spark 2.1.2, Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131). I am using IntellijIdea 2018 Community on Windows 10 (x64). And whenever I am trying to run a simple word count example I get the following error:
18/10/22 01:43:14 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at
least 471859200. Please increase heap size using the --driver-memory
option or spark.driver.memory in Spark configuration. at
org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:216)
at
org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:198)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:330) at
org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:174) at
org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
at org.apache.spark.SparkContext.(SparkContext.scala:432) at
WordCount$.main(WordCount.scala:5) at WordCount.main(WordCount.scala)
PS: this is the code of the wordcounter that use as an example:
import org.apache.spark.{SparkConf,SparkContext}
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("mySpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile(args(0))
val wordcount = rdd.flatMap(_.split("\t") ).map((_, 1))
.reduceByKey(_ + _)
for (arg <- wordcount.collect())
print(arg + " ")
println()
// wordcount.saveAsTextFile(args(1))
// wordcount.saveAsTextFile("myFile")
sc.stop()
}
}
So my question is how to get rid of this error. I have searched for the solution and tried installing different versions of Spark and JDK and Hadoop, but it didn't help. I don't know where may be the problem.
If you are in IntelliJ you may struggle a lot, What I did and it worked is that I have initialized SparkContext before SparkSession by doing
val conf:SparkConf = new SparkConf().setAppName("name").setMaster("local")
.set("spark.testing.memory", "2147480000")
val sc:SparkContext = new SparkContext(conf)
There is maybe a better solution, because here I actually don't need to initialise SparkContext since it is implicitly done by initialising SparkSession.
Go to settings - run/debug configurations -> and for VM options put
-Xms128m -Xms512m -XX:MaxPermSize=300m -ea
Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-
Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();
If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();
I'm writing a custom Spark structured streaming source (using v2 interfaces and Spark 2.3.0) in Java/Scala.
When testing the integration with the Spark offsets/checkpoint, I get the following error:
18/06/20 11:58:49 ERROR MicroBatchExecution: Query [id = 58ec2604-3b04-4912-9ba8-c757d930ac05, runId = 5458caee-6ef7-4864-9968-9cb843075458] terminated with error
java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
This is my Offset implementation (simplified version, I removed JSON (de) serialization):
package mypackage
import org.apache.spark.sql.execution.streaming.SerializedOffset
import org.apache.spark.sql.sources.v2.reader.streaming.Offset
case class MyOffset(offset: Long) extends Offset {
override val json = "{\"offset\":"+offset+"}"
}
private object MyOffset {
def apply(offset: SerializedOffset): MyOffset = new MyOffset(0L)
}
Dou you have any advice about how to solve this problem?
Check that Spark version of your client app is exactly the same as Spark version of your cluster. I used Spark v.2.4.0 dependencies in spark job application, but cluster had Spark engine v.2.3.0. When I downgraded dependencies to v.2.3.0 error has gone.
While I've tried to use the following code snippet with the Groovy in-operator explanation the VerifyError has occured. Have you guys any idea about?
The code and console output is below.
class Hello extends ArrayList {
boolean isCase(Object val) {
return val == 66
}
static void main(args) {
def myList = new Hello()
myList << 55
assert 66 in myList
assert !myList.contains(66)
}
}
The error log:
Exception in thread "main" java.lang.VerifyError: (class: Hello, method: super$1$stream signature: ()Ljava/util/stream/Stream;) Illegal use of nonvirtual function call
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:116)
The code origin from the topic How does the Groovy in operator work?.
Update:
Groovy Version: 1.8.6 JVM: 1.6.0_45 Vendor: Sun Microsystems Inc. OS: Linux
Check this out.
It's for Java, but generally problem is, that you are using wrong library versions. The class is there, but different version than expected.
http://craftingjava.blogspot.co.uk/2012/08/3-reasons-for-javalangverfiyerror.html
Probably you have messed up Groovy or Java SDK installations.