Apache Spark - Parquet / Snappy compression error - java

I have a dataframe from an oracle table which I am attempting to write into Parquet format with Snappy compression locally.
Works fine if I save as CSV, but hitting this error when attempting to save as Parquet.
java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
Snappy libraries are already in my classpath, this has worked for other source types (flat files).
What can I do to resolve?
Stack trace below:
2017-05-19 08:10:37.398 INFO 7740 --- [rker for task 0] org.apache.hadoop.io.compress.CodecPool : Got brand-new compressor [.snappy]
2017-05-19 08:11:45.482 ERROR 7740 --- [rker for task 0] org.apache.spark.util.Utils : Aborting task
java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) ~[snappy-java-1.1.2.6.jar:na]
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:376) ~[snappy-java-1.1.2.6.jar:na]
at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) ~[hadoop-common-2.2.0.jar:na]
at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) ~[hadoop-common-2.2.0.jar:na]
at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:89) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:152) ~[parquet-column-1.8.1.jar:1.8.1]
at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:113) ~[parquet-column-1.8.1.jar:1.8.1]
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:205) ~[parquet-column-1.8.1.jar:1.8.1]
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347) ~[parquet-column-1.8.1.jar:1.8.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$9.apply(ParquetWriteSupport.scala:169) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$9.apply(ParquetWriteSupport.scala:157) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields$1.apply$mcV$sp(ParquetWriteSupport.scala:114) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$consumeField(ParquetWriteSupport.scala:422) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields(ParquetWriteSupport.scala:113) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$write$1.apply$mcV$sp(ParquetWriteSupport.scala:104) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:410) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:103) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:51) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.writeInternal(ParquetOutputWriter.scala:42) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:245) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) ~[spark-core_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) [spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) [spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) [spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.scheduler.Task.run(Task.scala:99) [spark-core_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) [spark-core_2.11-2.1.1.jar:2.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2017-05-19 08:11:45.484 INFO 7740 --- [rker for task 0] o.a.p.h.InternalParquetRecordWriter : Flushing mem columnStore to file. allocated memory: 13,812,677
2017-05-19 08:11:45.499 WARN 7740 --- [rker for task 0] org.apache.hadoop.fs.FileUtil : Failed to delete file or dir [C:\Dev\edi_parquet\GMS_TEST\_temporary\0\_temporary\attempt_20170519081036_0000_m_000000_0\.part-00000-193f8835-6505-4dac-8cb6-0e8c5f3cff1b.snappy.parquet.crc]: it still exists.
2017-05-19 08:11:45.501 WARN 7740 --- [rker for task 0] org.apache.hadoop.fs.FileUtil : Failed to delete file or dir [C:\Dev\edi_parquet\GMS_TEST\_temporary\0\_temporary\attempt_20170519081036_0000_m_000000_0\part-00000-193f8835-6505-4dac-8cb6-0e8c5f3cff1b.snappy.parquet]: it still exists.
2017-05-19 08:11:45.501 WARN 7740 --- [rker for task 0] o.a.h.m.lib.output.FileOutputCommitter : Could not delete file:/C:/Dev/edi_parquet/GMS_TEST/_temporary/0/_temporary/attempt_20170519081036_0000_m_000000_0
2017-05-19 08:11:45.504 ERROR 7740 --- [rker for task 0] o.a.s.s.e.datasources.FileFormatWriter : Job job_20170519081036_0000 aborted.

This issue is due to an incompatibility between the snappy-java version that is required by parquet and spark/hadoop
We faced same issue for spark 2.3 on cloudera.
Solution which worked for us is downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder solves the issue.
This include replacing snappy-java jar on all nodes where spark is installed.
you can find Spark's jar folder at following location :
Cloudera : /opt/cloudera/parcels/SPARK2-{spark-cloudera-version}/lib/spark2/jars
Hdp : /usr/hdp/{hdp version}/spark2/jars

Related

How to avoid java.io.StreamCorruptedException: invalid stream header: 204356EC when using toPandas() with PySpark?

Whenever I try to read a Spark dataset using PySpark and convert it to a Pandas df for modeling I get the error: java.io.StreamCorruptedException: invalid stream header: 204356EC on the toPandas() step.
I am not a Java coder (hence PySpark) and so these errors can be pretty cryptic to me. I tried the following things, but I still have this issue:
Made sure my Spark and PySpark versions matched as suggested here: java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame
Reinstalled Spark using the methods suggested here: Complete Guide to Installing PySpark on MacOS
The logging in the test script below verifies the Spark and PySpark versions are aligned.
test.py:
import logging
from pyspark.sql import SparkSession
from pyspark import SparkContext
import findspark
findspark.init()
logging.basicConfig(
format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S')
sc = SparkContext('local[*]', 'test')
spark = SparkSession(sc)
logging.info('Spark location: {}'.format(findspark.find()))
logging.info('PySpark version: {}'.format(spark.sparkContext.version))
logging.info('Reading spark input dataframe')
test_df = spark.read.csv('./data', header=True, sep='|', inferSchema=True)
logging.info('Converting spark DF to pandas DF')
pandas_df = test_df.toPandas()
logging.info('DF record count: {}'.format(len(pandas_df)))
sc.stop()
Output:
$ python ./test.py
21/05/13 11:54:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-05-13 11:54:34 INFO Spark location: /Users/username/server/spark-3.1.1-bin-hadoop2.7
2021-05-13 11:54:34 INFO PySpark version: 3.1.1
2021-05-13 11:54:34 INFO Reading spark input dataframe
2021-05-13 11:54:42 INFO Converting spark DF to pandas DF
21/05/13 11:54:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
21/05/13 11:54:45 ERROR TaskResultGetter: Exception while getting task result12]
java.io.StreamCorruptedException: invalid stream header: 204356EC
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:394)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64)
at org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64)
at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:97)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "./test.py", line 23, in <module>
pandas_df = test_df.toPandas()
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/pandas/conversion.py", line 141, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 677, in collect
sock_info = self._jdf.collectToPython()
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.io.StreamCorruptedException: invalid stream header: 204356EC
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
The issue was resolved for me by ensuring that the serialisation option (registered in configuration under spark.serlializer) was not incompatible with pyarrow (typically used during the conversion of pandas to pyspark and vice versa if you've got it enabled).
The fix was to remove the often recommended spark.serializer: org.apache.spark.serializer.KryoSerializer from the configuration and rely instead on the potentially slower default.
For context, our set-up was with a ML version of the databricks spark cluster (v7.3).
I have this exception with Spark Thrift server.
Driver version and cluster version was different.
In my case i delete this, for using version from driver in all cluster.
spark.yarn.archive=hdfs:///spark/3.1.1.zip

What is causing "java.net.URISyntaxException: Relative path in absolute URI" when submit spark job?

I have compiled the latest version of apache-griffin version 0.6.0 and it is all set up. It creates a spark job and submit it via apache-livy. When it gets submit and start it starts to show the following trace. I am unable to to pin point any issue from the trace. Can anyone suggest a starting point?
As per my digging it happens when the configuration is not right
My configurations are as stated in the guides available on the github page.
Application application_1593428020619_0001 failed 2 times due to AM Container for appattempt_1593428020619_0001_000002 exited with exitCode: 254
Failing this attempt.Diagnostics: [2020-06-29 16:16:12.221]Exception from container-launch.
Container id: container_1593428020619_0001_02_000001
Exit code: 254
[2020-06-29 16:16:12.224]Container exited with a non-zero exit code 254. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
0%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {
"spark" :%20%7B%0A%20%20%20%20%22log.level%22%20:%20%22WARN%22%0A%20%20%7D,%0A%20%20%22sinks%22%20:%20%5B%20%7B%0A%20%20%20%20%22type%22%20:%20%22CONSOLE%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22max.log.lines%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22HDFS%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22path%22%20:%20%22hdfs://griffin/persist%22,%0A%20%20%20%20%20%20%22max.persist.lines%22%20:%2010000,%0A%20%20%20%20%20%20%22max.lines.per.file%22%20:%2010000%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22ELASTICSEARCH%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22method%22%20:%20%22post%22,%0A%20%20%20%20%20%20%22api%22%20:%20%22http:/es:9200/griffin/accuracy%22,%0A%20%20%20%20%20%20%22connection.timeout%22%20:%20%221m%22,%0A%20%20%20%20%20%20%22retry%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D%20%5D,%0A%20%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.griffin.measure.utils.HdfsUtil$.openFile(HdfsUtil.scala:58)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader$$anonfun$readConfig$1.apply(ParamFileReader.scala:37)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader$$anonfun$readConfig$1.apply(ParamFileReader.scala:36)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader.readConfig(ParamFileReader.scala:36)
at org.apache.griffin.measure.Application$.readParamFile(Application.scala:127)
at org.apache.griffin.measure.Application$.main(Application.scala:55)
at org.apache.griffin.measure.Application.main(Application.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: {
"spark" :%20%7B%0A%20%20%20%20%22log.level%22%20:%20%22WARN%22%0A%20%20%7D,%0A%20%20%22sinks%22%20:%20%5B%20%7B%0A%20%20%20%20%22type%22%20:%20%22CONSOLE%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22max.log.lines%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22HDFS%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22path%22%20:%20%22hdfs://griffin/persist%22,%0A%20%20%20%20%20%20%22max.persist.lines%22%20:%2010000,%0A%20%20%20%20%20%20%22max.lines.per.file%22%20:%2010000%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22ELASTICSEARCH%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22method%22%20:%20%22post%22,%0A%20%20%20%20%20%20%22api%22%20:%20%22http:/es:9200/griffin/accuracy%22,%0A%20%20%20%20%20%20%22connection.timeout%22%20:%20%221m%22,%0A%20%20%20%20%20%20%22retry%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D%20%5D,%0A%20%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 14 more
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Shutdown hook called before final status was reported.)
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://localhost:9000/user/geek/.sparkStaging/application_1593428020619_0001
20/06/29 16:16:11 INFO util.ShutdownHookManager: Shutdown hook called
[2020-06-29 16:16:12.225]Container exited with a non-zero exit code 254. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
0%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {
"spark" :%20%7B%0A%20%20%20%20%22log.level%22%20:%20%22WARN%22%0A%20%20%7D,%0A%20%20%22sinks%22%20:%20%5B%20%7B%0A%20%20%20%20%22type%22%20:%20%22CONSOLE%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22max.log.lines%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22HDFS%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22path%22%20:%20%22hdfs://griffin/persist%22,%0A%20%20%20%20%20%20%22max.persist.lines%22%20:%2010000,%0A%20%20%20%20%20%20%22max.lines.per.file%22%20:%2010000%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22ELASTICSEARCH%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22method%22%20:%20%22post%22,%0A%20%20%20%20%20%20%22api%22%20:%20%22http:/es:9200/griffin/accuracy%22,%0A%20%20%20%20%20%20%22connection.timeout%22%20:%20%221m%22,%0A%20%20%20%20%20%20%22retry%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D%20%5D,%0A%20%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.griffin.measure.utils.HdfsUtil$.openFile(HdfsUtil.scala:58)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader$$anonfun$readConfig$1.apply(ParamFileReader.scala:37)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader$$anonfun$readConfig$1.apply(ParamFileReader.scala:36)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.griffin.measure.configuration.dqdefinition.reader.ParamFileReader.readConfig(ParamFileReader.scala:36)
at org.apache.griffin.measure.Application$.readParamFile(Application.scala:127)
at org.apache.griffin.measure.Application$.main(Application.scala:55)
at org.apache.griffin.measure.Application.main(Application.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: {
"spark" :%20%7B%0A%20%20%20%20%22log.level%22%20:%20%22WARN%22%0A%20%20%7D,%0A%20%20%22sinks%22%20:%20%5B%20%7B%0A%20%20%20%20%22type%22%20:%20%22CONSOLE%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22max.log.lines%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22HDFS%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22path%22%20:%20%22hdfs://griffin/persist%22,%0A%20%20%20%20%20%20%22max.persist.lines%22%20:%2010000,%0A%20%20%20%20%20%20%22max.lines.per.file%22%20:%2010000%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22ELASTICSEARCH%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22method%22%20:%20%22post%22,%0A%20%20%20%20%20%20%22api%22%20:%20%22http:/es:9200/griffin/accuracy%22,%0A%20%20%20%20%20%20%22connection.timeout%22%20:%20%221m%22,%0A%20%20%20%20%20%20%22retry%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D%20%5D,%0A%20%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 14 more
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Shutdown hook called before final status was reported.)
20/06/29 16:16:11 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://localhost:9000/user/geek/.sparkStaging/application_1593428020619_0001
20/06/29 16:16:11 INFO util.ShutdownHookManager: Shutdown hook called
For more detailed output, check the application tracking page: http://progeek:8088/cluster/app/application_1593428020619_0001 Then click on links to logs of each attempt.
. Failing the application.
URISyntaxException: Relative path in absolute URI: { "spark" :%20%7B%0A%20%20%20%20%22log.level%22%20:%20%22WARN%22%0A%20%20%7D,%0A%20%20%22sinks%22%20:%20%5B%20%7B%0A%20%20%20%20%22type%22%20:%20%22CONSOLE%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22max.log.lines%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22HDFS%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22path%22%20:%20%22hdfs://griffin/persist%22,%0A%20%20%20%20%20%20%22max.persist.lines%22%20:%2010000,%0A%20%20%20%20%20%20%22max.lines.per.file%22%20:%2010000%0A%20%20%20%20%7D%0A%20%20%7D,%20%7B%0A%20%20%20%20%22type%22%20:%20%22ELASTICSEARCH%22,%0A%20%20%20%20%22config%22%20:%20%7B%0A%20%20%20%20%20%20%22method%22%20:%20%22post%22,%0A%20%20%20%20%20%20%22api%22%20:%20%22http:/es:9200/griffin/accuracy%22,%0A%20%20%20%20%20%20%22connection.timeout%22%20:%20%221m%22,%0A%20%20%20%20%20%20%22retry%22%20:%2010%0A%20%20%20%20%7D%0A%20%20%7D%20%5D,%0A%20%20%22griffin.checkpoint%22%20:%20%5B%20%5D%0A%7D
That long string, starting at the {, is not a URL, but is being treated as such, and so is rejected.
If you URL decode that string, you get the following, and it becomes even more obvious that it is not a URL, since it is actually JSON.
You need to figure out where that JSON text comes from, and why the code believes it is a URL. It is likely the response to some web service call, but that's just a guess.
{"spark": {
"log.level" : "WARN"
},
"sinks" : [ {
"type" : "CONSOLE",
"config" : {
"max.log.lines" : 10
}
}, {
"type" : "HDFS",
"config" : {
"path" : "hdfs://griffin/persist",
"max.persist.lines" : 10000,
"max.lines.per.file" : 10000
}
}, {
"type" : "ELASTICSEARCH",
"config" : {
"method" : "post",
"api" : "http:/es:9200/griffin/accuracy",
"connection.timeout" : "1m",
"retry" : 10
}
} ],
"griffin.checkpoint" : [ ]
}
this is a livy bug,and fixed in lastest version in Griffin, you may find the answer in
https://issues.apache.org/jira/browse/GRIFFIN-248?jql=project%20%3D%20GRIFFIN%20AND%20issuetype%20%3D%20Bug%20AND%20text%20~%20%22%2525%22
The change i did was just to replace the "\" before trying to parse the json.
Class: Application.scala
lines: 44-45, in the moment of parsing the arguments.
However, this comes from the SparkSubmitJob.java in function setLivyArgs(), where there is a workaround for a livy bug.
The livy version used when encountered "java.net.URISyntaxException: Relative path in absolute URI: " was 0.6.0-incubating.

'JavaPackage' object is not callable

My use case is as follows. I need to be able to call java methods from within python code
from py spark this seems to be very easy
I start the py spark like this
./pyspark --driver-class-path /path/to/app.jar
and from pyspark shell do this
x=sc._jvm.com.abc.def.App
x.getMessage()
u'Hello'
x.getMessage()
u'Hello'
This works fine.
When working with spark job server though:
I use the WordCountSparkJob.py example shipped
from sparkjobserver.api import SparkJob, build_problems
from py4j.java_gateway import JavaGateway, java_import
class WordCountSparkJob(SparkJob):
def validate(self, context, runtime, config):
if config.get('input.strings', None):
return config.get('input.strings')
else:
return build_problems(['config input.strings not found'])
def run_job(self, context, runtime, data):
x = context._jvm.com.abc.def.App
return x.getMessage()
My python.conf looks like this
spark {
jobserver {
jobdao = spark.jobserver.io.JobSqlDAO
}
context-settings {
python {
paths = [
"/home/xxx/SPARK/spark-1.6.0-bin-hadoop2.6/python",
"/home/xxx/.local/lib/python2.7/site-packages/pyhocon",
"/home/xxx/SPARK/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip",
"/home/xxx/SPARK/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip",
"/home/xxx/gitrepos/spark-jobserver/job-server-python/src/python /dist/spark_jobserver_python-NO_ENV-py2.7.egg"
]
}
dependent-jar-uris = ["file:///path/to/app.jar"]
}
home = /home/path/to/spark
}
I get the following error
[2016-10-08 23:03:46,214] ERROR jobserver.python.PythonJob [] [akka://JobServer/user/context-supervisor/py-context] - From Python: Error while calling 'run_job'TypeError("'JavaPackage' object is not callable",)
[2016-10-08 23:03:46,226] ERROR jobserver.python.PythonJob [] [akka://JobServer/user/context-supervisor/py-context] - Python job failed with error code 4
[2016-10-08 23:03:46,228] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/py-context] - Got Throwable
java.lang.Exception: Python job failed with error code 4
at spark.jobserver.python.PythonJob$$anonfun$1.apply(PythonJob.scala:85)
at scala.util.Try$.apply(Try.scala:161)
at spark.jobserver.python.PythonJob.runJob(PythonJob.scala:62)
at spark.jobserver.python.PythonJob.runJob(PythonJob.scala:13)
at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:288)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-10-08 23:03:46,232] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/py-context] - Exception from job 942727f0-dd81-445d-bc64-bd18880eb4bc:
java.lang.Exception: Python job failed with error code 4
at spark.jobserver.python.PythonJob$$anonfun$1.apply(PythonJob.scala:85)
at scala.util.Try$.apply(Try.scala:161)
at spark.jobserver.python.PythonJob.runJob(PythonJob.scala:62)
at spark.jobserver.python.PythonJob.runJob(PythonJob.scala:13)
at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:288)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-10-08 23:03:46,232] INFO k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/py-context/$a] - Job 942727f0-dd81-445d-bc64-bd18880eb4bc finished with an error
[2016-10-08 23:03:46,233] INFO r$RemoteDeadLetterActorRef [] [akka://JobServer/deadLetters] - Message [spark.jobserver.CommonMessages$JobErroredOut] from Actor[akka://JobServer/user/context-supervisor/py-context/$a#1919442151] to Actor[akka://JobServer/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
In the python.conf file I have the app.jar as an entry in dependent-jar-uris.
Am I missing something here
Error "'JavaPackage' object is not callable" probably means that PySpark cannot see your jar or your class in it.

Storm WordCount error: Pipe to subprocess seems to be broken, no output read

Storm 0.10.0
my previous question (Apache storm : Could not load main class org.apache.storm.starter.ExclamationTopology) which was solved.
hello I have a single node cluster up and running on my machine, the storm config file is as follows:(storm.yaml)
storm.zookeeper.servers:
# - "server1"
# - "server2"
- "localhost"
storm.zookeeper.port: 2181
nimbus.host: "localhost"
storm.local.dir: "/var/stormtmp"
java.library.path: "/usr/local"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
and I ran this WordCount topology on the cluster (found the topology here and simply ran it) https://dl.dropboxusercontent.com/s/kc933u6vz2crqkb/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar
(which is python)
but on of the bolt is throwing the following error at port 6703 localhost
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to subprocess seems to be broken! No output read. Serializer Exception: at backtype.storm.utils.ShellProcess.readShellMs
So I figured something was wrong in the topology so checked my WordCount-3-1457017776-worker-6701.log (file) and found this:
b.s.d.executor [INFO] TRANSFERING tuple TASK: 8 TUPLE: source: split:18, stream: default, id: {}, ["moon"]
b.s.d.executor [INFO] BOLT ack TASK: 18 TIME: TUPLE: source: spout:25, stream: default, id: {}, [the cow jumped over the moon]
b.s.t.ShellBolt [INFO] ShellLog pid:1714, name:split Traceback (most recent call last):
File "/var/stormtmp/supervisor/stormdist/WordCount-3-1457017776/resources/storm.py", line 172, in run
self.process(tup)
File "splitsentence.py", line 5, in process
words = tup.values[0].split(" ")
IndexError: list index out of range
b.s.t.ShellBolt [ERROR] Halting process: ShellBolt died. Command: [python, splitsentence.py], ProcessInfo pid:1714, name:split exitCode:0, errorString:
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to subprocess seems to be broken! No output read.
Serializer Exception:
at backtype.storm.utils.ShellProcess.readShellMsg(ShellProcess.java:101) ~[storm-core-0.10.0.jar:0.10.0]
at backtype.storm.task.ShellBolt$BoltReaderRunnable.run(ShellBolt.java:321) [storm-core-0.10.0.jar:0.10.0]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_95]
b.s.d.executor [ERROR]
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to subprocess seems to be broken! No output read.
Serializer Exception:
at backtype.storm.utils.ShellProcess.readShellMsg(ShellProcess.java:101) ~[storm-core-0.10.0.jar:0.10.0]
at backtype.storm.task.ShellBolt$BoltReaderRunnable.run(ShellBolt.java:321) [storm-core-0.10.0.jar:0.10.0]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_95]
So I belive the index out of range(occured due to line 5, tuple becomming 0) is causing the bolt to die and the pipe to it is broken, so I am not able to do futher processing of data, Is my understanding about the issue correct ?
and is there a soulution to this ? or may be a different topology I can test on? Please help me out this is my first topology running on storm

OpenIDE's Lookup fails with Gephi controller objects

I was able to run the demos just fine and build up a graph builder in my unit tests, but now when I deploy this and run on my local server I get NullPointerExceptions on some but not all of the lookups I call.
ProjectController pc = Lookup.getDefault().lookup(ProjectController.class);
pc.newProject();
workspace = pc.getCurrentWorkspace();
GraphController gc = Lookup.getDefault().lookup(GraphController.class);
GraphModel model = gc.getModel();
Stack trace below:
Caused by: java.lang.NullPointerException
at com.network.manager.impl.NetworkLayoutManagerImpl.initGraphModel(NetworkLayoutManagerImpl.java:167)
at com.network.manager.impl.NetworkLayoutManagerImpl.convertNetworkToGraph(NetworkLayoutManagerImpl.java:49)
at com.network.manager.impl.NetworkChartManagerImpl.buildNetworkGEXF(NetworkChartManagerImpl.java:61)
at com.network.controller.LoadNetworkControllerImpl.loadNodesAndEdges(LoadNetworkControllerImpl.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.jboss.el.util.ReflectionUtil.invokeMethod(ReflectionUtil.java:328)
at org.jboss.el.util.ReflectionUtil.invokeMethod(ReflectionUtil.java:341)
at org.jboss.el.parser.AstPropertySuffix.invoke(AstPropertySuffix.java:58)
at org.jboss.el.parser.AstValue.invoke(AstValue.java:96)
at org.jboss.el.MethodExpressionImpl.invoke(MethodExpressionImpl.java:276)
at com.sun.faces.facelets.el.TagMethodExpression.invoke(TagMethodExpression.java:105)
at javax.faces.component.MethodBindingMethodExpressionAdapter.invoke(MethodBindingMethodExpressionAdapter.java:87)
... 139 more
My GraphController "gc" is what is null in this case, though, I'm able to lookup a ProjectController without issue. Out of curiosity, I added the other controllers I needed (AttributeController and ExportController) and printed them out.
(ProjectController --- GraphController --- AttributeController --- ExportController)
System.err.println(pc + " --- " + gc + " --- " + ac + " --- " + ec);
Gives me the following:
org.gephi.project.impl.ProjectControllerImpl#1b819521 --- null --- null --- org.gephi.io.exporter.impl.ExportControllerImpl#3412470a
I'm not too familiar with the Lookup API so this is a complete mystery. I'm running this on Tomcat server. Let me know if any more info is needed.
There is a similar question posted here and the gephi forums with no response.
Streaming Graph to Gephi using toolkit : NullPointerException
https://forum.gephi.org/viewtopic.php?t=1599
Lookup depends on files found META-INF/services. It appears that Tomcat isn't finding the ones for GraphController nor AttributeController. There should be files named similarly to how GraphController and AttributeController are imported in your source file.

Categories

Resources