Apache Spark to transfer data - java

I setup Apache Spark on a server, it is now all operational and waiting for data to crunch.
Here is my Java code:
SparkConf conf = new SparkConf().setAppName("myFirstJob").setMaster("spark://10.0.100.120:7077");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
javaSparkContext.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(javaSparkContext);
System.out.println("Hello, Remote Spark v." + javaSparkContext.version());
DataFrame df;
df = sqlContext.read().option("dateFormat", "yyyy-mm-dd")
.json("./src/main/resources/north-carolina-school-performance-data.json"); // this is line #31
df = df.withColumn("district", df.col("fields.district"));
df = df.groupBy("district").count().orderBy(df.col("district"));
df.show(150);
Spark complains that the ./src/main/resources/north-carolina-school-performance-data.json file is not on the server:
16/07/12 15:08:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, micha): java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
...
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
**at net.jgp.labs.spark.FirstJob.main(FirstJob.java:31)**
Caused by: java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
Fair enough, it is not on the server. I was hoping that open would take the file locally, where the driver is running and send it over. Is there a way to do it or is it outside the scope of Apache Spark? If it is outside, any recommendation on doing it properly (I mean I can set up a CIFS server, etc. but I find it a little ugly)?

Related

How to avoid java.io.StreamCorruptedException: invalid stream header: 204356EC when using toPandas() with PySpark?

Whenever I try to read a Spark dataset using PySpark and convert it to a Pandas df for modeling I get the error: java.io.StreamCorruptedException: invalid stream header: 204356EC on the toPandas() step.
I am not a Java coder (hence PySpark) and so these errors can be pretty cryptic to me. I tried the following things, but I still have this issue:
Made sure my Spark and PySpark versions matched as suggested here: java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame
Reinstalled Spark using the methods suggested here: Complete Guide to Installing PySpark on MacOS
The logging in the test script below verifies the Spark and PySpark versions are aligned.
test.py:
import logging
from pyspark.sql import SparkSession
from pyspark import SparkContext
import findspark
findspark.init()
logging.basicConfig(
format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S')
sc = SparkContext('local[*]', 'test')
spark = SparkSession(sc)
logging.info('Spark location: {}'.format(findspark.find()))
logging.info('PySpark version: {}'.format(spark.sparkContext.version))
logging.info('Reading spark input dataframe')
test_df = spark.read.csv('./data', header=True, sep='|', inferSchema=True)
logging.info('Converting spark DF to pandas DF')
pandas_df = test_df.toPandas()
logging.info('DF record count: {}'.format(len(pandas_df)))
sc.stop()
Output:
$ python ./test.py
21/05/13 11:54:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-05-13 11:54:34 INFO Spark location: /Users/username/server/spark-3.1.1-bin-hadoop2.7
2021-05-13 11:54:34 INFO PySpark version: 3.1.1
2021-05-13 11:54:34 INFO Reading spark input dataframe
2021-05-13 11:54:42 INFO Converting spark DF to pandas DF
21/05/13 11:54:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
21/05/13 11:54:45 ERROR TaskResultGetter: Exception while getting task result12]
java.io.StreamCorruptedException: invalid stream header: 204356EC
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:394)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64)
at org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64)
at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:97)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "./test.py", line 23, in <module>
pandas_df = test_df.toPandas()
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/pandas/conversion.py", line 141, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 677, in collect
sock_info = self._jdf.collectToPython()
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.io.StreamCorruptedException: invalid stream header: 204356EC
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
The issue was resolved for me by ensuring that the serialisation option (registered in configuration under spark.serlializer) was not incompatible with pyarrow (typically used during the conversion of pandas to pyspark and vice versa if you've got it enabled).
The fix was to remove the often recommended spark.serializer: org.apache.spark.serializer.KryoSerializer from the configuration and rely instead on the potentially slower default.
For context, our set-up was with a ML version of the databricks spark cluster (v7.3).
I have this exception with Spark Thrift server.
Driver version and cluster version was different.
In my case i delete this, for using version from driver in all cluster.
spark.yarn.archive=hdfs:///spark/3.1.1.zip

Old Kafka Offset consuming by Spark Structured Streaming after clearing Checkpointing location

I have built an application using the Apache Kafka and Apache Spark Structured streaming. I am facing the below issue.
Scenario:
I set up a Spark structured stream with a source of Kafka topic and
sink as Kafka topic.
We run the stream and produce a number of messages on the Kafka
topic.
We stopped the stream and restart stream by clearing checkpointing
location of the stream. After running for 5 to 6 hour later stream is
consuming old Kafka messages randomly.
After clearing checkpointing location I was expecting only new messages on stream.
Spark version: 2.4.0,
Kafka-client version: 2.0.0,
Kafka version: 2.0.0,
Cluster Manager: Kubernetes.
I have tried this scenario by changing the checkpointing location but the issue still persists.
{
SparkConf sparkConf = new SparkConf().setAppName("SparkKafkaConsumer");
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
Dataset<Row> stream = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option(subscribeType, "REQUEST_TOPIC")
.option("failOnDataLoss",false)
.option("maxOffsetsPerTrigger","50")
.option("startingOffsets","latest")
.load()
.selectExpr(
"CAST(value AS STRING) as payload",
"CAST(key AS STRING)",
"CAST(topic AS STRING)",
"CAST(partition AS STRING)",
"CAST(offset AS STRING)",
"CAST(timestamp AS STRING)",
"CAST(timestampType AS STRING)");
DataStreamWriter<String> dataWriterStream = stream
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("kafka.max.request.size", "35000000")
.option("kafka.retries", "5")
.option("kafka.batch.size", "35000000")
.option("kafka.receive.buffer.bytes", "200000000")
.option("kafka.acks","0")
.option("kafka.compression.type", "snappy")
.option("kafka.linger.ms", "0")
.option("kafka.buffer.memory", "50000000")
.option("topic", "RESPONSE_TOPIC")
.outputMode("append")
.option("checkpointLocation", checkPointDirectory);
spark.streams().awaitAnyTermination();
}
check below link,
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-checkpointing.html
You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory - the directory where RDDs are checkpointed. The directory must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines

How to use HDFS sink right in flink?

I play around with the twitter connector in apache flink and now want to save some streamed data in my local hdfs instance. In the flink documentation is a small BucketerSink example, but my program always quit with the following error:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:933)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:92)
at org.apache.hadoop.security.Groups.<init>(Groups.java:76)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:239)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2473)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2465)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2331)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:418)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:352)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:177)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:159)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:105)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:251)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:678)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:666)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
at java.base/java.lang.String.substring(String.java:1885)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:49)
... 25 more
Do you have any ideas what went wrong with my code? I use the inital twitter connector example for testing and my environment is build up with a docker container for hdfs. The ports are correctly mapped form docker to my local machine and i also can check the status of hdfs on the web ui.
Here is my code approach:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
Properties props = new Properties();
props.setProperty(TwitterSource.CONSUMER_KEY, "KEY");
props.setProperty(TwitterSource.CONSUMER_SECRET, "SECRET");
props.setProperty(TwitterSource.TOKEN, "TOKEN");
props.setProperty(TwitterSource.TOKEN_SECRET, "TOKENSECRET");
DataStream<String> streamSource = env.addSource(new TwitterSource(props));
DataStream<Tuple2<String, Integer>> tweets = streamSource
// selecting English tweets and splitting to (word, 1)
.flatMap(new SelectGermanAndTokenizeFlatMap())
// group by words and sum their occurrences
.keyBy(0)
.timeWindow(Time.minutes(1), Time.seconds(30))
.sum(1);
BucketingSink<Tuple2<String, Integer>> sink = new BucketingSink<>("hdfs://localhost:8020/flink/twitter-test");
sink.setBucketer(new DateTimeBucketer<Tuple2<String, Integer>>("yyyy-MM-dd--HHmm"));
sink.setBatchSize(1024 * 1024 * 400);
tweets.addSink(sink);
//tweets.print();
env.execute("Twitter Streaming Example");

Spark Streaming with Elasticsearch connector throws JVM_Bind error

I am using Spark 2.1.1 in Java and elasticsearch-spark-20_2.11 (version 5.3.2) in order to write data in Elasticsearch.I create JavaStreamingContext which I then set to await termination, so the application should always retrieve new data.
After I read the stream, I split it into RDDs and for each one I apply SQL aggregations and then write it to Elasticsearch as follows:
recordStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
/*
* Create RDD from JSON
*/
Dataset<Row> df = spark.read().json(rdd.rdd());
df.createOrReplaceTempView("data");
df.cache();
/*
* Apply the aggregations
*/
Dataset aggregators = spark.sql(ORDER_TYPE_DB);
JavaEsSparkSQL.saveToEs(aggregators.toDF(), "order_analytics/record");
aggregators = spark.sql(ORDER_CUSTOMER_DB);
JavaEsSparkSQL.saveToEs(aggregators.toDF(), "customer_analytics/record");
}
});
This works fine the first time data is read and inserted to Elasticsearch, but when more data are retrieved by the stream, I get the following error:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:94)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:94)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: java.net.BindException: Address already in use: JVM_Bind
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:129)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:627)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
... 10 more
Caused by: java.net.BindException: Address already in use: JVM_Bind
at java.net.DualStackPlainSocketImpl.bind0(Native Method)
at java.net.DualStackPlainSocketImpl.socketBind(DualStackPlainSocketImpl.java:106)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:190)
at java.net.Socket.bind(Socket.java:644)
at java.net.Socket.<init>(Socket.java:433)
at java.net.Socket.<init>(Socket.java:286)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:478)
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:112)
... 16 more
Any ideas what the problem could be?
Spark uses default configuration and is instantiated in Java as
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
Elasticsearch is configured via Docker compose with the following environment parameters:
- cluster.name=cp-es-cluster
- node.name=cloud1
- http.cors.enabled=true
- http.cors.allow-origin="*"
- network.host=0.0.0.0
- discovery.zen.ping.unicast.hosts=${ENV_IP}
- network.publish_host=${ENV_IP}
- discovery.zen.minimum_master_nodes=1
- xpack.security.enabled=false
- xpack.monitoring.enabled=false

StreamWritter with thrift server

I'm trying to process some data via spark2 stream and save them to hdfs. While stream is running i want to read the stored data via thrift server with simple select:
SELECT COUNT(*) FROM stream_table UNION ALL SELECT COUNT(*) FROM thisistable;
But I'm getting this exception
Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 5.0 (TID 6, localhost):
java.lang.RuntimeException:
hdfs://5b6b8bf723a2:9000/archiveData/parquets/efc44dd4-1792-4b6d-b0f2-120818047b1b is not a Parquet file (too small)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:371)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:99)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:85)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:246)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.Task.run(Task.scala:85)ShuffleMapTask.scala:47)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.lang.Thread.run(Thread.java:745)or$Worker.run(ThreadPoolExecutor.java:617)
My assumption is that spark will create an empty parquet file at the start of the batch, and fill it at the end of the batch, and I'm running A select via archived files, but one is empty as the actual batch is not finished yet.
Simple spark stream example (Thread.sleep for simulation of transformation delay)
spark
.readStream()
.schema(schema)
.json("/tmp")
.filter(x->{
Thread.sleep(1000);
return true;
})
.writeStream()
.format("parquet")
.queryName("thisistable")
.start()
.awaitTermination();
Is there a way for me to avoid this exception and with thrift server get only finished files?

Categories

Resources