I'm trying to process some data via spark2 stream and save them to hdfs. While stream is running i want to read the stored data via thrift server with simple select:
SELECT COUNT(*) FROM stream_table UNION ALL SELECT COUNT(*) FROM thisistable;
But I'm getting this exception
Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 5.0 (TID 6, localhost):
java.lang.RuntimeException:
hdfs://5b6b8bf723a2:9000/archiveData/parquets/efc44dd4-1792-4b6d-b0f2-120818047b1b is not a Parquet file (too small)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:371)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:99)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:85)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:246)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)(RDD.scala:319)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)scala:38)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.Task.run(Task.scala:85)ShuffleMapTask.scala:47)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.lang.Thread.run(Thread.java:745)or$Worker.run(ThreadPoolExecutor.java:617)
My assumption is that spark will create an empty parquet file at the start of the batch, and fill it at the end of the batch, and I'm running A select via archived files, but one is empty as the actual batch is not finished yet.
Simple spark stream example (Thread.sleep for simulation of transformation delay)
spark
.readStream()
.schema(schema)
.json("/tmp")
.filter(x->{
Thread.sleep(1000);
return true;
})
.writeStream()
.format("parquet")
.queryName("thisistable")
.start()
.awaitTermination();
Is there a way for me to avoid this exception and with thrift server get only finished files?
Related
I've setted up pyspark on google colab using this tutorial from towardsdatascience. It runs well until it fails on trying to use IDF
from pyspark.ml.feature import IDF
idf = IDF(inputCol='hash', outputCol='features')
model_idf = idf.fit(df_hash) <--- fails here
with next error:
Py4JJavaError: An error occurred while calling o401.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 697) (fbea1ac0124f executor driver): org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$3093/1092846241: (string) => array<string>)
Java version is 8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
Has anyone faced a similar problem?
The problem was with NA in my data in the column which I tried to tokenize and then apply IDF. Dropping rows with NA in this column helped me:
df_without_na = df.na.drop(subset='my_column_name')
df_without_na.show()
I have a neo4j 3.2.1 multi-labeled multi-properties graph database which has 4M nodes, 15M edges, and 4.8M distinct labels with ~6GB size on the disk.
I've imported the dataset using "neo4j-import" tool using a linux machine.
I can open the dataset, traverse the nodes, edges, and their descriptions well using the Java API. However, once I want to shut it down, it takes a lot of time and finally, it gives me the following log file error:
2017-08-04 07:07:38.189+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Shutdown started
2017-08-04 07:07:38.190+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Database is now unavailable
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting check pointing...
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting store flush...
2017-08-04 07:23:35.022+0000 ERROR [o.n.k.i.t.l.c.CheckPointerImpl] Error performing check point Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
2017-08-04 07:23:35.665+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics START ---
2017-08-04 07:23:35.666+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics END ---
In the Java itself, I get the following exception:
Exception in thread "main" org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.NeoStoreDataSource$3#3101ffd3' failed to transition from stopped to shutting_down. Please see the attached cause exception "Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:497)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
Caused by: org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
In fact, in the Java program, I just read the information and do not write anything on the dataset.
Furthermore, to open the database using the following line of code, it takes 80 seconds on a 3.1GHz Core i7 MacBook with 16GB of Ram with 10GB of JVM arguments.
Is it normal to take this much of time for a dataset with the mentioned size?
GraphDatabaseService dataGraph = new GraphDatabaseFactory().newEmbeddedDatabase(storeDir);
Could you please guide me how can I repair the dataset to be easily shut down?
I have 5GB worth of data in DSE 4.8.9. I am trying to load the same data into DSE 5.0.2. The command I use is following:
root#dse:/mnt/cassandra/data$ sstableloader -d 10.0.2.91 /mnt/cassandra/data/my-keyspace/my-table-0b168ba1637111e6b40131c603254a9b/
This gives me following exception:
DEBUG 15:27:12,850 Using framed transport.
DEBUG 15:27:12,850 Opening framed transport to: 10.0.2.91:9160
DEBUG 15:27:12,850 Using thriftFramedTransportSize size of 16777216
DEBUG 15:27:12,851 Framed transport opened successfully to: 10.0.2.91:9160
Could not retrieve endpoint ranges:
InvalidRequestException(why:unconfigured table schema_columnfamilies)
java.lang.RuntimeException: Could not retrieve endpoint ranges: at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:342)
at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:156)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:109)
Caused by: InvalidRequestException(why:unconfigured table schema_columnfamilies)
at org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result$execute_cql3_query_resultStandardScheme.read(Cassandra.java:50297)
at org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result$execute_cql3_query_resultStandardScheme.read(Cassandra.java:50274)
at org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:50189)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1734)
at org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1719)
at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:321)
... 2 more
Thoughts?
For scenarios when you have few nodes and not a lot of data, you can follow these steps for a cluster migration (ensure the clusters are at most 1 major release apart)
1) create the schema in the new cluster
2) move both node's data to each new node (into the new cfid tables)
3) nodetool refresh to pick up the data
4) nodetool cleanup to clear out the extra data
5) If the old cluster was from a previous major version, run sstable upgrade on the new cluster.
I am trying to read data from Cassandra using Spark.
DataFrame rdf = sqlContext.read().option("keyspace", "readypulse")
.option("table", "ig_posts")
.format("org.apache.spark.sql.cassandra").load();
rdf.registerTempTable("cassandra_table");
System.out.println(sqlContext.sql("select count(external_id) from cassandra_table").collect()[0].getLong(0));
The task fails with the following error. I am not able to understand why the ShuffleMaptask is being called and why it is a problem to cast it to Task.
16/03/30 02:27:15 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, ip-10-165-180-22.ec2.internal):
java.lang.ClassCastException:
org.apache.spark.scheduler.ShuffleMapTask
cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/03/30 02:27:15 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor ip-10-165-180-22.ec2.internal:
java.lang.ClassCastException (org.apache.spark.scheduler.Shuf
fleMapTask
cannot be cast to org.apache.spark.scheduler.Task) [duplicate 1]
16/03/30 02:27:15 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
I am using EMR 4.4, Spark 1.6, Cassandra 2.2 (Datastax Community), and spark-cassandra-connector-java_2.10 1.6.0-M1 (also tried 1.5.0)
I also tried the same with following code, but got same error.
CassandraJavaRDD<CassandraRow> cjrdd = functions.cassandraTable(
KEYSPACE, tableName).select(columns);
logger.info("Got rows from cassandra " + cjrdd.count());
JavaRDD<Double> jrdd2 = cjrdd.map(new Function<CassandraRow, Double>() {
#Override
public Double call(CassandraRow trainingRow) throws Exception {
Object fCount = trainingRow.getRaw("follower_count");
double count = 0;
if (fCount != null) {
count = (Long) fCount;
}
return count;
}
});
logger.info("Mapper done : " + jrdd2.count());
logger.info("Mapper done values : " + jrdd2.collect());
I've been encountering the similar problem recently due to
--conf spark.executor.userClassPathFirst=true.
Quoting Spark's official documentation:
spark.executor.userClassPathFirst (Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances.
I think those exceptions were due to some jar version conflict, and by the spark document, "The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime."
I was also struggling with the same error. But setting userClassPathFirst=true doesn't help me. I set "spark.driver.userClassPathFirst=false" and "spark.executor.userClassPathFirst=false". It solved my ClassCastException problem. Here is my command.
spark-submit --conf "spark.driver.userClassPathFirst=false" --conf "spark.executor.userClassPathFirst=false" --deploy-mode client --class org.sdrc.kspmis.dashboardservice.KspDashboardServiceApplication --master spark://192.168.1.95:7077 dashboard-0.0.1-SNAPSHOT-shaded.jar
I have a spark jobs that runs on yarn, it works with about 150gb of dataset and does multiple shuffle operations and finally stores data into hbase. It keeps failing at saveAsHadoopDataset Basically multiple Executors fails at this stage after reporting high GC activities. However none of the executor logs, driver logs or node manager logs indicate any OutOfMemory errors or GC Overhead Exceeded errors or memory limits exceeded errors. I don't see any other reason for Executor failures as well in spark ui as well.
val hConf = HBaseConfiguration.create
hConf.setInt("hbase.client.scanner.caching", 10000)
hConf.setBoolean("hbase.cluster.distributed", true)
new PairRDDFunctions(hbaseRdd).saveAsHadoopDataset(jobConfig)
Driver Logs:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Job aborted due to stage failure: Task 388 in stage 22.0 failed 4 times, most recent failure: Lost task 388.3 in stage 22.0 (TID 32141, maprnode5): ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 388 in stage 22.0 failed 4 times, most recent failure: Lost task 388.3 in stage 22.0 (TID 32141, maprnode5): ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1124)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
Executor Logs:
16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage 8.0 (TID 15318). 2099 bytes result sent to driver
16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15333
16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0 (TID 15333)
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125 non-empty blocks out of 3007 blocks
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14 remote fetches in 10 ms
16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to maprnode5 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from maprnode5 is closed
16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from maprnode5 closed
at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 6 outstanding blocks after 5000 ms
16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive connection to maprnode5, creating a new one.
16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in connection from maprnode5
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from maprnode5 is closed
16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
So it turns out although spark UI says it failed at saveAsHadoopDataSet it was in fact failing at first step of the stage where saveAsHadoopDataSet was the last step. To elaborate more, spark defines stage boundaries based on sequence of narrow transformation or sequence of combined wide transformation and narrow transformation. In my particular case, sequence was groupByKey(wide dep) -> mapValues(narrow dep) -> map(narrow dep) where last map is actually doing saveAsHadoopDataSet. Executor was reporting hight GC activity and memory usage at in fact shuffle stage groupByKey. I changed my application logic to use reduceByKey instead of groupByKey. Now its super slow but at least not failing.