I have a Spark job that processes several folders on S3 per run and stores its state on DynamoDB. In other words, we're running the job once per day, it looks for new folders added by another job, transforms them one-by-one and writes state to DynamoDB. Here's rough pseudocode:
object App {
val allFolders = S3Folders.list()
val foldersToProcess = DynamoDBState.getFoldersToProcess(allFolders)
Transformer.run(foldersToProcess)
}
object Transformer {
def run(folders: List[String]): Unit = {
val sc = new SparkContext()
folders.foreach(process(sc, _))
}
def process(sc: SparkContext, folder: String): Unit = ??? // transform and write to S3
}
This approach works well if S3Folders.list() returns relatively small amount of folders (up to few thousands), if it returns more (4-8K) very often we see following error (that in first glance has nothing to do with Spark):
17/10/31 08:38:20 ERROR ApplicationMaster: User class threw exception: shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponses
SaxParser$ListObjectsV2Handler
shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponsesSaxParser$ListObjectsV2Handler
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:214)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.parseListObjectsV2Response(XmlResponsesSaxParser.java:315)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:88)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:77)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1553)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1271)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4188)
at shadeaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:865)
at me.chuwy.transform.S3Folders$.com$chuwy$transform$S3Folders$$isGlacierified(S3Folders.scala:136)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.S3Folders$.list(S3Folders.scala:112)
at me.chuwy.transform.Main$.main(Main.scala:22)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:186)
... 36 more
For big amount of folders (~20K) this happens all the time and job cannot start.
Previously we had very similar, but much more frequent error when getFoldersToProcess did GetItem for every folder from allFolders and therefore took much longer:
17/09/30 14:46:07 ERROR ApplicationMaster: User class threw exception: shadeaws.AbortedException:
shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:51)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1240)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:802)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:109)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1503)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1226)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2089)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2065)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1173)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1149)
at me.chuwy.tranform.sdk.Manifest$.contains(Manifest.scala:179)
at me.chuwy.tranform.DynamoDBState$$anonfun$getUnprocessed$1.apply(ProcessManifest.scala:44)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.DynamoDBState$.getFoldersToProcess(DynamoDBState.scala:44)
at me.chuwy.transform.Main$.main(Main.scala:19)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
I believe that current error has nothing to do with XML parsing or invalid response, but originate from some race condition inside Spark, because:
There's clear connection between amount of time "state-fetching" takes and chance of failure
Tracebacks have underlying AbortedException, which AFAIK caused by swallowed InterruptedException, which can mean something inside JVM (spark-submit or even YARN) calls Thread.sleep for main thread.
Right now I'm using EMR AMI 5.5.0, Spark 2.1.0 and shaded AWS SDK 1.11.208, but had similar error with AWS SDK 1.10.75.
I'm deploying this job on EMR via command-runner.jar spark-submit --deploy-mode cluster --class ....
Does anyone have any idea where does this exception originate from and how to fix it?
foreach does not guarantee orderly computations and it applies the operation(s) to each element of an RDD, meaning that it will instantiate for every element which, in turn, may overwhelm the executor.
The problem was that getFoldersToProcess is a blocking (and very long) operation, which prevents SparkContext from being instantiated. SpackContext itself should signal about own instantiation to YARN and if it doesn't help in a certain amount of time - YARN assumes that driver node has fallen off and kills the whole cluster.
Related
I am using Flink on the cluster. As I submitted the task, I got the following exception:
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079)
at akka.dispatch.OnComplete.internal(Future.scala:263)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458)
... 9 more
Caused by: java. lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor.getShuffleDescriptors(InputGateDeploymentDescriptor.java:150)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:125)
at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:261)
at org.apache.flink.runtime.taskmanager.Task.<init>(Task.java:420)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:737)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Flink version: 1.13.6;
Scala version: 2.11
Kafka version: 2.2.2
Part of my code:
object batchProcess {
def main(args:Array[String]): Unit = {
val host = "localhost"
val port = 6379
val env = StreamExecutionEnvironment.getExecutionEnvironment
// read from kafka
val source = KafkaSource.builder[String].setBootstrapServers("localhost:9092")
.setTopics("movie_rating_records").setGroupId("my-group").setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setBounded(OffsetsInitializer.latest).build()
// val inputDataStream = env.readTextFile("a.txt")
val inputDataStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val dataStream = inputDataStream
.map( data =>{
val arr = data.split(",")
( arr(0),arr(1).toInt,arr(2).toInt,arr(3).toFloat,arr(4).toLong)
})
val (counterUserIdPos,counterUserIdNeg,counterMovieIdPos,counterMovieIdNeg,counterUserId2MovieId) = commonProcess(dataStream)
counterUserIdPos.map(x =>{
val jedisIns = new Jedis(host,port,100000)
jedisIns.set("batch2feature_userId_rating1_"+x._1.toString, x._2.toString)
jedisIns.close()
})
env.execute("test")
}
}
The input stream from Kafka is a string split by a comma, for example: 1542295208rating,556,112852,1.0,1542295208. The above code process the string and puts them into another datastream process function. And finally, it writes the result into Redis.
Any help or hints on resolving the issue would be greatly appreciated!
Here aer a few pointers I can think of
Netty is the internal serialization mechanism of Flink => from the stack trace we know the error is likely occurring in one of the .map or so, not when interacting with Kafka nor Redis.
Serialization issues are sometimes happening in Flink when using Scala. Maybe the second .map is somehow causing connection pools or some other context instance to be serialized into the lambda, so replacing it with a Flink SinkFunction might help (in addition to improving performance since you'd only create one Jedis instance per partition).
Investigate also what serialization is going on in the commonProcess.
Essentially, you should be hunting for a place where the code somehow needs to serialize some instance whose type would confuse the Flink serialization mechanism.
The execution was ok locally in unit test, but fails when the Spark Streaming execution is propagated to the real cluster executors, like they silently crash and no longer available for the context:
stream execution thread for kafkaDataGeneratorInactiveESP_02/Distance [id = 438f45a0-acd6-4729-953f-5a18ae208f1f, runId = a98c6d39-fe14-4ed5-b7fe-7e4009de51b2]] impl.BlockReaderFactory (BlockReaderFactory.java:getRemoteBlockReaderFromTcp(765)) - I/O error constructing remote block reader.
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2940)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
at java.io.DataInputStream.read(DataInputStream.java:149)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
at org.apache.spark.sql.execution.streaming.CommitLog.deserialize(CommitLog.scala:56)
at org.apache.spark.sql.execution.streaming.CommitLog.deserialize(CommitLog.scala:48)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:153)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$getLatest$2(HDFSMetadataLog.scala:190)
at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:258)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:189)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:300)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:194)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
First thing I have tried was changing the query name, having slash and space: kafkaDataGeneratorInactiveESP_02/Distance
After repacing it to the correct one in
.queryName("kafkaDataGeneratorInactive" + currentIter.metadata.getString("label"))
100% proven from having / or space in the string, the error hasn't gone.
The reason of the failure was actually in using the same name for the query name and checkpoint location path (but not in the part that was attempteed to be improved for the first time). Later I have fount one more error log:
2021-12-01 15:05:46,906 WARN [main] streaming.StreamingQueryManager (Logging.scala:logWarning(69)) - Stopping existing streaming query [id=b13a69d7-5a2f-461e-91a7-a9138c4aa716, runId=9cb31852-d276-42d8-ade6-9839fa97f85c], as a new run is being started.
Why the query was stopped? That's because in Scala I was creating streaming queries in a loop, iterating the collection, while keeping all the query names and all the checkpoint names the same. After making them unique (i just used the string values from the collection), the failure problem has gone.
How do deal with h2 database inability to deal with interrupts, I was occasionally seeing that my embedded h2 database appeared to get corrupted, in particular I had amended an ExecutorService so that if a task took too long it would cancel the task. The task would be cancelled okay but then subsequent database access failed with exceptions such as
23/07/2019 14.23.31:BST:DeleteDuplicatesController:start:SEVERE: commit failed
org.hibernate.TransactionException: commit failed
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:187)
at com.jthink.songkong.db.ReportCache.save(ReportCache.java:46)
at com.jthink.songkong.reports.AbstractReport.setReportDatabaseObject(AbstractReport.java:365)
at com.jthink.songkong.reports.DeleteDuplicatesReport.setReportDatabaseObject(DeleteDuplicatesReport.java:333)
at com.jthink.songkong.reports.DeleteDuplicatesReport.closeReport(DeleteDuplicatesReport.java:377)
at com.jthink.songkong.analyse.toplevelanalyzer.DeleteDuplicatesController.deleteAnyDups(DeleteDuplicatesController.java:606)
at com.jthink.songkong.analyse.toplevelanalyzer.DeleteDuplicatesController.start(DeleteDuplicatesController.java:665)
at com.jthink.songkong.ui.swingworker.DeleteDuplicates.doInBackground(DeleteDuplicates.java:43)
at com.jthink.songkong.ui.swingworker.DeleteDuplicates.doInBackground(DeleteDuplicates.java:20)
at javax.swing.SwingWorker$1.call(SwingWorker.java:295)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at javax.swing.SwingWorker.run(SwingWorker.java:334)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.hibernate.TransactionException: unable to commit against JDBC connection
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doCommit(JdbcTransaction.java:116)
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:180)
... 14 more
Caused by: org.h2.jdbc.JdbcSQLNonTransientException: General error: "java.lang.IllegalStateException: Reading from nio:C:/Users/Paul/AppData/Roaming/SongKong/Database/Database.mv.db failed; file length -1 read length 4096 at 1541494 [1.4.199/1]"; SQL statement:
COMMIT [50000-199]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:502)
at org.h2.message.DbException.getJdbcSQLException(DbException.java:427)
at org.h2.message.DbException.get(DbException.java:194)
at org.h2.message.DbException.convert(DbException.java:347)
at org.h2.command.Command.executeUpdate(Command.java:280)
at org.h2.jdbc.JdbcConnection.commit(JdbcConnection.java:542)
at com.mchange.v2.c3p0.impl.NewProxyConnection.commit(NewProxyConnection.java:1284)
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doCommit(JdbcTransaction.java:112)
... 15 more
Caused by: java.lang.IllegalStateException: Reading from nio:C:/Users/Paul/AppData/Roaming/SongKong/Database/Database.mv.db failed; file length -1 read length 4096 at 1541494 [1.4.199/1]
at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:883)
at org.h2.mvstore.DataUtils.readFully(DataUtils.java:420)
at org.h2.mvstore.FileStore.readFully(FileStore.java:98)
at org.h2.mvstore.MVStore.readBufferForPage(MVStore.java:1048)
at org.h2.mvstore.MVStore.readPage(MVStore.java:2186)
at org.h2.mvstore.MVMap.readPage(MVMap.java:554)
at org.h2.mvstore.Page$NonLeaf.getChildPage(Page.java:1086)
at org.h2.mvstore.Page.get(Page.java:221)
at org.h2.mvstore.MVMap.get(MVMap.java:402)
at org.h2.mvstore.MVMap.get(MVMap.java:389)
at org.h2.mvstore.MVStore.getMapName(MVStore.java:2737)
at org.h2.mvstore.MVStore.renameMap(MVStore.java:2650)
at org.h2.mvstore.tx.TransactionStore.commit(TransactionStore.java:453)
at org.h2.mvstore.tx.Transaction.commit(Transaction.java:389)
at org.h2.engine.Session.commit(Session.java:691)
at org.h2.command.dml.TransactionCommand.update(TransactionCommand.java:46)
at org.h2.command.CommandContainer.update(CommandContainer.java:133)
at org.h2.command.Command.executeUpdate(Command.java:267)
... 18 more
Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:721)
at org.h2.store.fs.FileNio.read(FilePathNio.java:74)
at org.h2.mvstore.DataUtils.readFully(DataUtils.java:406)
... 34 more
23/07/2019 14.23.31:BST:Errors:addError:SEVERE: Adding Error:commit failed
I have since found this issue
Basically if using H2 in embedded mode, and it receives an interrupt then all subsequent access fails until the thread pool is close and reopened. In the example I give of a process having to be cancelled because it appears to be stuck there is no solution except for interrupting
I also have another case whereby usually the controller thread that doesn't directly do a database work itself so I was struggling to see why when an interrupt occurred why this would cause database errors since this is handled by controller thread. I have now worked out the issue is that Im using an ExecutorService with a fixed size BlockingQueue (so that we dont have a big queue build up in memory), but if the queue gets full then new task is actually executed by the controller thread (because of CallerRunsPolicy), so the controller thread can be making calls to database after all.
Im using H2 with hibernate and in both cases calling the following immediately after the interrupt
HibernateUtil.closeFactory();
seems to solve the issue, however I guess this means that any other threads with hibernate sessions will be broken, but at least newly opened sessins will be okay. So im not particularly happy with this workaround, any other ideas ?
Using H2 as a server is not a solution since the whole point of H2 was an embedded db self contained within application.
Although not properly documented using the async protocol allows a connection to be interrupted without breaking all other connections.
I am running a YARN job on CDH 5.3 cluster. I have default configurations.
No of nodes=3
yarn.nodemanager.resource.cpu-vcores=8
yarn.nodemanager.resource.memory-mb=10GB
mapreduce.[map/reduce].cpu.vcores=1
mapreduce.[map/reduce].memory.mb=1GB
mapreduce.[map | reduce].java.opts.max.heap=756MB
While doing a run on 4.5GB csv data spread over 11 files ,I get following error:
2015-10-12 05:21:04,507 FATAL [IPC Server handler 18 on 50388] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1444634391081_0005_r_000000_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Then I tuned mapreduce.reduce.memory.mb=1GB to mapreduce.reduce.memory.mb=3GB and job runned fine.
So how to decide on how much data maximum can be handled by 1 reducer assuming that all the input to mapper have to be processed by 1 reducer only?
Generally there is no limitation on the data that can be processed by a single reducer. The memory allocation can slow down the process but must not restrict or fail to process the data. I believe after allocating minimum memory to reducer the data processing should not be an issue. Can u pls share some code snippet to check for any memory leak issues.
We used to process 6+Gb of file in a single reducer withou any issues. I believe you might be having memory leak issues.
When I try to lease tasks from a pull queue in my application, I get the error below. This didn't happen previously with my code, so something has changed. I suspect that it's a threading issue - these calls are made in a separate thread (created using ThreadManager.createThreadForCurrentRequest) - has that recently been disallowed?
uk.org.jaggard.myapp.BlahBlahBlahRunnable run: Exception leasing tasks. Already tried 0 times.
com.google.apphosting.api.ApiProxy$CancelledException: The API call taskqueue.QueryAndOwnTasks() was explicitly cancelled.
at com.google.apphosting.runtime.ApiProxyImpl.doSyncCall(ApiProxyImpl.java:218)
at com.google.apphosting.runtime.ApiProxyImpl.access$000(ApiProxyImpl.java:68)
at com.google.apphosting.runtime.ApiProxyImpl$1.run(ApiProxyImpl.java:182)
at com.google.apphosting.runtime.ApiProxyImpl$1.run(ApiProxyImpl.java:180)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl.makeSyncCall(ApiProxyImpl.java:180)
at com.google.apphosting.runtime.ApiProxyImpl.makeSyncCall(ApiProxyImpl.java:68)
at com.google.appengine.tools.appstats.Recorder.makeSyncCall(Recorder.java:323)
at com.googlecode.objectify.cache.TriggerFutureHook.makeSyncCall(TriggerFutureHook.java:154)
at com.google.apphosting.api.ApiProxy.makeSyncCall(ApiProxy.java:105)
at com.google.appengine.api.taskqueue.QueueApiHelper.makeSyncCall(QueueApiHelper.java:44)
at com.google.appengine.api.taskqueue.QueueImpl.leaseTasksInternal(QueueImpl.java:709)
at com.google.appengine.api.taskqueue.QueueImpl.leaseTasks(QueueImpl.java:731)
at uk.org.jaggard.myapp.BlahBlahBlahRunnable.run(BlahBlahBlahRunnable.java:53)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:997)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:994)
at java.lang.Thread.run(Thread.java:679)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1031)
I suspect the warnings in your application's logs will contain
"Thread was interrupted, throwing CancelledException."
Java's InterruptedException is discussed in
http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html
and the book "Java Concurrency in Practice".