Storagelevel in spark RDD MEMORY_AND_DISK_2() throw exception - java

Can anyone explain how storage level of rdd works.
I got heap memory error when I use persist method with storage level(StorageLevel.MEMORY_AND_DISK_2())
However my code works fine when I use cache method.
As per spark doc documentation cache Persist RDD with the default storage level (MEMORY_ONLY).
My code where I get heap error
JavaRDD<String> rawData = sparkContext
.textFile(inputFile.getAbsolutePath())
.setName("Input File").persist(SparkToolConstant.rdd_stroage_level);
// cache()
String[] headers = new String[0];
String headerStr = null;
if (headerPresent) {
headerStr = rawData.first();
headers = headerStr.split(delim);
List<String> headersList = new ArrayList<String>();
headersList.add(headerStr);
JavaRDD<String> headerRDD = sparkContext
.parallelize(headersList);
JavaRDD<String> filteredRDD = rawData.subtract(headerRDD)
.setName("Raw data without header").persist(StorageLevel.MEMORY_AND_DISK_2());;
rawData = filteredRDD;
}
Stack trace
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 10, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1176)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1185)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:846)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:668)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:176)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
Spark version : 1.3.0

Seeing this go unanswered for so long, I post this for general info and for those like me whose searches lead here.
This type of question is hard to answer without more specifics about your application. In general, it does seem upside down that you'd get a memory error when serializing to disk. I suggest you try with Kryo serialization and if you have a lot of extra memory somewhere use Alluxio (the software formerly known as Tachyon :) for "disk serialization," this will speed things up.
More from Spark docs on Tuning Data Storage, Serialized RDD Storage and (maybe helpful) GC Tuning:
When your objects are still too large to efficiently store despite
this tuning, a much simpler way to reduce memory usage is to store
them in serialized form, using the serialized StorageLevels in the
RDD persistence API, such as MEMORY_ONLY_SER. Spark will then
store each RDD partition as one large byte array. The only downside of
storing data in serialized form is slower access times, due to having
to deserialize each object on the fly. We highly recommend using Kryo
if you want to cache data in serialized form, as it leads to much
smaller sizes than Java serialization (and certainly than raw Java
objects).

Related

Apache Flink got exception: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors

I am using Flink on the cluster. As I submitted the task, I got the following exception:
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079)
at akka.dispatch.OnComplete.internal(Future.scala:263)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458)
... 9 more
Caused by: java. lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor.getShuffleDescriptors(InputGateDeploymentDescriptor.java:150)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:125)
at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:261)
at org.apache.flink.runtime.taskmanager.Task.<init>(Task.java:420)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:737)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Flink version: 1.13.6;
Scala version: 2.11
Kafka version: 2.2.2
Part of my code:
object batchProcess {
def main(args:Array[String]): Unit = {
val host = "localhost"
val port = 6379
val env = StreamExecutionEnvironment.getExecutionEnvironment
// read from kafka
val source = KafkaSource.builder[String].setBootstrapServers("localhost:9092")
.setTopics("movie_rating_records").setGroupId("my-group").setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setBounded(OffsetsInitializer.latest).build()
// val inputDataStream = env.readTextFile("a.txt")
val inputDataStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val dataStream = inputDataStream
.map( data =>{
val arr = data.split(",")
( arr(0),arr(1).toInt,arr(2).toInt,arr(3).toFloat,arr(4).toLong)
})
val (counterUserIdPos,counterUserIdNeg,counterMovieIdPos,counterMovieIdNeg,counterUserId2MovieId) = commonProcess(dataStream)
counterUserIdPos.map(x =>{
val jedisIns = new Jedis(host,port,100000)
jedisIns.set("batch2feature_userId_rating1_"+x._1.toString, x._2.toString)
jedisIns.close()
})
env.execute("test")
}
}
The input stream from Kafka is a string split by a comma, for example: 1542295208rating,556,112852,1.0,1542295208. The above code process the string and puts them into another datastream process function. And finally, it writes the result into Redis.
Any help or hints on resolving the issue would be greatly appreciated!
Here aer a few pointers I can think of
Netty is the internal serialization mechanism of Flink => from the stack trace we know the error is likely occurring in one of the .map or so, not when interacting with Kafka nor Redis.
Serialization issues are sometimes happening in Flink when using Scala. Maybe the second .map is somehow causing connection pools or some other context instance to be serialized into the lambda, so replacing it with a Flink SinkFunction might help (in addition to improving performance since you'd only create one Jedis instance per partition).
Investigate also what serialization is going on in the commonProcess.
Essentially, you should be hunting for a place where the code somehow needs to serialize some instance whose type would confuse the Flink serialization mechanism.

How to manage RecordTooLargeException avoiding Flink job restarting

Is there any way to ignore oversized messages without Flink job restarting?
If I try to produce (using KafkaSink ) a message which is too large (greater than max.message.bytes) then the RecordTooLargeException occurs and the Flink job restarts, then this "exception&restart" cycle is repeating endlessly!
I don't need to increase messages size limits such as max.message.bytes (Kafka Topic Config) and max.request.size (Flink Producer Config), they are good, they are already big. I just want to handle the situation when an unrealistically large message is trying to be produced. In this case, this big message should be ignored, and an error should be logged, and any Runtime Exception should NOT occur, and the endless restarting loop should NOT start.
I tried to use ProducerInterceptor -> it cannot intercept/reject a message, it can just modify it.
I tried to ignore oversized messages in SerializationSchema (implemented a custom wrapper of SerializationSchema) -> it cannot discard message producing too.
I am trying to overwrite KafkaWriter and KafkaSink classes, but it seems to be challenging.
I will be grateful for any advice!
A few quick environment details:
Kafka version is 2.8.1
Flink code is Java code based on the newer KafkaSource/KafkaSink API, not the
older KafkaConsumer/KafkaProduer API.
The flink-clients and flink-connector-kafka version is 1.15.0
Code sample which throws the RecordTooLargeException:
int numberOfRows = 1;
int rowsPerSecond = 1;
DataStream<String> stream = environment.addSource(
new DataGeneratorSource<>(
RandomGenerator.stringGenerator(1050000), // max.message.bytes=1048588
rowsPerSecond,
(long) numberOfRows),
TypeInformation.of(String.class))
.setParallelism(1)
.name("string-generator");
KafkaSinkBuilder<String> builder = KafkaSink.<String>builder()
.setBootstrapServers("localhost:9092")
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.setRecordSerializer(
KafkaRecordSerializationSchema.builder().setTopic("test.output")
.setValueSerializationSchema(new SimpleStringSchema())
.build());
KafkaSink<String> sink = builder.build();
stream.sinkTo(sink).setParallelism(1).name("output-producer");
Exception Stack Trace:
2022-06-02/14:01:45.066/PDT [flink-akka.actor.default-dispatcher-4] INFO output-producer: Writer -> output-producer: Committer (1/1) (a66beca5a05c1c27691f7b94ca6ac025) switched from RUNNING to FAILED on 271b1b90-7d6b-4a34-8116-3de6faa8a9bf # 127.0.0.1 (dataPort=-1). org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka null with FlinkKafkaInternalProducer{transactionalId='null', inTransaction=false, closed=false} at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.throwException(KafkaWriter.java:440) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.lambda$onCompletion$0(KafkaWriter.java:421) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:353) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:317) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) ~[flink-runtime-1.15.0.jar:1.15.0] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1050088 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.

Spark possible race condition in driver

I have a Spark job that processes several folders on S3 per run and stores its state on DynamoDB. In other words, we're running the job once per day, it looks for new folders added by another job, transforms them one-by-one and writes state to DynamoDB. Here's rough pseudocode:
object App {
val allFolders = S3Folders.list()
val foldersToProcess = DynamoDBState.getFoldersToProcess(allFolders)
Transformer.run(foldersToProcess)
}
object Transformer {
def run(folders: List[String]): Unit = {
val sc = new SparkContext()
folders.foreach(process(sc, _))
}
def process(sc: SparkContext, folder: String): Unit = ??? // transform and write to S3
}
This approach works well if S3Folders.list() returns relatively small amount of folders (up to few thousands), if it returns more (4-8K) very often we see following error (that in first glance has nothing to do with Spark):
17/10/31 08:38:20 ERROR ApplicationMaster: User class threw exception: shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponses
SaxParser$ListObjectsV2Handler
shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponsesSaxParser$ListObjectsV2Handler
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:214)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.parseListObjectsV2Response(XmlResponsesSaxParser.java:315)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:88)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:77)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1553)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1271)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4188)
at shadeaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:865)
at me.chuwy.transform.S3Folders$.com$chuwy$transform$S3Folders$$isGlacierified(S3Folders.scala:136)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.S3Folders$.list(S3Folders.scala:112)
at me.chuwy.transform.Main$.main(Main.scala:22)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:186)
... 36 more
For big amount of folders (~20K) this happens all the time and job cannot start.
Previously we had very similar, but much more frequent error when getFoldersToProcess did GetItem for every folder from allFolders and therefore took much longer:
17/09/30 14:46:07 ERROR ApplicationMaster: User class threw exception: shadeaws.AbortedException:
shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:51)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1240)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:802)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:109)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1503)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1226)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2089)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2065)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1173)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1149)
at me.chuwy.tranform.sdk.Manifest$.contains(Manifest.scala:179)
at me.chuwy.tranform.DynamoDBState$$anonfun$getUnprocessed$1.apply(ProcessManifest.scala:44)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.DynamoDBState$.getFoldersToProcess(DynamoDBState.scala:44)
at me.chuwy.transform.Main$.main(Main.scala:19)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
I believe that current error has nothing to do with XML parsing or invalid response, but originate from some race condition inside Spark, because:
There's clear connection between amount of time "state-fetching" takes and chance of failure
Tracebacks have underlying AbortedException, which AFAIK caused by swallowed InterruptedException, which can mean something inside JVM (spark-submit or even YARN) calls Thread.sleep for main thread.
Right now I'm using EMR AMI 5.5.0, Spark 2.1.0 and shaded AWS SDK 1.11.208, but had similar error with AWS SDK 1.10.75.
I'm deploying this job on EMR via command-runner.jar spark-submit --deploy-mode cluster --class ....
Does anyone have any idea where does this exception originate from and how to fix it?
foreach does not guarantee orderly computations and it applies the operation(s) to each element of an RDD, meaning that it will instantiate for every element which, in turn, may overwhelm the executor.
The problem was that getFoldersToProcess is a blocking (and very long) operation, which prevents SparkContext from being instantiated. SpackContext itself should signal about own instantiation to YARN and if it doesn't help in a certain amount of time - YARN assumes that driver node has fallen off and kills the whole cluster.

Why small data size cause GC overhead limit exception?

I use spark doing some calculation.
Basically I did two thing:
New file will come into a folder periodically
I turn the new files into data frame then insert it into an previous data frame.
(You may ask why I read it in loop. I did it because of some reasons:
1. The files not comes at once. Actually it will come periodically. So I can not read them at once.
2. Although Stream can do this. I do not want to use stream. Because using Stream I need to set up a long window. It is not convienent to debug and test
)
The code is like below :
# Get the file list in the HDFS directory
client = InsecureClient('http://10.79.148.184:50070')
file_list = client.list('/test')
df_total = None
counter = 0
for file in file_list:
counter += 1
# turn each file (CSV format) into data frame
lines = sc.textFile("/test/%s" % file)
parts = lines.map(lambda l: l.split(","))
rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), protocol=p[7],bit=int(p[10])))
df = sqlContext.createDataFrame(rows)
# do some transform on the data frame
df_protocol = df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))
# add the current data frame to previous data frame set
if not df_total:
df_total = df_protocol
else:
df_total = df_total.unionAll(df_protocol)
# cache the df_total
df_total.cache()
if counter % 5 == 0:
df_total.rdd.checkpoint()
# get the df_total information
df_total.show()
I know that as time goes on, the df_total could be big. But actually, before that time come, the above code already raise exception.
When the loop is about 30 loops. The code throw GC overhead limit exceeded exception. The file is very small so even 300 loops the data size could only be about a few MB. I do not know why it throw GC error.
The exception detail is below :
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.toString(Integer.java:331)
at java.lang.Integer.toString(Integer.java:739)
at java.lang.String.valueOf(String.java:2854)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.storage.RDDBlockId.name(BlockId.scala:53)
at org.apache.spark.storage.BlockId.equals(BlockId.scala:46)
at java.util.HashMap.getEntry(HashMap.java:471)
at java.util.HashMap.get(HashMap.java:421)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocations(BlockManagerMasterEndpoint.scala:371)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:72)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/04/20 09:52:00 ERROR TaskSchedulerImpl: Lost executor 0 on ES01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/04/20 09:52:12 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=4721950849479578179, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=47]}} to ES01/10.79.148.184:53059; closing connection
io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:602)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:204)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:256)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:77)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
... 13 more

How to determine maximum amount of data that can be handled by 1 run of MR2 job?

I am running a YARN job on CDH 5.3 cluster. I have default configurations.
No of nodes=3
yarn.nodemanager.resource.cpu-vcores=8
yarn.nodemanager.resource.memory-mb=10GB
mapreduce.[map/reduce].cpu.vcores=1
mapreduce.[map/reduce].memory.mb=1GB
mapreduce.[map | reduce].java.opts.max.heap=756MB
While doing a run on 4.5GB csv data spread over 11 files ,I get following error:
2015-10-12 05:21:04,507 FATAL [IPC Server handler 18 on 50388] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1444634391081_0005_r_000000_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Then I tuned mapreduce.reduce.memory.mb=1GB to mapreduce.reduce.memory.mb=3GB and job runned fine.
So how to decide on how much data maximum can be handled by 1 reducer assuming that all the input to mapper have to be processed by 1 reducer only?
Generally there is no limitation on the data that can be processed by a single reducer. The memory allocation can slow down the process but must not restrict or fail to process the data. I believe after allocating minimum memory to reducer the data processing should not be an issue. Can u pls share some code snippet to check for any memory leak issues.
We used to process 6+Gb of file in a single reducer withou any issues. I believe you might be having memory leak issues.

Categories

Resources