BigQuery - How to set read timeout in the Java client library - java

I am using Spark to load some data into BigQuery. The idea is to read data from S3 and use Spark and BigQuery client API to load data. Below is the code that does the insert into BigQuery.
val bq = createAuthorizedClientWithDefaultCredentialsFromStream(appName, credentialStream)
val bqjob = bq.jobs().insert(pid, job, data).execute() // data is a InputStream content
With this approach, I am seeing lot of SocketTimeoutException.
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
Looks like the delay in reading from S3 causes Google http-client to timeout. I wanted to increase the timeout and tried the below options.
val req = bq.jobs().insert(pid, job, data).buildHttpRequest()
req.setReadTimeout(3 * 60 * 1000)
val res = req.execute()
But this causes a Precondition failure in BigQuery. It expects the mediaUploader to be null, not sure why though.
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:37)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:297)
This caused me to try the second insert API on BigQuery
val req = bq.jobs().insert(pid, job).buildHttpRequest().setReadTimeout(3 * 60 * 1000).setContent(data)
val res = req.execute()
And this time it failed with a different error.
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: ",
"reason" : "invalid"
} ],
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
Please suggest me how I can set the timeout. Also point me if I am doing something wrong.

I'll answer the main question from the title: how to set timeouts using the Java client library.
To set timeouts, you need a custom HttpRequestInitializer configured in your client. For example:
Bigquery.Builder builder =
new Bigquery.Builder(new UrlFetchTransport(), new JacksonFactory(), credential);
final HttpRequestInitializer existing = builder.getHttpRequestInitializer();
builder.setHttpRequestInitializer(new HttpRequestInitializer() {
#Override
public void initialize(HttpRequest request) throws IOException {
existing.initialize(request);
request
.setReadTimeout(READ_TIMEOUT)
.setConnectTimeout(CONNECTION_TIMEOUT);
}
});
Bigquery client = builder.build();
I don't think this will solve all the issues you are facing. A few ideas that might be helpful, but I don't fully understand the scenario so these may be off track:
If you are moving large files: consider staging them on GCS before loading them into BigQuery.
If you are using media upload to send the data with your request: these can't be too large or you risk timeouts or network connection failures.
If you are running an embarrassingly parallel data migration, and the data chunks are relatively small, bigquery.tabledata.insertAll may be more appropriate for large fan-in scenarios like this. See https://cloud.google.com/bigquery/streaming-data-into-bigquery for more details.
Thanks for the question!

Related

AWS SDK is throwing SdkClientException

I am attempting to retrieve a secret from aws secrets manager. My application can retrieve 3 / 4 secrets. When attempting to retrieve the 2nd secret, i get an interrupt exception. The code below is what i am using. I am unsure where to look or why this is happening. It consistently happens when attempting to retrieve the 2nd secret. I have verified that it exists in aws and that it contains the value i expect.
I am using the reativerse [https://github.com/reactiverse/aws-sdk] library in tandem with vert.x + aws sdk 2. Any help or pointers on where to look would be greatly appreciated. The exception i see and the code i have are written below.
java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: software.amazon.awssdk.core.exception.SdkClientException
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:189)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler.java:271)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.doExecute(BaseAsyncClientHandler.java:213)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.lambda$execute$1(BaseAsyncClientHandler.java:91)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.measureApiCallSuccess(BaseAsyncClientHandler.java:277)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.execute(BaseAsyncClientHandler.java:74)
at software.amazon.awssdk.awscore.client.handler.AwsAsyncClientHandler.execute(AwsAsyncClientHandler.java:52)
at software.amazon.awssdk.services.secretsmanager.DefaultSecretsManagerAsyncClient.getSecretValue(DefaultSecretsManagerAsyncClient.java:1142)
... 8 more
Caused by: software.amazon.awssdk.core.exception.SdkInterruptedException
at software.amazon.awssdk.core.internal.http.InterruptMonitor.checkInterrupted(InterruptMonitor.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:43)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:29)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:185)
... 16 more
private final SecretsManagerAsyncClient client = VertxSdkClient.withVertx(SecretsManagerAsyncClient.builder(), ctx).build();
public <T> Promise<T> getSecret(String secretName) {
Promise<T> completableFuture = Promise.promise();
Context ctx = Vertx.vertx().getOrCreateContext();
ctx.owner().executeBlocking(blockingFuture -> {
try {
GetSecretValueRequest valueRequest = GetSecretValueRequest.builder().secretId(secretName).build();
CompletableFuture<GetSecretValueResponse> valueResponse = this.client.getSecretValue(valueRequest);
GetSecretValueResponse getResponse = valueResponse.get();
String response = getResponse.secretString();
completableFuture.complete((T) Json.decodeValue(response, new <T>Object().getClass()));
blockingFuture.complete();
} catch (Exception e) {
// log exception.
}
});
...
}
See SdkInterruptedException ...it has one method getResponse():
try {
// provoke the exception in here
} catch (SdkInterruptedException e) {
System.err.println(e.getResponse());
}
This should (hopefully) at least tell why the request had been interrupted.
// log exception. is not the correct approach to handle any Exception.
The SdkClientException might be more relevant than the SdkInterruptedException:
try { ...
} catch (SdkClientException e) {
// log that exception, instead of ignoring it.
System.err.println(e.getMessage());
}
This is how an Authorization Header (AWS Signature Version 4) should basically look alike.
It may well be, that HMAC-SHA256 headers are being enforced (they're different on each single request). I could even provide a known to be working example... which would show how to configure the SecretsManagerClient. I'd suspect the root cause related to that client's configuration.
When I read "when attempting to retrieve the 2nd secret", this with high probability means that you need a refresh the HMAC-SHA256 signature before the second request will be performed. Destroying the client and constructing it again may be one sloppy workaround. When seeing duplicate values for AWS4-HMAC-SHA256 header on both requests, this definitely is the cause.

Apache Flink got exception: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors

I am using Flink on the cluster. As I submitted the task, I got the following exception:
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079)
at akka.dispatch.OnComplete.internal(Future.scala:263)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458)
... 9 more
Caused by: java. lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor.getShuffleDescriptors(InputGateDeploymentDescriptor.java:150)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:125)
at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:261)
at org.apache.flink.runtime.taskmanager.Task.<init>(Task.java:420)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:737)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Flink version: 1.13.6;
Scala version: 2.11
Kafka version: 2.2.2
Part of my code:
object batchProcess {
def main(args:Array[String]): Unit = {
val host = "localhost"
val port = 6379
val env = StreamExecutionEnvironment.getExecutionEnvironment
// read from kafka
val source = KafkaSource.builder[String].setBootstrapServers("localhost:9092")
.setTopics("movie_rating_records").setGroupId("my-group").setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setBounded(OffsetsInitializer.latest).build()
// val inputDataStream = env.readTextFile("a.txt")
val inputDataStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val dataStream = inputDataStream
.map( data =>{
val arr = data.split(",")
( arr(0),arr(1).toInt,arr(2).toInt,arr(3).toFloat,arr(4).toLong)
})
val (counterUserIdPos,counterUserIdNeg,counterMovieIdPos,counterMovieIdNeg,counterUserId2MovieId) = commonProcess(dataStream)
counterUserIdPos.map(x =>{
val jedisIns = new Jedis(host,port,100000)
jedisIns.set("batch2feature_userId_rating1_"+x._1.toString, x._2.toString)
jedisIns.close()
})
env.execute("test")
}
}
The input stream from Kafka is a string split by a comma, for example: 1542295208rating,556,112852,1.0,1542295208. The above code process the string and puts them into another datastream process function. And finally, it writes the result into Redis.
Any help or hints on resolving the issue would be greatly appreciated!
Here aer a few pointers I can think of
Netty is the internal serialization mechanism of Flink => from the stack trace we know the error is likely occurring in one of the .map or so, not when interacting with Kafka nor Redis.
Serialization issues are sometimes happening in Flink when using Scala. Maybe the second .map is somehow causing connection pools or some other context instance to be serialized into the lambda, so replacing it with a Flink SinkFunction might help (in addition to improving performance since you'd only create one Jedis instance per partition).
Investigate also what serialization is going on in the commonProcess.
Essentially, you should be hunting for a place where the code somehow needs to serialize some instance whose type would confuse the Flink serialization mechanism.

Flink FileSync with AWS S3 plugin throw error when accessing S3 access point - "null uri host"

After following this instruction I am able to access the S3 bucket via access point + VPC endpoint perfectly fine from AWS CLI.
Basically I use
s3://arn:aws:s3:ap-southeast-1:<account number>:accesspoint/<bucket name>
the same way as I use
s3://<bucket name>
All aws s3 ... commands works great.
However that's not the case for my my Java based Flink project code. The code works great with s3://<bucket name>, but it seems that it does not recognize the new S3 URI.
Here is how the sink is defined in my code:
final FileSink<ConsumerRecordPOJO<CacheInfo>> sink = FileSink //
.<ConsumerRecordPOJO<CacheInfo>>forRowFormat(new Path(s3Url),
new Encoder<ConsumerRecordPOJO<CacheInfo>>() {
#Override
public void encode(ConsumerRecordPOJO<CacheInfo> record, OutputStream stream)
throws IOException {
GzipParameters params = new GzipParameters();
params.setCompressionLevel(Deflater.BEST_COMPRESSION);
GzipCompressorOutputStream out = new GzipCompressorOutputStream(stream, params);
OBJECT_MAPPER.writeValue(out, record);
out.finish();
}
}) //
// (some extra configuration omitted here)
.build();
After passing s3://arn:aws:s3:ap-southeast-1:<account number>:accesspoint/<bucket name> to the s3Url param, the job execution failed with
2021-11-26 22:14:34,085 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: kafka -> Filter -> Map -> Sink file (1/1)#3 (c654160d3fab026c4544ca8a64644796) switched from INITIALIZING to FAILED with failure cause: org.apache.flink.util.FlinkRuntimeException: Could not create writer state serializer.
at org.apache.flink.connector.file.sink.FileSink.getWriterStateSerializer(FileSink.java:135)
at org.apache.flink.streaming.runtime.operators.sink.SinkOperatorFactory.createStreamOperator(SinkOperatorFactory.java:63)
at org.apache.flink.streaming.api.operators.StreamOperatorFactoryUtil.createOperator(StreamOperatorFactoryUtil.java:81)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOperator(OperatorChain.java:712)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOperatorChain(OperatorChain.java:686)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOutputCollector(OperatorChain.java:626)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOperatorChain(OperatorChain.java:676)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOutputCollector(OperatorChain.java:626)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOperatorChain(OperatorChain.java:676)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOutputCollector(OperatorChain.java:626)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:187)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.<init>(RegularOperatorChain.java:63)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:666)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: null uri host.
at org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:162)
at org.apache.flink.core.fs.PluginFileSystemFactory.create(PluginFileSystemFactory.java:62)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:508)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:409)
at org.apache.flink.connector.file.sink.FileSink$RowFormatBuilder.createBucketWriter(FileSink.java:326)
at org.apache.flink.connector.file.sink.FileSink$RowFormatBuilder.getWriterStateSerializer(FileSink.java:307)
at org.apache.flink.connector.file.sink.FileSink.getWriterStateSerializer(FileSink.java:130)
... 18 more
Caused by: java.lang.NullPointerException: null uri host.
at java.util.Objects.requireNonNull(Objects.java:228)
at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:486)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:246)
at org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:123)
... 24 more
It turns out I could use the S3 access point alias which works perfectly for Flink
See https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html

Vertx WebClient Using Too Many File Descriptors

I am trying to run a vertx WebClient within AWS Lambda. AWS Lambdas have a max of 1024 file descriptors and are non-adjustable. I am having trouble finding what exactly is using up all of my file descriptors. I am only using the WebClient from vertx and not running any verticles. Here is my shared WebClient:
var webClientOptions = new WebClientOptions()
.setSsl(true)
.setUseAlpn(true)
.setReusePort(true)
.setDefaultPort(443)
.setHttp2MaxPoolSize(30)
.setProtocolVersion(HttpVersion.HTTP_2)
.setSslHandshakeTimeout(30)
.setIdleTimeout(5)
.setIdleTimeoutUnit(TimeUnit.SECONDS);
return WebClient.create(vertx, webClientOptions);
This client is reused between lambda invocations, and never closed.
How I am using it:
var requestFutures = new ArrayList<CompletableFuture<Foo>>();
for (var request : requests) {
CompletableFuture<Foo> future = new CompletableFuture();
requestFutures.add(future);
Buffer buffer = Buffer.buffer(request.getPayload());
httpRequest.sendBuffer(buffer, (res) -> {
future.complete(res);
});
}
for (var requestFuture : requestFutures) {
requestFuture.get();
}
Where I have 45k-65k requests to send. What I get is:
io.netty.channel.ChannelException: Failed to open a socket.
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:70)
at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:87)
at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:80)
at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:320)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:163)
at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:145)
at io.vertx.core.net.impl.ChannelProvider.handleConnect(ChannelProvider.java:140)
at io.vertx.core.net.impl.ChannelProvider.connect(ChannelProvider.java:93)
at io.vertx.core.http.impl.HttpChannelConnector.doConnect(HttpChannelConnector.java:171)
at io.vertx.core.http.impl.HttpChannelConnector.connect(HttpChannelConnector.java:105)
at io.vertx.core.http.impl.pool.Pool$Holder.connect(Pool.java:125)
at io.vertx.core.http.impl.pool.Pool.checkPendingTasks(Pool.java:250)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Too many open files
at java.base/sun.nio.ch.Net.socket0(Native Method)
at java.base/sun.nio.ch.Net.socket(Unknown Source)
at java.base/sun.nio.ch.Net.socket(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
at java.base/sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:68)
... 17 more
When I run this locally and watch the open files/sockets in the OSX Activity Monitor it shoots up to thousands. I don't understand why it's doing this based on how i've configured the WebClientOptions.
Can someone tell from this information why I would be using so many file descriptors?
setHttp2MaxPoolSize(30) is a lot, in theory you only need 1 because it is HTTP/2

Spark possible race condition in driver

I have a Spark job that processes several folders on S3 per run and stores its state on DynamoDB. In other words, we're running the job once per day, it looks for new folders added by another job, transforms them one-by-one and writes state to DynamoDB. Here's rough pseudocode:
object App {
val allFolders = S3Folders.list()
val foldersToProcess = DynamoDBState.getFoldersToProcess(allFolders)
Transformer.run(foldersToProcess)
}
object Transformer {
def run(folders: List[String]): Unit = {
val sc = new SparkContext()
folders.foreach(process(sc, _))
}
def process(sc: SparkContext, folder: String): Unit = ??? // transform and write to S3
}
This approach works well if S3Folders.list() returns relatively small amount of folders (up to few thousands), if it returns more (4-8K) very often we see following error (that in first glance has nothing to do with Spark):
17/10/31 08:38:20 ERROR ApplicationMaster: User class threw exception: shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponses
SaxParser$ListObjectsV2Handler
shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponsesSaxParser$ListObjectsV2Handler
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:214)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.parseListObjectsV2Response(XmlResponsesSaxParser.java:315)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:88)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:77)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1553)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1271)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4188)
at shadeaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:865)
at me.chuwy.transform.S3Folders$.com$chuwy$transform$S3Folders$$isGlacierified(S3Folders.scala:136)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.S3Folders$.list(S3Folders.scala:112)
at me.chuwy.transform.Main$.main(Main.scala:22)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:186)
... 36 more
For big amount of folders (~20K) this happens all the time and job cannot start.
Previously we had very similar, but much more frequent error when getFoldersToProcess did GetItem for every folder from allFolders and therefore took much longer:
17/09/30 14:46:07 ERROR ApplicationMaster: User class threw exception: shadeaws.AbortedException:
shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:51)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1240)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:802)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:109)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1503)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1226)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2089)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2065)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1173)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1149)
at me.chuwy.tranform.sdk.Manifest$.contains(Manifest.scala:179)
at me.chuwy.tranform.DynamoDBState$$anonfun$getUnprocessed$1.apply(ProcessManifest.scala:44)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.DynamoDBState$.getFoldersToProcess(DynamoDBState.scala:44)
at me.chuwy.transform.Main$.main(Main.scala:19)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
I believe that current error has nothing to do with XML parsing or invalid response, but originate from some race condition inside Spark, because:
There's clear connection between amount of time "state-fetching" takes and chance of failure
Tracebacks have underlying AbortedException, which AFAIK caused by swallowed InterruptedException, which can mean something inside JVM (spark-submit or even YARN) calls Thread.sleep for main thread.
Right now I'm using EMR AMI 5.5.0, Spark 2.1.0 and shaded AWS SDK 1.11.208, but had similar error with AWS SDK 1.10.75.
I'm deploying this job on EMR via command-runner.jar spark-submit --deploy-mode cluster --class ....
Does anyone have any idea where does this exception originate from and how to fix it?
foreach does not guarantee orderly computations and it applies the operation(s) to each element of an RDD, meaning that it will instantiate for every element which, in turn, may overwhelm the executor.
The problem was that getFoldersToProcess is a blocking (and very long) operation, which prevents SparkContext from being instantiated. SpackContext itself should signal about own instantiation to YARN and if it doesn't help in a certain amount of time - YARN assumes that driver node has fallen off and kills the whole cluster.

Categories

Resources