How to use HDFS sink right in flink? - java

I play around with the twitter connector in apache flink and now want to save some streamed data in my local hdfs instance. In the flink documentation is a small BucketerSink example, but my program always quit with the following error:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:933)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:92)
at org.apache.hadoop.security.Groups.<init>(Groups.java:76)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:239)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2473)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2465)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2331)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:418)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:352)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:177)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:159)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:105)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:251)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:678)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:666)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
at java.base/java.lang.String.substring(String.java:1885)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:49)
... 25 more
Do you have any ideas what went wrong with my code? I use the inital twitter connector example for testing and my environment is build up with a docker container for hdfs. The ports are correctly mapped form docker to my local machine and i also can check the status of hdfs on the web ui.
Here is my code approach:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
Properties props = new Properties();
props.setProperty(TwitterSource.CONSUMER_KEY, "KEY");
props.setProperty(TwitterSource.CONSUMER_SECRET, "SECRET");
props.setProperty(TwitterSource.TOKEN, "TOKEN");
props.setProperty(TwitterSource.TOKEN_SECRET, "TOKENSECRET");
DataStream<String> streamSource = env.addSource(new TwitterSource(props));
DataStream<Tuple2<String, Integer>> tweets = streamSource
// selecting English tweets and splitting to (word, 1)
.flatMap(new SelectGermanAndTokenizeFlatMap())
// group by words and sum their occurrences
.keyBy(0)
.timeWindow(Time.minutes(1), Time.seconds(30))
.sum(1);
BucketingSink<Tuple2<String, Integer>> sink = new BucketingSink<>("hdfs://localhost:8020/flink/twitter-test");
sink.setBucketer(new DateTimeBucketer<Tuple2<String, Integer>>("yyyy-MM-dd--HHmm"));
sink.setBatchSize(1024 * 1024 * 400);
tweets.addSink(sink);
//tweets.print();
env.execute("Twitter Streaming Example");

Related

Apache Flink got exception: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors

I am using Flink on the cluster. As I submitted the task, I got the following exception:
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079)
at akka.dispatch.OnComplete.internal(Future.scala:263)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458)
... 9 more
Caused by: java. lang.IllegalStateException: Trying to work with offloaded serialized shuffle descriptors.
at org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor.getShuffleDescriptors(InputGateDeploymentDescriptor.java:150)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:125)
at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:261)
at org.apache.flink.runtime.taskmanager.Task.<init>(Task.java:420)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:737)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Flink version: 1.13.6;
Scala version: 2.11
Kafka version: 2.2.2
Part of my code:
object batchProcess {
def main(args:Array[String]): Unit = {
val host = "localhost"
val port = 6379
val env = StreamExecutionEnvironment.getExecutionEnvironment
// read from kafka
val source = KafkaSource.builder[String].setBootstrapServers("localhost:9092")
.setTopics("movie_rating_records").setGroupId("my-group").setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setBounded(OffsetsInitializer.latest).build()
// val inputDataStream = env.readTextFile("a.txt")
val inputDataStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val dataStream = inputDataStream
.map( data =>{
val arr = data.split(",")
( arr(0),arr(1).toInt,arr(2).toInt,arr(3).toFloat,arr(4).toLong)
})
val (counterUserIdPos,counterUserIdNeg,counterMovieIdPos,counterMovieIdNeg,counterUserId2MovieId) = commonProcess(dataStream)
counterUserIdPos.map(x =>{
val jedisIns = new Jedis(host,port,100000)
jedisIns.set("batch2feature_userId_rating1_"+x._1.toString, x._2.toString)
jedisIns.close()
})
env.execute("test")
}
}
The input stream from Kafka is a string split by a comma, for example: 1542295208rating,556,112852,1.0,1542295208. The above code process the string and puts them into another datastream process function. And finally, it writes the result into Redis.
Any help or hints on resolving the issue would be greatly appreciated!
Here aer a few pointers I can think of
Netty is the internal serialization mechanism of Flink => from the stack trace we know the error is likely occurring in one of the .map or so, not when interacting with Kafka nor Redis.
Serialization issues are sometimes happening in Flink when using Scala. Maybe the second .map is somehow causing connection pools or some other context instance to be serialized into the lambda, so replacing it with a Flink SinkFunction might help (in addition to improving performance since you'd only create one Jedis instance per partition).
Investigate also what serialization is going on in the commonProcess.
Essentially, you should be hunting for a place where the code somehow needs to serialize some instance whose type would confuse the Flink serialization mechanism.

Not able to run flink application which deserializes avro data coming from a kafka topic

Trying to read avro data from a kafka topic using a flink application. Getting the below error while running the flink app. This is my first time working with flink/kafka, and couldn't fix this from days.
org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster.
at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint s3://rinc-ingestion-service/flink-savepoints/namespaces/default/deployments/e8f1afd0-9236-43c2-93cb-db16604da594/1a6716af-6357-4ae3-95da-dad0bdb1f7cc/savepoint-f101ee-feb4e8023b09. Cannot map checkpoint/savepoint state for operator 7df19f87deec5680128845fd9a6ca18d to the new program, because the operator is not available in the new program. If you want to allow to skip this, you can set the --allowNonRestoredState option on the CLI.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
... 3 more
Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint s3://rinc-ingestion-service/flink-savepoints/namespaces/default/deployments/e8f1afd0-9236-43c2-93cb-db16604da594/1a6716af-6357-4ae3-95da-dad0bdb1f7cc/savepoint-f101ee-feb4e8023b09. Cannot map checkpoint/savepoint state for operator 7df19f87deec5680128845fd9a6ca18d to the new program, because the operator is not available in the new program. If you want to allow to skip this, you can set the --allowNonRestoredState option on the CLI.
at org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:230)
at org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:194)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1648)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:163)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:138)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:335)
at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:191)
at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:140)
at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:134)
at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:346)
at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:323)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
... 3 more
The flink app code is below.
import io.confluent.kafka.serializers.KafkaAvroDeserializerConfig;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import testRecord.DataRecordAvro;
public class KafkaAvroDeserialize {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
KafkaSource<DataRecordAvro> source = KafkaSource.<DataRecordAvro>builder()
.setBootstrapServers("pkc-2396y.us-east-1.aws.confluent.cloud:9092")
.setTopics("test")
.setGroupId("demo-consumer-avro-1")
.setProperty(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "https://psrc-1wydj.us-east-2.aws.confluent.cloud")
.setProperty("basic.auth.credentials.source", "USER_INFO")
.setProperty("basic.auth.user.info", "5CDO5ZR4HDJSGOHT:v/YJxrw6iR+ASbIwMGefZN9SIzPccuyyTQC8EXFn8cOcJCeC1EASEfvHzFRflyXd")
.setProperty("advertised.host.name", "pkc-2396y.us-east-1.aws.confluent.cloud:9092")
.setValueOnlyDeserializer(new AvroDeserializer<>(DataRecordAvro.class))
.setStartingOffsets(OffsetsInitializer.earliest())
.build();
DataStreamSource<DataRecordAvro> input = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");
input.map(record -> "DESERIALIZED: " + record.getMsgType() + "-" + record.get(0) + "-" + record.get(1) + "-" + record.get(2)).print();
env.execute("Printing the payload");
}
}
The clue is this entry from the logs:
Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint s3://rinc-ingestion-service/flink-savepoints/namespaces/default/deployments/e8f1afd0-9236-43c2-93cb-db16604da594/1a6716af-6357-4ae3-95da-dad0bdb1f7cc/savepoint-f101ee-feb4e8023b09. Cannot map checkpoint/savepoint state for operator 7df19f87deec5680128845fd9a6ca18d to the new program, because the operator is not available in the new program. If you want to allow to skip this, you can set the --allowNonRestoredState option on the CLI.
You've made changes to the job since it was last run, and as a result there's state in a checkpoint or savepoint that cannot be restored.
You'll want to read this section from the documentation -- https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/savepoints/#assigning-operator-ids -- which explains the role played by operator IDs, and the importance of setting explicit IDs so that you can evolve how your job uses state without running into problems like this.
If you don't mind starting over from scratch (and abandoning whatever state your job has), then the easiest way forward is to resubmit the job without having it try to restart from a checkpoint or savepoint. In your case the only state is in the KafkaSource, so hopefully dropping the state won't be painful.

Spark Streaming with Elasticsearch connector throws JVM_Bind error

I am using Spark 2.1.1 in Java and elasticsearch-spark-20_2.11 (version 5.3.2) in order to write data in Elasticsearch.I create JavaStreamingContext which I then set to await termination, so the application should always retrieve new data.
After I read the stream, I split it into RDDs and for each one I apply SQL aggregations and then write it to Elasticsearch as follows:
recordStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
/*
* Create RDD from JSON
*/
Dataset<Row> df = spark.read().json(rdd.rdd());
df.createOrReplaceTempView("data");
df.cache();
/*
* Apply the aggregations
*/
Dataset aggregators = spark.sql(ORDER_TYPE_DB);
JavaEsSparkSQL.saveToEs(aggregators.toDF(), "order_analytics/record");
aggregators = spark.sql(ORDER_CUSTOMER_DB);
JavaEsSparkSQL.saveToEs(aggregators.toDF(), "customer_analytics/record");
}
});
This works fine the first time data is read and inserted to Elasticsearch, but when more data are retrieved by the stream, I get the following error:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:94)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:94)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: java.net.BindException: Address already in use: JVM_Bind
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:129)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:627)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
... 10 more
Caused by: java.net.BindException: Address already in use: JVM_Bind
at java.net.DualStackPlainSocketImpl.bind0(Native Method)
at java.net.DualStackPlainSocketImpl.socketBind(DualStackPlainSocketImpl.java:106)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:190)
at java.net.Socket.bind(Socket.java:644)
at java.net.Socket.<init>(Socket.java:433)
at java.net.Socket.<init>(Socket.java:286)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:478)
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:112)
... 16 more
Any ideas what the problem could be?
Spark uses default configuration and is instantiated in Java as
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
Elasticsearch is configured via Docker compose with the following environment parameters:
- cluster.name=cp-es-cluster
- node.name=cloud1
- http.cors.enabled=true
- http.cors.allow-origin="*"
- network.host=0.0.0.0
- discovery.zen.ping.unicast.hosts=${ENV_IP}
- network.publish_host=${ENV_IP}
- discovery.zen.minimum_master_nodes=1
- xpack.security.enabled=false
- xpack.monitoring.enabled=false

TApplicationException exception when running a mapreduce job on an Accumulo Table

I am running a map reduce job taking data from a table in Accumulo as the input and storing the result in another table in Accumulo. To do this, I am using the AccumuloInputFormat and AccumuloOutputFormat classes. Here is the code
public int run(String[] args) throws Exception {
Opts opts = new Opts();
opts.parseArgs(PivotTable.class.getName(), args);
Configuration conf = getConf();
conf.set("formula", opts.formula);
Job job = Job.getInstance(conf);
job.setJobName("Pivot Table Generation");
job.setJarByClass(PivotTable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PivotTableMapper.class);
job.setCombinerClass(PivotTableCombiber.class);
job.setReducerClass(PivotTableReducer.class);
job.setInputFormatClass(AccumuloInputFormat.class);
ClientConfiguration zkConfig = new ClientConfiguration().withInstance(opts.getInstance().getInstanceName()).withZkHosts(opts.getInstance().getZooKeepers());
AccumuloInputFormat.setInputTableName(job, opts.dataTable);
AccumuloInputFormat.setZooKeeperInstance(job, zkConfig);
AccumuloInputFormat.setConnectorInfo(job, opts.getPrincipal(), new PasswordToken(opts.getPassword().value));
job.setOutputFormatClass(AccumuloOutputFormat.class);
BatchWriterConfig bwConfig = new BatchWriterConfig();
AccumuloOutputFormat.setBatchWriterOptions(job, bwConfig);
AccumuloOutputFormat.setZooKeeperInstance(job, zkConfig);
AccumuloOutputFormat.setConnectorInfo(job, opts.getPrincipal(), new PasswordToken(opts.getPassword().value));
AccumuloOutputFormat.setDefaultTableName(job, opts.pivotTable);
AccumuloOutputFormat.setCreateTables(job, true);
return job.waitForCompletion(true) ? 0 : 1;
}
PivotTable is the name of the class that contains the main method (and this one too). I have made the mapper, combiner and reducer classes as well. But when I try to exectute this job, I get an error
Exception in thread "main" java.io.IOException: org.apache.accumulo.core.client.AccumuloException: org.apache.thrift.TApplicationException: Internal error processing hasTablePermission
at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validatePermissions(InputConfigurator.java:707)
at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:397)
at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:668)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at com.latize.ulysses.accumulo.postprocess.PivotTable.run(PivotTable.java:247)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.latize.ulysses.accumulo.postprocess.PivotTable.main(PivotTable.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.accumulo.core.client.AccumuloException: org.apache.thrift.TApplicationException: Internal error processing hasTablePermission
at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.execute(SecurityOperationsImpl.java:87)
at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.hasTablePermission(SecurityOperationsImpl.java:220)
at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validatePermissions(InputConfigurator.java:692)
... 21 more
Caused by: org.apache.thrift.TApplicationException: Internal error processing hasTablePermission
at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.recv_hasTablePermission(ClientService.java:641)
at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.hasTablePermission(ClientService.java:624)
at org.apache.accumulo.core.client.impl.SecurityOperationsImpl$8.execute(SecurityOperationsImpl.java:223)
at org.apache.accumulo.core.client.impl.SecurityOperationsImpl$8.execute(SecurityOperationsImpl.java:220)
at org.apache.accumulo.core.client.impl.ServerClient.executeRaw(ServerClient.java:79)
at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.execute(SecurityOperationsImpl.java:73)
Can someone tell me what am I doing wrong here? Any help would be appreciated.
EDIT : I am running Accumulo 1.7.0
A TApplicationException indicates the error occurred on the Accumulo tablet server, rather than in your client (MapReduce) code. You'll need to examine your tablet server logs to get more information about the particular error wherever you see TApplicationException.
Table permissions are usually retrieved from ZooKeeper, so it may indicate a problem with the tserver connecting to ZooKeeper.
Unfortunately, I don't see the hostname or the IP in the stack trace, so you may have to check all the tserver logs to find it.

BigQuery - How to set read timeout in the Java client library

I am using Spark to load some data into BigQuery. The idea is to read data from S3 and use Spark and BigQuery client API to load data. Below is the code that does the insert into BigQuery.
val bq = createAuthorizedClientWithDefaultCredentialsFromStream(appName, credentialStream)
val bqjob = bq.jobs().insert(pid, job, data).execute() // data is a InputStream content
With this approach, I am seeing lot of SocketTimeoutException.
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
Looks like the delay in reading from S3 causes Google http-client to timeout. I wanted to increase the timeout and tried the below options.
val req = bq.jobs().insert(pid, job, data).buildHttpRequest()
req.setReadTimeout(3 * 60 * 1000)
val res = req.execute()
But this causes a Precondition failure in BigQuery. It expects the mediaUploader to be null, not sure why though.
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:37)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:297)
This caused me to try the second insert API on BigQuery
val req = bq.jobs().insert(pid, job).buildHttpRequest().setReadTimeout(3 * 60 * 1000).setContent(data)
val res = req.execute()
And this time it failed with a different error.
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: ",
"reason" : "invalid"
} ],
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
Please suggest me how I can set the timeout. Also point me if I am doing something wrong.
I'll answer the main question from the title: how to set timeouts using the Java client library.
To set timeouts, you need a custom HttpRequestInitializer configured in your client. For example:
Bigquery.Builder builder =
new Bigquery.Builder(new UrlFetchTransport(), new JacksonFactory(), credential);
final HttpRequestInitializer existing = builder.getHttpRequestInitializer();
builder.setHttpRequestInitializer(new HttpRequestInitializer() {
#Override
public void initialize(HttpRequest request) throws IOException {
existing.initialize(request);
request
.setReadTimeout(READ_TIMEOUT)
.setConnectTimeout(CONNECTION_TIMEOUT);
}
});
Bigquery client = builder.build();
I don't think this will solve all the issues you are facing. A few ideas that might be helpful, but I don't fully understand the scenario so these may be off track:
If you are moving large files: consider staging them on GCS before loading them into BigQuery.
If you are using media upload to send the data with your request: these can't be too large or you risk timeouts or network connection failures.
If you are running an embarrassingly parallel data migration, and the data chunks are relatively small, bigquery.tabledata.insertAll may be more appropriate for large fan-in scenarios like this. See https://cloud.google.com/bigquery/streaming-data-into-bigquery for more details.
Thanks for the question!

Categories

Resources