How to set AvroCoder with KafkaIO and Apache Beam with Java - java

I'm trying to create a pipeline that streams data from a Kafka topic to google's Bigquery. Data in the topic is in Avro.
I call the apply function 3 times. Once to read from Kafka, once to extract record and once to write to Bigquery. Here is the main part of the code:
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers(options.getKafkaBrokers().get())
.withTopics(Utils.getListFromString(options.getKafkaTopics()))
.withKeyDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get())
)
.withValueDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get()))
.withoutMetadata()
)
.apply("Extract GenericRecord",
MapElements.into(TypeDescriptor.of(GenericRecord.class)).via(KV::getValue)
)
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.useAvroLogicalTypes()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
//Temporary location to save files in GCS before loading to BQ
.withCustomGcsTempLocation(options.getGcsTempLocation())
.withNumFileShards(options.getNumShards().get())
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration(options.getWindowDuration().get()))
.to(new TableReference()
.setProjectId(options.getGcpProjectId().get())
.setDatasetId(options.getGcpDatasetId().get())
.setTableId(options.getGcpTableId().get()))
);
When running, i get the following error:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Extract GenericRecord/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes: No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.avro.generic.GenericRecord.
Building a Coder using a registered CoderProvider failed.
How do I set the coder to properly read Avro?

There are at least three approaches to this:
Set the coder inline:
pipeline.apply("Read from Kafka", ....)
.apply("Dropping key", Values.create())
.setCoder(AvroCoder.of(Schema schemaOfGenericRecord))
.apply("Write data to BQ", ....);
Note that the key is dropped because its unused, with this you wont need MapElements any more.
Register the coder in the pipeline's instance of CoderRegistry:
pipeline.getCoderRegistry().registerCoderForClass(GenericRecord.class, AvroCoder.of(Schema genericSchema));
Get the coder from the schema registry via:
ConfluentSchemaRegistryDeserializerProvider.getCoder(CoderRegistry registry)
https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry-

Related

Azure Databricks running Autoloader Implementation from java Jar throws org.apache.spark.sql.AnalysisException

currently I am running into an issue but do not understand why this is happning. I have implemented a Java function which uses the Databricks Autoloader to readstream all parquet files from an azure blob storage and "write" it in a dataframe (Dataset because it is in Java written). The code is executed from an Jar which I build in Java and running as a Job on a Shared Cluster.
Code:
Dataset<Row> newdata= spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
// .option("cloudFiles.useNotifications", "true")
.schema(dfsample.schema()).option("cloudFiles.includeExistingFiles", "true").load(filePath);
newdata.show();
But unfortunatelly I get the following exception:
WARN SQLExecution: Error executing delta metering
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
cloudFiles
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
What makes me wonder is, that the exactly same code is running fine inside a Databricks Notebook written in Scala:
val df1 = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", "true").option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.subscriptionId", storagesubscriptionid).schema(df_schema).option("cloudFiles.includeExistingFiles", "false").load(filePath);
display(df1);
I expect a Dataset object containing all the new data from the blobstorage parquet files in schema: id1:int, id2:int, content:binary
So finally, I have found a way to get Autoloader working inside my Java Jar.
As Vincent already commented you have to combine readstream with a writestream.
So I am simply writing the files which have been detected by the autoloader, to a Azure Data Lake.
spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", STORAGE_SUBSCRIPTION_ID)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", SP_TENANT_ID)
.option("cloudFiles.clientId", SP_APPLICATION_ID)
.option("cloudFiles.clientSecret", SP_CLIENT_SECRET)
.option("cloudFiles.resourceGroup", STORAGE_RESOURCE_GROUP)
.option("cloudFiles.connectionString", STORAGE_SAS_CONNECTION_STRING)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.useNotifications", "true")
.schema(DF_SCHEMA)
.load(BLOB_STORAGE_LANDING_ZONE_PATH)
.writeStream()
.format("delta")
.option("checkpointLocation", DELTA_TABLE_RAW_DATA_CHECKPOINT_PATH)
.option("mergeSchema", "true")
.trigger(Trigger.Once())
.outputMode("append")
.start(DELTA_TABLE_RAW_DATA_PATH).awaitTermination();
This works fine with Java when you need to run a Jar as Databricks Jobs.
But to be honest I am still wondering why, from inside a Notebook, I don't have to use writestream in scala language to receive new files from the autoloader.

Unrecognized pipeline stage name: '$setOnInsert'

Using a standalone MongoDB instance in version 4.4.1 with a Java client that connects using the latest driver (org.mongodb:mongodb-driver-sync:4.1.1), I am getting an error when calling findOneAndUpdate with the $setOnInsert operator.
Here is the query used:
final List<Bson> updates = new ArrayList<>();
updates.add(Updates.set("data", "test"));
updates.add(Updates.setOnInsert("firstSeenTime", new Date()));
final Document updatedDocument =
this.visitorsCollection.findOneAndUpdate(
eq("userId", "u1"), updates, new FindOneAndUpdateOptions().returnDocument(ReturnDocument.AFTER).upsert(true));
The error:
Exception in thread "main" com.mongodb.MongoCommandException: Command
failed with error 40324 (Location40324): 'Unrecognized pipeline stage
name: '$setOnInsert'' on server A.B.C.D:XXXXX. The full
response is {"ok": 0.0, "errmsg": "Unrecognized pipeline stage name:
'$setOnInsert'", "code": 40324, "codeName": "Location40324"} at
com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:175)
at
com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:359)
at
com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:280)
at
com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:100)
at
com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:490)
at
com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
at
com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:255)
at
com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:202)
at
com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:118)
at
com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:110)
at
com.mongodb.internal.operation.CommandOperationHelper$13.call(CommandOperationHelper.java:712)
at
com.mongodb.internal.operation.OperationHelper.withReleasableConnection(OperationHelper.java:620)
at
com.mongodb.internal.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:705)
at
com.mongodb.internal.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:697)
at
com.mongodb.internal.operation.BaseFindAndModifyOperation.execute(BaseFindAndModifyOperation.java:69)
at
com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:195)
at
com.mongodb.client.internal.MongoCollectionImpl.executeFindOneAndUpdate(MongoCollectionImpl.java:785)
at
com.mongodb.client.internal.MongoCollectionImpl.findOneAndUpdate(MongoCollectionImpl.java:765)
If I get rid of the Updates.setOnInsert(...) call, then the update works but not as I would like. My purpose is to set some fields based on whether the document to update exists or not. Looking at the documentation, $setOnInsert should be supported:
https://docs.mongodb.com/manual/reference/operator/update/#id1
Any idea about what is wrong?
The problem here is there are 2 forms of findOneAndUpdate. The second argument can be either:
a document containing update operator expressions
an array containing $set, $unset, and $replaceRoot aggregation stages
Since you are creating updates as an ArrayList, findOneAndUpdate is trying to process it as an aggregation pipeline, which does not recognize a $setOneInsert stage.
You need to build updates as a Document for the update operators to be recognized. Following your example, you can simply wrap the list with Updates.combine(updates) and pass it to findOneAndUpdate as the second parameter.

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

Compress map output result exception in hadoop program

In Hadoop program, I tried to compress the map result, I wrote the following code:
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);
and run it, I got the below exception, anybody know the reason?
WARN mapred.LocalJobRunner: job_local1149103367_0001
java.io.IOException: not a gzip file
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:495)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:256)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:185)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at org.apache.hadoop.mapred.IFile$Reader.positionToNextRecord(IFile.java:400)
at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:425)
at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:323)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:613)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:558)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:70)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:385)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)
today, I tested it again, I found that if the put the 2 lines before the job object was created,
Job job = new Job(conf, "MyCounter");
the error will happen, if after that, no error will occur, why this happen?
are you using MRv1 or MRv2. If you are using MRv2 then use the following job config.
config.setBoolean("mapreduce.output.fileoutputformat.compress", true);
config.setClass("mapreduce.output.fileoutputformat.compress.codec",GzipCodec.class,CompressionCodec.class);
additionally you can set
config.set("mapreduce.output.fileoutputformat.compress.type",CompressionType.NONE.toString());
BLOCK|NONE|RECORD are three types of compression.

Reuse results of first computation in second computation

I'm trying to write a computation in Flink which requires two phases.
In the first phase I start from a text file, and perform some parameter estimation, obtaining as a result a Java object representing a statistical model of the data.
In the second phase, I'd like to use this object to generate data for a simulation.
I'm unsure how to do this. I tried with a LocalCollectionOutputFormat, and it works locally, but when I deploy the job on a cluster, I get a NullPointerException - which is not really surprising.
What is the Flink way of doing this?
Here is my code:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
GlobalConfiguration.includeConfiguration(configuration);
// Phase 1: read file and estimate model
DataSource<Tuple4<String, String, String, String>> source = env
.readCsvFile(args[0])
.types(String.class, String.class, String.class, String.class);
List<Tuple4<Bayes, Bayes, Bayes, Bayes>> bayesResult = new ArrayList<>();
// Processing here...
....output(new LocalCollectionOutputFormat<>(bayesResult));
env.execute("Bayes");
DataSet<BTP> btp = env
.createInput(new BayesInputFormat(bayesResult.get(0)))
// Phase 2: BayesInputFormat generates data for further calculations
// ....
This is the exception I get:
Error: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
at org.apache.flink.client.program.Client.run(Client.java:328)
at org.apache.flink.client.program.Client.run(Client.java:294)
at org.apache.flink.client.program.Client.run(Client.java:288)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
at it.list.flink.test.Test01.main(Test01.java:62)
...
With the latest release (0.9-milestone-1) a collect() method was added to Flink
public List<T> collect()
which fetches a DataSet<T> as List<T> to the driver program. collect() will also trigger an immediate execution of the program (don't need to call ExecutionEnvironment.execute()). Right now, there is size limitation for data sets of about 10 MB.
If you do not evaluate the models in the driver program, you can also chain both programs together and emit the model to the side by attaching a data sink. This will be more efficient, because the data won't do the round-trip over the client machine.
If you're using Flink prior to 0.9 you may use the following snippet to collect your dataset to a local collection:
val dataJavaList = new ArrayList[K]
val outputFormat = new LocalCollectionOutputFormat[K](dataJavaList)
dataset.output(outputFormat)
env.execute("collect()")
Where K is the type of object you want to collect

Categories

Resources