Handling empty PCollections with BigQuery in Apache Beam

Handling empty PCollections with BigQuery in Apache Beam - java

Using the following code, I am getting the following errors when trying to write to BigQuery
I am using Apache-Beam 2.0.0
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
If I change the text.startsWith to D, everything works fine (i.e. so something is output).
Is there someway to catch or watch for empty PCollections?
Based on the StackTrace it looks like the error is actually in BigQueryIO - the file left in my bucket has 0 bytes and maybe this is causing BigQueryIO a problem.
My use case is that I am using side outputs for DeadLetters and encountered this error when my job produced no dead-letter output, so robustly handling this would be useful.
The job should really be able to run in batch or streaming mode, my best guess is to write any output to GCS / TextIO in batch mode and GBQ when streaming, if that sounds sensible?
Any help gratefully received.
public class EmptyPCollection {
public static void main(String [] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.setTempLocation("gs://<your-bucket-here>/temp");
Pipeline pipeline = Pipeline.create(options);
String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
String table = "<your-dataset>.<your-table>";
List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
String text = c.element();
if (text.startsWith("X")) { // change to (D)og and works fine
TableRow row = new TableRow();
row.set("pet", text);
c.output(row);
}
}
}));
rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
pipeline.run().waitUntilFinish();
}
}
[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)

This looks like a bug in the BigQuery sink implementation within Apache Beam. Filing a bug in the Apache Beam Jira would be the appropriate place to file this.
I have filed https://issues.apache.org/jira/browse/BEAM-2406 to track this issue.

Related

Apache Beam Row Coder issue org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException

Can you please help me with this issue. Is it not possible to convert PCollection of strings into Pcollection of Row ?
Is it not possible to convert Pcollection of String Array into Pcollection of Beam Row ?
I tried String Data type for all the fields in beam schema but it is also giving me same error.
I am using Java 11, Maven 3.8.5 and Apache beam Java SDK 2.41.0
I tried same code with Java 1.8 and Beam 2.40.0 getting same error.
public class beamRowPractise {
public static void main(String[] args){
PipelineOptions opts = PipelineOptionsFactory.create();
opts.setRunner(DirectRunner.class);
Pipeline p = Pipeline.create(opts);
PCollection<String> pc1 = p.apply(TextIO.read().from("data/indata.csv"));
PCollection<Row> pc2 = pc1.apply(MapElements.via(new mapString())).setRowSchema(getSchema()) ;
System.out.println(pc2.getSchema().toString());
p.run();
}
public static class mapString extends SimpleFunction<String, Row> {
#Override
public Row apply(String record){
String arr[] = record.split(",");
Row.Builder row = Row.withSchema(getSchema()) ;
row.withFieldValue("name",arr[0]);
row.withFieldValue("id1",arr[1]);
row.withFieldValue("id2",arr[2]);
row.withFieldValue("id3",arr[3]);
row.withFieldValue("id4",arr[4]);
return row.build();
}
}
public static Schema getSchema() {
org.apache.beam.sdk.schemas.Schema.Builder typed_schema_builder = org.apache.beam.sdk.schemas.Schema.builder();
typed_schema_builder.addField("name", org.apache.beam.sdk.schemas.Schema.FieldType.STRING);
typed_schema_builder.addField("id1", Schema.FieldType.INT64);
typed_schema_builder.addField("id2", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
typed_schema_builder.addField("id3", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
typed_schema_builder.addField("id4", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
org.apache.beam.sdk.schemas.Schema typed_beam_schema = typed_schema_builder.build();
org.apache.beam.sdk.schemas.Schema schema = typed_beam_schema;
return schema;
}
}
Error :
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:374)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:342)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at com.bhargav.beamFirst.beamRowPractise.main(beamRowPractise.java:25)
Caused by: java.lang.IllegalStateException
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
at org.apache.beam.sdk.coders.RowCoderGenerator$EncodeInstruction.encodeDelegate(RowCoderGenerator.java:313)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$hZNCN9ub.encode(Unknown Source)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$hZNCN9ub.encode(Unknown Source)
at org.apache.beam.sdk.schemas.SchemaCoder.encode(SchemaCoder.java:124)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:86)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:70)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:55)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:168)
at org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.<init>(MutationDetectors.java:118)
at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:49)
at org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:115)
at org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:305)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:275)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:85)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:423)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:76)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)
Process finished with exit code 1

UPD: You need to chain your builder calls since .withFieldValue() returns Row.FieldValueBuilder, like this:
public static class mapString extends SimpleFunction<String, Row> {
#Override
public Row apply(String record){
String[] arr = record.split(",");
return Row.withSchema(getSchema())
.withFieldValue("name", arr[0])
.withFieldValue("id1", Long.valueOf(arr[1]))
.withFieldValue("id2", Long.valueOf(arr[2]))
.withFieldValue("id3", Long.valueOf(arr[3]))
.withFieldValue("id4", Long.valueOf(arr[4]))
.build();
}
}
As a workaround, you may try to use row.addValue(...) instead and add values in the order defined in your schema, like this:
row.addValue(arr[0]);
row.addValue(Long.valueOf(arr[1]));
row.addValue(Long.valueOf(arr[2]));
row.addValue(Long.valueOf(arr[3]));
row.addValue(Long.valueOf(arr[4]));

I noticed that you are setting the field types of the form idx to Schema.FieldType.INT64 (Java long) but you are actually setting strings when generating Rows (for example, row.withFieldValue("id1",arr[1])). Can you try setting values of the correct type ?

Query Avro Schema using Beam SQL

I'm trying to read avro files with Apache Beam and use Beam SQL to transform the data.
I'm still new in Beam and Java. Here's my simple code:
public class BeamSQLReadAvro {
#SuppressWarnings("serial")
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
/* Schema definition */
Schema schema = new Schema.Parser().parse(new File("data/RATE_CODE/RATE_CODE.avsc"));
/* Create record/row */
PCollection<GenericRecord> records = p.apply(AvroIO.readGenericRecords(schema).from("data/RATE_CODE/*.avro"));
/* SQL Transform */
records.apply("SQL Transform 01",SqlTransform.query("SELECT RCODE,RNAME,RDESC FROM PCOLLECTION LIMIT 10"))
/* Print output */
.apply("Output",
MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("PCOLLECTION: " + input.getValues());
return input;
}
}
)
);
p.run().waitUntilFinish();
}
}
it gives me error
Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema
I don't understand, I have defined variable called schema. Any pointers here?

Actually, there are two types of schemas in your pipeline - Avro and Beam schemas. Avro schema is used to parse your Avro input records, but for SQL transform you are supposed to use rows with Beam schema. To do this, AvroIO provides an option withBeamSchemas(boolean), which should be set to true in your case, like:
AvroIO.readGenericRecords(schema).withBeamSchemas(true).from("data/RATE_CODE/*.avro")

GCP Dataflow Streaming Template : Not able to customize google provided java based PubSubToBQ template

Problem Statement is We are customizing Google Provided PubSubToBQ Dataflow streaming java template in which we are configuring multiple subscripotion/topics to be read and push data into multiple Bigquery tables, this needs to be executed as single dataflow pipeline to read all streams from a source and push into Bigquery tables. But when we execute template from eclipse we have to pass Subscription/Topic and BQ details, and tempalte stage on gcs bucket then when we run template using gcloud command with different Subscription and BQ details. Dataflow job is not override with new Subscription or BQ tables.
Objective : My objective is to use Google Provided PubSubTOBQ.java class template and pass a list of subscription with corresponding Bigquery Table and create a pipeline of passing subscription per table. So n-n, n pipeline in a single Job.
I am using Google Provided PubSubTOBQ.java class template which is taking input as a single subscription or single topic and corresponding Big Query Table detail.
Now i need to customize this to take input as list of Topics or list of subscriptions as comma separated. Which i am able to take using ValueProvider> and inside main or run method i am iterating through Array of String and passing subscription/topic and bq table as a string. Look at the below code for more information.
What i read on gcp doc is we cannot pass ValueProvider Variables outside DoFn if we want to override or use value during rumtime to create dynamic Piepline. Not sure if we can read messages inside DoFn.
**PubsubIO.readMessagesWithAttributes().fromSubscription(providedSubscriptionArray[i])**
If Yes please let me know. So that my objective is achieved.
Code:
public static void main(String[] args) {
StreamingDataflowOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(StreamingDataflowOptions.class);
List<String> listOfSubStr = new ArrayList<String>();
List<String> listOfTopicStr = new ArrayList<String>();
List<String> listOfTableStr = new ArrayList<String>();
String[] providedSubscriptionArray = null;
String[] providedTopicArray = null;
String[] providedTableArray = null;
if (options.getInputSubscription().isAccessible()) {
listOfSubStr = options.getInputSubscription().get();
providedSubscriptionArray = new String[listOfSubStr.size()];
providedSubscriptionArray = createListOfProvidedStringArray(listOfSubStr);
}
if (options.getInputTopic().isAccessible()) {
listOfTopicStr = options.getInputTopic().get();
providedTopicArray = new String[listOfSubStr.size()];
providedTopicArray = createListOfProvidedStringArray(listOfTopicStr);
}
if (options.getOutputTableSpec().isAccessible()) {
listOfTableStr = options.getOutputTableSpec().get();
providedTableArray = new String[listOfSubStr.size()];
providedTableArray = createListOfProvidedStringArray(listOfTableStr);
}
Pipeline pipeline = Pipeline.create(options);
PCollection<PubsubMessage> readPubSubMessage = null;
for (int i = 0; i < providedSubscriptionArray.length; i++) {
if (options.getUseSubscription()) {
readPubSubMessage = pipeline
.apply(PubsubIO.readMessagesWithAttributes().fromSubscription(providedSubscriptionArray[i]));
} else {
readPubSubMessage = pipeline.apply(PubsubIO.readMessagesWithAttributes().fromTopic(providedTopicArray[i]));
}
readPubSubMessage
/*
* Step #2: Transform the PubsubMessages into TableRows
*/
.apply("Convert Message To TableRow", ParDo.of(new PubsubMessageToTableRow()))
.apply("Insert Data To BigQuery",
BigQueryIO.writeTableRows().to(providedTableArray[i])
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
}
pipeline.run().waitUntilFinish();
}
Should be able to use single Dataflow PubSubTOBQ template for multiple pipeline of number of subscription corresponding to number of bigquery template in single Dataflow Streaming Job.

The problem is that Dataflow templates, as of now, need to know the pipeline graph at staging/creation time so it can't be different at runtime. If you still want to do it with a non-templated pipeline and passing a comma-separated Pub/Sub topic list as --topicList option parameter, then you can do something like:
String[] listOfTopicStr = options.getTopicList().split(",");
PCollection[] p = new PCollection[listOfTopicStr.length];
for (int i = 0; i < listOfTopicStr.length; i++) {
p[i] = pipeline
.apply(PubsubIO.readStrings().fromTopic(listOfTopicStr[i]))
.apply(ParDo.of(new DoFn<String, Void>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
Log.info(String.format("Message=%s", c.element()));
}
}));
}
Full code here.
If we test it with 3 topics such as:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.MultipleTopics \
-Dexec.args="--project=$PROJECT \
--topicList=projects/$PROJECT/topics/topic1,projects/$PROJECT/topics/topic2,projects/$PROJECT/topics/topic3 \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
gcloud pubsub topics publish topic1 --message="message 1"
gcloud pubsub topics publish topic2 --message="message 2"
gcloud pubsub topics publish topic3 --message="message 3"
The output and Dataflow graph will be as expected:
A possible workaround to force this approach into templates would be to have a large enough number of topics N for the worst-case scenario. When we execute the template with n topics (satisfying n <= N) we would need to specify N - n unused/dummy topics to fill in.

How to run apache flink streaming job continuously on Flink server

Hello,
I written code for streaming job where as source and target is a PostgreSQL database. I used JDBCInputFormat/JDBCOutputFormat to read and write the records(Referenced example).
Code:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row row) {
Date dt = (Date) row.getField(2);
return dt.getTime();
}
})
.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)))
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
return row;
}
});
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery("insert into tablename(id, name) values (?,?)")
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR})
.finish());
This code is executing correctly but not running continuously on Flink server(Select query is executing only once.)
Expected to run continuously on flink server.

Probably, you have to define your own Flink Source or JDBCInputFormat, since the one you use here will stop the SourceTask while fetching all results from DB. One way to solve this is create your own jdbc input format based on JDBCInputFormat, trying to re-execute the SQL query while reading the last row from DB in nextRecord.

Google Dataflow: PCollection<String> to PCollection<TableRow> for BigQuery insertion

I'm fairly new to the Google Cloud Platform and I'm trying Google Dataflow for the first time for a project for my postgraduate programme. What I want to do is write an automated load job that loads files from a certain bucket on my Cloud Storage and inserts the data from it into a BigQuery table.
I get the data as a PCollection<String> type, but for insertion in BigQuery I apparently need to transform it to a PCollection<TableRow> type. So far I haven't found a solid answer to do this.
Here's my code:
public static void main(String[] args) {
//Defining the schema of the BigQuery table
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("Datetime").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("Consumption").setType("FLOAT"));
fields.add(new TableFieldSchema().setName("MeterID").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
//Creating the pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
//Getting the data from cloud storage
PCollection<String> lines = p.apply(TextIO.Read.named("ReadCSVFromCloudStorage").from("gs://mybucket/myfolder/certainCSVfile.csv"));
//Probably need to do some transform here ...
//Inserting data into BigQuery
lines.apply(BigQueryIO.Write
.named("WriteToBigQuery")
.to("projectID:datasetID:tableID")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
}
I'm probably just forgetting something basic, so I hope you guys can help me with this ...

BigQueryIO.Write operates on PCollection<TableRow>, as outlined in Writing to BigQuery. You'll need to apply a transform to convert PCollection<TableRow>into PCollection<String>. For an example, take a look at StringToRowConverter:
static class StringToRowConverter extends DoFn<String, TableRow> {
/**
* In this example, put the whole string into single BigQuery field.
*/
#Override
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
...
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Handling empty PCollections with BigQuery in Apache Beam - java

This looks like a bug in the BigQuery sink implementation within Apache Beam. Filing a bug in the Apache Beam Jira would be the appropriate place to file this. I have filed https://issues.apache.org/jira/browse/BEAM-2406 to track this issue.

Related

Apache Beam Row Coder issue org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException

Query Avro Schema using Beam SQL

GCP Dataflow Streaming Template : Not able to customize google provided java based PubSubToBQ template

How to run apache flink streaming job continuously on Flink server

Google Dataflow: PCollection<String> to PCollection<TableRow> for BigQuery insertion

Categories

Resources