I'm trying to read avro files with Apache Beam and use Beam SQL to transform the data.
I'm still new in Beam and Java. Here's my simple code:
public class BeamSQLReadAvro {
#SuppressWarnings("serial")
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
/* Schema definition */
Schema schema = new Schema.Parser().parse(new File("data/RATE_CODE/RATE_CODE.avsc"));
/* Create record/row */
PCollection<GenericRecord> records = p.apply(AvroIO.readGenericRecords(schema).from("data/RATE_CODE/*.avro"));
/* SQL Transform */
records.apply("SQL Transform 01",SqlTransform.query("SELECT RCODE,RNAME,RDESC FROM PCOLLECTION LIMIT 10"))
/* Print output */
.apply("Output",
MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("PCOLLECTION: " + input.getValues());
return input;
}
}
)
);
p.run().waitUntilFinish();
}
}
it gives me error
Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema
I don't understand, I have defined variable called schema. Any pointers here?
Actually, there are two types of schemas in your pipeline - Avro and Beam schemas. Avro schema is used to parse your Avro input records, but for SQL transform you are supposed to use rows with Beam schema. To do this, AvroIO provides an option withBeamSchemas(boolean), which should be set to true in your case, like:
AvroIO.readGenericRecords(schema).withBeamSchemas(true).from("data/RATE_CODE/*.avro")
Related
I need to publish the Big query table rows to Kafka in Avro format.
PCollection<TableRow> rows =
pipeline
.apply(
"Read from BigQuery query",
BigQueryIO.readTableRows().from(String.format("%s:%s.%s", project, dataset, table))
//How to convert rows to avro format?
rows.apply(KafkaIO.<Long, ???>write()
.withBootstrapServers("kafka:29092")
.withTopic("test")
.withValueSerializer(KafkaAvorSerializer.class)
);
How to convert TableRow to Avro format?
Use MapElements
rows.apply(MapElements.via(new SimpleFunction<Tabelrows, GenericRecord>() {
#Override
public GenericRecord apply(Tabelrows input) {
log.info("Parsing {} to Avro", input);
return null; // TODO: Replace with Avro object
}
});
If Tabelrows is a collection-type that you want to convert to many records, you can use FlatMapElements instead.
As for writing to Kafka, I wrote a simple example
Can you please help me with this issue. Is it not possible to convert PCollection of strings into Pcollection of Row ?
Is it not possible to convert Pcollection of String Array into Pcollection of Beam Row ?
I tried String Data type for all the fields in beam schema but it is also giving me same error.
I am using Java 11, Maven 3.8.5 and Apache beam Java SDK 2.41.0
I tried same code with Java 1.8 and Beam 2.40.0 getting same error.
public class beamRowPractise {
public static void main(String[] args){
PipelineOptions opts = PipelineOptionsFactory.create();
opts.setRunner(DirectRunner.class);
Pipeline p = Pipeline.create(opts);
PCollection<String> pc1 = p.apply(TextIO.read().from("data/indata.csv"));
PCollection<Row> pc2 = pc1.apply(MapElements.via(new mapString())).setRowSchema(getSchema()) ;
System.out.println(pc2.getSchema().toString());
p.run();
}
public static class mapString extends SimpleFunction<String, Row> {
#Override
public Row apply(String record){
String arr[] = record.split(",");
Row.Builder row = Row.withSchema(getSchema()) ;
row.withFieldValue("name",arr[0]);
row.withFieldValue("id1",arr[1]);
row.withFieldValue("id2",arr[2]);
row.withFieldValue("id3",arr[3]);
row.withFieldValue("id4",arr[4]);
return row.build();
}
}
public static Schema getSchema() {
org.apache.beam.sdk.schemas.Schema.Builder typed_schema_builder = org.apache.beam.sdk.schemas.Schema.builder();
typed_schema_builder.addField("name", org.apache.beam.sdk.schemas.Schema.FieldType.STRING);
typed_schema_builder.addField("id1", Schema.FieldType.INT64);
typed_schema_builder.addField("id2", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
typed_schema_builder.addField("id3", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
typed_schema_builder.addField("id4", org.apache.beam.sdk.schemas.Schema.FieldType.INT64);
org.apache.beam.sdk.schemas.Schema typed_beam_schema = typed_schema_builder.build();
org.apache.beam.sdk.schemas.Schema schema = typed_beam_schema;
return schema;
}
}
Error :
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:374)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:342)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at com.bhargav.beamFirst.beamRowPractise.main(beamRowPractise.java:25)
Caused by: java.lang.IllegalStateException
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
at org.apache.beam.sdk.coders.RowCoderGenerator$EncodeInstruction.encodeDelegate(RowCoderGenerator.java:313)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$hZNCN9ub.encode(Unknown Source)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$hZNCN9ub.encode(Unknown Source)
at org.apache.beam.sdk.schemas.SchemaCoder.encode(SchemaCoder.java:124)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:86)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:70)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:55)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:168)
at org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.<init>(MutationDetectors.java:118)
at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:49)
at org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:115)
at org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:305)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:275)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:85)
at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:423)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:76)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)
Process finished with exit code 1
UPD: You need to chain your builder calls since .withFieldValue() returns Row.FieldValueBuilder, like this:
public static class mapString extends SimpleFunction<String, Row> {
#Override
public Row apply(String record){
String[] arr = record.split(",");
return Row.withSchema(getSchema())
.withFieldValue("name", arr[0])
.withFieldValue("id1", Long.valueOf(arr[1]))
.withFieldValue("id2", Long.valueOf(arr[2]))
.withFieldValue("id3", Long.valueOf(arr[3]))
.withFieldValue("id4", Long.valueOf(arr[4]))
.build();
}
}
As a workaround, you may try to use row.addValue(...) instead and add values in the order defined in your schema, like this:
row.addValue(arr[0]);
row.addValue(Long.valueOf(arr[1]));
row.addValue(Long.valueOf(arr[2]));
row.addValue(Long.valueOf(arr[3]));
row.addValue(Long.valueOf(arr[4]));
I noticed that you are setting the field types of the form idx to Schema.FieldType.INT64 (Java long) but you are actually setting strings when generating Rows (for example, row.withFieldValue("id1",arr[1])). Can you try setting values of the correct type ?
Using the following code, I am getting the following errors when trying to write to BigQuery
I am using Apache-Beam 2.0.0
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
If I change the text.startsWith to D, everything works fine (i.e. so something is output).
Is there someway to catch or watch for empty PCollections?
Based on the StackTrace it looks like the error is actually in BigQueryIO - the file left in my bucket has 0 bytes and maybe this is causing BigQueryIO a problem.
My use case is that I am using side outputs for DeadLetters and encountered this error when my job produced no dead-letter output, so robustly handling this would be useful.
The job should really be able to run in batch or streaming mode, my best guess is to write any output to GCS / TextIO in batch mode and GBQ when streaming, if that sounds sensible?
Any help gratefully received.
public class EmptyPCollection {
public static void main(String [] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.setTempLocation("gs://<your-bucket-here>/temp");
Pipeline pipeline = Pipeline.create(options);
String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
String table = "<your-dataset>.<your-table>";
List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
String text = c.element();
if (text.startsWith("X")) { // change to (D)og and works fine
TableRow row = new TableRow();
row.set("pet", text);
c.output(row);
}
}
}));
rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
pipeline.run().waitUntilFinish();
}
}
[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)
This looks like a bug in the BigQuery sink implementation within Apache Beam. Filing a bug in the Apache Beam Jira would be the appropriate place to file this.
I have filed https://issues.apache.org/jira/browse/BEAM-2406 to track this issue.
I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
#Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData is getting executed twice.
first, when I read jsonStringRdd as DataFrame using SQLContext
Second, when I write the dataSchemaDF to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
I'm fairly new to the Google Cloud Platform and I'm trying Google Dataflow for the first time for a project for my postgraduate programme. What I want to do is write an automated load job that loads files from a certain bucket on my Cloud Storage and inserts the data from it into a BigQuery table.
I get the data as a PCollection<String> type, but for insertion in BigQuery I apparently need to transform it to a PCollection<TableRow> type. So far I haven't found a solid answer to do this.
Here's my code:
public static void main(String[] args) {
//Defining the schema of the BigQuery table
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("Datetime").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("Consumption").setType("FLOAT"));
fields.add(new TableFieldSchema().setName("MeterID").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
//Creating the pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
//Getting the data from cloud storage
PCollection<String> lines = p.apply(TextIO.Read.named("ReadCSVFromCloudStorage").from("gs://mybucket/myfolder/certainCSVfile.csv"));
//Probably need to do some transform here ...
//Inserting data into BigQuery
lines.apply(BigQueryIO.Write
.named("WriteToBigQuery")
.to("projectID:datasetID:tableID")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
}
I'm probably just forgetting something basic, so I hope you guys can help me with this ...
BigQueryIO.Write operates on PCollection<TableRow>, as outlined in Writing to BigQuery. You'll need to apply a transform to convert PCollection<TableRow>into PCollection<String>. For an example, take a look at StringToRowConverter:
static class StringToRowConverter extends DoFn<String, TableRow> {
/**
* In this example, put the whole string into single BigQuery field.
*/
#Override
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
...
}