Google Dataflow: PCollection<String> to PCollection<TableRow> for BigQuery insertion - java

I'm fairly new to the Google Cloud Platform and I'm trying Google Dataflow for the first time for a project for my postgraduate programme. What I want to do is write an automated load job that loads files from a certain bucket on my Cloud Storage and inserts the data from it into a BigQuery table.
I get the data as a PCollection<String> type, but for insertion in BigQuery I apparently need to transform it to a PCollection<TableRow> type. So far I haven't found a solid answer to do this.
Here's my code:
public static void main(String[] args) {
//Defining the schema of the BigQuery table
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("Datetime").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("Consumption").setType("FLOAT"));
fields.add(new TableFieldSchema().setName("MeterID").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
//Creating the pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
//Getting the data from cloud storage
PCollection<String> lines = p.apply(TextIO.Read.named("ReadCSVFromCloudStorage").from("gs://mybucket/myfolder/certainCSVfile.csv"));
//Probably need to do some transform here ...
//Inserting data into BigQuery
lines.apply(BigQueryIO.Write
.named("WriteToBigQuery")
.to("projectID:datasetID:tableID")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
}
I'm probably just forgetting something basic, so I hope you guys can help me with this ...

BigQueryIO.Write operates on PCollection<TableRow>, as outlined in Writing to BigQuery. You'll need to apply a transform to convert PCollection<TableRow>into PCollection<String>. For an example, take a look at StringToRowConverter:
static class StringToRowConverter extends DoFn<String, TableRow> {
/**
* In this example, put the whole string into single BigQuery field.
*/
#Override
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
...
}

Related

Does spring-data-mongodb supports Atlas search? need an example of it

I am trying to find how to use mongo Atlas search indexes, from java application, which is using spring-data-mongodb to query the data, can anyone share an example for it
what i found was as code as below, but that is used for MongoDB Text search, though it is working, but not sure whether it is using Atlas search defined index.
TextQuery textQuery = TextQuery.queryText(new TextCriteria().matchingAny(text)).sortByScore();
textQuery.fields().include("cast").include("title").include("id");
List<Movies> movies = mongoOperations
.find(textQuery, Movies.class);
I want smaple java code using spring-data-mongodb for below query:
[
{
$search: {
index: 'cast-fullplot',
text: {
query: 'sandeep',
path: {
'wildcard': '*'
}
}
}
}
]
It will be helpful if anyone can explain how MongoDB Text Search is different from Mongo Atlas Search and correct way of using Atalas Search with the help of java spring-data-mongodb.
How to code below with spring-data-mongodb:
Arrays.asList(new Document("$search",
new Document("index", "cast-fullplot")
.append("text",
new Document("query", "sandeep")
.append("path",
new Document("wildcard", "*")))),
new Document())
Yes, spring-data-mongo supports the aggregation pipeline, which you'll use to execute your query.
You need to define a document list, with the steps defined in your query, in the correct order. Atlas Search must be the first step in the pipeline, as it stands. You can translate your query to the aggregation pipeline using the Mongo Atlas interface, they have an option to export the pipeline array in the language of your choosing. Then, you just need to execute the query and map the list of responses to your entity class.
You can see an example below:
public class SearchRepositoryImpl implements SearchRepositoryCustom {
private final MongoClient mongoClient;
public SearchRepositoryImpl(MongoClient mongoClient) {
this.mongoClient = mongoClient;
}
#Override
public List<SearchEntity> searchByFilter(String text) {
// You can add codec configuration in your database object. This might be needed to map
// your object to the mongodb data
MongoDatabase database = mongoClient.getDatabase("aggregation");
MongoCollection<Document> collection = database.getCollection("restaurants");
List<Document> pipeline = List.of(new Document("$search", new Document("index", "default2")
.append("text", new Document("query", "Many people").append("path", new Document("wildcard", "*")))));
List<SearchEntity> searchEntityList = new ArrayList<>();
collection.aggregate(pipeline, SearchEntity.class).forEach(searchEntityList::add);
return searchEntityList;
}
}

Query Avro Schema using Beam SQL

I'm trying to read avro files with Apache Beam and use Beam SQL to transform the data.
I'm still new in Beam and Java. Here's my simple code:
public class BeamSQLReadAvro {
#SuppressWarnings("serial")
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
/* Schema definition */
Schema schema = new Schema.Parser().parse(new File("data/RATE_CODE/RATE_CODE.avsc"));
/* Create record/row */
PCollection<GenericRecord> records = p.apply(AvroIO.readGenericRecords(schema).from("data/RATE_CODE/*.avro"));
/* SQL Transform */
records.apply("SQL Transform 01",SqlTransform.query("SELECT RCODE,RNAME,RDESC FROM PCOLLECTION LIMIT 10"))
/* Print output */
.apply("Output",
MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("PCOLLECTION: " + input.getValues());
return input;
}
}
)
);
p.run().waitUntilFinish();
}
}
it gives me error
Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema
I don't understand, I have defined variable called schema. Any pointers here?
Actually, there are two types of schemas in your pipeline - Avro and Beam schemas. Avro schema is used to parse your Avro input records, but for SQL transform you are supposed to use rows with Beam schema. To do this, AvroIO provides an option withBeamSchemas(boolean), which should be set to true in your case, like:
AvroIO.readGenericRecords(schema).withBeamSchemas(true).from("data/RATE_CODE/*.avro")

GCP Dataflow Streaming Template : Not able to customize google provided java based PubSubToBQ template

Problem Statement is We are customizing Google Provided PubSubToBQ Dataflow streaming java template in which we are configuring multiple subscripotion/topics to be read and push data into multiple Bigquery tables, this needs to be executed as single dataflow pipeline to read all streams from a source and push into Bigquery tables. But when we execute template from eclipse we have to pass Subscription/Topic and BQ details, and tempalte stage on gcs bucket then when we run template using gcloud command with different Subscription and BQ details. Dataflow job is not override with new Subscription or BQ tables.
Objective : My objective is to use Google Provided PubSubTOBQ.java class template and pass a list of subscription with corresponding Bigquery Table and create a pipeline of passing subscription per table. So n-n, n pipeline in a single Job.
I am using Google Provided PubSubTOBQ.java class template which is taking input as a single subscription or single topic and corresponding Big Query Table detail.
Now i need to customize this to take input as list of Topics or list of subscriptions as comma separated. Which i am able to take using ValueProvider> and inside main or run method i am iterating through Array of String and passing subscription/topic and bq table as a string. Look at the below code for more information.
What i read on gcp doc is we cannot pass ValueProvider Variables outside DoFn if we want to override or use value during rumtime to create dynamic Piepline. Not sure if we can read messages inside DoFn.
**PubsubIO.readMessagesWithAttributes().fromSubscription(providedSubscriptionArray[i])**
If Yes please let me know. So that my objective is achieved.
Code:
public static void main(String[] args) {
StreamingDataflowOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(StreamingDataflowOptions.class);
List<String> listOfSubStr = new ArrayList<String>();
List<String> listOfTopicStr = new ArrayList<String>();
List<String> listOfTableStr = new ArrayList<String>();
String[] providedSubscriptionArray = null;
String[] providedTopicArray = null;
String[] providedTableArray = null;
if (options.getInputSubscription().isAccessible()) {
listOfSubStr = options.getInputSubscription().get();
providedSubscriptionArray = new String[listOfSubStr.size()];
providedSubscriptionArray = createListOfProvidedStringArray(listOfSubStr);
}
if (options.getInputTopic().isAccessible()) {
listOfTopicStr = options.getInputTopic().get();
providedTopicArray = new String[listOfSubStr.size()];
providedTopicArray = createListOfProvidedStringArray(listOfTopicStr);
}
if (options.getOutputTableSpec().isAccessible()) {
listOfTableStr = options.getOutputTableSpec().get();
providedTableArray = new String[listOfSubStr.size()];
providedTableArray = createListOfProvidedStringArray(listOfTableStr);
}
Pipeline pipeline = Pipeline.create(options);
PCollection<PubsubMessage> readPubSubMessage = null;
for (int i = 0; i < providedSubscriptionArray.length; i++) {
if (options.getUseSubscription()) {
readPubSubMessage = pipeline
.apply(PubsubIO.readMessagesWithAttributes().fromSubscription(providedSubscriptionArray[i]));
} else {
readPubSubMessage = pipeline.apply(PubsubIO.readMessagesWithAttributes().fromTopic(providedTopicArray[i]));
}
readPubSubMessage
/*
* Step #2: Transform the PubsubMessages into TableRows
*/
.apply("Convert Message To TableRow", ParDo.of(new PubsubMessageToTableRow()))
.apply("Insert Data To BigQuery",
BigQueryIO.writeTableRows().to(providedTableArray[i])
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
}
pipeline.run().waitUntilFinish();
}
Should be able to use single Dataflow PubSubTOBQ template for multiple pipeline of number of subscription corresponding to number of bigquery template in single Dataflow Streaming Job.
The problem is that Dataflow templates, as of now, need to know the pipeline graph at staging/creation time so it can't be different at runtime. If you still want to do it with a non-templated pipeline and passing a comma-separated Pub/Sub topic list as --topicList option parameter, then you can do something like:
String[] listOfTopicStr = options.getTopicList().split(",");
PCollection[] p = new PCollection[listOfTopicStr.length];
for (int i = 0; i < listOfTopicStr.length; i++) {
p[i] = pipeline
.apply(PubsubIO.readStrings().fromTopic(listOfTopicStr[i]))
.apply(ParDo.of(new DoFn<String, Void>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
Log.info(String.format("Message=%s", c.element()));
}
}));
}
Full code here.
If we test it with 3 topics such as:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.MultipleTopics \
-Dexec.args="--project=$PROJECT \
--topicList=projects/$PROJECT/topics/topic1,projects/$PROJECT/topics/topic2,projects/$PROJECT/topics/topic3 \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
gcloud pubsub topics publish topic1 --message="message 1"
gcloud pubsub topics publish topic2 --message="message 2"
gcloud pubsub topics publish topic3 --message="message 3"
The output and Dataflow graph will be as expected:
A possible workaround to force this approach into templates would be to have a large enough number of topics N for the worst-case scenario. When we execute the template with n topics (satisfying n <= N) we would need to specify N - n unused/dummy topics to fill in.

Spark: Running multiple queries on multiple files, optimization

I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.

How to get name of tableid of an existing Dataset in Cloud Bigquery

Currently, I am implemting our project for inserting data from cloud storage into Bigquery. About the way to insert. Please refer the link below is the way which I do
How to load data from Cloud Storage into BigQuery using Java
However, I want to check the existing name of table in dataset in Bigquery before inserting data into bigquery.
I will be very happy if you can share your idea in solving this case ?
Thanks,
private static void listTables(Bigquery service, String projectNumber, String datasetId) throws IOException {
Bigquery.Tables.List listTablesReply = service.tables().list(projectNumber, datasetId);
TableList tableList = listTablesReply.execute();
if (tableList.getTables() != null) {
List tables = tableList.getTables();
System.out.println("Tables list:");
for (TableList.Tables table : tables) {
System.out.format("%s\n", table.getId());
}
}
}
I think it will be help us in this case :)

Categories

Resources