I need to publish the Big query table rows to Kafka in Avro format.
PCollection<TableRow> rows =
pipeline
.apply(
"Read from BigQuery query",
BigQueryIO.readTableRows().from(String.format("%s:%s.%s", project, dataset, table))
//How to convert rows to avro format?
rows.apply(KafkaIO.<Long, ???>write()
.withBootstrapServers("kafka:29092")
.withTopic("test")
.withValueSerializer(KafkaAvorSerializer.class)
);
How to convert TableRow to Avro format?
Use MapElements
rows.apply(MapElements.via(new SimpleFunction<Tabelrows, GenericRecord>() {
#Override
public GenericRecord apply(Tabelrows input) {
log.info("Parsing {} to Avro", input);
return null; // TODO: Replace with Avro object
}
});
If Tabelrows is a collection-type that you want to convert to many records, you can use FlatMapElements instead.
As for writing to Kafka, I wrote a simple example
Related
I'm trying to read avro files with Apache Beam and use Beam SQL to transform the data.
I'm still new in Beam and Java. Here's my simple code:
public class BeamSQLReadAvro {
#SuppressWarnings("serial")
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
/* Schema definition */
Schema schema = new Schema.Parser().parse(new File("data/RATE_CODE/RATE_CODE.avsc"));
/* Create record/row */
PCollection<GenericRecord> records = p.apply(AvroIO.readGenericRecords(schema).from("data/RATE_CODE/*.avro"));
/* SQL Transform */
records.apply("SQL Transform 01",SqlTransform.query("SELECT RCODE,RNAME,RDESC FROM PCOLLECTION LIMIT 10"))
/* Print output */
.apply("Output",
MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("PCOLLECTION: " + input.getValues());
return input;
}
}
)
);
p.run().waitUntilFinish();
}
}
it gives me error
Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema
I don't understand, I have defined variable called schema. Any pointers here?
Actually, there are two types of schemas in your pipeline - Avro and Beam schemas. Avro schema is used to parse your Avro input records, but for SQL transform you are supposed to use rows with Beam schema. To do this, AvroIO provides an option withBeamSchemas(boolean), which should be set to true in your case, like:
AvroIO.readGenericRecords(schema).withBeamSchemas(true).from("data/RATE_CODE/*.avro")
I have a log file of 30k records, which I am publishing from Kafka and through spark I am persisting it into HBase. Out of 30K records, I can see only 4K records in HBase table.
I have tried saving the stream in MySQL and it is saving all records in MySql properly.
But in HBase if I publish a file of 100 records in Kafka topic, it saves 36 records in HBase table where if I publish 30K records Hbase shows only 4k records.
Also, Records(rows) in HBase are not in sequence like 1..3..10..17th.
final Job newAPIJobConfiguration1 = Job.getInstance(config); newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "logs"); newAPIJobConfiguration1.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
HTable hTable = new HTable(config, "country");
lines.foreachRDD((rdd,time)->
{
// Get the singleton instance of SparkSession
SparkSession spark = SparkSession.builder().config(rdd.context().getConf()).getOrCreate();
// Convert RDD[String] to RDD[case class] to DataFrame
JavaRDD rowRDD = rdd.map(line -> {
String[] logLine = line.split(" +");
Log record = new Log();
record.setTime((logLine[0]));
record.setTime_taken((logLine[1]));
record.setIp(logLine[2]);
return record;
});
saveToHBase(rowRDD, newAPIJobConfiguration1.getConfiguration());
});
ssc.start();
ssc.awaitTermination();
}
//6. saveToHBase method - insert data into HBase
public static void saveToHBase(JavaRDD rowRDD, Configuration conf) throws IOException {
// create Key, Value pair to store in HBase
JavaPairRDD hbasePuts = rowRDD.mapToPair(
new PairFunction() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2 call(Log row) throws Exception {
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
//put.addColumn(Bytes.toBytes("sparkaf"), Bytes.toBytes("message"), Bytes.toBytes(row.getMessage()));
put.addImmutable(Bytes.toBytes("time"), Bytes.toBytes("col1"), Bytes.toBytes(row.getTime()));
put.addImmutable(Bytes.toBytes("time_taken"), Bytes.toBytes("col2"), Bytes.toBytes(row.getTime_taken()));
put.addImmutable(Bytes.toBytes("ip"), Bytes.toBytes("col3"), Bytes.toBytes(row.getIp()));
return new Tuple2(new ImmutableBytesWritable(), put);
}
});
// save to HBase- Spark built-in API method
//hbasePuts.saveAsNewAPIHadoopDataset(conf);
hbasePuts.saveAsNewAPIHadoopDataset(conf);
Since HBase stores records uniquely by rowkey, it is very possible that you are overwriting records.
You are using the currentTime in milliseconds as the rowkey and any records created with the same rowkey will overwrite the old one.
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
So if 100 Puts are created in 1 millisecond, then only 100 will show up in HBase since the same row was overwritten 99 times.
It's likely that the 4k rowkeys in HBase are the 4k unique milliseconds (4 seconds) it took to load the data.
I would suggest using a different rowkey design. Also, as a side note, it is typically a bad idea to use monotonically increasing rowkeys in HBase:
Further Information
Hello,
I written code for streaming job where as source and target is a PostgreSQL database. I used JDBCInputFormat/JDBCOutputFormat to read and write the records(Referenced example).
Code:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row row) {
Date dt = (Date) row.getField(2);
return dt.getTime();
}
})
.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)))
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
return row;
}
});
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery("insert into tablename(id, name) values (?,?)")
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR})
.finish());
This code is executing correctly but not running continuously on Flink server(Select query is executing only once.)
Expected to run continuously on flink server.
Probably, you have to define your own Flink Source or JDBCInputFormat, since the one you use here will stop the SourceTask while fetching all results from DB. One way to solve this is create your own jdbc input format based on JDBCInputFormat, trying to re-execute the SQL query while reading the last row from DB in nextRecord.
I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
#Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData is getting executed twice.
first, when I read jsonStringRdd as DataFrame using SQLContext
Second, when I write the dataSchemaDF to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.