How to pass Table schema from Source to Sink using Apache Beam? - java

I have one use case where I need to load thousands of tables from Oracle to BiQuery using Apache Beam (DataFlow). I have written the below code that is working by creating tables manually and using CreateDisposition.CREATE_NEVER but that will not be feasible to create all tables manually. So I have written code to fetch schema from Source (JdbcIO) and pass it to BigQuery writeTableRows().
But the code is giving the below error.
Exception in thread "main" java.lang.IllegalArgumentException: schema can not be null
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.withSchema(BigQueryIO.java:2256)
at org.example.Main.main(Main.java:109)
Code
package org.example;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.jdbc.JdbcIO;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Main {
private static final Logger LOG = LoggerFactory.getLogger(Main.class);
public static TableSchema schema;
public static void main(String[] args) {
// Read from JDBC
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
String query2= "select * from Test.emptable";
PCollection<TableRow> rows = p.apply(JdbcIO.<TableRow>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"oracle.jdbc.OracleDriver", "jdbc:oracle:thin:#//localhost:1521/ORCL")
.withUsername("root")
.withPassword("password"))
.withQuery(query2)
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
#Override
public TableRow mapRow(ResultSet resultSet) throws Exception {
schema = getSchemaFromResultSet(resultSet);
TableRow tableRow = new TableRow();
List<TableFieldSchema> columnNames = schema.getFields();
for(int i =1; i<= resultSet.getMetaData().getColumnCount(); i++) {
tableRow.put(columnNames.get(i-1).get("name").toString(), String.valueOf(resultSet.getObject(i)));
}
return tableRow;
}
})
);
rows.apply(BigQueryIO.writeTableRows()
.to("project:SampleDataset.emptable")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
);
p.run().waitUntilFinish();
}
private static TableSchema getSchemaFromResultSet(ResultSet resultSet) {
FieldSchemaListBuilder fieldSchemaListBuilder = new FieldSchemaListBuilder();
try {
ResultSetMetaData rsmd = resultSet.getMetaData();
for(int i=1; i <= rsmd.getColumnCount(); i++) {
fieldSchemaListBuilder.stringField(resultSet.getMetaData().getColumnName(i));
}
}
catch (SQLException ex) {
LOG.error("Error getting metadata: " + ex.getMessage());
}
return fieldSchemaListBuilder.schema();
}
}
I have tried to assign a dummy schema to handle this compile time error and assigned schema value to the dummy schema, but that is creating a table with a dummy schema, not with the actual schema.
Can someone help me to understand the flow where I am missing and how I can get the schema from JdbcIO and assign it to BigQuery Sink Connector?

To load a schema within the pipeline itself as you're suggesting here, you can use BigQueryIO.write() and specify withSchemaFromView. In that case, you'd need to fetch the schema from the source database and wrap that in a PCollectionView (see Side inputs in the Beam programming guide).
You're using the storage write API, which likely requires a schema be specified. Note that the BigQuery API for file loads can allow inferring schema from the file contents at load time, although I'm not completely sure if Beam supports this. I would encourage you to try using file loads and setting withSchemaUpdateOption(SchemaUpdateOption.ALLOW_FIELD_ADDITION) to see if that leads to the table creation behavior you're looking for.

Related

How to retrieve the query string in the HiveMetastoreListener?

I am trying to create a listener on the HiveMetastore where i need to retrieve the query submitted to the metastore. Is there any way to retrieve the queryString?
In the MetatstoreListener we get events such as onCreate, onDelete etc.
It is possible to have postHook on the hive but need to have the listener on the metastore so that all the DDL commands are getting catched executed from anywhere
Is there any way to capture the events in the Metastore and apply the same events to another parallel Metastore setup?
context:- I am trying to upgrade the hive from 1.x version to 3.x.x
where the idea is to have the stateless setup of Metastore-service in Kubernetes.
but not sure how much the query syntax is compatible between both versions. So wanted to set hot-hot setup parallelly and monitor the results of queries. So if there is any way on the MetastoreListener to transfer the DDL events from one Metastore to another and execute simultaneously?
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.metastore.events.AlterTableEvent;
import org.apache.hadoop.hive.metastore.MetaStoreEventListener;
import org.apache.hadoop.hive.metastore.events.CreateTableEvent;
import org.apache.hadoop.hive.metastore.api.MetaException;
import org.codehaus.jackson.map.ObjectMapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.time.LocalDateTime;
public class HiveMetastoreListener extends MetaStoreEventListener {
private static final Logger LOGGER = LoggerFactory.getLogger(HiveMetastoreListener.class);
private static final ObjectMapper objMapper = new ObjectMapper();
private final DataProducer dataProducer = DataProducer.getInstance();
public HiveMetastoreListener(Configuration config) {
super(config);
}
/**
* Handler for a CreateTable Event
*/
#Override
public void onCreateTable(CreateTableEvent tableEvent) throws MetaException{
super.onCreateTable(tableEvent);
try {
String data = null;
dataProducer.produceToKafka("metastore_topic", LocalDateTime.now().toString(), data);
}catch (Exception e) {
System.out.println("Error:- " + e);
}
}

Extracting Rally Defect Discussion using the Java Rally Rest API

I am attempting to create a simple Java script which will connect to Rally, fetch all of the defects and return the defect details including the discussion as a Java object. The problem here is that the Discussion is returned as what I believe is a collection because only a URL is given. I am stuck on how to return the discussion for the defect as an object within the JSON rather than only another query which would have to be run separately (thousands of times I presume since we have thousands of defects).
Here is my code:
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.rallydev.rest.RallyRestApi;
import com.rallydev.rest.request.GetRequest;
import com.rallydev.rest.request.QueryRequest;
import com.rallydev.rest.request.UpdateRequest;
import com.rallydev.rest.response.QueryResponse;
import com.rallydev.rest.util.Fetch;
import com.rallydev.rest.util.QueryFilter;
import com.rallydev.rest.util.Ref;
import org.json.simple.JSONArray;
public class ExtractData{
public static void main(String[] args) throws URISyntaxException, IOException, NumberFormatException
{
RallyRestApi restApi = new RallyRestApi(new URI("https://rally1.rallydev.com"), "apiKeyHere");
restApi.setProxy(URI.create("http://usernameHere:passwordHere0#proxyHere:8080"));
restApi.setApplicationName("QueryExample");
//Will store all of the parsed defect data
JSONArray defectData = new JSONArray();
try{
QueryRequest defects = new QueryRequest("defect");
defects.setFetch(new Fetch("FormattedID","Discussion","Resolution"));
defects.setQueryFilter(new QueryFilter("Resolution","=","Configuration Change"));
defects.setPageSize(5000);
defects.setLimit(5000);
QueryResponse queryResponse = restApi.query(defects);
if(queryResponse.wasSuccessful()){
System.out.println(String.format("\nTotal results: %d",queryResponse.getTotalResultCount()));
for(JsonElement result: queryResponse.getResults()){
JsonObject defect = result.getAsJsonObject();
System.out.println(defect);
}
}else{
System.err.print("The following errors occured: ");
for(String err: queryResponse.getErrors()){
System.err.println("\t+err");
}
}
}finally{
restApi.close();
}
}
}
Here is an example of what I am getting when I attempt this:
{"_rallyAPIMajor":"2","_rallyAPIMinor":"0","_ref":"https://rally1.rallydev.com/slm/webservice/v2.0/defect/30023232168","_refObjectUUID":"cea42323c2f-d276-4078-92cc-6fc32323ae","_objectVersion":"6","_refObjectName":"Example defect name","Discussion":{"_rallyAPIMajor":"2","_rallyAPIMinor":"0","_ref":"https://rally1.rallydev.com/slm/webservice/v2.0/Defect/32323912168/Discussion","_type":"ConversationPost","Count":0},"FormattedID":"DE332322","Resolution":"Configuration Change","Summary":{"Discussion":{"Count":0}},"_type":"Defect"}
As you can see the discussion is being returned as a URL rather than fetching the actual discussion. As this query will be used at runtime I'd prefer the entire object.
Unfortunately there is no way to get all of that data in one request- you'll have to load the Discussion collection for each defect you read. Also of note, the max page size is 2000.
This isn't exactly the same as what you're trying to do, but this example shows loading child stories much like you'd load discussions...
https://github.com/RallyCommunity/rally-java-rest-apps/blob/master/GetChildStories.java#L37

Adding values in HBase using Put.add() method

I'm writing a simple Java client code to add values in HBase table. I'm using put.add(byte[] columnFamily, byte[] columnQualifier, byte[] value), but this method is deprecated in new HBase API. Can anyone please help in what is the way of doing it using new Put API?
Using maven I have downloaded jar for HBase version 1.2.0.
I'm using the following code :
package com.NoSQL;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
public class PopulatingData {
public static void main(String[] args) throws IOException{
String table = "Employee";
Logger.getRootLogger().setLevel(Level.WARN);
Configuration conf = HBaseConfiguration.create();
Connection con = ConnectionFactory.createConnection(conf);
Admin admin = con.getAdmin();
if(admin.tableExists(TableName.valueOf(table))) {
Table htable = con.getTable(TableName.valueOf(table));
/*********** adding a new row ***********/
// adding a row key
Put p = new Put(Bytes.toBytes("row1"));
p.add(Bytes.toBytes("ContactDetails"), Bytes.toBytes("Mobile"), Bytes.toBytes("9876543210"));
p.add(Bytes.toBytes("ContactDetails"), Bytes.toBytes("Email"), Bytes.toBytes("abhc#gmail.com"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Name"), Bytes.toBytes("Abhinav Rawat"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Age"), Bytes.toBytes("21"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Gender"), Bytes.toBytes("M"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("Company"), Bytes.toBytes("UpGrad"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("DOJ"), Bytes.toBytes("11:06:2018"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("Designation"), Bytes.toBytes("ContentStrategist"));
htable.put(p);
/**********************/
System.out.print("Table is Populated");`enter code here`
}else {
System.out.println("The HBase Table named "+table+" doesn't exists.");
}
System.out.println("Returnning Main");
}
}
Use addColumn() method :
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(NAME_FAMILY, NAME_COL_QUALIFIER, name);
Please refer more details in below javadoc :
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html

Google Cloud Dataflow issue with writing the data (TextIO or DatastoreIO)

OK, everyone. Another Dataflow question from a Dataflow newbie. (Just started playing with it this week..)
I'm creating a datapipe to take in a list of product names and generate autocomplete data. The data processing part is all working fine, it seems, but I'm missing something obvious because when I add my last ".apply" to use either DatastoreIO or TextIO to write the data out, I'm getting a syntax error in my IDE that says the following:
"The method apply(DatastoreV1.Write) is undefined for the type ParDo.SingleOutput>,Entity>"
If gives me an option add a cast to the method receiver, but that obviously isn't the answer. Do I need to do some other step before I try to write the data out? My last step before trying to write the data is a call to an Entity helper for Dataflow to change my Pipeline structure from > to , which seems to me like what I'd need to write to Datastore.
I got so frustrated with this thing the last few days, I even decided to write the data to some AVRO files instead so I could just load it in Datastore by hand. Imagine how ticked I was when I got all that done and got the exact same error in the exact same place on my call to TextIO. That is why I think I must be missing something very obvious here.
Here is my code. I included it all for reference, but you probably just need to look at the main[] at the bottom. Any input would be greatly appreciated! Thanks!
MrSimmonsSr
package com.client.autocomplete;
import com.client.autocomplete.AutocompleteOptions;
import com.google.datastore.v1.Entity;
import com.google.datastore.v1.Key;
import com.google.datastore.v1.Value;
import static com.google.datastore.v1.client.DatastoreHelper.makeKey;
import static com.google.datastore.v1.client.DatastoreHelper.makeValue;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import com.google.api.services.bigquery.model.TableRow;
import com.google.common.base.MoreObjects;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.datastore.DatastoreIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.transforms.GroupByKey;
import org.apache.beam.sdk.transforms.DoFn.ProcessContext;
import org.apache.beam.sdk.transforms.DoFn.ProcessElement;
import org.apache.beam.sdk.extensions.jackson.ParseJsons;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.options.Validation;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.List;
import java.util.ArrayList;
/*
* A simple Dataflow pipeline to create autocomplete data from a list of
* product names. It then loads that prefix data into Google Cloud Datastore for consumption by
* a Google Cloud Function. That function will take in a prefix and return a list of 10 product names
*
* Pseudo Code Steps
* 1. Load a list of product names from Cloud Storage
* 2. Generate prefixes for use with autocomplete, based on the product names
* 3. Merge the prefix data together with 10 products per prefix
* 4. Write that prefix data to the Cloud Datastore as a KV with a <String>, List<String> structure
*
*/
public class ClientAutocompletePipeline {
private static final Logger LOG = LoggerFactory.getLogger(ClientAutocompletePipeline.class);
/**
* A DoFn that keys each product name by all of its prefixes.
* This creates one row in the PCollection for each prefix<->product_name pair
*/
private static class AllPrefixes
extends DoFn<String, KV<String, String>> {
private final int minPrefix;
private final int maxPrefix;
public AllPrefixes(int minPrefix) {
this(minPrefix, 10);
}
public AllPrefixes(int minPrefix, int maxPrefix) {
this.minPrefix = minPrefix;
this.maxPrefix = maxPrefix;
}
#ProcessElement
public void processElement(ProcessContext c) {
String productName= c.element().toString();
for (int i = minPrefix; i <= Math.min(productName.length(), maxPrefix); i++) {
c.output(KV.of(productName.substring(0, i), c.element()));
}
}
}
/**
* Takes as input the top product names per prefix, and emits an entity
* suitable for writing to Cloud Datastore.
*
*/
static class FormatForDatastore extends DoFn<KV<String, List<String>>, Entity> {
private String kind;
private String ancestorKey;
public FormatForDatastore(String kind, String ancestorKey) {
this.kind = kind;
this.ancestorKey = ancestorKey;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Initialize an EntityBuilder and get it a valid key
Entity.Builder entityBuilder = Entity.newBuilder();
Key key = makeKey(kind, ancestorKey).build();
entityBuilder.setKey(key);
// New HashMap to hold all the properties of the Entity
Map<String, Value> properties = new HashMap<>();
String prefix = c.element().getKey();
String productsString = "Products[";
// iterate through the product names and add each one to the productsString
for (String productName : c.element().getValue()) {
// products.add(productName);
productsString += productName + ", ";
}
productsString += "]";
properties.put("prefix", makeValue(prefix).build());
properties.put("products", makeValue(productsString).build());
entityBuilder.putAllProperties(properties);
c.output(entityBuilder.build());
}
}
/**
* Options supported by this class.
*
* <p>Inherits standard Beam example configuration options.
*/
public interface Options
extends AutocompleteOptions {
#Description("Input text file")
#Validation.Required
String getInputFile();
void setInputFile(String value);
#Description("Cloud Datastore entity kind")
#Default.String("prefix-product-map")
String getKind();
void setKind(String value);
#Description("Whether output to Cloud Datastore")
#Default.Boolean(true)
Boolean getOutputToDatastore();
void setOutputToDatastore(Boolean value);
#Description("Cloud Datastore ancestor key")
#Default.String("root")
String getDatastoreAncestorKey();
void setDatastoreAncestorKey(String value);
#Description("Cloud Datastore output project ID, defaults to project ID")
String getOutputProject();
void setOutputProject(String value);
}
public static void main(String[] args) throws IOException{
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
// create the pipeline
Pipeline p = Pipeline.create(options);
PCollection<String> toWrite = p
// A step to read in the product names from a text file on GCS
.apply(TextIO.read().from("gs://sample-product-data/clean_product_names.txt"))
// Next expand the product names into KV pairs with prefix as key (<KV<String, String>>)
.apply("Explode Prefixes", ParDo.of(new AllPrefixes(2)))
// Apply a GroupByKey transform to the PCollection "flatCollection" to create "productsGroupedByPrefix".
.apply(GroupByKey.<String, String>create())
// Now format the PCollection for writing into the Google Datastore
.apply("FormatForDatastore", ParDo.of(new FormatForDatastore(options.getKind(),
options.getDatastoreAncestorKey()))
// Write the processed data to the Google Cloud Datastore
// NOTE: This is the line that I'm getting the error on!!
.apply(DatastoreIO.v1().write().withProjectId(MoreObjects.firstNonNull(
options.getOutputProject(), options.getOutputProject()))));
// Run the pipeline.
PipelineResult result = p.run();
}
}
I think you need another closing parenthesis. I've removed some of the extraneous bits and reindent according to the parentheses:
PCollection<String> toWrite = p
.apply(TextIO.read().from("..."))
.apply("Explode Prefixes", ...)
.apply(GroupByKey.<String, String>create())
.apply("FormatForDatastore", ParDo.of(new FormatForDatastore(
options.getKind(), options.getDatastoreAncestorKey()))
.apply(...);
Specifically, you need another parenthesis to close the apply("FormatForDatastore", ...). Right now, it is trying to call ParDo.of(...).apply(...) which doesn't work.

How to read multiple types of Avro data in single MapReduce

I have two different types of Avro data which have some common fields. I want to read those common fields in the mapper. I want to read this by spawning a single job in cluster.
Below is the sample avro schema
Schema 1:
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaOne","doc":"Avro storing with schema using MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"BB","type":["null","string"],"default":null},
{"name":"CC","type":["null","string"],"default":null}]}
Schema 2 :
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaTwo","doc":"Avro
storing with schema using
MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"CC","type":["null","string"],"default":null},
{"name":"DD","type":["null","string"],"default":null}]}
Driver Class:
package com.mango.schema.aggrDaily;
import java.util.Date;
import org.apache.avro.Schema;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(super.getConf(), getClass());
conf.setJobName("DF");
args[0] = "hdfs://localhost:9999/home/hadoop/work/alok/aggrDaily/data/avro512MB/part-m-00000.avro";
args[1] = "/home/hadoop/work/alok/tmp"; // temp location
args[2] = "hdfs://localhost:9999/home/hadoop/work/alok/tmp/10";
FileInputFormat.addInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[2]));
AvroJob.setInputReflect(conf);
AvroJob.setMapperClass(conf, AvroMapper.class);
AvroJob.setOutputSchema(
conf,
Pair.getPairSchema(Schema.create(Schema.Type.STRING),
Schema.create(Schema.Type.INT)));
RunningJob job = JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
long startTime = new Date().getTime();
System.out.println("Start Time :::::" + startTime);
Configuration conf = new Configuration();
int exitCode = ToolRunner.run(conf, new AvroDriver(), args);
long endTime = new Date().getTime();
System.out.println("End Time :::::" + endTime);
System.out.println("Total Time Taken:::"
+ new Double((endTime - startTime) * 0.001) + "Sec.");
System.exit(exitCode);
}
}
Mapper class:
package com.mango.schema.aggrDaily;
import java.io.IOException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
public class AvroMapper extends
AvroMapper<GenericData, Pair<CharSequence, Integer>> {
#Override
public void map(GenericData record,
AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter) throws IOException {
System.out.println("record :: " + record);
}
}
I able to read Avro data with this code by setting the input schema.
AvroJob.setInputSchema(conf, new AggrDaily().getSchema());
As the Avro data has builtin schema into the data, I don't want to pass the specific schema to the job explicitly. I achieve this in Pig. But now I want to achieve the same in MapReduce also.
Can anybody help me to achieve this through MR code or let me know where am I going wrong?
By *org.apache.hadoop.mapreduce.lib.input.MultipleInputs class we can read multiple avro data trough single MR job
We cannot use org.apache.hadoop.mapreduce.lib.input.MultipleInputs to read multiple avro data because each one of the avro inputs will have a schema associated with it and currently context can store the schema for only one of the inputs. So other mappers will not be able read data..
The same thing is true with HCatInputFormat as well (because every input has a schema associated with it). However in Hcatalog 0.14 onwards there is a provision for the same.
AvroMultipleInputs can be used to accomplish the same. It works only with Specific and Reflect mappings. It is available from version 1.7.7 onwards.

Categories

Resources