I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
#Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData is getting executed twice.
first, when I read jsonStringRdd as DataFrame using SQLContext
Second, when I write the dataSchemaDF to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
Related
I need to publish the Big query table rows to Kafka in Avro format.
PCollection<TableRow> rows =
pipeline
.apply(
"Read from BigQuery query",
BigQueryIO.readTableRows().from(String.format("%s:%s.%s", project, dataset, table))
//How to convert rows to avro format?
rows.apply(KafkaIO.<Long, ???>write()
.withBootstrapServers("kafka:29092")
.withTopic("test")
.withValueSerializer(KafkaAvorSerializer.class)
);
How to convert TableRow to Avro format?
Use MapElements
rows.apply(MapElements.via(new SimpleFunction<Tabelrows, GenericRecord>() {
#Override
public GenericRecord apply(Tabelrows input) {
log.info("Parsing {} to Avro", input);
return null; // TODO: Replace with Avro object
}
});
If Tabelrows is a collection-type that you want to convert to many records, you can use FlatMapElements instead.
As for writing to Kafka, I wrote a simple example
Goal
I am working on ETL Mongodb to Hive using Spark (2.3.1) with Java
Where I am RN
I can load existing Mongodb and show/query the data
Problem
but I have issue saving it to hive table.
Mongodb data structure
Current mongodb data is complicate nested dict (struct type), is there a way to transform to save in hive more easily?
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView
implicitDS.createOrReplaceTempView("my_table");
// mongo_spark.sql("DROP TABLE IF EXISTS my_table");
// cannot save table this step
// implicitDS.write().saveAsTable("my_table");
// can query the temp view
mongo_spark.sql("SELECT * FROM my_table limit 1").show();
// More application logic would go here...
JavaMongoRDD<Document> rdd = MongoSpark.load(jsc);
System.out.println(rdd.count());
System.out.println(rdd.first().toJson());
jsc.close();
}
Does anyone have experience in doing this ETL spark job in Java?
I really appreciated.
As worked more about it, I realized this is a broad question. The precise answer to this question is
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView
implicitDS.createOrReplaceTempView("my_table");
mongo_spark.sql("DROP TABLE IF EXISTS my_table");
implicitDS.write().saveAsTable("my_table");
jsc.close();
}
So that actually the code is working but what blocks me is that something happening in my data
conflict data type of single field (com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast…) - this can be resolve increasing the sample size while loading, check java syntax
How to config Java Spark sparksession samplesize
nulltype in nested structure - this one I am still seeking a solution in Java
As many research I got scala code samples, I'll do my best to record what I found and hopefully it can save your time one day
I'm trying to read avro files with Apache Beam and use Beam SQL to transform the data.
I'm still new in Beam and Java. Here's my simple code:
public class BeamSQLReadAvro {
#SuppressWarnings("serial")
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
/* Schema definition */
Schema schema = new Schema.Parser().parse(new File("data/RATE_CODE/RATE_CODE.avsc"));
/* Create record/row */
PCollection<GenericRecord> records = p.apply(AvroIO.readGenericRecords(schema).from("data/RATE_CODE/*.avro"));
/* SQL Transform */
records.apply("SQL Transform 01",SqlTransform.query("SELECT RCODE,RNAME,RDESC FROM PCOLLECTION LIMIT 10"))
/* Print output */
.apply("Output",
MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("PCOLLECTION: " + input.getValues());
return input;
}
}
)
);
p.run().waitUntilFinish();
}
}
it gives me error
Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema
I don't understand, I have defined variable called schema. Any pointers here?
Actually, there are two types of schemas in your pipeline - Avro and Beam schemas. Avro schema is used to parse your Avro input records, but for SQL transform you are supposed to use rows with Beam schema. To do this, AvroIO provides an option withBeamSchemas(boolean), which should be set to true in your case, like:
AvroIO.readGenericRecords(schema).withBeamSchemas(true).from("data/RATE_CODE/*.avro")
I am using Java.
I am receiving a filepath over Kafka messages. And I need to load this file into a spark RDD, process it, and dump it into HDFS.
I am able to retrieve the filepath from Kafka message. And I wish to create a dataset / RDD over this file.
I cannot run a map function on Kafka message dataset. It errors out with a NPE as sparkContext is not available on worker.
I cannot run a foreach on the Kafka messages dataset. It errors out with message:
Queries with streaming sources must be executed with writeStream.start();"
I cannot collect the data received from kafka message dataset, as it errors out with message
Queries with streaming sources must be executed with writeStream.start();;
I guess this must be a very general use-case and must be running in lot of setups.
How can I load the file as RDD from the paths that I receive in Kafka message?
SparkSession spark = SparkSession.builder()
.appName("MyKafkaStreamReader")
.master("local[4]")
.config("spark.executor.memory", "2g")
.getOrCreate();
// Create DataSet representing the stream of input lines from kafka
Dataset<String> kafkaValues = spark.readStream()
.format("kafka")
.option("spark.streaming.receiver.writeAheadLog.enable", true)
.option("kafka.bootstrap.servers", Configuration.KAFKA_BROKER)
.option("subscribe", Configuration.KAFKA_TOPIC)
.option("fetchOffset.retryIntervalMs", 100)
.option("checkpointLocation", "file:///tmp/checkpoint")
.load()
.selectExpr("CAST(value AS STRING)").as(Encoders.STRING());
Dataset<String> messages = kafkaValues.map(x -> {
ObjectMapper mapper = new ObjectMapper();
String m = mapper.readValue(x.getBytes(), String.class);
return m;
}, Encoders.STRING() );
// ====================
// TEST 1 : FAILS
// ====================
// CODE TRYING TO execute MAP on the received RDD
// This fails with a Null pointer exception because "spark" is not available on worker node
/*
Dataset<String> statusRDD = messages.map(message -> {
// BELOW STATEMENT FAILS
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
return getHdfsLocation();
}, Encoders.STRING());
StreamingQuery query2 = statusRDD.writeStream().outputMode("append").format("console").start();
*/
// ====================
// TEST 2 : FAILS
// ====================
// CODE BELOW FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// Hence, processing the deduplication on the worker side using
/*
JavaRDD<String> messageRDD = messages.toJavaRDD();
messageRDD.foreach( message -> {
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
});
*/
// ====================
// TEST 3 : FAILS
// ====================
// CODE TRYING TO COLLECT ALSO FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// List<String> mess = messages.collectAsList();
Any idea on how can I read create the file-paths and create RDDs over the files?
In Structured Streaming, I don't think that there's a way to reify the data in one stream to be used as a parameter for a Dataset operation.
In the Spark ecosystem, this is possible by combining Spark Streaming and Spark SQL (Datasets). We can use Spark Streaming to consume the Kafka topic and then, using Spark SQL, we can load the corresponding data and apply the desired process.
Such a job would look roughly like this: (This is in Scala, Java code will follow the same structure. Only that the actual code is a bit more verbose)
// configure and create spark Session
val spark = SparkSession
.builder
.config(...)
.getOrCreate()
// create streaming context with a 30-second interval - adjust as required
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(30))
// this uses Kafka080 client. Kafka010 has some subscription differences
val kafkaParams = Map[String, String](
"metadata.broker.list" -> kafkaBootstrapServer,
"group.id" -> "job-group-id",
"auto.offset.reset" -> "largest",
"enable.auto.commit" -> (false: java.lang.Boolean).toString
)
// create a kafka direct stream
val topics = Set("topic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext, kafkaParams, topics)
// extract the values from the kafka message
val dataStream = stream.map{case (id, data) => data}
// process the data
dataStream.foreachRDD { dataRDD =>
// get all data received in the current interval
// We are assuming that this data fits in memory.
// We're not processing a million files per second, are we?
val files = dataRDD.collect()
files.foreach{ file =>
// this is the process proposed in the question --
// notice how we have access to the spark session in the context of the foreachRDD
val fileDataset = spark.read().option("header", "true").csv(file)
val dedupedFileDataset = fileDataset.dropDuplicates()
// this can probably be written in terms of the dataset api
//dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
dedupedFileDataset.write.format("text").mode("overwrite").save(getHdfsLocation())
}
}
// start the streaming process
streamingContext.start()
streamingContext.awaitTermination()
I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.