Goal
I am working on ETL Mongodb to Hive using Spark (2.3.1) with Java
Where I am RN
I can load existing Mongodb and show/query the data
Problem
but I have issue saving it to hive table.
Mongodb data structure
Current mongodb data is complicate nested dict (struct type), is there a way to transform to save in hive more easily?
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView
implicitDS.createOrReplaceTempView("my_table");
// mongo_spark.sql("DROP TABLE IF EXISTS my_table");
// cannot save table this step
// implicitDS.write().saveAsTable("my_table");
// can query the temp view
mongo_spark.sql("SELECT * FROM my_table limit 1").show();
// More application logic would go here...
JavaMongoRDD<Document> rdd = MongoSpark.load(jsc);
System.out.println(rdd.count());
System.out.println(rdd.first().toJson());
jsc.close();
}
Does anyone have experience in doing this ETL spark job in Java?
I really appreciated.
As worked more about it, I realized this is a broad question. The precise answer to this question is
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView
implicitDS.createOrReplaceTempView("my_table");
mongo_spark.sql("DROP TABLE IF EXISTS my_table");
implicitDS.write().saveAsTable("my_table");
jsc.close();
}
So that actually the code is working but what blocks me is that something happening in my data
conflict data type of single field (com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast…) - this can be resolve increasing the sample size while loading, check java syntax
How to config Java Spark sparksession samplesize
nulltype in nested structure - this one I am still seeking a solution in Java
As many research I got scala code samples, I'll do my best to record what I found and hopefully it can save your time one day
Related
I am using Java.
I am receiving a filepath over Kafka messages. And I need to load this file into a spark RDD, process it, and dump it into HDFS.
I am able to retrieve the filepath from Kafka message. And I wish to create a dataset / RDD over this file.
I cannot run a map function on Kafka message dataset. It errors out with a NPE as sparkContext is not available on worker.
I cannot run a foreach on the Kafka messages dataset. It errors out with message:
Queries with streaming sources must be executed with writeStream.start();"
I cannot collect the data received from kafka message dataset, as it errors out with message
Queries with streaming sources must be executed with writeStream.start();;
I guess this must be a very general use-case and must be running in lot of setups.
How can I load the file as RDD from the paths that I receive in Kafka message?
SparkSession spark = SparkSession.builder()
.appName("MyKafkaStreamReader")
.master("local[4]")
.config("spark.executor.memory", "2g")
.getOrCreate();
// Create DataSet representing the stream of input lines from kafka
Dataset<String> kafkaValues = spark.readStream()
.format("kafka")
.option("spark.streaming.receiver.writeAheadLog.enable", true)
.option("kafka.bootstrap.servers", Configuration.KAFKA_BROKER)
.option("subscribe", Configuration.KAFKA_TOPIC)
.option("fetchOffset.retryIntervalMs", 100)
.option("checkpointLocation", "file:///tmp/checkpoint")
.load()
.selectExpr("CAST(value AS STRING)").as(Encoders.STRING());
Dataset<String> messages = kafkaValues.map(x -> {
ObjectMapper mapper = new ObjectMapper();
String m = mapper.readValue(x.getBytes(), String.class);
return m;
}, Encoders.STRING() );
// ====================
// TEST 1 : FAILS
// ====================
// CODE TRYING TO execute MAP on the received RDD
// This fails with a Null pointer exception because "spark" is not available on worker node
/*
Dataset<String> statusRDD = messages.map(message -> {
// BELOW STATEMENT FAILS
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
return getHdfsLocation();
}, Encoders.STRING());
StreamingQuery query2 = statusRDD.writeStream().outputMode("append").format("console").start();
*/
// ====================
// TEST 2 : FAILS
// ====================
// CODE BELOW FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// Hence, processing the deduplication on the worker side using
/*
JavaRDD<String> messageRDD = messages.toJavaRDD();
messageRDD.foreach( message -> {
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
});
*/
// ====================
// TEST 3 : FAILS
// ====================
// CODE TRYING TO COLLECT ALSO FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// List<String> mess = messages.collectAsList();
Any idea on how can I read create the file-paths and create RDDs over the files?
In Structured Streaming, I don't think that there's a way to reify the data in one stream to be used as a parameter for a Dataset operation.
In the Spark ecosystem, this is possible by combining Spark Streaming and Spark SQL (Datasets). We can use Spark Streaming to consume the Kafka topic and then, using Spark SQL, we can load the corresponding data and apply the desired process.
Such a job would look roughly like this: (This is in Scala, Java code will follow the same structure. Only that the actual code is a bit more verbose)
// configure and create spark Session
val spark = SparkSession
.builder
.config(...)
.getOrCreate()
// create streaming context with a 30-second interval - adjust as required
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(30))
// this uses Kafka080 client. Kafka010 has some subscription differences
val kafkaParams = Map[String, String](
"metadata.broker.list" -> kafkaBootstrapServer,
"group.id" -> "job-group-id",
"auto.offset.reset" -> "largest",
"enable.auto.commit" -> (false: java.lang.Boolean).toString
)
// create a kafka direct stream
val topics = Set("topic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext, kafkaParams, topics)
// extract the values from the kafka message
val dataStream = stream.map{case (id, data) => data}
// process the data
dataStream.foreachRDD { dataRDD =>
// get all data received in the current interval
// We are assuming that this data fits in memory.
// We're not processing a million files per second, are we?
val files = dataRDD.collect()
files.foreach{ file =>
// this is the process proposed in the question --
// notice how we have access to the spark session in the context of the foreachRDD
val fileDataset = spark.read().option("header", "true").csv(file)
val dedupedFileDataset = fileDataset.dropDuplicates()
// this can probably be written in terms of the dataset api
//dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
dedupedFileDataset.write.format("text").mode("overwrite").save(getHdfsLocation())
}
}
// start the streaming process
streamingContext.start()
streamingContext.awaitTermination()
So I'm new to Apache Spark and I have a file that looks like this:
Name Size Records
File1 1,000 104,370
File2 950 91,780
File3 1,500 109,123
File4 2,170 113,888
File5 2,000 111,974
File6 1,820 110,666
File7 1,200 106,771
File8 1,500 108,991
File9 1,000 104,007
File10 1,300 107,037
File11 1,900 111,109
File12 1,430 108,051
File13 1,780 110,006
File14 2,010 114,449
File15 2,017 114,889
This is my sample/test data. I'm working on an anomaly detection program and I have to test other files with the same format but different values and detect which one have anomalies on the size and records values (if size/records on another file differ a lot from the standard one, or if size and records are not proportional within each other). I decided to start trying different ML algorithms and I wanted to start with the k-Means approach. I tried putting this file on the following line:
KMeansModel model = kmeans.fit(file)
file is already parsed to a Dataset variable. However I get an error and I'm pretty sure it has to do with the structure/schema of the file. Is there a way to work with structured/labeled/organized data when trying to fit in on a model?
I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
And this is the code:
public class practice {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Anomaly Detection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1 = "C:\\Users\\ZK0GJXO\\Documents\\day1.txt";
Dataset<Row> df = spark.read().
option("header", "true").
option("delimiter", "\t").
csv(day1);
df.show();
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(df);
}
}
Thanks
By default all Spark ML models train on a column called "features". One can specify a different input column name via the setFeaturesCol method http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/KMeans.html#setFeaturesCol(java.lang.String)
update:
One can combine multiple columns into a single feature vector using VectorAssembler:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"size", "records"})
.setOutputCol("features");
Dataset<Row> vectorized_df = assembler.transform(df)
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(vectorized_df);
One can further streamline and chain these feature transformations with the pipeline API https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
#Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData is getting executed twice.
first, when I read jsonStringRdd as DataFrame using SQLContext
Second, when I write the dataSchemaDF to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
I tried to create an empty Dataframe with spark 2.0.0 and Java 1.8. I then want to append a schema. After this i would like to insert content with SQL insert statements.
SparkSession sparkSession = SparkSession
.builder()
.getOrCreate();
Dataset<Row> emptyDataset = null;
try {
//This part is still working:
emptyDataset = sparkSession.emptyDataFrame();
//This part has no effect at all:
emptyDataset.schema().add("id", DataTypes.IntType)
.add("date", DataTypes.DateType)
.add("type", DataTypes.StringType);
} catch(Exception e) {
System.out.println("Nope");
}
System.out.println("Schema:");
emptyDataset.printSchema();
And I get this, but no Schema:
Schema:
root
Any ideas, or what is wrong?
thx!
I'm fairly new to the Google Cloud Platform and I'm trying Google Dataflow for the first time for a project for my postgraduate programme. What I want to do is write an automated load job that loads files from a certain bucket on my Cloud Storage and inserts the data from it into a BigQuery table.
I get the data as a PCollection<String> type, but for insertion in BigQuery I apparently need to transform it to a PCollection<TableRow> type. So far I haven't found a solid answer to do this.
Here's my code:
public static void main(String[] args) {
//Defining the schema of the BigQuery table
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("Datetime").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("Consumption").setType("FLOAT"));
fields.add(new TableFieldSchema().setName("MeterID").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
//Creating the pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
//Getting the data from cloud storage
PCollection<String> lines = p.apply(TextIO.Read.named("ReadCSVFromCloudStorage").from("gs://mybucket/myfolder/certainCSVfile.csv"));
//Probably need to do some transform here ...
//Inserting data into BigQuery
lines.apply(BigQueryIO.Write
.named("WriteToBigQuery")
.to("projectID:datasetID:tableID")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
}
I'm probably just forgetting something basic, so I hope you guys can help me with this ...
BigQueryIO.Write operates on PCollection<TableRow>, as outlined in Writing to BigQuery. You'll need to apply a transform to convert PCollection<TableRow>into PCollection<String>. For an example, take a look at StringToRowConverter:
static class StringToRowConverter extends DoFn<String, TableRow> {
/**
* In this example, put the whole string into single BigQuery field.
*/
#Override
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
...
}