Retain keys with null values while writing JSON in spark - java

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?
code to write the file:
ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
part of JSON data from source:
"event_header": {
"accept_language": null,
"app_id": "App_ID",
"app_name": null,
"client_ip_address": "IP",
"event_id": "ID",
"event_timestamp": null,
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
Output:
"event_header": {
"app_id": "App_ID",
"client_ip_address": "IP",
"event_id": "ID",
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
In the above example keys accept_language, app_name and event_timestamp have been dropped.

Apparently, spark does not provide any option to handle nulls. So following custom solution should work.
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)
val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()
val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})
ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")
This will produce output as :
{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}

If you are on Spark 3, you can add
spark.sql.jsonGenerator.ignoreNullFields false

ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.
If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.
cols = ddp.columns
ddp_ = ddp.rdd
ddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])
ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)
This should produce output file like,
{"accept_language": None, "app_id":"123", ...}
{"accept_language": None, "app_id":"456", ...}
What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.
ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))

Since Spark 3, and if you are using the class DataFrameWriter
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#json-java.lang.String-
(same applies for pyspark)
https://spark.apache.org/docs/3.0.0-preview/api/python/_modules/pyspark/sql/readwriter.html
its json method has an option ignoreNullFields=None
where None means True.
So just set this option to false.
ddp.coalesce(20).write().mode("overwrite").option("ignoreNullFields", "false").json("hdfs://localhost:9000/user/dedupe_employee")

To retain null values converting to JSON please set this config option.
spark = (
SparkSession.builder.master("local[1]")
.config("spark.sql.jsonGenerator.ignoreNullFields", "false")
).getOrCreate()

Related

How to join two spark dataset to one with java objects?

I have a little problem joining two datasets in spark, I have this:
SparkConf conf = new SparkConf()
.setAppName("MyFunnyApp")
.setMaster("local[*]");
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.debug.maxToStringFields", 150)
.getOrCreate();
//...
//Do stuff
//...
Encoder<MyOwnObject1> encoderObject1 = Encoders.bean(MyOwnObject1.class);
Encoder<MyOwnObject2> encoderObject2 = Encoders.bean(MyOwnObject2.class);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile1)
.as(encoderObject1);
Dataset<MyOwnObject2> object2DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile2)
.as(encoderObject2);
I can print the schema and show it correctly.
//Here start the problem
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS =
object1DS.join(object2DS, object1DS.col("column01")
.equalTo(object2DS.col("column01")))
.as(Encoders.tuple(MyOwnObject1,MyOwnObject2));
Last line can't make join and get me this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<"LIST WITH ALL VARS FROM TWO OBJECT"> to Tuple2, but failed as the number of fields does not line up.;
That's true, because Tuple2 (object2) doesn't have all vars...
Then I had tried this:
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS = object1DS
.joinWith(object2DS, object1DS
.col("column01")
.equalTo(object2DS.col("column01")));
And works fine! But, I need a new Dataset without tuple, I have an object3, that have some vars from object1 and object2, then I have this problem:
Encoder<MyOwnObject3> encoderObject3 = Encoders.bean(MyOwnObject3.class);
Dataset<MyOwnObject3> object3DS = joinObjectDS.map(tupleObject1Object2 -> {
MyOwnObject1 myOwnObject1 = tupleObject1Object2._1();
MyOwnObject2 myOwnObject2 = tupleObject1Object2._2();
MyOwnObject3 myOwnObject3 = new MyOwnObject3(); //Sets all vars with start values
//...
//Sets data from object 1 and 2 to 3.
//...
return myOwnObject3;
}, encoderObject3);
Fails!... here is the error:
17/05/10 12:17:43 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 593, Column 72: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
and over thousands error lines...
What can I do? I had tried:
Make my object only with String, int (or Integer) and double (or Double) (no more)
use differents encoders like kryo or javaSerialization
use JavaRDD (works! but very slowly) and use Dataframes with Rows (works, but i need to change many objects)
All my java objects are serializable
use sparks 2.1.0 and 2.1.1, now I have 2.1.1 on my pom.xml
I want to use Datasets, to use the speed from Dataframes and object sintax from JavaRDD...
Help?
Thanks
Finally I found a solution,
I had a problem with the option inferSchema when my code was creating a Dataset. I have a String column that the option inferSchema return me an Integer column because all values are "numeric", but i need use them as String (like "0001", "0002"...) I need to do a schema, but I have many vars, then I write this with all my classes:
List<StructField> fieldsObject1 = new ArrayList<>();
for (Field field : MyOwnObject1.class.getDeclaredFields()) {
fieldsObject1.add(DataTypes.createStructField(
field.getName(),
CatalystSqlParser.parseDataType(field.getType().getSimpleName()),
true)
);
}
StructType schemaObject1 = DataTypes.createStructType(fieldsObject1);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(schemaObject1)
.csv(pathToFile1)
.as(encoderObject1);
Works fine.
The "best" solution would be this:
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(encoderObject1.schema())
.csv(pathToFile1)
.as(encoderObject1);
but encoderObject1.schema() returns me a Schema with vars in alphabetical order, not in original order, then this option fails when I read a csv. Maybe Encoders should return a schema with vars in original order and not in alphabetical order

jackson unmarshalling problems

I am trying to deserialize a JSON String using Jackson 2 with RestAssured (java tool for IT tests).
I have a problem. The String I am trying to deserialize is :
{"Medium":{"uuid":"2","estimatedWaitTime":0,"status":"OPEN_AVAILABLE","name":"Chat","type":"CHAT"}}
There is the object type "Medium" at the begining of the String. This cause Jackson failing during deserialization:
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "Medium"
I've set the "IGNORE_ON_UNKNOWN_PROPERTIES" to false and then I got no exception during deserialisation. However, all of my properties are 'null' in java.
Response getAvailability -> {"Medium":{"uuid":"2","estimatedWaitTime":0,"status":"OPEN_AVAILABLE","name":"Chat","type":"CHAT"}}
### MEDIUM name -> null
### MEDIUM uuid -> null
### MEDIUM wait time -> null
### MEDIUM wait time -> null
### MEDIUM status -> null
Does anyone can help me ? (note: I can't change my input JSON string).
{
"Medium": {
"uuid": "2",
"estimatedWaitTime": 0,
"status": "OPEN_AVAILABLE",
"name": "Chat",
"type": "CHAT"
}
}
as you can see uuid and other params are part of medium object , so class in which it can be deserialized is.
class Medium
{
string name;
// specify other params also.
}
class BaseObject
{
Medium Medium;
}
and then use jackson.deserialize('json', BaseObject.class)
above i had given pseudo code
You need to put annotation
#JsonRootName("Medium")
on your bean class and configure object mapper to
mapper.enable(DeserializationFeature.UNWRAP_ROOT_VALUE).
You need a way to remove the Object name that is the part of the input JSON. Since you cannot change the input string, Use this code to change this input string to a tree and get the value of "Medium" node.
ObjectMapper m = new ObjectMapper();
JsonNode root = m.readTree("{\"Medium\":{\"uuid\":\"2\",\"estimatedWaitTime\":0,\"status\":\"OPEN_AVAILABLE\",\"name\":\"Chat\",\"type\":\"CHAT\"}}");
JsonNode obj = root.get("Medium");
Medium medium = m.readValue(obj.asText, Medium.class);

Converting Typesafe Config type to java.util.Properties

The title talks by itself, I have a Config object (from https://github.com/typesafehub/config) and I want to pass it the a constructor which only supports java.util.Properties as argument.
Is there an easy way to convert a Config to a Properties object ?
Here is a way to convert a typesafe Config object into a Properties java object. I have only tested it in a simple case for creating Kafka properties.
Given this configuration in application.conf
kafka-topics {
my-topic {
zookeeper.connect = "localhost:2181",
group.id = "testgroup",
zookeeper.session.timeout.ms = "500",
zookeeper.sync.time.ms = "250",
auto.commit.interval.ms = "1000"
}
}
You can create the corresponding Properties object like that:
import com.typesafe.config.{Config, ConfigFactory}
import java.util.Properties
import kafka.consumer.ConsumerConfig
object Application extends App {
def propsFromConfig(config: Config): Properties = {
import scala.collection.JavaConversions._
val props = new Properties()
val map: Map[String, Object] = config.entrySet().map({ entry =>
entry.getKey -> entry.getValue.unwrapped()
})(collection.breakOut)
props.putAll(map)
props
}
val config = ConfigFactory.load()
val consumerConfig = {
val topicConfig = config.getConfig("kafka-topics.my-topic")
val props = propsFromConfig(topicConfig)
new ConsumerConfig(props)
}
// ...
}
The function propsFromConfig is what you are mainly interested in, and the key points are the use of entrySet to get a flatten list of properties, and the unwrapped of the entry value, that gives an Object which type depends on the configuration value.
You can try my scala wrapper https://github.com/andr83/scalaconfig. Using it convert config object to java Properties is simple:
val properties = config.as[Properties]
As typesafe config/hocon supports a much richer structure than java.util.propeties it will be hard to get a safe conversion.
Or spoken otherwise as properties can only express a subset of hocon the conversion is not clear, as it will have a possible information loss.
So if you configuration is rather flat and does not contain utf-8 then you could transform hocon to json and then extract the values.
A better solution would be to implement a ConfigClass and populate the values with values from hocon and passing this to the class you want to configure.
It is not possible directly through typesafe config. Even rending the entire hocon file into json does provide a true valid json:
ex:
"play" : {
"filters" : {
"disabled" : ${?play.filters.disabled}[
"play.filters.hosts.AllowedHostsFilter"
],
"disabled" : ${?play.filters.disabled}[
"play.filters.csrf.CSRFFilter"
]
}
}
That format is directly from Config.render
as you can see, disabled is represented twice with hocon style syntax.
I have also had problems with rendering hocon -> json -> hocon
Example hocon:
http {
port = "9000"
port = ${?HTTP_PORT}
}
typesafe config would parse this to
{
"http": {
"port": "9000,${?HTTP_PORT}"
}
}
However if you try to parse that in hocon - it throws a syntax error. the , cannot be there.
The hocon correct parsing would be 9000${?HTTP_PORT} - with no comma between the values. I believe this is true for all array concatenation and substitution

Spark: Running multiple queries on multiple files, optimization

I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.

Plain string template query for elasticsearch through java API?

I have a template foo.mustache saved in {{ES_HOME}}/config/scripts.
POST to http://localhost:9200/forward/_search/template with the following message body returns a valid response:
{
"template": {
"file": "foo"
},
"params": {
"q": "a",
"hasfilters": false
}
}
I want to translate this to using the java API now that I've validated all the different components work. The documentation here describes how to do it in java:
SearchResponse sr = client.prepareSearch("forward")
.setTemplateName("foo")
.setTemplateType(ScriptService.ScriptType.FILE)
.setTemplateParams(template_params)
.get();
However, I would instead like to just send a plain string query (i.e. the contents of the message body from above) rather than build up the response using the java. Is there a way to do this? I know with normal queries, I can construct it like so:
SearchRequestBuilder response = client.prepareSearch("forward")
.setQuery("""JSON_QUERY_HERE""")
I believe the setQuery() method wraps the contents into a query object, which is not what I want for my template query. If this is not possible, I will just have to go with the documented way and convert my json params to Map<String, Object>
I ended up just translating my template_params to a Map<String, Object> as the documentation requires. I utilized groovy's JsonSlurper to convert the text to an object with a pretty simple method.
import groovy.json.JsonSlurper
public static Map<String,Object> convertJsonToTemplateParam(String s) {
Object result = new JsonSlurper().parseText(s);
//Manipulate your result if you need to do any additional work here.
//I.e. Programmatically determine value of hasfilters if filters != null
return (Map<String,Object>) result;
}
And you could pass in the following as a string to this method:
{
"q": "a",
"hasfilters": true
"filters":[
{
"filter_name" : "foo.untouched",
"filters" : [ "FOO", "BAR"]
},
{
"filter_name" : "hello.untouched",
"list" : [ "WORLD"]
}
]
}

Categories

Resources