Is it possible in kryo to serialize an object along with the data schema, or to get the schema from the data serialized in the standard way ? I Need to make sure that the client side doesn't need a class in classpath. to load it from serialized data and then use reflection to subtract its fields, or deserialize all data in Maps, Lists, primitive types etc same as JSON or XML
Saves SampleBean as JSON string
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.registerKryoClasses(Array(classOf[SampleBean], classOf[InternalRow]
, classOf[Array[InternalRow]]
, classOf[WriteTaskResult]
, classOf[FileCommitProtocol.TaskCommitMessage]
, classOf[ExecutedWriteSummary],
classOf[BasicWriteTaskStats]))
val spark = SparkSession.builder.master("local[*]")
.config(conf)
.getOrCreate
import spark.implicits._
val df = List(SampleBean("A", "B")).toDF()
df.write.mode(SaveMode.Overwrite).json("src/main/resources/kryoTest")
df.printSchema()
reads the data simple JSON
val sparkNew = Constant.getSparkSess
val dfNew = sparkNew.read.json("src/main/resources/serialisedJavaObj.json").toDF()
dfNew.printSchema()
Related
I have a Spring Boot app which is using MySQL db.
At the moment I'm trying to do the following:
- deserialize instances from the *.csv files;
- inject them into the MySQL db.
For the simple instances there are no issues. But in case if I have an object with ManyToMany or OneToMany relations, deserialization is not working correctly. Currently I'm using Jackson dependency for *.csv deserialization:
CsvMapper csvMapper = new CsvMapper();
csvMapper.disable(MapperFeature.SORT_PROPERTIES_ALPHABETICALLY);
CsvSchema csvSchema = csvMapper.schemaFor(type).withHeader().withColumnSeparator(';').withLineSeparator("\n");
MappingIterator<Object> mappingIterator = csvMapper.readerFor(type).with(csvSchema).readValues(csv);
List<Object> objects = new ArrayList<>();
while (mappingIterator.hasNext()) {
objects.add(mappingIterator.next());
}
Example of the instance with many to many: (Idea is that one app can have different versions)
public class Application {
private Long id;
private String name;
#OneToMany(mappedBy = "application")
private Set<Version> versions = new HashSet<>();
}
For insertion into the DB I'm using Spring Boot entities that are #Autowired.
My first question is - what should I put as an input into the CSV file to deserialize it correctly ? Because if I have :
id;name;
1;testName;
(skipping versions), I'm having a trouble. The same even if I try to put some values into the version. So I don't know how to provide correctly the input for Jackson CSV deserialization in case of SET + later, how can I persist this entity ? Should I first put all the versions into the DB and then try to put applications?
Any thoughts? Thanks in advance!
Use ApacheCommons to parse csv.
final byte[] sourceCsv;
String csvString = new String(sourceCsv);
CSVFormat csvFormat = CSVFormat.DEFAULT;
List<CSVRecord> csvRecord = csvFormat.parse(new
StringReader(csvString)).getRecords();
It will help you to deserialize and to store in database.
I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?
code to write the file:
ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
part of JSON data from source:
"event_header": {
"accept_language": null,
"app_id": "App_ID",
"app_name": null,
"client_ip_address": "IP",
"event_id": "ID",
"event_timestamp": null,
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
Output:
"event_header": {
"app_id": "App_ID",
"client_ip_address": "IP",
"event_id": "ID",
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
In the above example keys accept_language, app_name and event_timestamp have been dropped.
Apparently, spark does not provide any option to handle nulls. So following custom solution should work.
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)
val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()
val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})
ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")
This will produce output as :
{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}
If you are on Spark 3, you can add
spark.sql.jsonGenerator.ignoreNullFields false
ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.
If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.
cols = ddp.columns
ddp_ = ddp.rdd
ddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])
ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)
This should produce output file like,
{"accept_language": None, "app_id":"123", ...}
{"accept_language": None, "app_id":"456", ...}
What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.
ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))
Since Spark 3, and if you are using the class DataFrameWriter
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#json-java.lang.String-
(same applies for pyspark)
https://spark.apache.org/docs/3.0.0-preview/api/python/_modules/pyspark/sql/readwriter.html
its json method has an option ignoreNullFields=None
where None means True.
So just set this option to false.
ddp.coalesce(20).write().mode("overwrite").option("ignoreNullFields", "false").json("hdfs://localhost:9000/user/dedupe_employee")
To retain null values converting to JSON please set this config option.
spark = (
SparkSession.builder.master("local[1]")
.config("spark.sql.jsonGenerator.ignoreNullFields", "false")
).getOrCreate()
I am having a json which is somethink like {"Header" : {"name" : "TestData", "contactNumber" : 8019071740}}
If i insert this to mongoDB it will be something like
{"_id" : ObjectId("58b7e55097989619e4ddb0bb"),"Header" : {"name" : "TestData","contactNumber" : NumberLong(8019071743)}
When i read this data back and try to convert to java object using Gson it throws exception com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected a long but was BEGIN_OBJECT at line 1 column 109 path $.Header.contactNumber
I have found this, But i was wondering if i have very complex json structure then i might need to manipulate many json nodes in this approach.
Do anyone has any better alternatives on this.
Edit:1
I am reading querying and converting json as below
Document MongoDocument = mycollection.find(searchCondition);
String resultJson = MongoDocument.toJson();
Gson gson = new Gson();
Model model= gson.fromJson(resultJson, ItemList.class);
We can use below code:
Document doc = documentCursor.next();
JsonWriterSettings relaxed = JsonWriterSettings.builder().outputMode(JsonMode.RELAXED).build();
CustomeObject obj = gson.fromJson(doc.toJson(relaxed), CustomeObject.class);
Take a look at: converting Document objects in MongoDB 3 to POJOS
I had the same problem. The workaround with com.mongodb.util.JSON.serialize(document) does the trick.
Mongo db uses Bson format with its own types which follows json standards but it can't be parsed by json library without writing the custom wrapper/codec.
You can use third party framework/plugins to make the library take care of converting between document and pojo.
If that is not an option for you, you will have to do mapping yourself.
Document mongoDocument = mycollection.find(searchCondition);
Model model= new Model();
model.setProperty(mongoDocument.get("property");
Document.toJson() gives bson but not json.
I.e. for Long field myLong equals to xxx of Document it produces json like:
"myLong" : { "$numberLong" : "xxx"}
Parsing of such with Gson will not give myLong=xxx evidently.
To convert Document to Pojo using Gson you may do next:
Gson gson = new Gson();
MyPojo pojo = gson.fromJson(gson.toJson(document), MyPojo.class);
val mongoJsonWriterSettings: JsonWriterSettings = JsonWriterSettings.builder.int64Converter((value, writer) => writer.writeNumber(value.toString)).build
def bsonToJson(document: Document): String = document toJson mongoJsonWriterSettings
I want to create JSON Schema manually using GSON but i dont find any JsonSchema element support in GSON. I dont want to convert a pojo to schema but want to create schema programatically . Is there any way in GSON ? May be something like following.
**1 JsonSchema schema = new JsonSchema();
2 schema.Type = JsonSchemaType.Object;
3 schema.Properties = new Dictionary<string, JsonSchema>
4{
5 { "name", new JsonSchema { Type = JsonSchemaType.String } },
6 {
7 "hobbies", new JsonSchema
8 {
9 Type = JsonSchemaType.Array,
10 Items = new List<JsonSchema> { new JsonSchema { Type = JsonSchemaType.String } }
11 }
12 },
13};**
You may consider using everit-org/json-schema for programmatically creating JSON Schemas. Although it is not properly documented, its builder classes form a fluent API which lets you do it. Example:
Schema schema = ObjectSchema.builder()
.addPropertySchema("name", StringSchema.builder().build())
.addPropertySchema("hobbies", ArraySchema.builder()
.allItemSchema(StringSchema.builder().build())
.build())
.build();
It is slightly different syntax than what you described, but it can be good for the same purpose.
(disclaimer: I'm the author of everit-org/json-schema)
I tried to build a schema as suggested above, see Everit schema builder includes unset properties as null
The title talks by itself, I have a Config object (from https://github.com/typesafehub/config) and I want to pass it the a constructor which only supports java.util.Properties as argument.
Is there an easy way to convert a Config to a Properties object ?
Here is a way to convert a typesafe Config object into a Properties java object. I have only tested it in a simple case for creating Kafka properties.
Given this configuration in application.conf
kafka-topics {
my-topic {
zookeeper.connect = "localhost:2181",
group.id = "testgroup",
zookeeper.session.timeout.ms = "500",
zookeeper.sync.time.ms = "250",
auto.commit.interval.ms = "1000"
}
}
You can create the corresponding Properties object like that:
import com.typesafe.config.{Config, ConfigFactory}
import java.util.Properties
import kafka.consumer.ConsumerConfig
object Application extends App {
def propsFromConfig(config: Config): Properties = {
import scala.collection.JavaConversions._
val props = new Properties()
val map: Map[String, Object] = config.entrySet().map({ entry =>
entry.getKey -> entry.getValue.unwrapped()
})(collection.breakOut)
props.putAll(map)
props
}
val config = ConfigFactory.load()
val consumerConfig = {
val topicConfig = config.getConfig("kafka-topics.my-topic")
val props = propsFromConfig(topicConfig)
new ConsumerConfig(props)
}
// ...
}
The function propsFromConfig is what you are mainly interested in, and the key points are the use of entrySet to get a flatten list of properties, and the unwrapped of the entry value, that gives an Object which type depends on the configuration value.
You can try my scala wrapper https://github.com/andr83/scalaconfig. Using it convert config object to java Properties is simple:
val properties = config.as[Properties]
As typesafe config/hocon supports a much richer structure than java.util.propeties it will be hard to get a safe conversion.
Or spoken otherwise as properties can only express a subset of hocon the conversion is not clear, as it will have a possible information loss.
So if you configuration is rather flat and does not contain utf-8 then you could transform hocon to json and then extract the values.
A better solution would be to implement a ConfigClass and populate the values with values from hocon and passing this to the class you want to configure.
It is not possible directly through typesafe config. Even rending the entire hocon file into json does provide a true valid json:
ex:
"play" : {
"filters" : {
"disabled" : ${?play.filters.disabled}[
"play.filters.hosts.AllowedHostsFilter"
],
"disabled" : ${?play.filters.disabled}[
"play.filters.csrf.CSRFFilter"
]
}
}
That format is directly from Config.render
as you can see, disabled is represented twice with hocon style syntax.
I have also had problems with rendering hocon -> json -> hocon
Example hocon:
http {
port = "9000"
port = ${?HTTP_PORT}
}
typesafe config would parse this to
{
"http": {
"port": "9000,${?HTTP_PORT}"
}
}
However if you try to parse that in hocon - it throws a syntax error. the , cannot be there.
The hocon correct parsing would be 9000${?HTTP_PORT} - with no comma between the values. I believe this is true for all array concatenation and substitution