I have been reading a lot about Apache Avro these days and I am more inclined towards using it instead of using JSON. Currently, what we are doing is, we are serializing the JSON document using Jackson and then writing that serialize JSON document into Cassandra for each row key/user id.
Then we have a REST service that reads the whole JSON document using the row key and then deserialize it and use it further.
Now while reading on the web it looks like, Avro requires a schema beforehand... I am not sure how to come up with a schema in Apache Avro for my JSON document.
Below is my JSON document that I am writing into Cassandra after serializing it using Jackson. Now how to come up with an Avro schema for the below JSON?
{
"lv" : [ {
"v" : {
"site-id" : 0,
"categories" : {
"321" : {
"price_score" : "0.2",
"confidence_score" : "0.5"
},
"123" : {
"price_score" : "0.4",
"confidence_score" : "0.2"
}
},
"price-score" : 0.5,
"confidence-score" : 0.2
}
} ],
"lmd" : 1379231624261
}
Can anyone provide a simple example on this, how to come up with a schema in Avro basis on my above JSON document? Thanks for the help.
The simplest way to define an avro schema as you have outlined above would be to start from what they call IDL. IDL is a high-level language than the Avro schema (json) and makes writing avro schema much more straight-forward..
See avro IDL here: http://avro.apache.org/docs/current/idl.html
To define what you've got above in JSON, you're going to define a set of records in IDL that look like this:
#namespace("com.sample")
protocol sample {
record Category {
union {null, string} price_score = null;
union {null, string} confidence_score = null;
}
record vObject {
int site_id = 0;
union {null, map<Category>} categories = null;
union {null, float} price_score = null;
union {null, float} confidence_score = null;
}
record SampleObject {
union {null, array<vObject>} lv = null;
long lmd = -1;
}
}
When you run the compiler tool (as listed on that website above), you will get an avro schema generated like so:
{
"protocol" : "sample",
"namespace" : "com.sample",
"types" : [ {
"type" : "record",
"name" : "Category",
"fields" : [ {
"name" : "price_score",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "confidence_score",
"type" : [ "null", "string" ],
"default" : null
} ]
}, {
"type" : "record",
"name" : "vObject",
"fields" : [ {
"name" : "site_id",
"type" : "int",
"default" : 0
}, {
"name" : "categories",
"type" : [ "null", {
"type" : "map",
"values" : "Category"
} ],
"default" : null
}, {
"name" : "price_score",
"type" : [ "null", "float" ],
"default" : null
}, {
"name" : "confidence_score",
"type" : [ "null", "float" ],
"default" : null
} ]
}, {
"type" : "record",
"name" : "SampleObject",
"fields" : [ {
"name" : "lv",
"type" : [ "null", {
"type" : "array",
"items" : "vObject"
} ],
"default" : null
}, {
"name" : "lmd",
"type" : "long",
"default" : -1
} ]
} ],
"messages" : {
}
}
Using whatever language you'd like, you can now generate a set of objects and the default "toString" operation is to output in JSON form as you have above. However, the true power of Avro comes with it's compression capabilities. You should truly write out in avro binary format to see the real benefits of avro.
Hope this helps!
Related
I am using kafka-streams to transform xml messages to avro format. I would like to know if it is possible to keep the field names of my union records when using union type for records in my avro schema as in the example below so that instead of having the name "main_record", i would have "record1" or "record2" in my avro message depending on the input data i am receiving:
{
"namespace": "proj.avro",
"protocol": "app_messages",
"doc" : "application messages",
"name": "myRecord",
"type" : "record",
"fields": [
{
"name": "main_record",
"type": [
{
"name": "record1",
"type" : "record",
"fields":
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "message_type",
"type" : "int"
},
{
"name" : "users",
"type" : "string"
}
]
},
{
"name" : "record2",
"type" : "record",
"fields" :
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "response_code",
"type" : "string"
},
{
"name" : "response_count",
"type" : "int"
},
{
"name" : "reason_code",
"type" : "string"
}
]
}
]
}
]
}
In my project, we use flink to handle log data, then we send the data into elastisearch. However, I find that es could not recognize json object, it only recogize some basic data types. Therefore, I could only transform json object into a string, but in this time, when I check log data in elasticsearch, the format is really hard to understand.
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [
{
"_index" : "wyh_dye_test",
"_type" : "nested",
"_id" : "gzlvM3EBRgA6CE7yDw8l",
"_score" : 1.0,
"_source" : {
"id" : "id",
"module" : "wyh_key",
"content" : """{"map":{"wyh_key":"wyh_value","user_key":"user_value","wqq_key":"wqq_value","hello_key":"hello_value"}}"""
}
}
this is my kibana search result, as you can see, the content field is really hard to read.
You can update this index mapping,then put the data into the corresponding field.
PUT xxx_index/_mapping/xxx_type
{
"properties": {
"wyh_key": {
"type": "keyword"
},
"user_key": {
"type": "keyword"
},
"wqq_key": {
"type": "keyword"
},
"hello_key": {
"type": "keyword"
}
}
}
I am developing an API that create a list of Questions , and would like to know check if STS have any native capability that can support bulk insert , or if i have to create a custom query using #Query annotation?
I have refer to this Spring Data MongoDB support bulk insert/save , i would like to check if an unique ObjectId still be generated through bulk insert/save?
Sample definition i am expecting , where each question is differentiated with an unique Id.
questions": [
{
"id" : "01-QuestionId",
"type" : "multiple",
"question" : "What is your Gender?",
"options" : [
{
"key" : "a",
"value" : "Male"
},
{
"key" : "b",
"value" : "Female"
}
],
"survey":{
"id": "123",
"name": "Test1",
"description": "First Survey"
}
},
{
"id" : "02-QuestionId",
"type" : "multiple",
"question" : "What is your income?",
"options" : [
{
"key" : "a",
"value" : "1000"
},
{
"key" : "b",
"value" : "2000"
}
],
"survey":{
"id": "123",
"name": "Test1",
"description": "First Survey"
}
}
]
Thanks all!
Robin
Found out after deeper research in Spring Data.
We can just use save() or insert() interface from MongoRepository class.
For example
final List savedQuestions = questionRepository.save(questions);
org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"enum","name":"document_change_type","namespace":"document","symbols":["create","update","delete"]}]: create
I am passing in the string create for this field, and it is throwing the above exception.
create is one of the 3 acceptable values for the enum, what is causing the exception?
Suppose your avro schema looks this:
{
"type" : "record",
"namespace" : "document",
"name" : "document_details",
"fields" : [
{ "name" : "documentName" , "type" : "string" },
{"name" : "documentChange" ,
"type" : ["null",
{"type" : "enum",
"namespace" : "document",
"name" : "documentChangeType",
"symbols" :["create","update","delete"]
}]
}
]
}
You can create the Record for this schema in your code as :
GenericRecord documentDetailsRecord = new GenericData.Record(schema);
GenericEnumSymbol enumSymbol = new GenericData.EnumSymbol(schema.getField("documentChange").schema().getTypes().get(1), "create");
e2.put("documentName", "someDocumentName");
e2.put("documentChange",enumSymbol);
You can get a list of schema for all the fields of union as follows :
schema.getField(<unionFieldName>).schema().getTypes()
I am facing a trouble in the use of ElasticSearch for my java application.
I explain myself, I have a mapping, which is something like :
{
"products": {
"properties": {
"id": {
"type": "long",
"ignore_malformed": false
},
"locations": {
"properties": {
"category": {
"type": "long",
"ignore_malformed": false
},
"subCategory": {
"type": "long",
"ignore_malformed": false
},
"order": {
"type": "long",
"ignore_malformed": false
}
}
},
...
So, as you can see, I receive a list of products, which are composed of locations. In my model, this locations are all the categories' product. It means that a product can be in 1 or more categories. In each of this category, the product has an order, which is the order the client wants to show them.
For instance, a diamond product can have a first place in Jewelry, but the third place in Woman (my examples are not so logic ^^).
So, when I click on Jewelry, I want to show this products, ordered by the field locations.order in this specific category.
For the moment, when I search all the products on a specific category the response for ElasticSearch that I receive is something like :
{"id":5331880,"locations":[{"category":5322606,"order":1},
{"category":5883712,"subCategory":null,"order":3},
{"category":5322605,"subCategory":6032961,"order":2},.......
Is it possible to sort this products, by the element locations.order for the specific category I am searching for ? For instance, if I am querying the category 5322606, I want the order 1 for this product to be taken.
Thank you very much beforehand !
Regards,
Olivier.
First a correction of terminology: in Elasticsearch, "parent/child" refers to completely separate docs, where the child doc points to the parent doc. Parent and children are stored on the same shard, but they can be updated independently.
With your example above, what you are trying to achieve can be done with nested docs.
Currently, your locations field is of type:"object". This means that the values in each location get flattened to look something like this:
{
"locations.category": [5322606, 5883712, 5322605],
"locations.subCategory": [6032961],
"locations.order": [1, 3, 2]
}
In other words, the "sub" fields get flattened into multi-value fields, which is of no use to you, because there is no correlation between category: 5322606 and order: 1.
However, if you change locations to be type:"nested" then internally it will index each location as a separate doc, meaning that each location can be queried independently, using the dedicated nested query and filter.
By default, the nested query will return a _score based upon how well each location matches, but in your case you want to return the highest value of the order field from any matching children. To do this, you'll need to use a custom_score query.
So let's start by creating the index with the appropriate mapping:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"products" : {
"properties" : {
"locations" : {
"type" : "nested",
"properties" : {
"order" : {
"type" : "long"
},
"subCategory" : {
"type" : "long"
},
"category" : {
"type" : "long"
}
}
},
"id" : {
"type" : "long"
}
}
}
}
}
'
The we index your example doc:
curl -XPOST 'http://127.0.0.1:9200/test/products?pretty=1' -d '
{
"locations" : [
{
"order" : 1,
"category" : 5322606
},
{
"order" : 3,
"subCategory" : null,
"category" : 5883712
},
{
"order" : 2,
"subCategory" : 6032961,
"category" : 5322605
}
],
"id" : 5331880
}
'
And now we can search for it using the queries we discussed above:
curl -XGET 'http://127.0.0.1:9200/test/products/_search?pretty=1' -d '
{
"query" : {
"nested" : {
"query" : {
"custom_score" : {
"script" : "doc[\u0027locations.order\u0027].value",
"query" : {
"constant_score" : {
"filter" : {
"and" : [
{
"term" : {
"category" : 5322605
}
},
{
"term" : {
"subCategory" : 6032961
}
}
]
}
}
}
}
},
"score_mode" : "max",
"path" : "locations"
}
}
}
'
Note: the single quotes within the script have been escaped as \u0027 to get around shell quoting. The script actually looks like this: "doc['locations.order'].value"
If you look at the _score from the results, you can see that it has used the order value from the matching location:
{
"hits" : {
"hits" : [
{
"_source" : {
"locations" : [
{
"order" : 1,
"category" : 5322606
},
{
"order" : 3,
"subCategory" : null,
"category" : 5883712
},
{
"order" : 2,
"subCategory" : 6032961,
"category" : 5322605
}
],
"id" : 5331880
},
"_score" : 2,
"_index" : "test",
"_id" : "cXTFUHlGTKi0hKAgUJFcBw",
"_type" : "products"
}
],
"max_score" : 2,
"total" : 1
},
"timed_out" : false,
"_shards" : {
"failed" : 0,
"successful" : 5,
"total" : 5
},
"took" : 9
}
Just add a more updated version related to sorting parent by child field.
We can query parent doc type sorted by child field ('count' e.g.) similar as follows.
https://gist.github.com/robinloxley1/7ea7c4f37a3413b1ca16