use of "default" in avro schema - java

As per the definition of "default" attribute in Avro docs: "A default value for this field, used when reading instances that lack this field (optional)."
This means that if the corresponding field is missing, the default value is taken.
But this does not seem to be the case. Consider the following student schema:
{
"type": "record",
"namespace": "com.example",
"name": "Student",
"fields": [{
"name": "age",
"type": "int",
"default": -1
},
{
"name": "name",
"type": "string",
"default": "null"
}
]
}
Schema says that: if "age" field is missing, then consider value as -1. Likewise for "name" field.
Now, if I try to construct Student model, from the following JSON:
{"age":70}
I get this exception:
org.apache.avro.AvroTypeException: Expected string. Got END_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:698)
at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:227)
Looks like the default is NOT working as expected. So, What exactly is the role of default here ?
This is the code used to generate Student model:
Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, studentJson);
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);
return datumReader.read(null, decoder);
(Student class is auto-generated by Avro compiler from student schema)

I think there is some miss understanding around default values so hopefully my explanation will help to other people as well. The default value is useful to give a default value when the field is not present, but this is essentially when you are instancing an avro object (in your case calling datumReader.read) but it does not allow read data with a different schema, this is why the concept of "schema registry" is useful for this kind of situations.
The following code works and allow read your data
Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, "{\"age\":70}");
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);
Schema expected = new Schema.Parser().parse("{\n" +
" \"type\": \"record\",\n" +
" \"namespace\": \"com.example\",\n" +
" \"name\": \"Student\",\n" +
" \"fields\": [{\n" +
" \"name\": \"age\",\n" +
" \"type\": \"int\",\n" +
" \"default\": -1\n" +
" }\n" +
" ]\n" +
"}");
datumReader.setSchema(expected);
System.out.println(datumReader.read(null, decoder));
as you can see, I am specifying the schema used to "write" the json input which does not contain the field "name", however (considering your schema contains a default value) when you print the records you will see the name with your default value
{"age": 70, "name": "null"}
Just in case, might or might not already know, that "null" is not really a null value is a string with value "null".

Just to add what is already said in above answer. in order for a field to be null if not present. then union its type with null. otherwise its just a string which is spelled as null that gets in.example schema:
{
"name": "name",
"type": [
"null",
"string"
],
"default": null
}
and then if you add {"age":70} and retrieve the record, you will get below:
{"age":70,"name":null}

Related

How To Extract Json Data Filed With RestApi?

i make a post to an api with rest assured. and than i try to make sure expected data from responsed data ,
but i got some errors like this -> "java.lang.IllegalArgumentException: The parameter "data" was used but not defined. Define parameters using the JsonPath.params(...) function"
my code:
String payload_data = "{" +
"\"Time\":1638057600, " +
"\"exampleType\":example, " +
"\"Id\":[2]}";
RestAssured.defaultParser = Parser.JSON;
given().
contentType(ContentType.JSON).
body(payload_data).
when().
post(api_url).
then().
statusCode(200).
body("data.examples.2.exampleData", equalTo("33"));
}
my json data
{
"success": true,
"data": {
"examples": {
"2": {
"ex_data": 0,
"exampleData": 33,
"data_ex": 0,
}
}
}
First, I tested path "data.examples.2.exampleData" with your json, it works fine. No problem.
You made some mistakes here.
Your payload is invalid json.
{
"Time": 1638057600,
"exampleType": example, //it must be number or String with double quote
"Id": [
2
]
}
You are comparing 2 things with different data types.
"data.examples.2.exampleData" --> int 33
equalTo("33") --> String "33"
Fix: body("data.examples.2.exampleData", equalTo(33));

How to have special character in JSON

Here I have shown the sample JSON object of my project
{
"objectType": "main",
"description": "",
"Column": "[{"displayName": "Account Name", "DataType" : "string"}, {"displayName" : "Billing City", "DataType" : "string" }, { "displayName" : "Billing State/Province" , "DataType" : "string" }, {"displayName" : "Billing Street", "DataType" : "textarea"}]"
}
It has some special characters like"/" (Billing State/Province) in the display name. Now it was not allowed to convert it to JSONArray and put in the RootJsonObject due to the special character in it. I had used the following code
RootJsonObject = new JSONObject();
RootJsonObject.put("content",new JSONArray(JsonObject.getString("Column")));
But I need that special character. Is there is any other way to have that "/" in my JSONObject or can I use any other JSON util.
I can see you've updated your question to include valid JSON for the Content property.
When deserialising this string to JSON, you will need to escape any special characters.
Apache StringEscapeUtils contains an escapeJson method which can do this for you:
RootJsonObject = new JSONObject();
String jsonArrayString = StringEscapeUtils.escapeJson(JsonObject.getString("Column"));
RootJsonObject.put("content", new JSONArray(jsonArrayString));
I am not 100% clear where JsonObject comes from or is populated but if the above does not work, you may need to escape the content earlier in the process somewhere.

Convert Avro file to JSON with reader schema

I would like to deserialize Avro data on the command line with a reader schema that is different from the writer schema. I can specify writer schema on serialization, but not during deserialization.
record.json (data file):
{"test1": 1, "test2": 2}
writer.avsc (writer schema):
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test1",
"type": "int"
},
{
"name": "test2",
"type": "int"
}
]
}
reader.avsc (reader schema):
{
"type": "record",
"name": "pouac",
"fields": [{
"name": "test2",
"type": "int",
"aliases": ["test1"]
}]
}
Serializing data:
$ java -jar avro-tools-1.8.2.jar fromjson --schema-file writer.avsc record.json > record.avro
For deserializing data, I tried the following:
$ java -jar avro-tools-1.8.2.jar tojson --schema-file reader.avsc record.avro
Exception in thread "main" joptsimple.UnrecognizedOptionException: 'schema-file' is not a recognized option
...
I'm looking primarily for a command line instruction because I'm not so confortable writing Java code, but I'd be happy with Java code to compile myself. Actually, what I'm interested in is the exact deserialization result. (the more fundamental issue at stake is described in this conversation on a fastavro PR that I opened to implement aliases)
The avro-tools tojson target is only meant as a dump tool for translating a binary encoded Avro file to JSON. The schema always accompanies the records in the Avro file as outlined in the link below. As a result it cannot be overridden by avro-tools.
http://avro.apache.org/docs/1.8.2/#compare
I am not aware of a stand-alone tool that can be used to achieve what you want. I think you'll need to do some programming to achieve the desired results. Avro has many supported languages including Python but the capabilities across languages is not uniform. Java is in my experience the most advanced. As an example Python lacks the ability to specify a reader schema on the DataFileReader which would help achieve what you want:
https://github.com/apache/avro/blob/master/lang/py/src/avro/datafile.py#L224
The closest you can get in Python is the following;
import avro.schema as avsc
import avro.datafile as avdf
import avro.io as avio
reader_schema = avsc.parse(open("reader.avsc", "rb").read())
# need ability to inject reader schema as 3rd arg
with avdf.DataFileReader(open("record.avro", "rb"), avio.DatumReader()) as reader:
for record in reader:
print record
In terms of the schemas and the data you've outlined. The expected behaviour should be undefined and therefore emit an error.
This behaviour can be verified with the following Java code;
package ca.junctionbox.soavro;
import org.apache.avro.Schema;
import org.apache.avro.SchemaValidationException;
import org.apache.avro.SchemaValidationStrategy;
import org.apache.avro.SchemaValidator;
import org.apache.avro.SchemaValidatorBuilder;
import java.util.ArrayList;
public class Main {
public static final String V1 = "{\n" +
" \"type\": \"record\",\n" +
" \"name\": \"pouac\",\n" +
" \"fields\": [\n" +
" {\n" +
" \"name\": \"test1\",\n" +
" \"type\": \"int\"\n" +
" },\n" +
" {\n" +
" \"name\": \"test2\",\n" +
" \"type\": \"int\"\n" +
" }\n" +
" ]\n" +
"}";
public static final String V2 = "{\n" +
" \"type\": \"record\",\n" +
" \"name\": \"pouac\",\n" +
" \"fields\": [{\n" +
" \"name\": \"test2\",\n" +
" \"type\": \"int\",\n" +
" \"aliases\": [\"test1\"]\n" +
" }]\n" +
"}";
public static void main(final String[] args) {
final SchemaValidator sv = new SchemaValidatorBuilder()
.canBeReadStrategy()
.validateAll();
final Schema sv1 = new Schema.Parser().parse(V1);
final Schema sv2 = new Schema.Parser().parse(V2);
final ArrayList<Schema> existing = new ArrayList<>();
existing.add(sv1);
try {
sv.validate(sv2, existing);
System.out.println("Good to go!");
} catch (SchemaValidationException e) {
e.printStackTrace();
}
}
}
This yields the following output:
org.apache.avro.SchemaValidationException: Unable to read schema:
{
"type" : "record",
"name" : "pouac",
"fields" : [ {
"name" : "test2",
"type" : "int",
"aliases" : [ "test1" ]
} ]
}
using schema:
{
"type" : "record",
"name" : "pouac",
"fields" : [ {
"name" : "test1",
"type" : "int"
}, {
"name" : "test2",
"type" : "int"
} ]
}
at org.apache.avro.ValidateMutualRead.canRead(ValidateMutualRead.java:70)
at org.apache.avro.ValidateCanBeRead.validate(ValidateCanBeRead.java:39)
at org.apache.avro.ValidateAll.validate(ValidateAll.java:51)
at ca.junctionbox.soavro.Main.main(Main.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:294)
at java.lang.Thread.run(Thread.java:748)
Aliases are typically used for backwards compatibility in schema evolution allowing mappings from disparate/legacy keys to a common key name. Given your writer schema doesn't treat the test1 and test2 fields as "optional" through the use of unions I can't see what scenario you'd want this transformation. If you want to "drop" the test1 field then it can be achieved by excluding it from the v2 schema specification. Any reader that can apply a reader scheme would then ignore test1 using the v2 schema definition.
To illustrate what I mean by evolution;
v1 schema
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test1",
"type": "int"
}]
}
v2 schema
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test2",
"type": "int",
"aliases": ["test1"]
}]
}
You could have terabytes of data in the v1 format and introduce the v2 format which renames the test1 field to test2. The alias would allow you to perform map-reduce jobs, Hive queries, etc on both v1 and v2 data without proactively rewriting all the old v1 data first. Note this assumes there is no change in type and the semantic meaning of the fields.
You can run java -jar avro-tools-1.8.2.jar tojson to see the help, what it tells is that you can use this command like:
java -jar avro-tools-1.8.2.jar tojson record.avro > tost.json
and this will output to the file:
{"test1":1,"test2":2}
Also you can call it with --pretty argumment:
java -jar avro-tools-1.8.2.jar tojson --pretty record.avro > tost.json
and the output will be pretty:
{
"test1" : 1,
"test2" : 2
}

I want to add additional property to json string without transform the String to any object format

I want to add additional property to json string without transform the String to any object format (I am doing the transformation right now which taking 30 ms time, i want to avoid the transformation time), so is there any way to add property to the json string without transform the payload to any object format?
Ex:
{
"type": "some type",
"data": [{
"email": "email id",
"content": {
"some filed": "filed value"
}
}]
}
i need my payload after the new field added like.,
{
"type": "some type",
"data": [{
"email": "email id",
"content": {
"some filed": "filed value",
"new field": "New value"
}
}]
}
You can use regex to slip in your new field at the end of content:
String json = "{ \"type\": \"some type\", \"data\": [{ \"email\": \"email id\", \"content\": { \"some filed\": \"filed value\" } }] }";
String newField = "\"new field\": \"New value\"";
json = json.replaceAll("(\"content\".*?)\\}", "$1" + Matcher.quoteReplacement("," + newField + "}") + "");

Elasticsearch - Deleting nested object using java api not working

I have an elasticsearch documents which contain nested objects within them, I want to be able to remove them via the java update api. Here is the code containing the script:
UpdateRequest updateRequest = new UpdateRequest(INDEX, "thread", String.valueOf(threadId));
updateRequest.script("for (int i = 0; i < ctx._source.messages.size(); i++){if(ctx._source.messages[i]._message_id == " + messageId + ")" +
"{ctx._source.messages.remove(i);i--;}}", ScriptService.ScriptType.INLINE);
client.update(updateRequest).actionGet();
This is the mapping of my document:
{
"thread_and_messages": {
"mappings": {
"thread": {
"properties": {
"messages": {
"type": "nested",
"include_in_parent": true,
"properties": {
"message_id": {
"type": "string"
},
"message_nick": {
"type": "string"
},
"message_text": {
"type": "string"
}
}
},
"thread_id": {
"type": "long"
}
}
}
}
}
}
I'm not receiving any error messages, but when I run a query on the index to find that nested document it hasn't been removed. Could someone let me know what I am doing wrong?
Since message_id is a string your script needs to account for it and be modified like this (see the escaped double quotes around the message_id field). There is a second typo, in that your mapping declares a message_id field but you name it _message_id in your script:
"for (int i = 0; i < ctx._source.messages.size(); i++){if(ctx._source.messages[i].message_id == \"" + messageId + "\")"
^ ^ ^
| | |
no underscore here add escaped double quotes
Finally also make sure that you have dynamic scripting enabled in your ES config
UPDATE
You can try a "groovy-er" way of removing elements from lists, i.e. no more for loop and if, just use the groovy power:
"ctx._source.messages.removeAll{ it.message_id == \"" + messageId + "\"}"
Normally, that will modify the messages array by removing all elements whose message_id field matches the messageId value.

Categories

Resources