Convert Avro file to JSON with reader schema

Convert Avro file to JSON with reader schema - java

I would like to deserialize Avro data on the command line with a reader schema that is different from the writer schema. I can specify writer schema on serialization, but not during deserialization.
record.json (data file):
{"test1": 1, "test2": 2}
writer.avsc (writer schema):
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test1",
"type": "int"
},
{
"name": "test2",
"type": "int"
}
]
}
reader.avsc (reader schema):
{
"type": "record",
"name": "pouac",
"fields": [{
"name": "test2",
"type": "int",
"aliases": ["test1"]
}]
}
Serializing data:
$ java -jar avro-tools-1.8.2.jar fromjson --schema-file writer.avsc record.json > record.avro
For deserializing data, I tried the following:
$ java -jar avro-tools-1.8.2.jar tojson --schema-file reader.avsc record.avro
Exception in thread "main" joptsimple.UnrecognizedOptionException: 'schema-file' is not a recognized option
...
I'm looking primarily for a command line instruction because I'm not so confortable writing Java code, but I'd be happy with Java code to compile myself. Actually, what I'm interested in is the exact deserialization result. (the more fundamental issue at stake is described in this conversation on a fastavro PR that I opened to implement aliases)

The avro-tools tojson target is only meant as a dump tool for translating a binary encoded Avro file to JSON. The schema always accompanies the records in the Avro file as outlined in the link below. As a result it cannot be overridden by avro-tools.
http://avro.apache.org/docs/1.8.2/#compare
I am not aware of a stand-alone tool that can be used to achieve what you want. I think you'll need to do some programming to achieve the desired results. Avro has many supported languages including Python but the capabilities across languages is not uniform. Java is in my experience the most advanced. As an example Python lacks the ability to specify a reader schema on the DataFileReader which would help achieve what you want:
https://github.com/apache/avro/blob/master/lang/py/src/avro/datafile.py#L224
The closest you can get in Python is the following;
import avro.schema as avsc
import avro.datafile as avdf
import avro.io as avio
reader_schema = avsc.parse(open("reader.avsc", "rb").read())
# need ability to inject reader schema as 3rd arg
with avdf.DataFileReader(open("record.avro", "rb"), avio.DatumReader()) as reader:
for record in reader:
print record
In terms of the schemas and the data you've outlined. The expected behaviour should be undefined and therefore emit an error.
This behaviour can be verified with the following Java code;
package ca.junctionbox.soavro;
import org.apache.avro.Schema;
import org.apache.avro.SchemaValidationException;
import org.apache.avro.SchemaValidationStrategy;
import org.apache.avro.SchemaValidator;
import org.apache.avro.SchemaValidatorBuilder;
import java.util.ArrayList;
public class Main {
public static final String V1 = "{\n" +
" \"type\": \"record\",\n" +
" \"name\": \"pouac\",\n" +
" \"fields\": [\n" +
" {\n" +
" \"name\": \"test1\",\n" +
" \"type\": \"int\"\n" +
" },\n" +
" {\n" +
" \"name\": \"test2\",\n" +
" \"type\": \"int\"\n" +
" }\n" +
" ]\n" +
"}";
public static final String V2 = "{\n" +
" \"type\": \"record\",\n" +
" \"name\": \"pouac\",\n" +
" \"fields\": [{\n" +
" \"name\": \"test2\",\n" +
" \"type\": \"int\",\n" +
" \"aliases\": [\"test1\"]\n" +
" }]\n" +
"}";
public static void main(final String[] args) {
final SchemaValidator sv = new SchemaValidatorBuilder()
.canBeReadStrategy()
.validateAll();
final Schema sv1 = new Schema.Parser().parse(V1);
final Schema sv2 = new Schema.Parser().parse(V2);
final ArrayList<Schema> existing = new ArrayList<>();
existing.add(sv1);
try {
sv.validate(sv2, existing);
System.out.println("Good to go!");
} catch (SchemaValidationException e) {
e.printStackTrace();
}
}
}
This yields the following output:
org.apache.avro.SchemaValidationException: Unable to read schema:
{
"type" : "record",
"name" : "pouac",
"fields" : [ {
"name" : "test2",
"type" : "int",
"aliases" : [ "test1" ]
} ]
}
using schema:
{
"type" : "record",
"name" : "pouac",
"fields" : [ {
"name" : "test1",
"type" : "int"
}, {
"name" : "test2",
"type" : "int"
} ]
}
at org.apache.avro.ValidateMutualRead.canRead(ValidateMutualRead.java:70)
at org.apache.avro.ValidateCanBeRead.validate(ValidateCanBeRead.java:39)
at org.apache.avro.ValidateAll.validate(ValidateAll.java:51)
at ca.junctionbox.soavro.Main.main(Main.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:294)
at java.lang.Thread.run(Thread.java:748)
Aliases are typically used for backwards compatibility in schema evolution allowing mappings from disparate/legacy keys to a common key name. Given your writer schema doesn't treat the test1 and test2 fields as "optional" through the use of unions I can't see what scenario you'd want this transformation. If you want to "drop" the test1 field then it can be achieved by excluding it from the v2 schema specification. Any reader that can apply a reader scheme would then ignore test1 using the v2 schema definition.
To illustrate what I mean by evolution;
v1 schema
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test1",
"type": "int"
}]
}
v2 schema
{
"type": "record",
"name": "pouac",
"fields": [
{
"name": "test2",
"type": "int",
"aliases": ["test1"]
}]
}
You could have terabytes of data in the v1 format and introduce the v2 format which renames the test1 field to test2. The alias would allow you to perform map-reduce jobs, Hive queries, etc on both v1 and v2 data without proactively rewriting all the old v1 data first. Note this assumes there is no change in type and the semantic meaning of the fields.

You can run java -jar avro-tools-1.8.2.jar tojson to see the help, what it tells is that you can use this command like:
java -jar avro-tools-1.8.2.jar tojson record.avro > tost.json
and this will output to the file:
{"test1":1,"test2":2}
Also you can call it with --pretty argumment:
java -jar avro-tools-1.8.2.jar tojson --pretty record.avro > tost.json
and the output will be pretty:
{
"test1" : 1,
"test2" : 2
}

Related

Passing json string as an input to one of the parameters of a POST request body in RESTAssured with Selenium

I am using RESTAssured java library in Selenium for API test automation. I need to pass a json string as a value to one parameter of a POST request body. My request body looks like this:
{
"parameter1": "abc",
"parameter2": "def",
"parameter3": {
"id": "",
"key1": "test123",
"prod1": {
"id": "",
"key3": "test123",
"key4": "12334",
"key5": "3",
"key6": "234334"
},
"prod2": {
"id": "",
"key7": "test234",
"key8": "1",
"key9": true
}
},
"parameter4": false,
"parameter5": "ghi"
}
For parameter3 I need to be pass a string value in json format. The json file is located in my local system and is a huge file, so it would make sense if I can pass the path to the json file.
Is there any way using RestAssured to achieve this?

Use org.json library;
Read json file and get as a String
String content = "";
try {
content = new String(Files.readAllBytes(Paths.get("absolute_path_to_file\\example.json")));
} catch (IOException e) {
e.printStackTrace();
}
Covert the String to JSONObject
JSONObject jsonObject = new JSONObject(content);
Get the new json object that you need to put in the jsonObject
String jsonString = "{\n" +
" \"firstName\": \"John\",\n" +
" \"lastName\" : \"doe\",\n" +
" \"age\" : 26,\n" +
" \"address\" : {\n" +
" \"streetAddress\": \"naist street\",\n" +
" \"city\" : \"Nara\",\n" +
" \"postalCode\" : \"630-0192\"\n" +
" }\n" +
"}";
JSONObject updateObject = new JSONObject(jsonString);
Replace the value of parameter3 with new updateObject
jsonObject.put("parameter3", updateObject);
System.out.println(jsonObject.toString());
If you beautify the printed output;
{
"parameter5": "ghi",
"parameter4": false,
"parameter3": {
"firstName": "John",
"lastName": "doe",
"address": {
"streetAddress": "naist street",
"city": "Nara",
"postalCode": "630-0192"
},
"age": 26
},
"parameter2": "def",
"parameter1": "abc"
}
If you want to update a nested json object like prod1 in parameter3
JSONObject parameter3JsonObject = jsonObject.getJSONObject("parameter3");
parameter3JsonObject.put("prod1", updateObject);

How do I correctly query multiple nested arrays for a string match? - MongoDB

I'm trying to build a MongoDB query where I can search a list of objects, which all contain multiple nested arrays of objects, for one where a property on the lowest level contains part of a string.
This is the general structure of theses objects:
{
"name": "Foo"
"structure": {
"configs":[
{
"combinations": [
{
"title": "item1"
},
{
"title": "item2"
}
]
},
{
"combinations": [
{
"title": "item3"
},
{
"title": "item4"
}
]
}
]
}
}
Now when I search for "item1" or just "1", I'd like for that example object to be returned, because the first combinationsarray contains an object with the title item1.
Since I'm building the application in Spring boot, usually the queries could easily be handled by the usual high level findAllByPropertyMatching(String searchTerm) in the repository class. Due do its complexity this does not work in this case, and I'm really struggling with how to go about this.
I tried a custom query...
#Query(value = "{'structure.configs.$[].combination.$[].title': {$regex : ?0, $options: 'i'}}")
public List<Item> findAllByStructuregMatchesRegex(String query);
...but it has multiple issues, obviously.
Because the data is loaded from an external source, I'm not able to change the underlying data structure. I also can't cache all items and filter it with Java logic, because the data set is way too big.
Can anyone point me in the right direction? I'd really appreciate your help, thanks a lot!

After more trial and error, the solution seems to be pretty straightforward:
#Query(value = "{'structure.configs': {\n" +
" '$elemMatch': {\n" +
" 'combinations': {\n" +
" '$elemMatch': {\n" +
" 'title': {$regex : ?0, $options: 'i'}\n" +
" }\n" +
" }\n" +
" }\n" +
" }}")
public List<Item> findAllByStructureMatchesRegex(String query);

use of "default" in avro schema

As per the definition of "default" attribute in Avro docs: "A default value for this field, used when reading instances that lack this field (optional)."
This means that if the corresponding field is missing, the default value is taken.
But this does not seem to be the case. Consider the following student schema:
{
"type": "record",
"namespace": "com.example",
"name": "Student",
"fields": [{
"name": "age",
"type": "int",
"default": -1
},
{
"name": "name",
"type": "string",
"default": "null"
}
]
}
Schema says that: if "age" field is missing, then consider value as -1. Likewise for "name" field.
Now, if I try to construct Student model, from the following JSON:
{"age":70}
I get this exception:
org.apache.avro.AvroTypeException: Expected string. Got END_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:698)
at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:227)
Looks like the default is NOT working as expected. So, What exactly is the role of default here ?
This is the code used to generate Student model:
Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, studentJson);
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);
return datumReader.read(null, decoder);
(Student class is auto-generated by Avro compiler from student schema)

I think there is some miss understanding around default values so hopefully my explanation will help to other people as well. The default value is useful to give a default value when the field is not present, but this is essentially when you are instancing an avro object (in your case calling datumReader.read) but it does not allow read data with a different schema, this is why the concept of "schema registry" is useful for this kind of situations.
The following code works and allow read your data
Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, "{\"age\":70}");
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);
Schema expected = new Schema.Parser().parse("{\n" +
" \"type\": \"record\",\n" +
" \"namespace\": \"com.example\",\n" +
" \"name\": \"Student\",\n" +
" \"fields\": [{\n" +
" \"name\": \"age\",\n" +
" \"type\": \"int\",\n" +
" \"default\": -1\n" +
" }\n" +
" ]\n" +
"}");
datumReader.setSchema(expected);
System.out.println(datumReader.read(null, decoder));
as you can see, I am specifying the schema used to "write" the json input which does not contain the field "name", however (considering your schema contains a default value) when you print the records you will see the name with your default value
{"age": 70, "name": "null"}
Just in case, might or might not already know, that "null" is not really a null value is a string with value "null".

Just to add what is already said in above answer. in order for a field to be null if not present. then union its type with null. otherwise its just a string which is spelled as null that gets in.example schema:
{
"name": "name",
"type": [
"null",
"string"
],
"default": null
}
and then if you add {"age":70} and retrieve the record, you will get below:
{"age":70,"name":null}

How to find a key in a JSON file with Gson library?

Given a JSON file which looks like the code below, how can I find a specific key with its corresponding value without hard-coding the structure in a Java class or otherwise. Let's say I want to get "Signal_Settings" as a JSON Object but given the possibility of changes in the JSON file structure I want to get this object regardless for example by iterating through all keys in the file. I have tried to do this iteratively and failed, so I think a recursive function is the way for which I didn't find a solution yet. My iterative code looks like this:
while (true) {
try {
for(Map.Entry<String, JsonElement> entry:newJsonObj.entrySet()){
System.out.println(entry.getKey());
System.out.println("Before NewJsonObj:" + newJsonObj);
newJsonObj = newJsonObj.getAsJsonObject(entry.getKey());
System.out.println("After NewJsonObj:" + newJsonObj + "\n");
//tempEntrySet = newJsonObj.entrySet().iterator().next();
}
//System.out.println("Key:" + tempEntrySet.getKey());
//System.out.println("TempEntry:" + tempEntrySet);
}catch (Exception e){
break;
}
}
and the JSON file:
{
"config2": {
"UDP_Settings": {
"ListenAddress": "'127.0.0.1'",
"ListenPort": "54523"
},
"Signal_Settings": {
"Count": 1,
"Signals": [
{
"Name": "time",
"Type": "uint8",
"Interval": "foo",
"Description": "",
"Unit": null
},
{
"Name": "othersignal",
"Type": "uint8",
"Interval": "fff",
"Description": "",
"Unit": null
}
]
},
"Tag_Settings": null,
"Model_Settings": null
}
}

I want to add additional property to json string without transform the String to any object format

I want to add additional property to json string without transform the String to any object format (I am doing the transformation right now which taking 30 ms time, i want to avoid the transformation time), so is there any way to add property to the json string without transform the payload to any object format?
Ex:
{
"type": "some type",
"data": [{
"email": "email id",
"content": {
"some filed": "filed value"
}
}]
}
i need my payload after the new field added like.,
{
"type": "some type",
"data": [{
"email": "email id",
"content": {
"some filed": "filed value",
"new field": "New value"
}
}]
}

You can use regex to slip in your new field at the end of content:
String json = "{ \"type\": \"some type\", \"data\": [{ \"email\": \"email id\", \"content\": { \"some filed\": \"filed value\" } }] }";
String newField = "\"new field\": \"New value\"";
json = json.replaceAll("(\"content\".*?)\\}", "$1" + Matcher.quoteReplacement("," + newField + "}") + "");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert Avro file to JSON with reader schema - java

Related

Passing json string as an input to one of the parameters of a POST request body in RESTAssured with Selenium

How do I correctly query multiple nested arrays for a string match? - MongoDB

use of "default" in avro schema

How to find a key in a JSON file with Gson library?

I want to add additional property to json string without transform the String to any object format

Categories

Resources