Deserialize Avro to Map - java

Does anybody know the way to deserialize Avro without using any Pojo and Schemas?
The problem:
I have a data stream of different Avro files.
The goal is to group that data depending on the presence of some attributes (e.g. user.role, another.really.deep.attribute.with.specific.value and so on).
Each avro entry might contain any number of matching attributes - from zero to all listed).
So, there is no need to do anything with data. Just to peek at some elements.
The question is, is there any way to convert that data to Map or Node? Like I can do it with JSON using Jackson or GSON.
I've tried to use GenericDatumReader, but it requires a Schema. So maybe all I need is to read the schema from avro (how?).
Also, I've tried to use something like this, but this approach doesn't work.
public Map deserialize(byte[] data) {
DatumReader<LinkedHashMap> reader
= new SpecificDatumReader<>(LinkedHashMap.class);
Decoder decoder = null;
try {
decoder = DecoderFactory.get().binaryDecoder(data, null);
return reader.read(null, decoder);
} catch (IOException e) {
logger.error("Deserialization error:" + e.getMessage());
}
}
Since I have time to 'play' with the problem, I have created a utility class that generates schemas depending on keys. It works, but looks like a big overhead.

A reader schema is required to deserialize any message.
If you have the writer schema available, you can simply use that. Note that if you have Avro files, these include the schema they were written with and you can use avro-tools.jar -getschema to extract it
Without these options, then you'll need to figure out the schema on your own (maybe using a hexdump and knowing how Avro data gets encoded)

Related

How to convert JSON to AVRO GenericRecord in Java

I am building a tool in an Apache Beam pipeline which will ingest lots of different types of data (different Schemas, different filetypes, etc.) and will output the results as Avro files. Because there are many different types of output schemas, I'm using GenericRecords to write the Avro data. These GenericRecords include schemas generated during ingestion for each unique file / schema layout. In general, I have been using the built in Avro Schema class to handle these.
I tried using DecoderFactory to convert the Json data to Avro
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, content);
DatumReader<GenericData.Record> reader = new GenericDatumReader<>(schema);
return reader.read(null, decoder);
Which works just fine for the most part, except for when I have a case of a schema that has nullable fields, because the data is being read in from a JSON format that does not include typed fields, so when it creates the Schema it knows whether or not that field can be nullable, or is required, etc. This produces a problem when it writes the data to Avro:
If I have a nullable record that looks like this:
{"someField": "someValue"}
Avro is expecting the JSON data to look like this:
{"someField": {"string": "someValue"}}. This presents a problem anytime this combination appears (which is very frequent).
One possible solution raised was to use an AvroMapper. I laid it out like it shows on that page, created the Schema object as an AvroSchema, packaged the data into a byte array with the schema using AvroMapper.writter()
static GenericRecord convertJsonToGenericRecord(String content, Schema schema)
throws IOException {
JsonNode node = ObjectMappers.defaultObjectMapper().readTree(content);
AvroSchema avroSchema = new AvroSchema(schema);
byte[] avroData =
mapper
.writer(avroSchema)
.writeValueAsBytes(node);
return mapper.readValue(avroData, GenericRecord.class);
Which may hopefully get around the typing problem with nullable records, but which is still giving me issues in the form of not recognizing that the AvroSchema is inside the actual byte array that I'm passing in (avroData). Here is the stack trace:
com.fasterxml.jackson.core.JsonParseException: No AvroSchema set, can not parse
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader._checkSchemaSet(MissingReader.java:68)
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader.nextToken(MissingReader.java:41)
at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(AvroParserImpl.java:97)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4762)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4668)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3691)
When I checked the avroData byte array just to see what it looked like, it did not include anything other than the actual value I passed into it. It didn't include the schema, and it didn't even include the header or key. For the test, I'm using a single K/V pair as in the example above, and all I got back was the value.
An alternative route that I may pursue if this doesn't work is to manually format the JSON data as it comes in, but this is messy, and will require lots of recursion. I'm 99% sure that I can get it working that way, but would love to avoid it if at all possible.
To reiterate, what I'm trying to do is package incoming JSON-formatted data (string, byte array, node, whatever) with an Avro Schema to create GenericRecords which will be output to .avro files. I need to find a way to ingest the data and Schema such that it will allow for nullable records to be untyped in the JSON-string.
Thank you for your time, and don't hesitate to ask clarifying questions.

How to convert Pojo to parquet? [duplicate]

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.

How to use Apache Avro to serialize the JSON document and then write it into Cassandra?

I have been reading a lot about Apache Avro these days and I am more inclined towards using it instead of using JSON. Currently, what we are doing is, we are serializing the JSON document using Jackson and then writing that serialize JSON document into Cassandra for each row key/user id. Then we have a REST service that reads the whole JSON document using the row key and then deserialize it and use it further.
We will write into Cassandra like this-
user-id column-name serialize-json-document-value
Below is an example which shows the JSON document that we are writing into Cassandra. This JSON document is for particular row key/user id.
{
"lv" : [ {
"v" : {
"site-id" : 0,
"categories" : {
"321" : {
"price_score" : "0.2",
"confidence_score" : "0.5"
},
"123" : {
"price_score" : "0.4",
"confidence_score" : "0.2"
}
},
"price-score" : 0.5,
"confidence-score" : 0.2
}
} ],
"lmd" : 1379214255197
}
Now we are thinking to use Apache Avro so that we can compact this JSON document by serializing with Apache Avro and then store it in Cassandra. I have couple of questions on this-
Is it possible to serialize the above JSON document using Apache Avro first of all and then write it into Cassandra? If yes, how can I do that? Can anyone provide a simple example?
And also we need to deserialize it as well while reading back from Cassandra from our REST service. Is this also possible to do?
Below is my simple code which is serializing the JSON document and printing it out on the console.
public static void main(String[] args) {
final long lmd = System.currentTimeMillis();
Map<String, Object> props = new HashMap<String, Object>();
props.put("site-id", 0);
props.put("price-score", 0.5);
props.put("confidence-score", 0.2);
Map<String, Category> categories = new HashMap<String, Category>();
categories.put("123", new Category("0.4", "0.2"));
categories.put("321", new Category("0.2", "0.5"));
props.put("categories", categories);
AttributeValue av = new AttributeValue();
av.setProperties(props);
Attribute attr = new Attribute();
attr.instantiateNewListValue();
attr.getListValue().add(av);
attr.setLastModifiedDate(lmd);
// serialize it
try {
String jsonStr = JsonMapperFactory.get().writeValueAsString(attr);
// then write into Cassandra
System.out.println(jsonStr);
} catch (JsonGenerationException e) {
e.printStackTrace();
} catch (JsonMappingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Serialzie JSON document will look something like this -
{"lv":[{"v":{"site-id":0,"categories":{"321":{"price_score":"0.2","confidence_score":"0.5"},"123":{"price_score":"0.4","confidence_score":"0.2"}},"price-score":0.5,"confidence-score":0.2}}],"lmd":1379214255197}
AttributeValue and Attribute class are using Jackson Annotations.
And also one important note, properties inside the above json document will get changed depending on the column names. We have different properties for different column names. Some column names will have two properties, some will have 5 properties. So the above JSON document will have its correct properties and its value according to our metadata that we are having.
I hope the question is clear enough. Can anyone provide a simple example for this how can I achieve that using Apache Avro. I am just starting with Apache Avro so I am having lot of problems..
Since you already use jackson, you could try the Jackson dataformat module to support Avro-encoded data.
Avro requires a schema, so you MUST design it before using it; and usage differs a lot from free-formed JSON.
But instead of Avro, you might want to consider Smile -- a one-to-one binary serialization of JSON, designed for use cases where you may want to go back and forth between JSON and binary data; for example, to use JSON for debugging, or when serving Javascript clients.
Jackson has Smile backend (see https://github.com/FasterXML/jackson-dataformat-smile) and it is literally a one-line change to use Smile instead of (or in addition to) JSON.
Many projects use it (for example, Elastic Search), and it is mature and stable format; and tooling support via Jackson is extensive for different datatypes.

Avro json decoder : ignore namespace

i tried to use Apache Avro on project... and i've met some difficulties
avro serialization/ deserialization work like a charm ... but i get decoder exceptions.. like unknown union branch blah-blah-blah... in case incomming json does't contain namepsace record ...
e.g.
"user":{"demo.avro.User":{"age":1000... //that's ok
"user":{"age":1000... //org.apache.avro.AvroTypeException: Unknown union branch age
I cannot put object in default namespace... but it is important to parse incoming json regardless it contains namespace node or not
Could you help me to fix it
If you use JSON, why are you using Avro decoders? There are tons of JSON libraries which are designed to work with JSON: with Avro, the idea is to Avro's own compact format, and JSON is mostly used for debugging (i.e. you can expose Avro data as JSON if necessary).

Categories

Resources