I am building a tool in an Apache Beam pipeline which will ingest lots of different types of data (different Schemas, different filetypes, etc.) and will output the results as Avro files. Because there are many different types of output schemas, I'm using GenericRecords to write the Avro data. These GenericRecords include schemas generated during ingestion for each unique file / schema layout. In general, I have been using the built in Avro Schema class to handle these.
I tried using DecoderFactory to convert the Json data to Avro
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, content);
DatumReader<GenericData.Record> reader = new GenericDatumReader<>(schema);
return reader.read(null, decoder);
Which works just fine for the most part, except for when I have a case of a schema that has nullable fields, because the data is being read in from a JSON format that does not include typed fields, so when it creates the Schema it knows whether or not that field can be nullable, or is required, etc. This produces a problem when it writes the data to Avro:
If I have a nullable record that looks like this:
{"someField": "someValue"}
Avro is expecting the JSON data to look like this:
{"someField": {"string": "someValue"}}. This presents a problem anytime this combination appears (which is very frequent).
One possible solution raised was to use an AvroMapper. I laid it out like it shows on that page, created the Schema object as an AvroSchema, packaged the data into a byte array with the schema using AvroMapper.writter()
static GenericRecord convertJsonToGenericRecord(String content, Schema schema)
throws IOException {
JsonNode node = ObjectMappers.defaultObjectMapper().readTree(content);
AvroSchema avroSchema = new AvroSchema(schema);
byte[] avroData =
mapper
.writer(avroSchema)
.writeValueAsBytes(node);
return mapper.readValue(avroData, GenericRecord.class);
Which may hopefully get around the typing problem with nullable records, but which is still giving me issues in the form of not recognizing that the AvroSchema is inside the actual byte array that I'm passing in (avroData). Here is the stack trace:
com.fasterxml.jackson.core.JsonParseException: No AvroSchema set, can not parse
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader._checkSchemaSet(MissingReader.java:68)
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader.nextToken(MissingReader.java:41)
at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(AvroParserImpl.java:97)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4762)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4668)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3691)
When I checked the avroData byte array just to see what it looked like, it did not include anything other than the actual value I passed into it. It didn't include the schema, and it didn't even include the header or key. For the test, I'm using a single K/V pair as in the example above, and all I got back was the value.
An alternative route that I may pursue if this doesn't work is to manually format the JSON data as it comes in, but this is messy, and will require lots of recursion. I'm 99% sure that I can get it working that way, but would love to avoid it if at all possible.
To reiterate, what I'm trying to do is package incoming JSON-formatted data (string, byte array, node, whatever) with an Avro Schema to create GenericRecords which will be output to .avro files. I need to find a way to ingest the data and Schema such that it will allow for nullable records to be untyped in the JSON-string.
Thank you for your time, and don't hesitate to ask clarifying questions.
Related
I am building an application that handles the batch processings with using a SpringBatch. As the ItemReaders can handle a dynamic schemas (e.g. reading a JSON files (JsonItemReader), XML files (StaxEventItemReader), getting a data from the MongoDB (MongoItemReader) and so on) I am wondering, how can I leverage a SpringBatch to use dynamically a FlatFileItemWriter as an last stage in the step and produce a CSV file.
Normally, it requires to get a fixed schema once I initialize a Writer (before I even start writing an objects). As the schema can differ in the JSON Objects, each product in each chunk can potentially have a different headers. Is there any workaround that I can use to include a FlatFileItemWriter as an output if the domain objects have a various schemas that are unknown until the Runtime?
That's the current code for initializing a FlatFileItemWriter but with using a static schema, that needs to be provided before I create a Writer.
FlatFileItemWriter<Row> flatFileItemWriter = new FlatFileItemWriter<>();
Resource resource = new FileSystemResource(path);
flatFileItemWriter.setResource(resource);
CSVLineAggregator lineAggregator = CSVLineAggregator.builder()
.schema(schema)
.delimiter(delimiter)
.quoteCharacter(quoteCharacter)
.escapeCharacter(escapeCharacter)
.build();
flatFileItemWriter.setLineAggregator(lineAggregator);
flatFileItemWriter.setEncoding(encoding);
flatFileItemWriter.setLineSeparator(lineSeparator);
flatFileItemWriter.setShouldDeleteIfEmpty(shouldDeleteIfEmpty);
flatFileItemWriter.setHeaderCallback(new HeaderCallback(schema.getColumnNames(), flatFileItemWriter, lineSeparator));
** The Row it's my domain object that is just a Map's based structure that stores the data in the Cells and Columns, along with the schema that can differ between the rows.
Thanks in advance for any tips!
Does anybody know the way to deserialize Avro without using any Pojo and Schemas?
The problem:
I have a data stream of different Avro files.
The goal is to group that data depending on the presence of some attributes (e.g. user.role, another.really.deep.attribute.with.specific.value and so on).
Each avro entry might contain any number of matching attributes - from zero to all listed).
So, there is no need to do anything with data. Just to peek at some elements.
The question is, is there any way to convert that data to Map or Node? Like I can do it with JSON using Jackson or GSON.
I've tried to use GenericDatumReader, but it requires a Schema. So maybe all I need is to read the schema from avro (how?).
Also, I've tried to use something like this, but this approach doesn't work.
public Map deserialize(byte[] data) {
DatumReader<LinkedHashMap> reader
= new SpecificDatumReader<>(LinkedHashMap.class);
Decoder decoder = null;
try {
decoder = DecoderFactory.get().binaryDecoder(data, null);
return reader.read(null, decoder);
} catch (IOException e) {
logger.error("Deserialization error:" + e.getMessage());
}
}
Since I have time to 'play' with the problem, I have created a utility class that generates schemas depending on keys. It works, but looks like a big overhead.
A reader schema is required to deserialize any message.
If you have the writer schema available, you can simply use that. Note that if you have Avro files, these include the schema they were written with and you can use avro-tools.jar -getschema to extract it
Without these options, then you'll need to figure out the schema on your own (maybe using a hexdump and knowing how Avro data gets encoded)
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.
i tried to use Apache Avro on project... and i've met some difficulties
avro serialization/ deserialization work like a charm ... but i get decoder exceptions.. like unknown union branch blah-blah-blah... in case incomming json does't contain namepsace record ...
e.g.
"user":{"demo.avro.User":{"age":1000... //that's ok
"user":{"age":1000... //org.apache.avro.AvroTypeException: Unknown union branch age
I cannot put object in default namespace... but it is important to parse incoming json regardless it contains namespace node or not
Could you help me to fix it
If you use JSON, why are you using Avro decoders? There are tons of JSON libraries which are designed to work with JSON: with Avro, the idea is to Avro's own compact format, and JSON is mostly used for debugging (i.e. you can expose Avro data as JSON if necessary).