Spring Batch - a FlatFileItemWriter with dynamic schema

Spring Batch - a FlatFileItemWriter with dynamic schema - java

I am building an application that handles the batch processings with using a SpringBatch. As the ItemReaders can handle a dynamic schemas (e.g. reading a JSON files (JsonItemReader), XML files (StaxEventItemReader), getting a data from the MongoDB (MongoItemReader) and so on) I am wondering, how can I leverage a SpringBatch to use dynamically a FlatFileItemWriter as an last stage in the step and produce a CSV file.
Normally, it requires to get a fixed schema once I initialize a Writer (before I even start writing an objects). As the schema can differ in the JSON Objects, each product in each chunk can potentially have a different headers. Is there any workaround that I can use to include a FlatFileItemWriter as an output if the domain objects have a various schemas that are unknown until the Runtime?
That's the current code for initializing a FlatFileItemWriter but with using a static schema, that needs to be provided before I create a Writer.
FlatFileItemWriter<Row> flatFileItemWriter = new FlatFileItemWriter<>();
Resource resource = new FileSystemResource(path);
flatFileItemWriter.setResource(resource);
CSVLineAggregator lineAggregator = CSVLineAggregator.builder()
.schema(schema)
.delimiter(delimiter)
.quoteCharacter(quoteCharacter)
.escapeCharacter(escapeCharacter)
.build();
flatFileItemWriter.setLineAggregator(lineAggregator);
flatFileItemWriter.setEncoding(encoding);
flatFileItemWriter.setLineSeparator(lineSeparator);
flatFileItemWriter.setShouldDeleteIfEmpty(shouldDeleteIfEmpty);
flatFileItemWriter.setHeaderCallback(new HeaderCallback(schema.getColumnNames(), flatFileItemWriter, lineSeparator));
** The Row it's my domain object that is just a Map's based structure that stores the data in the Cells and Columns, along with the schema that can differ between the rows.
Thanks in advance for any tips!

Related

How to convert JSON to AVRO GenericRecord in Java

I am building a tool in an Apache Beam pipeline which will ingest lots of different types of data (different Schemas, different filetypes, etc.) and will output the results as Avro files. Because there are many different types of output schemas, I'm using GenericRecords to write the Avro data. These GenericRecords include schemas generated during ingestion for each unique file / schema layout. In general, I have been using the built in Avro Schema class to handle these.
I tried using DecoderFactory to convert the Json data to Avro
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, content);
DatumReader<GenericData.Record> reader = new GenericDatumReader<>(schema);
return reader.read(null, decoder);
Which works just fine for the most part, except for when I have a case of a schema that has nullable fields, because the data is being read in from a JSON format that does not include typed fields, so when it creates the Schema it knows whether or not that field can be nullable, or is required, etc. This produces a problem when it writes the data to Avro:
If I have a nullable record that looks like this:
{"someField": "someValue"}
Avro is expecting the JSON data to look like this:
{"someField": {"string": "someValue"}}. This presents a problem anytime this combination appears (which is very frequent).
One possible solution raised was to use an AvroMapper. I laid it out like it shows on that page, created the Schema object as an AvroSchema, packaged the data into a byte array with the schema using AvroMapper.writter()
static GenericRecord convertJsonToGenericRecord(String content, Schema schema)
throws IOException {
JsonNode node = ObjectMappers.defaultObjectMapper().readTree(content);
AvroSchema avroSchema = new AvroSchema(schema);
byte[] avroData =
mapper
.writer(avroSchema)
.writeValueAsBytes(node);
return mapper.readValue(avroData, GenericRecord.class);
Which may hopefully get around the typing problem with nullable records, but which is still giving me issues in the form of not recognizing that the AvroSchema is inside the actual byte array that I'm passing in (avroData). Here is the stack trace:
com.fasterxml.jackson.core.JsonParseException: No AvroSchema set, can not parse
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader._checkSchemaSet(MissingReader.java:68)
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader.nextToken(MissingReader.java:41)
at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(AvroParserImpl.java:97)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4762)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4668)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3691)
When I checked the avroData byte array just to see what it looked like, it did not include anything other than the actual value I passed into it. It didn't include the schema, and it didn't even include the header or key. For the test, I'm using a single K/V pair as in the example above, and all I got back was the value.
An alternative route that I may pursue if this doesn't work is to manually format the JSON data as it comes in, but this is messy, and will require lots of recursion. I'm 99% sure that I can get it working that way, but would love to avoid it if at all possible.
To reiterate, what I'm trying to do is package incoming JSON-formatted data (string, byte array, node, whatever) with an Avro Schema to create GenericRecords which will be output to .avro files. I need to find a way to ingest the data and Schema such that it will allow for nullable records to be untyped in the JSON-string.
Thank you for your time, and don't hesitate to ask clarifying questions.

Best approach to create a csv "file" and save it to database as bytearray/blob?

I am trying to write a small spring boot application, which needs to do the following:
--Create and Save:
-Query things from database (List of result sets, to be converted to pojo)
-Convert that pojo to csv (without the actual file)
-Save that csv to database as blob/bytea (postgres).
--Retrieve:
-Get the bytea from db, send that as part of json response from a GET endpoint
I have managed to create the csv via OpenCSV and using StatefulBeanToCsv - but since this uses Printwriter, I am worried about having memory issues when processing large files.
Any alternatives so I can do above requirements?

Create XML based on XSD from data in DB

this is first time I work with XML so maybe it is very easy problem, but I would like to ask what is the best way I should create XML filled with data from DB when I know the schema.
Of course there is possibility to do it manually, but I would like to do something like this:
create configuration file which would specify column name, xpath, default value (if DB column is not populated) and based on this configuration file to create XML based on known schema.
Is there some tool in java which would allow be something like this?
MagicXMLTool tool = new MagicXMLTool(mySchema.xsd);
tool.set("some/xpath/",value);
tool.set("another/xpath",anotherValue)
String xml = tool.generateXML();
Thanks a lot!

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().

If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.

It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

How to validate values in a YAML configuration file while loading it?

Is there a way to validate values in a YAML file while loading it in the code. The requirement is I have some elements in the YAML file which must have values. If the validation fails, then YAML should not be loaded.
I'm using snakeyaml library and heard there is a way to do this via Representer.
Code I'm currently using to load the YAML,
Reader in = new InputStreamReader(Files.newInputStream(file), StandardCharsets.UTF_8);
Yaml yaml = new Yaml();
yaml.setBeanAccess(BeanAccess.FIELD);
return yaml.loadAs(in, School.class);

Since you can have any value in a YAML file, you should load the file in a function, test the values and raise an error if the values are not what you want. Return the loaded data if they are.
This may have side-effects if your YAML has tags that create arbitrary objects, but checking during loading will not prevent that, as such object might have been created before you come to the value you want to check.
If you do have tags in your YAML and that is a real problem, then you would have to make a safe_load-er for the YAML file that can handle the tags (by creating normal mapping objects), then check the values and reload with full tag support.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spring Batch - a FlatFileItemWriter with dynamic schema - java

Related

How to convert JSON to AVRO GenericRecord in Java

Best approach to create a csv "file" and save it to database as bytearray/blob?

Create XML based on XSD from data in DB

How to access SparkContext on executors to save DataFrame to Cassandra?

How to validate values in a YAML configuration file while loading it?

Categories

Resources