How to write union when creating Avro file in Java

How to write union when creating Avro file in Java - java

I'm trying to create Avro file in Java (just testing code at the moment). Everything works fine, the code looks about like this:
GenericRecord record = new GenericData.Record(schema);
File file = new File("test.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(record);
dataFileWriter.close();
The problem I'm facing now is - what kind of Java object do I instantiate when I want to write Union? Not necessarily on the top level, possibly attach the union to a record being written. There are a few objects for complex types prepared, like GenericData.Record, GenericData.Array etc. For those that are not prepared, usually the right object is simply a standard Java object (java.util.Map implementing classes for "map" Avro type etc.).
But I cannot figure out what is the right object to instantiate for writing a Union.
This question refers to writing Avro file WITHOUT code generation. Any help is very much appreciated.

Here's what I did:
Suppose the schema is defined like this:
record MyStructure {
...
record MySubtype {
int p1;
}
union {null, MySubtype} myField = null;
...
}
And this is the Java code:
Schema schema; // the schema of the main structure
// ....
GenericRecord rec = new GenericData.Record(schema);
int i = schema.getField("myField").schema().getIndexNamed("MySubtype");
GenericRecord myField = new GenericData.Record(schema.getField("myField").schema().getTypes().get(i));
myField.put("p1", 100);
rec.put("myField", myField);

Related

Spring Batch creating multiple files .Gradle based project

I need to create 3 separate files.
My Batch job should read from Mongo then parse through the information and find the "business" column (3 types of business: RETAIL,HPP,SAX) then create a file for their respective business. the file should create either RETAIL +formattedDate; HPP + formattedDate; SAX +formattedDate as the file name and the information found in the DB inside a txt file. Also, I need to set the .resource(new FileSystemResource("C:\filewriter\index.txt)) into something that will send the information to the right location, right now hard coding works but only creates one .txt file.
example:
#Bean
public FlatFileItemWriter<PaymentAudit> writer() {
LOG.debug("Mongo-writer");
FlatFileItemWriter<PaymentAudit> flatFile = new
FlatFileItemWriterBuilder<PaymentAudit>()
.name("flatFileItemWriter")
.resource(new FileSystemResource("C:\\filewriter\\index.txt))
//trying to create a path instead of hard coding it
.lineAggregator(createPaymentPortalLineAggregator())
.build();
String exportFileHeader =
"CREATE_DTTM";
StringHeaderWriter headerWriter = new
StringHeaderWriter(exportFileHeader);
flatFile.setHeaderCallback(headerWriter);
return flatFile;
}
My idea would be something like but not sure where to go:
public Map<String, List<PaymentAudit>> getPaymentPortalRecords() {
List<PaymentAudit> recentlyCreated =
PaymentPortalRepository.findByCreateDttmBetween(yesterdayMidnight,
yesterdayEndOfDay);
List<PaymentAudit> retailList = new ArrayList<>();
List<PaymentAudit> saxList = new ArrayList<>();
List<PaymentAudit> hppList = new ArrayList<>();
//String exportFilePath = "C://filewriter/";??????
recentlyCreated.parallelStream().forEach(paymentAudit -> {
if (paymentAudit.getBusiness().equalsIgnoreCase(RETAIL)) {
retailList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(SAX)) {
saxList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(HPP)) {
hppList.add(paymentAudit);
}
});

To create a file for each business object type, you can use the ClassifierCompositeItemWriter. In your case, you can create a writer for each type and add them as delegates in the composite item writer.
As per creating the filename dynamically, you need to use a step scoped writer. There is an example in the Step Scope section of the reference documentation.
Hope this helps.

Deserialize Avro messages into specific datum using KafkaAvroDecoder

I'm reading from a Kafka topic, which contains Avro messages serialized using the KafkaAvroEncoder (which automatically registers the schemas with the topics). I'm using the maven-avro-plugin to generate plain Java classes, which I'd like to use upon reading.
The KafkaAvroDecoder only supports deserializing into GenericData.Record types, which (in my opinion) misses the whole point of having a statically typed language. My deserialization code currently looks like this:
SpecificDatumReader<event> reader = new SpecificDatumReader<>(
event.getClassSchema() // event is my class generated from the schema
);
byte[] in = ...; // my input bytes;
ByteBuffer stuff = ByteBuffer.wrap(in);
// the KafkaAvroEncoder puts a magic byte and the ID of the schema (as stored
// in the schema-registry) before the serialized message
if (stuff.get() != 0x0) {
return;
}
int id = stuff.getInt();
// lets just ignore those special bytes
int length = stuff.limit() - 4 - 1;
int start = stuff.position() + stuff.arrayOffset();
Decoder decoder = DecoderFactory.get().binaryDecoder(
stuff.array(), start, length, null
);
try {
event ev = reader.read(null, decoder);
} catch (IOException e) {
e.printStackTrace();
}
I found my solution cumbersome, so I'd like to know if there is a simpler solution to do this.

Thanks to the comment I was able to find the answer. The secret was to instantiate KafkaAvroDecoder with a Properties specifying the use of the specific Avro reader, that is:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "...");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_C‌ONFIG, "...");
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
VerifiableProp vProps = new VerifiableProperties(props);
KafkaAvroDecoder decoder = new KafkaAvroDecoder(vProps);
MyLittleData data = (MyLittleData) decoder.fromBytes(input);
The same configuration applies for the case of using directly the KafkaConsumer<K, V> class (I'm consuming from Kafka in Storm using the KafkaSpout from the storm-kafka project, which uses the SimpleConsumer, so I have to manually deserialize the messages. For the courageous there is the storm-kafka-client project, which does this automatically by using the new style consumer).

java initialize object from file

I am currently writing a program that deals with graphs created by the jgrapht library. I have multiple graphs of the form:
UndirectedGraph <Integer, DefaultEdge> g_x = new SimpleGraph<Integer, DefaultEdge (DefaultEdge.class);
g.addVertex(1);
g.addVertex(2);
g.addVertex(3);
g.addEdge(1, 2);
g.addEdge(2, 4);
...
which are constant graphs associated with street maps that I am given as files. Right now I have all of my graphs declared in my main method and just reference the graph I want when a map is loaded. What I would like to do is have another file paired with each map (i.e map1.map and map1.graph) so that when I load the map from a file I can also load the graph like:
map = loadMap(mapName);
g_x = loadGraph(mapName);
where mapName is the file name prefix and not have to store it in my source code. Is it possible to do this in java and if so how would I create the files and load them? Would it also be possible to do this with a generic Object?

One option is to serialize your objects to xml or json (you could change the .xml to .map if you really wanted). Then you can open the xml in your code for each object you wish to load.
Serializing:
File file = new File(**filename**);
FileOutputStream out = new FileOutputStream(file);
XStream xmlStream = new XStream(new DomDriver());
out.write(xmlStream.toXML(**ObjectToSave**).getBytes());
out.close();
Deserializing:
try {
XStream xmlStream = new XStream(new DomDriver());
state = (**ClassNameYouWishToSave**) xmlStream.fromXML(new FileInputStream(**filename**));
} catch(IOException e) { e.printStackTrace(); }
You will need these imports:
import com.thoughtworks.xstream.XStream;
import com.thoughtworks.xstream.io.xml.DomDriver;
It is a simplistic way to do it, but it works. Hope it helps.

Generate Avro Schema from certain Java Object

Apache Avro provides a compact, fast, binary data format, rich data structure for serialization. However, it requires user to define a schema (in JSON) for object which need to be serialized.
In some case, this can not be possible (e.g: the class of that Java object has some members whose types are external java classes in external libraries). Hence, I wonder there is a tool can get the information from object's .class file and generate the Avro schema for that object (like Gson use object's .class information to convert certain object to JSON string).

Take a look at the Java reflection API.
Getting a schema looks like:
Schema schema = ReflectData.get().getSchema(T);
See the example from Doug on another question for a working example.
Credits of this answer belong to Sean Busby.

Here's how to Generate an Avro Schema from POJO definition
ObjectMapper mapper = new ObjectMapper(new AvroFactory());
AvroSchemaGenerator gen = new AvroSchemaGenerator();
mapper.acceptJsonFormatVisitor(RootType.class, gen);
AvroSchema schemaWrapper = gen.getGeneratedSchema();
org.apache.avro.Schema avroSchema = schemaWrapper.getAvroSchema();
String asJson = avroSchema.toString(true);

** Example**
Pojo class
public class ExportData implements Serializable {
private String body;
// ... getters and setters
}
Serialize
File file = new File(fileName);
DatumWriter<ExportData> writer = new ReflectDatumWriter<>(ExportData.class);
DataFileWriter<ExportData> dataFileWriter = new DataFileWriter<>(writer);
Schema schema = ReflectData.get().getSchema(ExportData.class);
dataFileWriter.create(schema, file);
for (Row row : resultSet) {
String rec = row.getString(0);
dataFileWriter.append(new ExportData(rec));
}
dataFileWriter.close();
Deserialize
File file = new File(avroFilePath);
DatumReader<ExportData> datumReader = new ReflectDatumReader<>(ExportData.class);
DataFileReader<ExportData> dataFileReader = new DataFileReader<>(file, datumReader);
ExportData record = null;
while (dataFileReader.hasNext()){
record = dataFileReader.next(record);
// process record
}

WEKA: Classify instances with a deserialized model

I used Weka Explorer:
Loaded the arff file
Applied StringToWordVector filter
Selected IBk as the best classifier
Generated/Saved my_model.model binary
In my Java code I deserialize the model:
URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );
Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new attributes)
I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector
final StringToWordVector filter = new StringToWordVector();
filter.setOptions(new String[]{"-C", "-P x_", "-L"});
FilteredClassifier fcls = new FilteredClassifier();
fcls.setFilter(filter);
fcls.setClassifier(cls);
The above does not work either. It throws the exception:
Exception in thread "main" java.lang.NullPointerException: No output instance format defined
What I am trying to avoid is doing the training in the Java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.

Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.
Using a FilteredClassifier is the correct idea, but you have to use it from the beginning:
Load the ARFF file
Select FilteredClassifier as classifier
Select StringToWordVector as filter for it
Select IBk as classifier for the FilteredClassifier
Generate/Save the model to my_model.binary
The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.

Another way to do this is to use the same filter to your testing data as the one used on training data. I describe the procedure analytically. In your case you just need to follow steps after the loading of your serialized classifier.
Create your training file (e.g training.arff)
Create Instances from training file. Instances trainingData = ..
Use StringToWordVector to transform your string attributes to number representation:
sample code:
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Select a classifier to create your model
sample code for LibLinear classifier
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
sample code
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create a testing file (e.g testing.arff)
Create Instances from training file: Instances testingData=...
Load classifier
sample code
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will keep the format of training set and will not add words that are not in training set.
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
sample code
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to write union when creating Avro file in Java - java

Related

Spring Batch creating multiple files .Gradle based project

Deserialize Avro messages into specific datum using KafkaAvroDecoder

java initialize object from file

Generate Avro Schema from certain Java Object

WEKA: Classify instances with a deserialized model

Categories

Resources