reading/writing avro file in spark core using java - java

I need to access avro file data in a program written in java on spark core. I can use the MapReduce InputFormat class but it gives me a tuple containing each line of file as a key. It's very hard to parse it as i am not using scala.
JavaPairRDD<AvroKey<GenericRecord>, AvroValue> avroRDD = sc.newAPIHadoopFile("dataset/testfile.avro", AvroKeyInputFormat.class, AvroKey.class, NullWritable.class,new Configuration());
Is there any utility class or jar available which i can use to map avro data directly into java classes. E.g. the codehaus.jackson package has a provision for mapping json to java class.
Otherwise is there any other method to easily parse fields present in avro file to java classes or RDDs.

Consider that your avro file contains serialized pairs, with key being a String, and value being an avro class. Then you could have a generic static function of some Utils class that looks like this:
public class Utils {
public static <T> JavaPairRDD<String, T> loadAvroFile(JavaSparkContext sc, String avroPath) {
JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile(avroPath, AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());
return records.keys()
.map(x -> (GenericRecord) x.datum())
.mapToPair(pair -> new Tuple2<>((String) pair.get("key"), (T)pair.get("value")));
}
}
And then you could use the method this way:
JavaPairRDD<String, YourAvroClassName> records = Utils.<YourAvroClassName>loadAvroFile(sc, inputDir);
You might also need to use KryoSerializer and register your custom KryoRegistrator:
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "com.test.avro.MyKryoRegistrator");
And the registrator class would look this way:
public class MyKryoRegistrator implements KryoRegistrator {
public static class SpecificInstanceCollectionSerializer<T extends Collection> extends CollectionSerializer {
Class<T> type;
public SpecificInstanceCollectionSerializer(Class<T> type) {
this.type = type;
}
#Override
protected Collection create(Kryo kryo, Input input, Class<Collection> type) {
return kryo.newInstance(this.type);
}
#Override
protected Collection createCopy(Kryo kryo, Collection original) {
return kryo.newInstance(this.type);
}
}
Logger logger = LoggerFactory.getLogger(this.getClass());
#Override
public void registerClasses(Kryo kryo) {
// Avro POJOs contain java.util.List which have GenericData.Array as their runtime type
// because Kryo is not able to serialize them properly, we use this serializer for them
kryo.register(GenericData.Array.class, new SpecificInstanceCollectionSerializer<>(ArrayList.class));
kryo.register(YourAvroClassName.class);
}
}

Related

Loading an abstract class based object by YAML file

I want to load object which contains array list of objects based on abstract class from yaml file. And i get this error message:
Exception in thread "LWJGL Application" Cannot create property=arrayListOfAbstractObjects for JavaBean=com.myyaml.test.ImplementationOfExampleClass#7a358cc1
in 'reader', line 1, column 1:
dummyLong: 1
^
java.lang.InstantiationException
in 'reader', line 3, column 3:
- dummyFloat: 444
^
YAML file
dummyLong: 1
arrayListOfAbstractObjects:
- dummyFloat: 444
- dummyDouble: 123
Java classes:
public abstract class ExampleClass {
protected ArrayList<AbstractClass> arrayListOfAbstractObjects;
protected long dummyLong = 111;
public ExampleClass() {
}
public void setArrayListOfAbstractObjects(ArrayList<AbstractClass> arrayListOfAbstractObjects) {
this.arrayListOfAbstractObjects = arrayListOfAbstractObjects;
}
public void setDummyLong(long dummyLong) {
this.dummyLong = dummyLong;
}
}
public class ImplementationOfExampleClass extends ExampleClass {
public ImplementationOfExampleClass() {
}
}
public abstract class AbstractClass {
private int dummyInt = 22;
public AbstractClass() {
}
public void setDummyInt(int dummyInt) {
this.dummyInt = dummyInt;
}
}
public class FirstImplementationOfAbstractClass extends AbstractClass {
float dummyFloat = 111f;
public FirstImplementationOfAbstractClass() {
}
public void setDummyFloat(float dummyFloat) {
this.dummyFloat = dummyFloat;
}
}
public class SecondImplementationOfAbstractClass extends AbstractClass {
double dummyDouble = 333f;
public SecondImplementationOfAbstractClass() {
}
public void setDummyDouble(double dummyDouble) {
this.dummyDouble = dummyDouble;
}
}
My guess is that yaml doesn't know which sort of abstract class implementation to use. FirstImplementationOfAbstractClass or SecondImplementationOfAbstractClass. Is it possible to load an object by yaml with such classes?
This is only possible if you tell the YAML processor which class you want to instantiate on the YAML side. You do this with tags:
dummyLong: 1
arrayListOfAbstractObjects:
- !first
dummyFloat: 444
- !second
dummyDouble: 123
Then, you can instruct your YAML processor to properly process the items based on their tags. E.g. with SnakeYAML, you would do
class MyConstructor extends Constructor {
public MyConstructor() {
this.yamlConstructors.put(new Tag("!first"), new ConstructFirst());
this.yamlConstructors.put(new Tag("!second"), new ConstructSecond());
}
private class ConstructFirst extends AbstractConstruct {
public Object construct(Node node) {
// raw values, as if you would have loaded the content into a generic map.
final Map<Object, Object> values = constructMapping(node);
final FirstImplementationOfAbstractClass ret =
new FirstImplementationOfAbstractClass();
ret.setDummyFloat(Float.parseFloat(values.get("dummyFloat").toString()));
return ret;
}
}
private class ConstructSecond extends AbstractConstruct {
public Object construct(Node node) {
final Map<Object, Object> values = constructMapping(node);
final SecondImplementationOfAbstractClass ret =
new SecondImplementationOfAbstractClass();
ret.setDummyFloat(Double.parseDouble(values.get("dummyFloat").toString()));
return ret;
}
}
}
Note: You can be more intelligent when loading the content, avoiding toString and instead process the node content directly; I use a dumb implementation for easy demonstration.
Then, you use this constructor:
Yaml yaml = new Yaml(new MyConstructor());
ExampleClass loaded = yaml.loadAs(input, ImplementationOfExampleClass.class);
The Node class is kind of YAML file transformed into Java data object. I found under debugger it contains field ArrayList<E> value. Which contains NodeTuple with YAML file fields (e.g. dummyFloat). So I must in constructMapping(node) method convert on my own each field and then set them in e.g. ConstructFirst.construct(Node node) on the constructed object.
EDIT:
So I must in constructMapping(node) method convert on my own each field and then set them in e.g. ConstructFirst.construct(Node node) on the constructed object.
Cast of param node to MappingNode is needed. This method is inherited from BaseConstructor.constructMapping(MappingNode node). Flyx didn't add that cast and I didn't know where to get it from. Thanks for help. Now it works. But i'm still struggle with nested abstract classes. Maybe i'll need help, but I will try to handle myself.
Also this link might be helpful:
Polymorphic collections in SnakeYaml

Apache Flink - Using class with generic type parameter

How can I use a class with generic types in flink? I run into the error:
The return type of function 'main(StreamingJob.java:63)' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
The class I use is of the form:
class MaybeProcessable<T> {
private final T value;
public MaybeProcessable(T value) {
this.value = value;
}
public T get() {
return value;
}
}
And I am using a example flink job like:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new PubSubSource(PROJECT_ID, SUBSCRIPTION_NAME))
.map(MaybeProcessable::new)
.map(MaybeProcessable::get)
.writeAsText("/tmp/flink-output", FileSystem.WriteMode.OVERWRITE);
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
Now I can add a TypeInformation instance using the .returns() function:
.map(MaybeProcessable::new).returns(new MyCustomTypeInformationClass(String.class))
But this would require me writing my own serializer. Is there not an easier way to achieve this?
You can use
.returns(TypeInformation.of(new TypeHint<MaybeProcessing<#CONCRETE_TYPE_HERE>>{}) for each re-use of a generically typed MapFunction to set the return type without creating any more custom classes of your own.

org.apache.spark.SparkException: Task not serializable in java

i'm trying use spark graphx. before that i wanted to arrange my vertex and edge rdd using dataframes. for that purpose i used JavaRdd map function.but i'm getting above error.i tried various ways to fix this issue.i serialized tha whole class.but it didn't work.and also i implement Function and Serializable classes in one class ind used it in map function.but it aslo didn't work.please help me in advance.
//add long unique id for vertex dataframe and get javaRdd
JavaRDD<Row> ff = vertex_dataframe.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, java.lang.Long>, Row>() {
public Row call(Tuple2<Row, java.lang.Long> rowLongTuple2) throws Exception {
return RowFactory.create(rowLongTuple2._1().getString(0), rowLongTuple2._2());
}
});
i serialized Function() class like below.
public abstract class SerialiFunJRdd<T1,R> implements Function<T1, R> , java.io.Serializable{
}
I will suggest you to read something about serializing non static inner classes in java. you are creating a non static inner class here in your map which is not serialisable even if you mark that serialisable. you have to make it static first.
JavaRDD<Row> ff = vertex_dataframe.javaRDD().zipWithIndex().map(mapFunc);
static SerialiFunJRdd<Tuple2<Row, java.lang.Long>, Row> mapFunc=new SerialiFunJRdd<Tuple2<Row, java.lang.Long>, Row>() {
#Override
public Row call(Tuple2<Row, java.lang.Long> rowLongTuple2) throws Exception {
return RowFactory.create(rowLongTuple2._1().getString(0), rowLongTuple2._2());
}
}

De-serializing nested, generic class with gson

Using Gson, I'm trying to de-serialize a a nested, generic class. The class structure looks like the following:
Wrapper object, simplified, but normally holds other properties such as statusMessage, which are returned along with the data-field from the server:
public class Response<T> {
private List<T> data = null;
public List<T> getData() { return this.data; }
}
Simple class, the expected output from data-field above (though as an array):
public class Language {
public String alias;
public String label;
}
Usage:
Type type = new TypeToken<Response<Language>>() {}.getType();
Response<Language> response = new Gson().fromJson(json, type);
List<Language> languages = response.getData();
Language l = languages.get(0);
System.out.println(l.alias); // Error occurs here
Where the json-variable is something like this.
However, when doing this, I recieve the following exception (on line 3, last code example):
ClassCastException: com.google.gson.internal.StringMap cannot be cast to book.Language
The exception ONLY occurs when storing the data from getData() into a variable (or when used as one).
Any help would be highly appreciated.
The problem you're actually having is not directly due to Gson, it's because of how arrays and Generics play together.
You'll find that you can't actually do new T[10] in a class like yours. see: How to create a generic array in Java?
You basically have two options:
Write a custom deserializer and construct the T[] array there as shown in the SO question I linked above
Use a List<T> instead, then it will simply work. If you really need to return an array, you can always just call List.toArray() in your method.
Edited from comments below:
This is a fully working example:
public class App
{
public static void main( String[] args )
{
String json = "{\"data\": [{\"alias\": \"be\",\"label\": \"vitryska\"},{\"alias\": \"vi\",\"label\": \"vietnamesiska\"},{\"alias\": \"hu\",\"label\": \"ungerska\"},{\"alias\": \"uk\",\"label\": \"ukrainska\"}]}";
Type type = new TypeToken<Response<Language>>(){}.getType();
Response<Language> resp = new Gson().fromJson(json, type);
Language l = resp.getData().get(0);
System.out.println(l.alias);
}
}
class Response<T> {
private List<T> data = null;
public List<T> getData() { return this.data; }
}
class Language {
public String alias;
public String label;
}
Output:
be

Using Jackson ObjectMapper with Generics to POJO instead of LinkedHashMap

Using Jersey I'm defining a service like:
#Path("/studentIds")
public void writeList(JsonArray<Long> studentIds){
//iterate over studentIds and save them
}
Where JsonArray is:
public class JsonArray<T> extends ArrayList<T> {
public JsonArray(String v) throws IOException {
ObjectMapper objectMapper = new ObjectMapper(new MappingJsonFactory());
TypeReference<ArrayList<T>> typeRef = new TypeReference<ArrayList<T>>() {};
ArrayList<T> list = objectMapper.readValue(v, typeRef);
for (T x : list) {
this.add((T) x);
}
}
}
This works just fine, but when I do something more complicated:
#Path("/studentIds")
public void writeList(JsonArray<TypeIdentifier> studentIds){
//iterate over studentIds and save them by type
}
Where the Bean is a simple POJO such as
public class TypeIdentifier {
private String type;
private Long id;
//getters/setters
}
The whole thing breaks horribly. It converts everything to LinkedHashMap instead of the actual object. I can get it to work if I manually create a class like:
public class JsonArrayTypeIdentifier extends ArrayList<TypeIdentifier> {
public JsonArrayTypeIdentifier(String v) throws IOException {
ObjectMapper objectMapper = new ObjectMapper(new MappingJsonFactory());
TypeReference<ArrayList<TypeIdentifier>> typeRef = new TypeReference<ArrayList<TypeIdentifier>>(){};
ArrayList<TypeIdentifier> list = objectMapper.readValue(v, typeRef);
for(TypeIdentifier x : list){
this.add((TypeIdentifier) x);
}
}
}
But I'm trying to keep this nice and generic without adding extra classes all over. Any leads on why this is happening with the generic version only?
First of all, it works with Longs because that is sort of native type, and as such default binding for JSON integral numbers.
But as to why generic type information is not properly passed: this is most likely due to problems with the way JAX-RS API passes type to MessageBodyReaders and MessageBodyWriters -- passing java.lang.reflect.Type is not (unfortunately!) enough to pass actual generic declarations (for more info on this, read this blog entry).
One easy work-around is to create helper types like:
class MyTypeIdentifierArray extends JsonArray<TypeIdentifier> { }
and use that type -- things will "just work", since super-type generic information is always retained.

Categories

Resources