Add Fields to Csv with Spark

Add Fields to Csv with Spark - java

So, I have a CSV which contains spatial (latitude, longitude) and temporal (timestamp) data.
To be useful for us, we converted the spatial information to "geohash", and the temporal information to "timehash".
The problem is, how to add the geohash and timehash as fields for each row in the CSV with spark (since the data is about 200 GB)?
we tried to use JavaPairRDD and it's function mapTopair , but the problem remains in how to convert back to a JavaRdd and then to CSV? So i think this was a bad solution I'm asking for a simple way.
Update of question :
After #Alvaro is help i have created this java class :
public class Hash {
public static SparkConf Spark_Config;
public static JavaSparkContext Spark_Context;
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
public Hash(String path) {
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.config("spark.master", "local")
.getOrCreate();
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
Dataset df=spark.read().csv(path)
.withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
.withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("C:/Users/Ahmed/Desktop/preprocess2");
}
public static void main(String[] args) {
String path = "C:/Users/Ahmed/Desktop/cabs_trajectories/cabs_trajectories/green/2013";
Hash h = new Hash(path);
}
}
and then i get serialization problem, which disappear when i delete write().csv()

One of the most efficient ways is to load the CSV using the Datasets API and use User Defined Function to convert the columns you've specified. In this way, your data will always remain structure, not having to deal with tuples.
First of all, you create your User Define Functions: geohashConverter, which takes two values (latitude and longitude), and timehashConverter, which only takes the timestamp.
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
#Override
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
#Override
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
Once created, you have to register them:
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
And finally, just read your CSV file, and apply the User Defined Functions by calling the withColumn. It will create a new column based on the User Defined Function you are calling with callUDF. callUDF always receives a String with the name of the registered UDF you want to call and one or many Columns whose value will be passed to the UDF.
And finally, just save your dataset by calling write().csv("path")
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;
spark.read().csv("/source/path")
.withColumn("geohash", callUDF("geohashConverter", col("latitude"), col("longitude")))
.withColumn("timehash", callUDF("timehashConverter", col("timestamp")))
.write().csv("/path/to/save");
Hope it helped!
Update
It would be pretty helpful if you post the code which is causing problems, because the exception says almost nothing about what part of the code is not serializable.
Anyways, from my personal experience with Spark, I think the problem is the object you are using to caculate the hashes. Bear in mind that this object has to be distributed through the cluster. If this object cannot be serialized, it will throw a Task not serializable Exception. You have two options to work around it:
Implement the Serializable interface in the class that you use to calculate the hash.
Create an static method that generate the hashes and call this method from the UDF.
Update 2
and then i get serialization problem, which disappear when i delete
write().csv()
It's an expected behaviour. When you delete write().csv() you are executing nothing. You should know how Spark works. In this code, all the methods called before csv() are transformations. In Spark, transformation are not executed until an action like csv(), show() or count() is called.
The problem is that you are creating and executing the Spark Job in a non-serializable class (and even worst in a constructor!!!??)
Creating the Spark job in an static method solves the problem. Bear in mind that your Spark code must be distributed through the cluster, and consequently, it must be serializable. It worked for me and must work for you:
public class Hash {
public static void main(String[] args) {
String path = "in/prueba.csv";
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.config("spark.master", "local")
.getOrCreate();
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
spark
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load(path)
.withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
.withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("resultados");
}
}

Related

Flink deduplication and processWindowFunction

i'm creating a pipeline where the inputs are json messages containing a timestamp field, used to set eventTime. The problem is about the fact that some record could arrive late or duplicate at the system, and this situations needs to be managed; to avoid duplicates I tried the following solution:
.assignTimestampsAndWatermarks(new RecordWatermark()
.withTimestampAssigner(new ExtractRecordTimestamp()))
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(3)))
.process(new WindowedFilter())
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(180)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(15)))
.process(new WindowedCountDistinct())
.map((value) -> value.toString());
where the first windowing operation is done to filter the records based on timestamp saved in a set, as follow:
public class WindowedFilter extends ProcessWindowFunction<MetricObject, MetricObject, String, TimeWindow> {
HashSet<Long> previousRecordTimestamps = new HashSet<>();
#Override
public void process(String s, Context context, Iterable<MetricObject> inputs, Collector<MetricObject> out) throws Exception {
String windowStart = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getStart()));
String windowEnd = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getEnd()));
log.info("window start: '{}', window end: '{}'", windowStart, windowEnd);
Long watermark = context.currentWatermark();
log.info(inputs.toString());
for (MetricObject in : inputs) {
Long recordTimestamp = in.getTimestamp().toEpochMilli();
if (!previousRecordTimestamps.contains(recordTimestamp)) {
log.info("timestamp not contained");
previousRecordTimestamps.add(recordTimestamp);
out.collect(in);
}
}
}
this solution works, but I've the feeling that I'm not considering something important or it could be done in a better way.

One potential problem with using windows for deduplication is that the windows implemented in Flink's DataStream API are always aligned to the epoch. This means that, for example, an event occurring at 11:59:59, and a duplicate occurring at 12:00:01, will be placed into different minute-long windows.
However, in your case it appears that the duplicates you are concerned about also carry the same timestamp. In that case, what you're doing will produce correct results, so long as you're not concerned about the watermarking producing late events.
The other issue with using windows for deduplication is the latency they impose on the pipeline, and the workarounds used to minimize that latency.
This is why I prefer to implement deduplication with a RichFlatMapFunction or a KeyedProcessFunction. Something like this will perform better than a window:
private static class Event {
public final String key;
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.cleanupFullSnapshot()
.build();
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
Here the stream is being deduplicated by key, and the state involved is being automatically cleared after one minute.

Apache Drill: Write general-purpose array_agg UDF

I would like to create an array_agg UDF for Apache Drill to be able to aggregate all values of a group to a list of values.
This should work with any major types (required, optional) and minor types (varchar, dict, map, int, etc.)
However, I get the impression that Apache Drill's UDF API does not really make use of inheritance and generics. Each type has its own writer and handler, and they cannot be abstracted to handle any type. E.g., the ValueHolder interface seems to be purely cosmetic and cannot be used to have type-agnostic hooking of UDFs to any type.
My current implementation
I tried to solve this by using Java's reflection so I could use the ListHolder's write function independent of the holder of the original value.
However, I then ran into the limitations of the #FunctionTemplate annotation.
I cannot create a general UDF annotation for any value (I tried it with the interface ValueHolder: #param ValueHolder input.
So to me it seems like the only way to support different types to have separate classes for each type. But I can't even abstract much and work on any #Param input, because input is only visible in the class where its defined (i.e. type specific).
I based my implementation on https://issues.apache.org/jira/browse/DRILL-6963
and created the following two classes for required and optional varchars (how can this be unified in the first place?)
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class VarChar_Agg implements DrillAggFunc {
#Param org.apache.drill.exec.expr.holders.VarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
listWriter.varChar().write(input);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class NullableVarChar_Agg implements DrillAggFunc {
#Param NullableVarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
if (input.isSet != 1) {
return;
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
org.apache.drill.exec.expr.holders.VarCharHolder outHolder = new org.apache.drill.exec.expr.holders.VarCharHolder();
outHolder.start = input.start;
outHolder.end = input.end;
outHolder.buffer = input.buffer;
listWriter.varChar().write(outHolder);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
Interestingly, I can't import org.apache.drill.exec.vector.complex.writer.BaseWriter to make the whole thing easier because then Apache Drill would not find it.
So I have to put the entire package path for everything in org.apache.drill.exec.vector.complex.writer in the code.
Furthermore, I'm using the depcreated ObjectHolder. Any better solution?
Anyway: These work so far, e.g. with this query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.eventLabel) AS label_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
WHERE tbl.data.slug IS NOT NULL
GROUP BY tbl.data.slug
however, when I use ORDER BY, I get this:
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: UnsupportedOperationException: NULL
Fragment 0:0
Additionally, I tried more complex types, namely maps/dicts.
Interestingly, when I call SELECT sqlTypeOf(tbl.data) FROM tbl, I get MAP.
But when I write UDFs, the query planner complains about having no UDF array_agg for type dict.
Anyway, I wrote a version for dicts:
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Map_Agg implements DrillAggFunc {
#Param MapHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Dict_agg implements DrillAggFunc {
#Param DictHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
But here, I get an empty list in the field data_agg for my query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.data) AS data_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
GROUP BY tbl.data.viewSlag
Summary of questions
Most importantly: How do I create an array_agg UDF for Apache Drill?
How to make UDFs type-agnostic/general purpose? Do I really have to implement an entire class for each Nullable, Required and Repeated version of all types? That's a lot to do and quite tedious. Isn't there a way to handle values in an UDF agnostic to the underlying types?
I wish Apache Drill would just use what Java offers here with function generic types, specialised function overloading and inheritence of their own type system. Am I missing something on how to do that?
How can I fix the NULL problem when I use ORDER BY on my varchar version of the aggregate?
How can I fix the problem where my aggregate of maps/dicts is an empty list?
Is there an alternative to using the deprecated ObjectHolder?

To answer your question, unfortunately you've run into one of the limits of the Drill Aggregate UDF API which is that it can only return simple data types.1 It would be a great improvement to Drill to fix this, but that is the current status. If you're interested in discussing that further, please start a thread on the Drill user group and/or slack channel. I don't think it is impossible, but it would require some modification to the Drill internals. IMHO it would be well worth it because there are a few other UDFs that I'd like to implement that need this feature.
The second part of your question is how to make UDFs type agnostic and once again... you've found yet another bit of ugliness in the UDF API. :-) If you do some digging in the codebase, you'll see that most of the Math functions have versions that accept FLOAT, INT etc..
Regarding the aggregate of null or empty lists. I actually have some good news here... The current way of doing that is to provide two versions of the function, one which accepts regular holders and the second which accepts nullable holders and returns an empty list or map if the inputs are null. Yes, this sucks, but the additional good news is that I'm working on cleaning this up and hopefully will have a PR submitted soon that will eliminate the need to do this.
Regarding the ObjectHolder, I wrote a median function that uses a few Stacks to compute a streaming median and I used the ObjectHolder for that. I think it will be with us for some time as there is no alternative at the moment.
I hope this answers your questions.

Best practices with regard the read of vast amounts of files in apache spark

I want to deal with a vast amount of .DAT files which contain measurements per second (each line represents a second) from an electrical device.
I have more than 5000 files each of which is about 160 KiB (not that much actually), but I am finding it difficult to find an efficient or recommended way to deal with this kind of problem: create an object that summmarizes each file content.
This is my file structure:
feeder/
CT40CA18_20190101_000000_60P_40000258.DAT
CT40CA18_20190101_010000_60P_40000258.DAT
CT40CA18_20190101_020000_60P_40000258.DAT
CT40CA18_20190101_030000_60P_40000258.DAT
CT40CA18_20190101_040000_60P_40000258.DAT
....
....
....
CT40CA18_20190812_010000_60P_40000258.DAT
My current code in Java Spark (2.1.1 version) is:
public class Playground {
private static final SparkSession spark = new SparkSession
.Builder()
.master("local[*]")
.getOrCreate();
public static void main(String[] args) {
Dataset<FeederFile> feederFileDataset = spark
.read()
.textFile("resources/CT40CA18/feeder/*.DAT")
.map(new ParseFeederFile(), Encoders.bean(FeederFile.class));
}
}
ParseFeederFile is:
package transformations.map;
import model.FeederFile;
import org.apache.spark.api.java.function.MapFunction;
public class ParseFeederFile implements MapFunction<String, FeederFile> {
private StringBuilder fileContent;
public ParseFeederFile() {
fileContent = new StringBuilder();
}
#Override
public FeederFile call(String s) throws Exception {
return new FeederFile().withContent(fileContent.append(s).append("\n").toString());
}
}
and FeederFile
package model;
import java.io.Serializable;
public class FeederFile implements Serializable {
private String content;
public FeederFile() {}
public void setContent(String content) {
this.content = content;
}
public String getContent() {
return content;
}
public FeederFile withContent(final String content) {
this.content = content;
return this;
}
}
The problem is that when map invokes call the string that is passed represents a line of a .DAT file. Therefore there is a vast and unnecessary creation of FeederFile objects. Another problem is that textFile does not differentiate between different files so everything is being appended to the same object (i.e, content of all files are in attribute content in FeederFile class)
I managed this naive way for retrieving all the content (I do not want all the content itself, but to create a sort of object that summarises information about the .DAT file, like the lines count, and some statistics based upon the data)
Does any of you come up with an idea on how can I create a FeederFile per .DAT?
Thank you in advance for any help you can provide.

You could use:
sparkContext.wholeTextFiles(...)
SparkContext’s whole text files method, i.e., sc.wholeTextFiles in
Spark Shell, creates a PairRDD with the key being the file name with a
path. It’s a full path like
“hdfs://aa1/data/src_data/stage/test_files/collection_vk/current_snapshot/*”.
The value is the whole content of file in String.

Apache Flink - Using class with generic type parameter

How can I use a class with generic types in flink? I run into the error:
The return type of function 'main(StreamingJob.java:63)' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
The class I use is of the form:
class MaybeProcessable<T> {
private final T value;
public MaybeProcessable(T value) {
this.value = value;
}
public T get() {
return value;
}
}
And I am using a example flink job like:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new PubSubSource(PROJECT_ID, SUBSCRIPTION_NAME))
.map(MaybeProcessable::new)
.map(MaybeProcessable::get)
.writeAsText("/tmp/flink-output", FileSystem.WriteMode.OVERWRITE);
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
Now I can add a TypeInformation instance using the .returns() function:
.map(MaybeProcessable::new).returns(new MyCustomTypeInformationClass(String.class))
But this would require me writing my own serializer. Is there not an easier way to achieve this?

You can use
.returns(TypeInformation.of(new TypeHint<MaybeProcessing<#CONCRETE_TYPE_HERE>>{}) for each re-use of a generically typed MapFunction to set the return type without creating any more custom classes of your own.

org.apache.spark.SparkException: Task not serializable in java

i'm trying use spark graphx. before that i wanted to arrange my vertex and edge rdd using dataframes. for that purpose i used JavaRdd map function.but i'm getting above error.i tried various ways to fix this issue.i serialized tha whole class.but it didn't work.and also i implement Function and Serializable classes in one class ind used it in map function.but it aslo didn't work.please help me in advance.
//add long unique id for vertex dataframe and get javaRdd
JavaRDD<Row> ff = vertex_dataframe.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, java.lang.Long>, Row>() {
public Row call(Tuple2<Row, java.lang.Long> rowLongTuple2) throws Exception {
return RowFactory.create(rowLongTuple2._1().getString(0), rowLongTuple2._2());
}
});
i serialized Function() class like below.
public abstract class SerialiFunJRdd<T1,R> implements Function<T1, R> , java.io.Serializable{
}

I will suggest you to read something about serializing non static inner classes in java. you are creating a non static inner class here in your map which is not serialisable even if you mark that serialisable. you have to make it static first.
JavaRDD<Row> ff = vertex_dataframe.javaRDD().zipWithIndex().map(mapFunc);
static SerialiFunJRdd<Tuple2<Row, java.lang.Long>, Row> mapFunc=new SerialiFunJRdd<Tuple2<Row, java.lang.Long>, Row>() {
#Override
public Row call(Tuple2<Row, java.lang.Long> rowLongTuple2) throws Exception {
return RowFactory.create(rowLongTuple2._1().getString(0), rowLongTuple2._2());
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Add Fields to Csv with Spark - java

Related

Flink deduplication and processWindowFunction

Apache Drill: Write general-purpose array_agg UDF

Best practices with regard the read of vast amounts of files in apache spark

Apache Flink - Using class with generic type parameter

org.apache.spark.SparkException: Task not serializable in java

Categories

Resources