Loading json data into Pair RDD in Spark using java

Loading json data into Pair RDD in Spark using java - java

I am very new to Spark.
I have a very basic question. I read a file in Spark RDD in which each line is a JSON. I want to make apply groupBy like transformations. So I want to transform each JSON line into a PairRDD. Is there a straight forward way to do it in Java?
My json is like this:
{
"tmpl": "p",
"bw": "874",
"aver": {"cnac": "US","t1": "2"},
}
Currently, the way I am trying is the to split by , first and then by :. Is there any straight forward way to do this?
My current code:
val pairs = setECrecords.flatMap(x => (x.split(",")))
pairs.foreach(println)
val pairsastuple = pairs.map(x => if(x.split("=").length>1) (x.split("=")(0), x.split("=")(1)) else (x.split("=")(0), x))

You can try mapToPair(), but using the Spark SQL & DataFrames API will enable you to group things much more easily. The data frames API allows you to load JSON data directly.

Related

Map original data from the dataset to new data using Datavec library and store it in Spark RDDs

I have a dataset that contains a latitude and longitude written like 20.55E and 30.11N. I want to replace these direction strings with an appropriate - where required. So basically, I'll map based on the condition and change the value.
Currently, I have a Schema and I'm trying to sort out the TransformProcess
My Schema is like this:
new Schema.Builder()
.addColumnTime("dt", DateTimeZone.UTC)
.addColumnsDouble("AverageTemperature" , "AverageTemperatureUncertainty")
.addColumnsInteger("City" , "Country")
.addColumnsFloat("Latitude" , "Longitude")
.build();
And I'm stuck with my TransformProcess like this:
new TransformProcess.Builder(schema)
.filter(new FilterInvalidValues("AverageTemperature" , "AverageTemperatureUncertainty"))
.stringToTimeTransform("dt","yyyy-MM-dd", DateTimeZone.UTC)
. // map currentLatitude -> remove direction string and put sign
I am trying to follow this code from a tutorial and after the TransformProcess, I'll do the Spark stuff and save the data.
My question is:
How can I perform the mapping?
From the API docs of TansformProcess, I cannot make sense of anything that will help me solve my problem.
I am using the Datavec library in Deeplearning4J

Converting Linq queries to Java 8

Im traslating a old enterprise App who uses C# with Linq queries to Java 8. I have some of those queries who I'm not able to reproduce using Lambdas as I dont know how C# works with those.
For example, in this Linq:
from register in registers
group register by register.muleID into groups
select new Petition
{
Data = new PetitionData
{
UUID = groups.Key
},
Registers = groups.ToList<AuditRegister>()
}).ToList<Petition>()
I undestand this as a GroupingBy on Java 8 Lambda, but what's the "select new PetitionData" inside of the query? I don't know how to code it in Java.
I have this at this moment:
Map<String, List<AuditRegister>> groupByMuleId =
registers.stream().collect(Collectors.groupingBy(AuditRegister::getMuleID));
Thank you and regards!

The select LINQ operation is similar to the map method of Stream in Java. They both transform each element of the sequence into something else.
collect(Collectors.groupingBy(AuditRegister::getMuleID)) returns a Map<String, List<AuditRegister>> as you know. But the groups variable in the C# version is an IEnumerable<IGrouping<string, AuditRegister>>. They are quite different data structures.
What you need is the entrySet method of Map. It turns the map into a Set<Map.Entry<String, List<AuditRegister>>>. Now, this data structure is more similar to IEnumerable<IGrouping<string, AuditRegister>>. This means that you can create a stream from the return value of entry, call map, and transform each element into a Petition.
groups.Key is simply x.getKey(), groups.ToList() is simply x.getValue(). It should be easy.
I suggest you to create a separate method to pass into the map method:
// you can probably came up with a more meaningful name
public static Petition mapEntryToPetition(Map.Entry<String, List<AuditRegister>> entry) {
Petition petition = new Petition();
PetitionData data = new PetitionData();
data.setUUID(entry.getKey());
petition.setData(data);
petition.setRegisters(entry.getValue());
return petition;
}

How to apply map function in Spark DataFrame using Java?

I am trying to use map function on DataFrame in Spark using Java. I am following the documentation which says
map(scala.Function1 f, scala.reflect.ClassTag evidence$4)
Returns a new RDD by applying a function to all rows of this DataFrame.
While using the Function1 in map , I need to implement all the functions. I have seen some questions related to this , but the solution provided converts the DataFrame into RDD.
How can I use the map function in DataFrame without converting it into a RDD also what is the second parameter of map ie scala.reflect.ClassTag<R> evidence$4
I am using Java 7 and Spark 1.6.

I know your question is about Java 7 and Spark 1.6, but in Spark 2 (and obviously Java 8), you can have a map function as part of a class, so you do not need to manipulate Java lambdas.
The call would look like:
Dataset<String> dfMap = df.map(
new CountyFipsExtractorUsingMap(),
Encoders.STRING());
dfMap.show(5);
The class would look like:
/**
* Returns a substring of the values in the id2 column.
*
* #author jgp
*/
private final class CountyFipsExtractorUsingMap
implements MapFunction<Row, String> {
private static final long serialVersionUID = 26547L;
#Override
public String call(Row r) throws Exception {
String s = r.getAs("id2").toString().substring(2);
return s;
}
}
You can find more details in this example on GitHub.

I think map is not the right way to use on a DataFrame. Maybe you should have a look at the examples in the API
There they show how to operate on DataFrames

You can use the dataset directly, need not convert the read data to RDD, its unnecessary consumption of resource.
dataset.map(mapfuncton{...}, encoder); this should suffice your needs.

Because you don't give any specific problems, there're some common alternatives to map in DataFrame like select, selectExpr, withColumn. If the spark sql builtin functions can't fit your task, you can use UTF.

Process json with different data type in Circe

In my case, there might be different data type of same json field. Example:
"need_exp":1500
or
"need_exp":"-"
How to process this case? I know it can be processed by parse or use custom encoders, but this is a very complex json text, is there any way to solve that without rewriting the whole decoder (for example, just "tell" the decoder to convert all Int to String in need_exp field)?

It is called a disjunction which can be encoded with the Scala standard Either class.
Simply map the json that to the following class:
case class Foo(need_exp: Either[String, Int])

My solution is to use a custom decoder. Rewrite a little part of the JSON can be fine.
For example, there is a simple JSON:
{
/*many fields*/
"hotList":[/* ... many lists inside*/],
"list":[ {/*... many fields*/
"level_info":{
"current_exp":11463,
"current_level":5,
"current_min":10800,
"next_exp":28800 //there is the problem
},
"sex":"\u4fdd\u5bc6"},/*...many lists*/]
}
In this case, I don't need to rewrite the whole JSON encoder, just write a custom encoder of level_info:
implicit val decodeUserLevel: Decoder[UserLevel] = (c: HCursor) => for
{
current_exp <- c.downField("current_exp").as[Int]
current_level <- c.downField("current_level").as[Int]
current_min <- c.downField("current_min").as[Int]
next_exp <- c.downField("next_exp").withFocus(_.mapString
{
case """-""" => "-1"
case default => default
}).as[Int]
} yield
{
UserLevel(current_exp, current_level, current_min, next_exp)
}
and it worked.

Why does SparkSession execute twice for one action?

Recently upgraded to Spark 2.0 and I'm seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here's a simple test case:
SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
"{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
"{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
));
JavaRDD<String> mappedRdd = rdd.map(json -> {
System.out.println("mapping json: " + json);
return json;
});
Dataset<Row> data = spark.read().json(mappedRdd);
data.show();
And the output:
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name| roles| title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]| cto|
+----+--------------------+--------+
It seems that the "map" function is being executed twice even though I'm only performing one action. I thought that Spark would lazily build an execution plan, then execute it when needed, but this makes it seem that in order to read data as JSON and do anything with it, the plan will have to be executed at least twice.
In this simple case it doesn't matter, but when the map function is long running, this becomes a big problem. Is this right, or am I missing something?

It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.
Since mappedRdd is not cached it will be evaluated twice:
once for schema inference
once when you call data.show
If you want to prevent you should provide schema for reader (Scala syntax):
val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Loading json data into Pair RDD in Spark using java - java

You can try mapToPair(), but using the Spark SQL & DataFrames API will enable you to group things much more easily. The data frames API allows you to load JSON data directly.

Related

Map original data from the dataset to new data using Datavec library and store it in Spark RDDs

Converting Linq queries to Java 8

How to apply map function in Spark DataFrame using Java?

Process json with different data type in Circe

Why does SparkSession execute twice for one action?

Categories

Resources