Recently upgraded to Spark 2.0 and I'm seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here's a simple test case:
SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
"{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
"{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
));
JavaRDD<String> mappedRdd = rdd.map(json -> {
System.out.println("mapping json: " + json);
return json;
});
Dataset<Row> data = spark.read().json(mappedRdd);
data.show();
And the output:
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name| roles| title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]| cto|
+----+--------------------+--------+
It seems that the "map" function is being executed twice even though I'm only performing one action. I thought that Spark would lazily build an execution plan, then execute it when needed, but this makes it seem that in order to read data as JSON and do anything with it, the plan will have to be executed at least twice.
In this simple case it doesn't matter, but when the map function is long running, this becomes a big problem. Is this right, or am I missing something?
It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.
Since mappedRdd is not cached it will be evaluated twice:
once for schema inference
once when you call data.show
If you want to prevent you should provide schema for reader (Scala syntax):
val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)
Related
I am trying to implement a Data Pipeline which joins multiple unbounded sources from Kafka topics. I am able to connect to topic and get the data as PCollection<String> and i need to convert it into PCollection<Row>. I am splitting the comma delimited string to an array and use schema to convert it as Row. But, How to implement/build schema & bind values to it dynamically?
Even if I create a separate class for schema building, is there a way to bind the string array directly to schema?
Below is my current working code which is static and needs to be rewritten every time i build a pipeline and it elongates based on the number of fields as well.
final Schema sch1 =
Schema.builder().addStringField("name").addInt32Field("age").build();
PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
.apply(
KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("testin1")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(
ImmutableMap.of("group.id", (Object)"test1")));
PCollection<Row> Input1 = kafkaDataIn1.apply(
ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
#ProcessElement
public void processElement(
ProcessContext processContext,
final OutputReceiver<Row> emitter) {
KafkaRecord<Long, String> record = processContext.element();
final String input = record.getKV().getValue();
final String[] parts = input.split(",");
emitter.output(
Row.withSchema(sch1)
.addValues(
parts[0],
Integer.parseInt(parts[1])).build());
}}))
.apply("window",
Window.<Row>into(FixedWindows.of(Duration.standardSeconds(50)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Input1.setRowSchema(sch1);
My Expectation is to achieve the same thing as above code dynamically/reusable way.
The schema is set on a pcollection so it is not dynamic, if you want to build it lazily, then you need to use a format/coder supporting it. Java serialization or json are examples.
That said to benefit from sql feature you can also use a static schema with querying fields and other fields, this way the static part enables to do you sql and you dont loose additionnal data.
Romain
I am trying to use map function on DataFrame in Spark using Java. I am following the documentation which says
map(scala.Function1 f, scala.reflect.ClassTag evidence$4)
Returns a new RDD by applying a function to all rows of this DataFrame.
While using the Function1 in map , I need to implement all the functions. I have seen some questions related to this , but the solution provided converts the DataFrame into RDD.
How can I use the map function in DataFrame without converting it into a RDD also what is the second parameter of map ie scala.reflect.ClassTag<R> evidence$4
I am using Java 7 and Spark 1.6.
I know your question is about Java 7 and Spark 1.6, but in Spark 2 (and obviously Java 8), you can have a map function as part of a class, so you do not need to manipulate Java lambdas.
The call would look like:
Dataset<String> dfMap = df.map(
new CountyFipsExtractorUsingMap(),
Encoders.STRING());
dfMap.show(5);
The class would look like:
/**
* Returns a substring of the values in the id2 column.
*
* #author jgp
*/
private final class CountyFipsExtractorUsingMap
implements MapFunction<Row, String> {
private static final long serialVersionUID = 26547L;
#Override
public String call(Row r) throws Exception {
String s = r.getAs("id2").toString().substring(2);
return s;
}
}
You can find more details in this example on GitHub.
I think map is not the right way to use on a DataFrame. Maybe you should have a look at the examples in the API
There they show how to operate on DataFrames
You can use the dataset directly, need not convert the read data to RDD, its unnecessary consumption of resource.
dataset.map(mapfuncton{...}, encoder); this should suffice your needs.
Because you don't give any specific problems, there're some common alternatives to map in DataFrame like select, selectExpr, withColumn. If the spark sql builtin functions can't fit your task, you can use UTF.
I'm quite new to Spark and I would like to extract features (basically count of words) from a text file using the Dataset class. I have read the "Extracting, transforming and selecting features" tutorial on Spark but every example reported starts from a bag of words defined "on the fly". I have tried several times to generate the same kind of Dataset starting from a text file but I have always failed. Here is my code:
SparkSession spark = SparkSession
.builder()
.appName("Simple application")
.config("spark.master", "local")
.getOrCreate();
Dataset<String> textFile = spark.read()
.textFile("myFile.txt")
.as(Encoders.STRING());
Dataset<Row> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty()).toDF();
Word2Vec word2Vec = new Word2Vec()
.setInputCol("value")
.setOutputCol("result")
.setVectorSize(16)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(words);
Dataset<Row> result = model.transform(words);
I get this error message: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column value must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.
I think I have to convert each line into a Row using something like:
RowFactory.create(0.0, line)
but I cannot figure out how to do that.
Basically, I was trying to train a classification system based on word counts of strings generated from a long sequence of characters. My text file contains one sequence per line so I need to split and count them for each row (the sub-strings are called k-mers and a general description can be found here). Depending on the length of the k-mers I could have more than 4^32 different strings, so I was looking for a scalable machine learning algorithm like Spark.
If you just want to count occurences of words, you can do:
Dataset<String> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty());
Dataset<Row> counts = words.toDF("word").groupBy(col("word")).count();
Word2Vec is much more powerful ML algorithm, in your case it's not necessary to use it. Remember to add import static org.apache.spark.sql.functions.*; at the beggining of the file
I am very new to Spark.
I have a very basic question. I read a file in Spark RDD in which each line is a JSON. I want to make apply groupBy like transformations. So I want to transform each JSON line into a PairRDD. Is there a straight forward way to do it in Java?
My json is like this:
{
"tmpl": "p",
"bw": "874",
"aver": {"cnac": "US","t1": "2"},
}
Currently, the way I am trying is the to split by , first and then by :. Is there any straight forward way to do this?
My current code:
val pairs = setECrecords.flatMap(x => (x.split(",")))
pairs.foreach(println)
val pairsastuple = pairs.map(x => if(x.split("=").length>1) (x.split("=")(0), x.split("=")(1)) else (x.split("=")(0), x))
You can try mapToPair(), but using the Spark SQL & DataFrames API will enable you to group things much more easily. The data frames API allows you to load JSON data directly.
This is the main body of my really simple Spark job...
def hBaseRDD = sc.newAPIHadoopRDD(config, TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
println "${hBaseRDD.count()} records counted"
def filteredRDD = hBaseRDD.filter({ scala.Tuple2 result ->
def val = result._2.getValue(family, qualifier)
val ? new String(val) == 'twitter' : false
} as Function<Result, Boolean>)
println "${filteredRDD.count()} counted from twitter."
println "Done!"
I noticed in the spark-submit output, that it appeared to go to HBase twice. The first time was when it called count on hBaseRDD and the second was when it called filter to create filteredRDD. Is there a way to get it to cache the results of the newAPIHadoopRDD call in hBaseRDD so that filter works on an in-memory only copy of the data?
hbaseRDD.cache() before counting will do the trick.
The docs cover the options in detail: http://spark.apache.org/docs/1.2.0/programming-guide.html#rdd-persistence