How to convert the datasets of Spark Row into string?

How to convert the datasets of Spark Row into string? - java

I have written the code to access the Hive table using SparkSQL. Here is the code:
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.master("local[*]")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> df = spark.sql("select survey_response_value from health").toDF();
df.show();
I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?

Here is the sample code in Java.
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//create df
List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
df.show();
//using df.as
List<String> listOne = df.as(Encoders.STRING()).collectAsList();
System.out.println(listOne);
//using df.map
List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
System.out.println(listTwo);
}
}
"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html

You can use the map function to convert every row into a string, e.g.:
df.map(row => row.mkString())
Instead of just mkString you can of course do more sophisticated work
The collect method then can retreive the whole thing into an array
val strings = df.map(row => row.mkString()).collect
(This is the Scala syntax, I think in Java it's quite similar)

If you are planning to read the dataset line by line, then you can use the iterator over the dataset:
Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);
for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
String item = (iter.next()).toString();
System.out.println(item.toString());
}

to put as a single string, from sparkSession you can do:
sparkSession.read.textFile(filePath).collect.mkString
assuming your Dataset is of type String: Dataset[String]

Related

Create a Dataset from String Spark Java (Without RDD)

I need to create a Dataset from String. Key is the String
Header h = new Header();
h.setName(Key);
SQLContext sqlC = spark.sqlContext();
Dataset<String> ds = sqlC.createDataset(Collections.singletonList(h), Encoders.STRING());
ds.show();
I need to write it into txt file(Is there one? I am using csv right now)
ds.write().format("com.databricks.spark.csv").mode("overwrite")
.save(SomeLocation);

from documentation df.write.text():
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#text-java.lang.String-

parse Dataset column of Json to Dataset<Row>

Having Dataset<Row> of single column of json strings:
+--------------------+
| value|
+--------------------+
|{"Context":"00AA0...|
+--------------------+
Json sample:
{"Context":"00AA00AA","MessageType":"1010","Module":"1200"}
How can I most efficiently get Dataset<Row> that looks like this:
+--------+-----------+------+
| Context|MessageType|Module|
+--------+-----------+------+
|00AA00AA| 1010| 1200|
+--------+-----------+------+
I'm processing those data in stream, i know that spark can do this by him self when i'm reading it from a file:
spark
.readStream()
.schema(MyPojo.getSchema())
.json("src/myinput")
but now i'm reading data from kafka and it gives me data in another form.
I know that i can use some parsers like Gson, but i would like to let spark to do it for me.

Try this sample.
public class SparkJSONValueDataset {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkJSONValueDataset")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.master("local")
.getOrCreate();
//Prepare data Dataset<Row>
List<String> data = Arrays.asList("{\"Context\":\"00AA00AA\",\"MessageType\":\"1010\",\"Module\":\"1200\"}");
Dataset<Row> df = spark.createDataset(data, Encoders.STRING()).toDF().withColumnRenamed("_1", "value");
df.show();
//convert to Dataset<String> and Read
Dataset<String> df1 = df.as(Encoders.STRING());
Dataset<Row> df2 = spark.read().json(df1.javaRDD());
df2.show();
spark.stop();
}
}

Convert RDD List to RDD of individual element in spark

I have a input rdd (JavaRDD<List<String>>) and i want to convert it to JavaRDD<String> as output.
Each element of input RDD list should become a individual element in output rdd.
how to achieve it in java?
JavaRDD<List<String>> input; //suppose rdd length is 2
input.saveAsTextFile(...)
output:
[a,b] [c,d]
what i want:
a b c d

Convert it into a DataFrame and use Explode UDF function.

I did a workaround using below code snippet:
Concat each element of list with separator '\n' then save rdd using standard spark API.
inputRdd.map(new Function<List<String>, String>() {
#Override
public String call(List<String> scores) throws Exception {
int size = scores.size();
StringBuffer sb = new StringBuffer();
for (int i=0; i <size;i++){
sb.append(scores.get(i));
if(i!=size-1){
sb.append("\n");
}
}
return sb.toString();
}
}).saveAsTextFile("/tmp/data"));

If the rdd type is RDD[List[String]], you can just do this:
val newrdd = rdd.flatmap(line => line)
Each of the elements will be a new line in the new rdd.

below will solve your problem
var conf = new SparkConf().setAppName("test")
.setMaster("local[1]")
.setExecutorEnv("executor-cores", "2")
var sc = new SparkContext(conf)
val a = sc.parallelize(Array(List("a", "b"), List("c", "d")))
a.flatMap(x => x).foreach(println)
output :
a
b
c
d

Spark - How to use SparkContext within classes?

I am building an application in Spark, and would like to use the SparkContext and/or SQLContext within methods in my classes, mostly to pull/generate data sets from files or SQL queries.
For example, I would like to create a T2P object which contains methods that gather data (and in this case need access to the SparkContext):
class T2P (mid: Int, sc: SparkContext, sqlContext: SQLContext) extends Serializable {
def getImps(): DataFrame = {
val imps = sc.textFile("file.txt").map(line => line.split("\t")).map(d => Data(d(0).toInt, d(1), d(2), d(3))).toDF()
return imps
}
def getX(): DataFrame = {
val x = sqlContext.sql("SELECT a,b,c FROM table")
return x
}
}
//creating the T2P object
class App {
val conf = new SparkConf().setAppName("T2P App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val t2p = new T2P(0, sc, sqlContext);
}
Passing the SparkContext as an argument to the T2P class doesn't work since the SparkContext is not serializable (getting a task not serializable error when creating T2P objects). What is the best way to use the SparkContext/SQLContext inside my classes? Or perhaps is this the wrong way to design a data pull type process in Spark?
UPDATE
Realized from the comments on this post that the SparkContext was not the problem, but that I was using a using a method within a 'map' function, causing Spark to try to serialize the entire class. This would cause the error since SparkContext is not serializable.
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
def buildUserRollup() = {
this.userRollup = this.userSorted.map(line=>startMetricTo(line, this.startMetric))
}
This results in a 'task not serializable' exception.

I fixed this problem (with the help of the commenters and other StackOverflow users) by creating a separate MetricCalc object to store my startMetricTo() method. Then I changed the buildUserRollup() method to use this new startMetricTo(). This allows the entire MetricCalc object to be serialized without issue.
//newly created object
object MetricCalc {
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
}
//using function in T2P
def buildUserRollup(startMetric: String) = {
this.userRollup = this.userSorted.map(line=>MetricCalc.startMetricTo(line, startMetric))
}

I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion. Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.

Convert JavaDStream<String> to JavaRDD<String>

I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.

You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)

You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})

I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert the datasets of Spark Row into string? - java

to put as a single string, from sparkSession you can do: sparkSession.read.textFile(filePath).collect.mkString assuming your Dataset is of type String: Dataset[String]

Related

Create a Dataset from String Spark Java (Without RDD)

parse Dataset column of Json to Dataset<Row>

Convert RDD List to RDD of individual element in spark

Spark - How to use SparkContext within classes?

Convert JavaDStream<String> to JavaRDD<String>

Categories

Resources