I need to create a Dataset from String. Key is the String
Header h = new Header();
h.setName(Key);
SQLContext sqlC = spark.sqlContext();
Dataset<String> ds = sqlC.createDataset(Collections.singletonList(h), Encoders.STRING());
ds.show();
I need to write it into txt file(Is there one? I am using csv right now)
ds.write().format("com.databricks.spark.csv").mode("overwrite")
.save(SomeLocation);
from documentation df.write.text():
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#text-java.lang.String-
Related
I'm using a data set with a lot of columns, this data set has ? in all the data set. I would like to Spark (Java) to change the ? to 0. By far I can only do this with one column but I would like to do everywhere:
Dataset<Row> csvData = spark.read()
.option("header", false)
.option("inferSchema", true)
.option("maxColumns", 50000)
.csv("src/main/resources/K9.data");
csvData = csvData.withColumn("_c5409", when(col("_c5409").isNull(),0).otherwise(col("_c5409")) )
.withColumn("_c0", when(col("_c0").equalTo("?"),0).otherwise(col("_c0")) );
Maybe this has an easy solution, I'm new with Java and Spark :)
You can create the list of columns using when, and use that in select if it has to deal with complex if and else cases
List<org.apache.spark.sql.Column> list = new ArrayList<org.apache.spark.sql.Column>();
for( String col : csvData.columns()){
list.add(when(csvData.col(col).isNull(),0).otherwise(csvData.col(col)).alias(col));
}
csvData = csvData.select(list.toArray(new org.apache.spark.sql.Column[0]));
If it is simply to replace nulls, this is good enough
csvData = csvData.na().fill(0, df.columns());
I have written the code to access the Hive table using SparkSQL. Here is the code:
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.master("local[*]")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> df = spark.sql("select survey_response_value from health").toDF();
df.show();
I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?
Here is the sample code in Java.
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//create df
List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
df.show();
//using df.as
List<String> listOne = df.as(Encoders.STRING()).collectAsList();
System.out.println(listOne);
//using df.map
List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
System.out.println(listTwo);
}
}
"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html
You can use the map function to convert every row into a string, e.g.:
df.map(row => row.mkString())
Instead of just mkString you can of course do more sophisticated work
The collect method then can retreive the whole thing into an array
val strings = df.map(row => row.mkString()).collect
(This is the Scala syntax, I think in Java it's quite similar)
If you are planning to read the dataset line by line, then you can use the iterator over the dataset:
Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);
for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
String item = (iter.next()).toString();
System.out.println(item.toString());
}
to put as a single string, from sparkSession you can do:
sparkSession.read.textFile(filePath).collect.mkString
assuming your Dataset is of type String: Dataset[String]
I am trying to learn spark and spark-streaming using Java. And developing an IOT application.
I am having a KAFKA server in place which accepts JSON data and I am able to parse it using SQLContext and foreach function.
Data format is as follows,
[{"t":1481368346000,"sensors":[{"s":"s1","d":"+149.625"},{"s":"s2","d":"+23.062"},{"s":"s3","d":"+16.375"},{"s":"s4","d":"+235.937"},{"s":"s5","d":"+271.437"},{"s":"s6","d":"+265.937"},{"s":"s7","d":"+295.562"},{"s":"s8","d":"+301.687"}]}]
In this t is a timestamp of each data stream
and sensors is array of sensor data with s as a name of each sensor and d containing a data.
What I have done till now is,
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topics);
SQLContext sqlContext = spark.sqlContext();
StreamingLinearRegressionWithSGD model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(2));
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
return message._2();
};
});
json.print();
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> jsonRecord) throws Exception {
System.out.println("JSON Record ---- "+jsonRecord);
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
}
});
ssc.start();
ssc.awaitTermination();
What I want to do is, convert json which is JavaDStream into a data structure which StreamingLinearRegressionWithSGD model accepts.
When I try to use sparks's map function to map json stream to JavaDStream as follows,
JavaDStream<LabeledPoint> forML = json.map(new Function<String, LabeledPoint>() {
#Override
public LabeledPoint call(String jsonRecord) throws Exception {
// TODO Auto-generated method stub
System.out.println("\n\n\n here is JSON in"+ jsonRecord);
LabeledPoint returnObj = null;
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
returnObj = new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
return returnObj;
}
}).cache();
model.trainOn(forML);
And call model.trainOn it fails with NullPointerException at
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
Now the questions I am having are,
Am I doing this right?
How I will be able to predict values and why and how I need to create a different stream to pass it on to predictOn function of model?
I will be receiving multiple sensors but single value for each sensor, and there can be thousands of such streams, how I can create different model for each of those thousand sensors and predict for such a vast amount of data efficiently?
Are there any other good machine learning algorithms or approaches which can be utilized for this type of sensor data?
As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods
using Spark SQLContext with a statically defined schema
// statically defined in the application
public static class TableTuple implements Serializable {
private int id;
private String line;
TableTuple (int i, String l) {
id = i;
line = l;
}
// getters and setters
...
}
and I consume the definition as:
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> rowrdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<TableTuple> rdd = rowrdd.map(row -> new TableTuple(row.getInt(0), row.getString(1)));
DataFrame dataFrame = sqlContext.createDataFrame(rdd, TableTuple.class);
dataFrame.registerTempTable("lines");
DataFrame resultsFrame = sqlContext.sql("Select line from lines where id=1");
System.out.println(Arrays.asList(resultsFrame.collect()));
using Spark SQLContext with a dynamically defined schema
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> cassandraRdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<Row> rdd = cassandraRdd.map(row -> RowFactory.create(row.getInt(0), row.getString(1)));
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
dataFrame.registerTempTable("lines");
DataFrame resultDataFrame = sqlContext.sql("select line from lines where id = 1");
System.out.println(Arrays.asList(resultDataFrame.collect()));
using CassandraSQLContext from the spark-cassandra-connector
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
DataFrame resultsFrame = sqlContext.sql("Select line from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where id = 1");
System.out.println(Arrays.asList(resultsFrame.collect()));
I would like to know the advantages/disadvantages of one method over another. Also, for the CassandraSQLContext method, are queries limited to CQL, or is it fully compatible with Spark SQL. I would also like an analysis pertaining to my specific use case, I have a cassandra column family with ~17.6 million tuples having 62 columns. For querying such a large database, which method is most adequate ?
I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.
You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)
You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})
I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();