I'm using spark sturctured streaming (2.3) and kafka 2.4 version.
I want to kow how can I use ASync and Sync commit offset property.
If I set enable.auto.commit as true, Is it Sync or ASync ?
How can I define callback in spark structured streaming ? Or how can I use Sync or ASync in Spark structured streaming ?
Thanks in Advance
My Code
package sparkProject;
import java.io.StringReader;
import java.util.*;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder;
import org.apache.spark.sql.catalyst.encoders.RowEncoder;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
public class XMLSparkStreamEntry {
static StructType structType = new StructType();
static {
structType = structType.add("FirstName", DataTypes.StringType, false);
structType = structType.add("LastName", DataTypes.StringType, false);
structType = structType.add("Title", DataTypes.StringType, false);
structType = structType.add("ID", DataTypes.StringType, false);
structType = structType.add("Division", DataTypes.StringType, false);
structType = structType.add("Supervisor", DataTypes.StringType, false);
}
static ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
public static void main(String[] args) throws StreamingQueryException {
SparkConf conf = new SparkConf();
SparkSession spark = SparkSession.builder().config(conf).appName("Spark Program").master("local[*]")
.getOrCreate();
Dataset<Row> ds1 = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "Kafkademo").load();
Dataset<Row> ss = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
Dataset<Row> finalOP = ss.flatMap(new FlatMapFunction<Row, Row>() {
private static final long serialVersionUID = 1L;
#Override
public Iterator<Row> call(Row t) throws Exception {
JAXBContext jaxbContext = JAXBContext.newInstance(FileWrapper.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
StringReader reader = new StringReader(t.getAs("value"));
FileWrapper person = (FileWrapper) unmarshaller.unmarshal(reader);
List<Employee> emp = new ArrayList<Employee>(person.getEmployees());
List<Row> rows = new ArrayList<Row>();
for (Employee e : emp) {
rows.add(RowFactory.create(e.getFirstname(), e.getLastname(), e.getTitle(), e.getId(),
e.getDivision(), e.getSupervisor()));
}
return rows.iterator();
}
}, encoder);
Dataset<Row> wordCounts = finalOP.groupBy("firstname").count();
StreamingQuery query = wordCounts.writeStream().outputMode("complete").format("console").start();
System.out.println("SHOW SCHEMA");
query.awaitTermination();
}
}
Can I anyone please check, where and how can I implement ASync and Sync offset commit in my above code ?
Thanks in Advance..!
Please read https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read This is an excellent source although a little bit of reading between the lines.
In short:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side
which is responsible for distributing offsets to executors and for
checkpointing them at the end of the processing round (epoch or
micro-batch).
Batck Spark Structured Streaming & KAFKA Integration works differently again.
Spark Structured Streaming doesn't support Kafka commit offset feature. Suggested option from the official docs is to enable checkpointing.
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Other suggestion is to change it to Spark Streaming, which supports Kafka commitAsync API.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Related
I want a fixed length file to be loaded depending on the given column name and length in separate file. I am able to load the data and append the new column. But, unable to retain the old column list. The column is getting overwritten. But, i want the complete list of columns. Below is the code, I have implemented:
samplefile.txt:
00120181120xyz12341
00220180203abc56792
00320181203pqr25483
00120181120xyz12341
schema.json:
{"Column":"id","length":"3","flag":"0"}
{"Column":"date","length":"8","flag":"0"}
{"Column":"name","length":"3","flag":"1"}
{"Column":"salary","length":"5","flag":"2"}
Current Output:
+-------------------+------+
| _c0|salary|
+-------------------+------+
|00120181120xyz12341| 12341|
|00220180203abc56792| 56792|
|00320181203pqr25483| 25483|
|00120181120xyz12341| 12341|
+-------------------+------+
Expected Output
+-------------------+------++----+--------+---+
| _c0|salary|name |date |id |
+-------------------+------++----+--------+---+
|00120181120xyz12341| 12341|xyz |20181120|001|
|00220180203abc56792| 56792|abc |20180203|002|
|00320181203pqr25483| 25483|pqr |20181203|003|
|00120181120xyz12341| 12341|xyz |20181120|001|
+-------------------+------+-----+--------+---+
Code:
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class App {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Develop")
.master("local").getOrCreate();
Dataset<Row> ds = spark
.read()
.format("csv")
.option("header", "false")
.load("C://Users//path//samplefile.txt");
ds.show();
Dataset<Row> SchemaFile = spark
.read()
.format("csv")
.option("header", "true")
.load("C://Users//path//schema.txt");
SchemaFile.show();
List<String> s = new ArrayList<String>();
int lens = 1;
List<Row> it = SchemaFile.select("Column", "length").collectAsList();
List<StructField> fields = new ArrayList<>();
for (Row fieldName : it) {
System.out.println(fieldName.get(0));
System.out.println(Integer.parseInt(fieldName.get(1).toString()));
ds1 = ds.withColumn(
fieldName.get(0).toString(),
substrings(ds, "_c0", lens,
Integer.parseInt(fieldName.get(1).toString()),
fieldName.get(1).toString())); // selectExpr("substring("+"_c0"+","+lens+","+Integer.parseInt(fieldName.get(1).toString())+")");
s.add(fieldName.get(0).toString());
lens += Integer.parseInt((String) fieldName.get(1));
System.out.println("Lengths:" + lens);
ds1.show();
StructField field = DataTypes.createStructField(
fieldName.toString(), DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
System.out.println(schema);
for (String s1 : s) {
System.out.println(s1);
}
}
private static Column substrings(Dataset<Row> ds, String string, int lens,
int i, String cols) {
return ds.col("_c0").substr(lens, i);
}
}
Any kind of help and advice is appreciated.
Thanks in Advance.
I know your question is quite old but maybe others come across this question as well and hope for an answer. I think you have just appended the wrong dataset and therefore dropped the columns after appending.
Possible solution:
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.List;
public class FlfReader {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("FixedLengthFileReader")
.master("local[*]").getOrCreate();
Dataset<Row> ds = spark
.read()
.format("csv")
.option("header", "false")
.load(FlfReader.class.getClassLoader().getResource("samplefile.txt").getPath());
ds.show();
Dataset<Row> SchemaFile = spark
.read()
.format("json")
.option("header", "true")
.load(FlfReader.class.getClassLoader().getResource("schema.json").getPath());
SchemaFile.show();
int lengths = 1;
List<Row> schemaFields = SchemaFile.select("Column", "length").collectAsList();
for (Row fieldName : schemaFields) {
int fieldLength = Integer.parseInt(fieldName.get(1).toString());
ds = ds.withColumn(
fieldName.get(0).toString(),
colSubstring(ds,
lengths,
fieldLength));
lengths += fieldLength;
}
ds.show();
}
private static Column colSubstring(Dataset<Row> ds, int startPos, int length) {
return ds.col("_c0").substr(startPos, length);
}
}
I am fetching neo4j data into spark dataframe using neo4j-spark connector. I am able to fetch it successfully as I am able to show the dataframe. Then I register the dataframe with createOrReplaceTempView() method. Then I try running spark sql on it, but it gives exception saying
org.apache.spark.sql.AnalysisException: Table or view not found: neo4jtable;
This is how my whole code looks like:
import java.text.ParseException;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.neo4j.spark.Neo4JavaSparkContext;
import org.neo4j.spark.Neo4j;
import scala.collection.immutable.HashMap;
public class Neo4jDF {
private static Neo4JavaSparkContext neo4jjsc;
private static SparkConf sConf;
private static JavaSparkContext jsc;
private static SparkContext sc;
private static SparkSession ss;
private static Dataset<Row> neo4jdf;
static String neo4jip = "ll.mm.nn.oo";
public static void main(String[] args) throws AnalysisException, ParseException
{
setSparkConf();
setJavaSparkContext();
setNeo4jJavaSparkContext();
setSparkContext();
setSparkSession();
neo4jdf = loadNeo4jDataframe();
neo4jdf.createOrReplaceTempView("neo4jtable");
neo4jdf.show(false); //this prints correctly
Dataset<Row> neo4jdfsqled = ss.sql("SELECT * from neo4jtable");
neo4jdfsqled.show(false); //this throws exception
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(jsc.sc());
HashMap<String, Object> a = new HashMap<String, Object>();
Dataset<Row> rdd = neo4j.cypher("cypher query deleted for irrelevance", a).loadDataFrame();
return rdd;
}
private static void setSparkConf()
{
sConf = new SparkConf().setAppName("GetNeo4jToRddDemo");
sConf.set("spark.neo4j.bolt.url", "bolt://" + neo4jip + ":7687");
sConf.set("spark.neo4j.bolt.user", "neo4j");
sConf.set("spark.neo4j.bolt.password", "admin");
sConf.setMaster("local");
sConf.set("spark.testing.memory", "471859200");
sConf.set("spark.sql.warehouse.dir", "file:///D:/Mahesh/workspaces/spark-warehouse");
}
private static void setJavaSparkContext()
{
jsc = new JavaSparkContext(sConf);
}
private static void setSparkContext()
{
sc = JavaSparkContext.toSparkContext(jsc);
}
private static void setSparkSession()
{
ss = new SparkSession(sc);
}
private static void setNeo4jJavaSparkContext()
{
neo4jjsc = Neo4JavaSparkContext.neo4jContext(jsc);
}
}
I feel the issue might be with how all spark environment variables are created.
I first created SparkConf sConf.
From sConf, I created JavaSparkContext jsc.
From jsc, I created SparkContext sc.
From sc, I created SparkSession ss.
From ss, I created Neo4jJavaSparkContext neo4jjjsc.
So visually:
sConf -> jsc -> sc -> ss
-> neo4jjsc
Also note that
Inside loadNeo4jDataframe(), I use sc to instantiate instance Neo4j neo4j, which is then used for fetching neo4j data.
Data is fetched using Neo4j instance.
neo4jjjsc is never used, but I kept it as a possible hint for issue.
Given all these points and observations, please tell me why I get table not found exception? I must be missing something stupid. :\
Update
Tried setting ss (after data is fetched using SparkContext of neo4j) as follows:
private static void setSparkSession(SparkContext sc)
{
ss = new SparkSession(sc);
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(sc);
Dataset<Row> rdd = neo4j.cypher("deleted cypher for irrelevance", a).loadDataFrame();
//initalizing ss after data is fetched using SparkContext of neo4j
setSparkSession(neo4j.sc());
return rdd;
}
Update 2
From comments, just realised that neo4j creates a its own spark session using spark context sc instance provided to it. I am not having access to that spark session. So, how I am supposed to add / register arbitrary dataframe (here, neo4jdf) which is created in some other spark session (here spark session created by neo4j.cypher) to my spark session ss?
Based on the symptoms we can infer that both pieces of code use different SparkSession / SQLContext. Assuming there is nothing unusual going on in the Neo4j connector, you should be able to fix this by changing:
private static void setSparkSession()
{
ss = SparkSession().builder.getOrCreate();
}
or by initializing SparkSession before calling setNeo4jJavaSparkContext.
If these won't work, you can switch to using createGlobalTempView.
Important:
In general I would recommend initializing single SparkSession using builder pattern, and deriving other contexts (SparkContexts) from it, when necessary.
Need to write large Dataset to CSV file. Below is my sample code
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.springframework.core.io.ClassPathResource;
import org.springframework.core.io.Resource;
import org.springframework.core.io.support.PropertiesLoaderUtils;
import org.apache.spark.sql.api.java.UDF3;
import org.apache.spark.sql.api.java.UDF2;
import java.util.HashMap;
import java.util.List;
import java.util.Properties;
import java.util.Set;
import org.apache.spark.sql.Dataset;
public class TestUdf3{
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "F:\\JAVA\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
List<Row> manufactuerSynonymData = new ArrayList<Row>();
try{
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
HashMap<String, String> options = new HashMap<String, String>();options.put("header", "true");options.put("path", "D:\\xls\\Source25K.csv"); //Load source excel file
Dataset<Row> SourcePropertSet = sqlContext.load("com.databricks.spark.csv", options) ;
Resource resource = new ClassPathResource("/ActaulManufacturerSynonym.properties");
Properties allProperties = PropertiesLoaderUtils.loadProperties(resource);
StructType schemaManufactuerSynonymDictionary = new StructType(new StructField[] {new StructField("ManufacturerSynonymSource", DataTypes.StringType, false, Metadata.empty()), new StructField("ManufacturerSynonymTarget", DataTypes.StringType, false, Metadata.empty()) });
Set<String> setuniqueManufacturerEntries=allProperties.stringPropertyNames();
Row individualRowEntry;
for (String individualManufacturerEntry : setuniqueManufacturerEntries) {
individualRowEntry=RowFactory.create(individualManufacturerEntry,allProperties.getProperty(individualManufacturerEntry));
manufactuerSynonymData.add(individualRowEntry);
}
Dataset<Row> SynonaymList = spark.createDataFrame(manufactuerSynonymData, schemaManufactuerSynonymDictionary).withColumn("ManufacturerSynonymSource", lower(col("ManufacturerSynonymSource")));
SynonaymList.show(90,false);
UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>() {
private static final long serialVersionUID = -5239951370238629896L;
#Override
public Boolean call(String t1, String t2) throws Exception {
return t1.matches(t2);
}
};
spark.udf().register("contains", contains, DataTypes.BooleanType);
UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, String, String, String>() {
private static final long serialVersionUID = -2882956931420910207L;
#Override
public String call(String t1, String t2, String t3) throws Exception {
return t1 .replaceAll(t2,t3);
}
};
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
Dataset<Row> joined = SourcePropertSet.join(SynonaymList, callUDF("contains", SourcePropertSet.col("manufacturersource"), SynonaymList.col("ManufacturerSynonymSource"))).withColumn("ManufacturerSource", callUDF("replaceWithTerm",SourcePropertSet.col("manufacturersource"),SynonaymList.col("ManufacturerSynonymSource"), SynonaymList.col("ManufacturerSynonymTarget")));
joined.show(54000);
joined.repartition(1).select("*").write().format("com.databricks.spark.csv").option("delimiter", ",")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.save("D:\\xls\\synonym.csv");
}
catch(Exception e){
e.printStackTrace();
}
}
}
In the above code rather than displaying the output in console using Statement :
joined.show(54000,false);
I need to write it in to csv file directly
It gives me an runtime exception they are:
1. save("D:\xls\synonym.csv");
org.apache.spark.SparkException: Job aborted.
Caused by:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0
in stage 3.0 (TID 3, localhost, executor driver):
org.apache.spark.SparkException: Failed to execute user defined
function($anonfun$apply$2: (string, string) => boolean)
2. return t1.matches(t2);
java.lang.NullPointerException
Caused by:
org.apache.spark.SparkException: Failed to execute user defined
function($anonfun$apply$2: (string, string) => boolean)
Can anybody suggest how to write large dataset to excel file
I am trying to receive streaming data from kafka. In this process I am able to receive and store the streaming data into JavaPairInputDStream. Now I need to analyze this data with out storing it into any database.So I want to convert this JavaPairInputDStream to DataSet or DataFrame
What I tried so far is:
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalog.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.AbstractJavaDStreamLike;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
//Streaming Working Code
public class KafkaToSparkStreaming
{
public static void main(String arr[]) throws InterruptedException
{
SparkConf conf = new SparkConf();
conf.set("spark.app.name", "SparkReceiver"); //The name of application. This will appear in the UI and in log data.
//conf.set("spark.ui.port", "7077"); //Port for application's dashboard, which shows memory and workload data.
conf.set("dynamicAllocation.enabled","false"); //Which scales the number of executors registered with this application up and down based on the workload
//conf.set("spark.cassandra.connection.host", "localhost"); //Cassandra Host Adddress/IP
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"); //For serializing objects that will be sent over the network or need to be cached in serialized form.
//conf.setMaster("local");
conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
// Create the context with 2 seconds batch size
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("zookeeper.connect", "localhost:2181"); //Make all kafka data for this cluster appear under a particular path.
kafkaParams.put("group.id", "testgroup"); //String that uniquely identifies the group of consumer processes to which this consumer belongs
kafkaParams.put("metadata.broker.list", "localhost:9092"); //Producer can find a one or more Brokers to determine the Leader for each topic.
kafkaParams.put("serializer.class", "kafka.serializer.StringEncoder"); //Serializer to use when preparing the message for transmission to the Broker.
kafkaParams.put("request.required.acks", "1"); //Producer to require an acknowledgement from the Broker that the message was received.
Set<String> topics = Collections.singleton("ny-2008.csv");
//Create an input DStream for Receiving data from socket
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams, topics);
//System.out.println(directKafkaStream);
directKafkaStream.print();
}
}
Here is the complete working code using Spark 2.0.
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class KafkaToSparkStreaming {
public static void main(String arr[]) throws InterruptedException
{
SparkConf conf = new SparkConf();
conf.set("spark.app.name", "SparkReceiver"); //The name of application. This will appear in the UI and in log data.
//conf.set("spark.ui.port", "7077"); //Port for application's dashboard, which shows memory and workload data.
conf.set("dynamicAllocation.enabled","false"); //Which scales the number of executors registered with this application up and down based on the workload
//conf.set("spark.cassandra.connection.host", "localhost"); //Cassandra Host Adddress/IP
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"); //For serializing objects that will be sent over the network or need to be cached in serialized form.
conf.setMaster("local");
conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
// Create the context with 2 seconds batch size
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("zookeeper.connect", "localhost:2181"); //Make all kafka data for this cluster appear under a particular path.
kafkaParams.put("group.id", "testgroup"); //String that uniquely identifies the group of consumer processes to which this consumer belongs
kafkaParams.put("metadata.broker.list", "localhost:9092"); //Producer can find a one or more Brokers to determine the Leader for each topic.
kafkaParams.put("serializer.class", "kafka.serializer.StringEncoder"); //Serializer to use when preparing the message for transmission to the Broker.
kafkaParams.put("request.required.acks", "1"); //Producer to require an acknowledgement from the Broker that the message was received.
Set<String> topics = Collections.singleton("ny-2008.csv");
//Create an input DStream for Receiving data from socket
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams, topics);
//Create JavaDStream<String>
JavaDStream<String> msgDataStream = directKafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaRDD<Row>
msgDataStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();
}
});
ssc.start();
ssc.awaitTermination();
}
}
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession
.builder()
.config(sparkConf)
.getOrCreate();
}
return instance;
}
}
Technically Dstream is sequence of RDDs, you won't convert Dstream to Datframe instead you will convert each RDD to Dataframe/Dataset as below(Scala code please convert it in Java for your case):
stream.foreachRDD { rdd =>
val dataFrame = rdd.map {case (key, value) => Row(key, value)}.toDF()
}
I'm trying to filter DataFrame content, using Spark's 1.5 method dropDuplicates().
Using it with fully data filled tables (I mean no empty cells) gives correct result, but when my CSV source contains empty cells (I'll provide you with source file) - Spark throw ArrayIndexOutOfBoundsException.
What am I doing wrong? I've read Spark SQL and DataFrames tutorial for version 1.6.2, It does not describe DataFrame operations in detail. I am also reading book "Learning Spark. Lightning-Fast Big Data Analysis.", but It's written for Spark 1.5 and operations I need are not described there. I'll be glad to get explanation either link to manual.
Thank you.
package data;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import java.util.Arrays;
public class TestDrop {
public static void main(String[] args) {
DropData dropData = new DropData("src/main/resources/distinct-test.csv");
dropData.execute();
}
}
class DropData{
private String csvPath;
private JavaSparkContext sparkContext;
private SQLContext sqlContext;
DropData(String csvPath) {
this.csvPath = csvPath;
}
void execute(){
initContext();
DataFrame dataFrame = loadDataFrame();
dataFrame.show();
dataFrame.dropDuplicates(new String[]{"surname"}).show();
//this one fails too: dataFrame.drop("surname")
}
private void initContext() {
sparkContext = new JavaSparkContext(new SparkConf().setMaster("local[4]").setAppName("Drop test"));
sqlContext = new SQLContext(sparkContext);
}
private DataFrame loadDataFrame() {
JavaRDD<String> strings = sparkContext.textFile(csvPath);
JavaRDD<Row> rows = strings.map(string -> {
String[] cols = string.split(",");
return RowFactory.create(cols);
});
StructType st = DataTypes.createStructType(Arrays.asList(DataTypes.createStructField("name", DataTypes.StringType, false),
DataTypes.createStructField("surname", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true),
DataTypes.createStructField("sex", DataTypes.StringType, true),
DataTypes.createStructField("socialId", DataTypes.StringType, true)));
return sqlContext.createDataFrame(rows, st);
}
}
Sending List instead of Object[] results as creation rows, containing 1 column with a list inside. That's what I was doing wrong.