I'm new to spark,now I want to transform two streams together,such as JavaNetworkWordCount
example,I receive two different streams :
JavaStreamingContext jssc = new JavaStreamingContext("local[2]", "JavaNetworkWordCount",new Duration(1000));
JavaReceiverInputDStream<String> lines1 = jssc.socketTextStream(ip1, port1);
JavaReceiverInputDStream<String> lines2 = jssc.socketTextStream(ip2, port2);
//can I union them like this in one driver program:
JavaDStream<String> words = lines1.union(lines2);
words = lines.flatMap(
new FlatMapFunction<String, String>() {
#Override public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
</code>
then do other transforms and action.I tested it and failed.
I had read spark documentation, can't find an example.
here's an example from the new Kinesis WordCount example:
Java version:
https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java#L130
Scala version:
https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L116
the idea is to create a list of the streams, then call ssc.union(list). the scala version is a bit cleaner, but the idea is the same for both.
Related
I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.
I am writing a Spark 2.4 transformation for spark benchmarking which will get JSON Streams from Kafka topic and need to dump it to MongoDB. I can do it using Java MongoClient, but data can be huge such as 1 Million records coming through multiple threads from Kafka. Spark processes it very fast but mongo write is very slow.
SparkConf sparkConf = new SparkConf().setMaster("local[*]").
setAppName("JavaDirectKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled","true");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "loacalhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id", "2");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("poc-topic");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(),
org.apache.spark.streaming.kafka010.ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));
#SuppressWarnings("serial")
JavaPairDStream<String, String> jPairDStream = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
System.out.println("value=" + rdd._2());
if (rdd._2() != null) {
System.out.println("inserting=" + rdd._2());
Document doc = Document.parse(rdd._2());
// List<Document> list = new ArrayList<>();
// list.add(doc);
db.getCollection("collection").insertOne(doc);
System.out.println("Inserted Data Done");
}
else {
System.out.println("Got no data in this window");
}
});
});
streamingContext.start();
streamingContext.awaitTermination();
Where
MongoClient mongo = new MongoClient("localhost", 27017);
MongoDatabase db = mongo.getDatabase("mongodb");
I expect to speed up the mongo Operation,how to achiever multithreading for mongo write? (should I use MongoClientOptions for minconnection per host?)
Also is the approach taken is correct to use MongoDriver or it should done by MonogSpark connector or By spark writestream() API's. If yes how to write each rdd as separate record in mongo any example in Java?
I don't know about "efficiently" because there are a lot of factors at play here.
For example, Kafka partitions and total Spark executors are just two values that need tuned to accomodate for thoughput.
I do see you are using the ForEachWriter, which is a good way to do it, but maybe not the best considering you're doing constantly calling insertOne, compared to using Spark Structed Streaming to begin with, reading from Kafka, manipulating your data into a Struct object, then using SparkSQL Mongo Connector to directly dump to Mongo collections (which I would guess uses Mongo transactions, and inserts mutiple records at a time)
Also worth mentioning, Landoop offers a MongoDB Kafka Connect Sink, which requires one config file, and no Spark code to be written.
How can I parse XML data in Storm and Spark streaming? For example in Spark streaming;
// Define spark streaming MAP function.
private static final Function<XML_DOCUMENT_TYPE, MY_JAVA_CLASS> parsingXMLFunc = (doc -> {
// create my java object
MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();
// classic xml parsing
List<String> parsed_doc = doc.parse(); // etc
mjc.temperature = parsed_doc[0];
mjc.accelerometer = parsed_doc[1];
return mjc;
});
In this example, can Spark parse xml in parallel?
Or Storm streaming example;
#Override
public void execute(Tuple tuple) {
// create my java object
MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();
// classic xml parsing
Document doc = tuple.get(0);
List<String> parsed_doc = doc.parse(); // etc
mjc.temperature = parsed_doc[0];
mjc.accelerometer = parsed_doc[1];
_collector.emit(new Values(mjc));
};
In the above examples, is the XML parse operation done in parallel? Or do you have better approachs?
I haven't worked in Spark. Regarding Storm, you can create a function to do XML parsing (using some common java XML parser's you prefer) & call that function inside "execute" method. This will run in parallel depending upon number of workers & executors you provide for your application.
Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.
This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files
Please help me to write dataset object to an excel file in spark java.
You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.
I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.
public class SparkExcel {
public static void main(String[] args) {
//spark session
SparkSession spark = SparkSession
.builder()
.appName("SparkExcel")
.master("local[*]")
.getOrCreate();
//Read
Dataset<Row> df = spark
.read()
.format("org.zuinnote.spark.office.excel")
.option("read.locale.bcp47", "de")
.load("c:\\temp\\test1.xlsx");
//Print
df.show();
df.printSchema();
//Flatmap function
FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
#Override
public Iterator<String[]> call(Row row) throws Exception {
ArrayList<String[]> rowList = new ArrayList<String[]>();
List<Row> spreadSheetRows = row.getList(0);
for (Row srow : spreadSheetRows) {
ArrayList<String> arr = new ArrayList<String>();
arr.add(srow.getString(0));
arr.add(srow.getString(1));
arr.add(srow.getString(2));
arr.add(srow.getString(3));
arr.add(srow.getString(4));
rowList.add(arr.toArray(new String[] {}));
}
return rowList.iterator();
}
};
//Apply flatMap function
Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());
//Write
df2.write()
.mode(SaveMode.Overwrite)
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.save("c:\\temp\\test2.xlsx");
}
}
I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3
I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*