How to flatMap to database in Apache Flink?

How to flatMap to database in Apache Flink? - java

I am using Apache Flink trying to get JSON records from Kafka to InfluxDB, splitting them from one JSON record into multiple InfluxDB points in the process.
I found the flatMap transform and it feels like it fits the purpose. Core code looks like this:
DataStream<InfluxDBPoint> dataStream = stream.flatMap(new FlatMapFunction<JsonConsumerRecord, InfluxDBPoint>() {
#Override
public void flatMap(JsonConsumerRecord record, Collector<InfluxDBPoint> out) throws Exception {
Iterator<Entry<String, JsonNode>> iterator = //...
while (iterator.hasNext()) {
// extract point from input
InfluxDBPoint point = //...
out.collect(point);
}
}
});
For some reason, I only get one of those collected points streamed into the database.
Even when I print out all mapped entries, it seems to work just fine: dataStream.print() yields:
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#144fd091
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#57256d1
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#28c38504
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#2d3a66b3
Am I misunderstanding flatMap or might there be some bug in the Influx connector?

The problem was actually related to the fact that a series (defined by its tagset and measurement as seen here) in Influx can only have one point per time, therefore even though my fields differed, the final point overwrote all previous points with the same time value.

Related

How does Flinks Collector.collect() handle data?

Im trying to understand what Flinks Collector.collect() does and how it handles incoming/outgoing data:
Example taken from Flink DataSet API:
The following code transforms a DataSet of text lines into a DataSet of words:
DataSet<String> output = input.flatMap(new Tokenizer());
public class Tokenizer implements FlatMapFunction<String, String> {
#Override
public void flatMap(String value, Collector<String> out) {
for (String token : value.split("\\W")) {
out.collect(token);
}
}
}
So the text Lines get split into tokens and each of them gets "collected". As intuitive as it might sound but im missing the actual dynamics behind Collector.collect(). Where is the collected data stored before it gets assigned to output i.e does Flink put them in some sort of Buffer? And if yes, how is the data transferred to the network?

from the official source code documentation.
Collects a record and forwards it. The collector is the "push"
counterpart of the {#link java.util.Iterator}, which "pulls" data in.
So, it receives a value and stores one or more values into the Iterator. Then pushes to the next operator. But this is a matter of the network stack/ buffers.

Apache Flink : How to Call One Stream from Another Stream

My scenario is, I want to call one stream based on another stream input. Both Stream type is different. The following is my sample code. I want to trigger one stream when some message is received from Kafka stream.
While Application start up, i can read data from DB. Then again i want to get data from DB based on some kafka message. When i receive kafka message in stream , i want to get data from DB again.This is my actual use case.
How to achieve this? Is it possible ?
public class DataStreamCassandraExample implements Serializable{
private static final long serialVersionUID = 1L;
static Logger LOG = LoggerFactory.getLogger(DataStreamCassandraExample.class);
private transient static StreamExecutionEnvironment env;
static DataStream<Tuple4<UUID,String,String,String>> inputRecords;
public static void main(String[] args) throws Exception {
env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool argParameters = ParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(argParameters);
Properties kafkaProps = new Properties();
kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "group1");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("testtopic", new SimpleStringSchema(), kafkaProps);
ClusterBuilder cb = new ClusterBuilder() {
private static final long serialVersionUID = 1L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("127.0.0.1")
.withPort(9042)
.withoutJMXReporting()
.build();
}
};
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat =
new CassandraInputFormat<> ("select * from employee_details", cb);
//While Application is start up , Read data from table and send as stream
inputRecords = getDBData(env,cassandraInputFormat);
// If any data comes from kafka means, again i want to get data from table.
//How to i trigger getDBData() method from inside this stream.
//The below code is not working
DataStream<String> inputRecords1= env.addSource(kafkaConsumer)
.map(new MapFunction<String,String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
inputRecords = getDBData(env,cassandraInputFormat);
return "OK";
}
});
//This is not printed , when i call getDBData() stream from inside the kafka stream.
inputRecords1.print();
DataStream<Employee> empDataStream = inputRecords.map(new MapFunction<Tuple4<UUID,String,String,String>, Tuple2<String,Employee>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Employee> map(Tuple4<UUID,String,String,String> value) throws Exception {
Employee emp = new Employee();
try{
emp.setEmpid(value.f0);
emp.setFirstname(value.f1);
emp.setLastname(value.f2);
emp.setAddress(value.f3);
}
catch(Exception e){
}
return new Tuple2<>(emp.getEmpid().toString(), emp);
}
}).keyBy(0).map(new MapFunction<Tuple2<String,Employee>,Employee>() {
private static final long serialVersionUID = 1L;
#Override
public Employee map(Tuple2<String, Employee> value)
throws Exception {
return value.f1;
}
});
empDataStream.print();
env.execute();
}
private static DataStream<Tuple4<UUID,String,String,String>> getDBData(StreamExecutionEnvironment env,
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat){
DataStream<Tuple4<UUID,String,String,String>> inputRecords = env
.createInput
(cassandraInputFormat
,TupleTypeInfo.of(new TypeHint<Tuple4<UUID,String,String,String>>() {}));
return inputRecords;
}
}

this is going to be a very verbose answer.
To correctly use Flink as a developper, you need to have an understading of its basic concepts. I suggest you start by the architecture overview (https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html), it contains all you need to know in order to get into the world of Flink when you come from programming.
Now, looking at your code, it should not do what you expect because of how Flink will read it. You need to understand that Flink has at least two big steps when it executes your code: first it builds an execution graph which only describes what it needs to do. This happens at the job manager level. The second big step is to ask one or many workers to execute the graph. These two steps are sequential and anything you do regarding the graph description has to be done at the job manager level not inside your operations.
In your case, the graph has:
A Kafak source.
A map that will call getDBData() at a worker level (not good because getDBData() alters the graph by adding a new Input each time it is called).
The line inputRecords = getDBData(env,cassandraInputFormat); will create an orphan branch of the graph. And the line DataStream<Employee> empDataStream = inputRecords.map... will append a branch of a map->keyBy->map to that orphan branch. This will build a part of the graph that will read all the employee records from Cassandra and apply the map->keyBy->map transformations. This will not be linked with the Kafka source in any way.
Now let's get back to your need. I understand you need to fetch data for an employee when his/her id comes from Kafka and do some operations.
The most clean way to handle this is called Side Inputs. This is a data input that you declare when you build your graph and the job manager handles the reading of data and its transmission to the workers. The bad news is that Side Inputs are not yet working for streaming jobs in Flink (https://issues.apache.org/jira/browse/FLINK-2491 - this bug causes streamning jobs to not create checskpoints because side inputs finish quickly and this puts the job in a bizzare state).
With this being said you still have three more options. The right option depends on the size of your employee cassandra table.
The second option is to load all employees to a static final variable employees and use it inside your map functions. The backside of this approach is that the job manager will send a serialized copy of this variable to all workers and may congest your network and may also overload the RAM. If the size of the table is small and should not grow big in the future, then this may be an acceptable work-arround until the Side Inputs are working for streaming jobs. If the size of the table is big or should evolve in the future then consider the third option.
The third option is an improvement of the second one. It uses Flink's broadcast variables (see https://flink.apache.org/2019/06/26/broadcast-state.html and https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html). Short story: it is the same as before with better transfer management. Flink will find the best way to store and send the variable to the workers. This approach is though a litle bit more complicated to implement correctly.
The last option is not advisable in normal circumstances. It simply consists in making a call to Cassandra inside your map operation. This is not a good practice because it adds repeated latency to all your map executions (there will be as many calls as items passing through Kafka). A call means a connection creation, the actual request with the query and waiting for Cassandra to reply and freeing the connection. This can be a lot of work for a step in your graph. It is a solution to consider when you really can not find any alternatives.
For your case, I would advise the third option. I guess the employee table should not be very big and using Broadcast variables is a good choice.

Spark - Collect partitions using foreachpartition

We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks

Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi

Java SPARK saveAsTextFile NULL

JavaRDD<Text> tx= counts2.map(new Function<Object, Text>() {
#Override
public Text call(Object o) throws Exception {
// TODO Auto-generated method stub
if (o.getClass() == Dict.class) {
Dict rkd = (Dict) o;
return new Text(rkd.getId());
} else {
return null ;
}
}
});
tx.saveAsTextFile("/rowkey/Rowkey_new");
I am new to Spark, I want to save this file, but I got the Null exception. I don't want to use return new Text() to replace return null,because it will insert a blank line to my file. So how can I solve this problem?

Instead of putting an if condition in your map, you simply use that if condition to build a RDD filter. The Spark Quick Start is a good place to start. There is also a nice overview of other transformations and actions.
Basically your code can look as follows (if you are using Java 8):
counts2
.filter((o)->o instanceof Dict)
.map(o->new Text(((Dict)o).getId()))
.saveAsTextFile("/rowkey/Rowkey_new");
You had the intention to map one incoming record to either zero or one outgoing record. This cannot be done with a map. However, filter maps to zero or one records with incoming record matches outgoing record, and flatMap gives you some more flexibility by allowing to map to zero or more outgoing records of any type.
It is strange, but not inconceivable, you create non-Dict objects that are going to be filters out further downstream anyhow. Possibly you can consider to push your filter even further upstream to make sure you only create Dict instances. Without knowing the rest of your code, this is only a assumption of course, and is not part of your original question anyhow.

How to deal with code that runs before foreach block in Apache Spark?

I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr.
This works perfectly fine when I run the following code in stand-alone mode of Spark, but does not work when the same code is run on a cluster. When I run the same code on a cluster, it is like "send to Solr" part of the code is executed before the list to be sent to Solr is filled with items. I try to force the execution by solrInputDocumentJavaRDD.collect(); after foreach, but it seems like it does not have any effect.
// For each RDD
solrInputDocumentJavaDStream.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
#Override
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
// For each item in a single RDD
solrInputDocumentJavaRDD.foreach(
new VoidFunction<SolrInputDocument>() {
#Override
public void call(SolrInputDocument solrInputDocument) {
// Add the solrInputDocument to the list of SolrInputDocuments
SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument);
}
});
// Try to force execution
solrInputDocumentJavaRDD.collect();
// After having finished adding every SolrInputDocument to the list
// add it to the solrServer, and commit, waiting for the commit to be flushed
try {
if (SolrIndexerDriver.solrInputDocumentList != null
&& SolrIndexerDriver.solrInputDocumentList.size() > 0) {
SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList);
SolrIndexerDriver.solrServer.commit(true, true);
SolrIndexerDriver.solrInputDocumentList.clear();
}
} catch (SolrServerException | IOException e) {
e.printStackTrace();
}
return null;
}
}
);
What should I do, so that sending-to-Solr part executes after the list of SolrDocuments are added to solrInputDocumentList (and works also in cluster mode)?

As I mentioned on the Spark Mailing list:
I'm not familiar with the Solr API but provided that 'SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to:
SolrIndexerDriver.solrInputDocumentList.add(elem)
is happening on different singleton instances of the SolrIndexerDriver on different JVMs while
SolrIndexerDriver.solrServer.commit
is happening on the driver.
In practical terms, the lists on the executors are being filled-in but they are never committed and on the driver the opposite is happening.
The recommended way to handle this is to use foreachPartition like this:
rdd.foreachPartition{iter =>
// prepare connection
Stuff.connect(...)
// add elements
iter.foreach(elem => Stuff.add(elem))
// submit
Stuff.commit()
}
This way you can add the data of each partition and commit the results in the local context of each executor. Be aware that this add/commit must be thread safe in order to avoid data loss or corruption.

have you checked under the spark UI to see the execution plan of this job.
Check how it is getting split into stages and their dependencies. That should give you an idea hopefully.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.