How does Flinks Collector.collect() handle data? - java

Im trying to understand what Flinks Collector.collect() does and how it handles incoming/outgoing data:
Example taken from Flink DataSet API:
The following code transforms a DataSet of text lines into a DataSet of words:
DataSet<String> output = input.flatMap(new Tokenizer());
public class Tokenizer implements FlatMapFunction<String, String> {
#Override
public void flatMap(String value, Collector<String> out) {
for (String token : value.split("\\W")) {
out.collect(token);
}
}
}
So the text Lines get split into tokens and each of them gets "collected". As intuitive as it might sound but im missing the actual dynamics behind Collector.collect(). Where is the collected data stored before it gets assigned to output i.e does Flink put them in some sort of Buffer? And if yes, how is the data transferred to the network?

from the official source code documentation.
Collects a record and forwards it. The collector is the "push"
counterpart of the {#link java.util.Iterator}, which "pulls" data in.
So, it receives a value and stores one or more values into the Iterator. Then pushes to the next operator. But this is a matter of the network stack/ buffers.

Related

Apache Flink : Add side inputs for DataStream API

In my Java application, I have three DataStreams. For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi. For these two streams Object type is different. For example, Stream-1 object type is Person, Stream-2 object type is Address.
The third one is the broadcast stream (for this data is consumed from Kafka).
Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element. How to implement this?
Note :
Stream-1 is mainstream and Stream-2 is side input. MainStream is continuously fetching data from Kafka. For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently) .
Sample structure:
DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);
I was referred to as the following links.
FLIP-17 Side Inputs for DataStream API
jira/browse/FLINK-6131
My Use case is :
Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB). This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.
Based on the latest response, the recommendation by #Arvid was in fact what was needed here.
Core of the answer:
You can easily join stream1 and stream2 even if they have different
types. Then you can add the broadcast to the result
Links to doc and example, and a relevant snippet from the doc (the example is too long to be included in here):
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

How to flatMap to database in Apache Flink?

I am using Apache Flink trying to get JSON records from Kafka to InfluxDB, splitting them from one JSON record into multiple InfluxDB points in the process.
I found the flatMap transform and it feels like it fits the purpose. Core code looks like this:
DataStream<InfluxDBPoint> dataStream = stream.flatMap(new FlatMapFunction<JsonConsumerRecord, InfluxDBPoint>() {
#Override
public void flatMap(JsonConsumerRecord record, Collector<InfluxDBPoint> out) throws Exception {
Iterator<Entry<String, JsonNode>> iterator = //...
while (iterator.hasNext()) {
// extract point from input
InfluxDBPoint point = //...
out.collect(point);
}
}
});
For some reason, I only get one of those collected points streamed into the database.
Even when I print out all mapped entries, it seems to work just fine: dataStream.print() yields:
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#144fd091
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#57256d1
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#28c38504
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#2d3a66b3
Am I misunderstanding flatMap or might there be some bug in the Influx connector?
The problem was actually related to the fact that a series (defined by its tagset and measurement as seen here) in Influx can only have one point per time, therefore even though my fields differed, the final point overwrote all previous points with the same time value.

In a Kafka Streams application, is there a way to define a topology with a wildcard list of output topics?

I have multi-schema Kafka Streams application that enriches a record via a join to a KTable, and then passes the enriched record along.
The input topic naming format is currently well defined but I'm changing this to a wildcard. I want to determine the input topic of each record, derive the output topic via regex replacement, and send it on.
E.g. While listening to event.raw.* a record comes in on event.raw.foo and I wish to pass it out on event.foo.
I realise I can get the input topics via the Processor API:
public class EnrichmentProcessor extends AbstractProcessor<String, GenericRecord> {
#Override
public void process(String key, GenericRecord value) {
//Do Join...
//Determine output topic and forward
String outputTopic = context().topic().replaceFirst(".raw.", ".");
context().forward(key, value, To.child(outputTopic));
context().commit();
}
}
But this doesn't help me when I'm trying to define my Topology because I have no way of knowing up front what my output topic is going to be.
InternalTopologyBuilder topologyBuilder = new InternalTopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, genericRecordDeserializer, "event.raw.*")
.addProcessor("ENRICHER", EnrichmentProcessor::new, "SOURCE")
.addSink("OUTPUT", outputTopic, stringSerializer, genericRecordSerializer, "ENRICHER"); // How can I register all possible output topics here?
Has anyone solved a situation like this before?
I know that if I had a list of possible output-topic names up front I could have multiple sinks defined on the topology but I'm not going to.
Is there a way I can define the topology to have dynamically allocated output topic names when I dont't have a hard coded list of possible output topic names up front?
This should be possible: You can use Topology#addSink(..., new TopicNameExtractor(){...}, ...) to dynamically set an output topic name. TopicNameExtractor has access to the RecordContext that allows you to get the input topic name via context.topic(). Hence, you should be able to compute the output topic name, base on the input topic name.

Java SPARK saveAsTextFile NULL

JavaRDD<Text> tx= counts2.map(new Function<Object, Text>() {
#Override
public Text call(Object o) throws Exception {
// TODO Auto-generated method stub
if (o.getClass() == Dict.class) {
Dict rkd = (Dict) o;
return new Text(rkd.getId());
} else {
return null ;
}
}
});
tx.saveAsTextFile("/rowkey/Rowkey_new");
I am new to Spark, I want to save this file, but I got the Null exception. I don't want to use return new Text() to replace return null,because it will insert a blank line to my file. So how can I solve this problem?
Instead of putting an if condition in your map, you simply use that if condition to build a RDD filter. The Spark Quick Start is a good place to start. There is also a nice overview of other transformations and actions.
Basically your code can look as follows (if you are using Java 8):
counts2
.filter((o)->o instanceof Dict)
.map(o->new Text(((Dict)o).getId()))
.saveAsTextFile("/rowkey/Rowkey_new");
You had the intention to map one incoming record to either zero or one outgoing record. This cannot be done with a map. However, filter maps to zero or one records with incoming record matches outgoing record, and flatMap gives you some more flexibility by allowing to map to zero or more outgoing records of any type.
It is strange, but not inconceivable, you create non-Dict objects that are going to be filters out further downstream anyhow. Possibly you can consider to push your filter even further upstream to make sure you only create Dict instances. Without knowing the rest of your code, this is only a assumption of course, and is not part of your original question anyhow.

Reading documents from the Couchbase bucket as batches

I have a Couchbase cluster which has around 25M documents. I am able to read them sequentially and also I have a function that can read a specific number of documents from the database. But my use case is slightly different since I cannot store all the 25M documents (each document is huge) in memory.
I need to process the documents in batches, say 1M/batch, push that batch to my memory, (do some operation on those documents) and push the next batch.
The function which I have written to read specific number of documents doesn't ensure that it returns a different set of documents when called again.
Is there a way by which I can complete this functionality? I also have a function which can create documents in batches. I am not sure if I can write a similar function that can read the documents in batches. The function is given below.
public void createMultipleCustomerDocuments(String docId, Customer myCust, long numDocs) {
Gson gson = new GsonBuilder().create();
JsonObject content = JsonObject.fromJson(gson.toJson(myCust));
JsonDocument document = JsonDocument.create(docId, content);
jsonDocuments.add(document);
documentCounter++;
if (documentCounter == numDocs) {
Observable.from(jsonDocuments).flatMap(new Func1<JsonDocument, Observable<JsonDocument>>() {
public Observable<JsonDocument > call(final JsonDocument docToInsert) {
return (theBucket.async().upsert(docToInsert));
}
}).last().toBlocking().single();
documentCounter = 0;
//System.out.println("Batch counter: " + batchCounter++);
}
Can someone please help me with this?
I would try to create a view which containing all of the documents, and then querying the view with skip and limit. (Can use .startKey() and startKeyId() functions instead of skip() to avoid overhead.)
but, remember not to keep that view in a production env, it will be cpu hog.
Another option, use the DCP protocol to replicate the database into your app. but it is more work.

Categories

Resources