Union of more than two streams in apache flink

Union of more than two streams in apache flink - java

I have an architecture question regarding the union of more than two streams in Apache Flink.
We are having three and sometime more streams that are some kind of code book with whom we
have to enrich main stream.
Code book streams are compacted Kafka topics. Code books are something that doesn't change
so often, eg currency. Main stream is a fast event stream.
Our goal is to enrich main stream with code books.
There are three possible ways as I see it to do it:
Make a union of all code books and then join it with main stream and store the
enrichment data as managed, keyed state (so when compact events from kafka expire I have the
codebooks saved in state). This is now only way that I tired to do it.
Deserilized Kafka topic messages which are in JSON to POJOs eg. Currency, OrganizationUnit and so on.
I made one big wrapper class CodebookData with all code books eg:
public class CodebookData {
private Currency currency;
private OrganizationUnit organizationUnit
...
}
Next I mapped incoming stream of every kafka topic to this wrapper class and then made a union:
DataStream<CodebookData> enrichedStream = mappedCurrency.union(mappedOrgUnit).union(mappedCustomer);
When I print CodebookData it is populated like this
CodebookData{
Currency{populated with data},
OrganizationUnit=null,
Customer=null
}
CodebookData{
Curenncy=null,
OrganizationUnit={populated with data},
Customer=null
}
...
Here I stopped because I have problem how to connect this Codebook stream with main stream and save codebook data in value state. I do not have unique foreign key in my Codebook data because every codebook has its own foregin key that connects with main stream, eg. Currency has currencyId, organizationUnit orgID and so on.
Eg.I want to do something like this
SingleOutputStreamOperator<CanonicalMessage> enrichedMainStream = mainStream
.connect(enrichedStream)
.keyBy(?????)
.process(new MyKeyedCoProcessFunction());
and in MyCoProcessFunction I would create ValueState of type CodebookData.
Is this totally wrong or can I do something with this and if it is douable what I am doing wrong?
Second approach is by cascading a series of two-input CoProcessFunction operators with every kafka event source but I read somewhere that this is not optimal approach.
Third approach is broadcast state that is not so much familiar to me. For now I see the problem if I am using RocksDb for checkpointing and savepointing I am not sure that I can then use broadcast state.
Should I use some other approach from approach no.1 whit whom I am currently struggling?

In many cases where you need to do several independent enrichment joins like this, a better pattern to follow is to use a fan-in / fan-out approach, and perform all of the joins in parallel.
Something like this, where after making sure each event on the main stream has a unique ID, you create 3 or more copies of each event:
Then you can key each copy by whatever is appropriate -- the currency, the organization unit, and so on (or customer, IP address, and merchant in the example I took this figure from) -- then connect it to the appropriate cookbook stream, and compute each of the 2-way joins independently.
Then union together these parallel join result streams, keyBy the random nonce you added to each of the original events, and glue the results together.
Now in the case of three streams, this may be overly complex. In that case I might just do a series of three 2-way joins, one after another, using keyBy and connect each time. But at some point, as they get longer, pipelines built that way tend to run into performance / checkpointing problems.
There's an example implementing this fan-in/fan-out pattern in https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6.

Related

Is there a way to query the internal state store generated by kafka streams?

I have joined two kakfa streams using .joinWindows function , after which Two changelog topics are generted
{consumer-group}--KSTREAM-JOINOTHER-0000000005-store-changelog
{consumer-group}--KSTREAM-JOINTHIS-0000000004-store-changelog
1.what is the purpose of each of the two topics?
2.What is the data stored by them, is it a key-value pair ?
3.Is there a way to query these internal topic and get the number of events present in these internal topics ?
Could pass the internal store to a processor and then accessing it tried using windowstore, but it doesn’t have a function to get number of events processed.
code :
Keyvalue stateStore = processorContext.getStateStore("KSTREAM-JOINTHIS-0000000005-store");
stateStore.approximateNumEntries();
org.apache.kafka.streams.processor.internals.AbstractReadWriteDecorator$WindowStoreReadWriteDecorator cannot be cast to class org.apache.kafka.streams.state.KeyValueStore
I want to get the total count of events entering the stream across various partitions and the the number of joins that took place

1.what is the purpose of each of the two topics?
Those are the changelog topics that back the states stores involved in the join. Each "side" of the join will look in the other side's state store for a match on a key, i.e., perform a join.
2.What is the data stored by them? Is it a key-value pair?
Yes, changelog topics are Kafka topics, and they store the key-value pairs in the state stores for durability.
3.Is there a way to query these internal topic and get the number of events present in these internal topics ?
You could point a consumer at those topics to inspect the records. But what is the use case for doing so? Also, you'd want to ensure you never produce records to those topics.
The state stores for joins aren't accessible via interactive queries. If you want to get an idea of the number of joins, you could add an operator after the join that logs a running counter and logs the results or something similar.

Compare Flink Table API to join table, and DataStream.join()

I try to join two DataStreams by IDs, and found there are two API set can do so,
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/joining.html
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/sql/queries.html#joins
It seems both of them can get the job done.
So my questions are:
What is the main different? How to select?
If I join stream A and B, and both has lot of records (eg. A:10000,
B:20000), are all records in different stream compared to each other
one by one? Total number of comparison is 10000x20000?
Moreover, is there any cases (maybe network issue), stream B is
delayed, then some of record in stream B is not compared to stream
A?
Thanks.

What the main differences? How to choose?
There are several different APIs that can be used to implement joins with Flink. You'll find a survey of the different approaches in the Apache Flink developer training materials shared by Ververica, at https://training.ververica.com/decks/joins/?mode=presenter (behind the registration form). Disclaimer: I wrote these training materials.
To summarize:
The low-level building block for implementing streaming joins is the KeyedCoProcessFunction. Using this directly makes sense in special cases where having complete control is valuable, but for most purposes you're better off using a higher-level API.
The DataSet API offers batch joins implemented as hash joins, sort-merge joins, and broadcast joins. This API has been soft deprecated, and will ultimately be replaced by a combination of bounded streaming and Flink's relational APIs (SQL/Table).
The DataStream API only offers some time windowed and interval joins. It doesn't support any joins where unbounded state retention might be required.
The SQL/Table API supports a wide range of both batch and streaming joins:
STREAMING & BATCH
Time-Windowed and Interval INNER + OUTER JOIN
Non-windowed INNER + OUTER JOIN
STREAMING ONLY
Time-versioned INNER JOIN
External lookup INNER JOIN
The SQL optimizer is able to reason about state no longer needed because of temporal constraints. But some streaming joins do have the potential to require unbounded state to produce fully correct results; a state retention policy can be put in place to clear out stale entries that are unlikely to be needed.
Note that the Table API is fully interoperable with the DataStream API. I would use SQL/Table joins wherever possible, as they are much simpler to implement and are very well optimized.
If I join stream A and B, and both has lot of records (eg. A:10000, B:20000), are all records in different stream compared to each other one by one? Total number of comparison is 10000x20000?
Flink supports equi-key joins, where for some specific key, you want to join records from streams A and B having the same value for that key. If there are 10000 records from A and 20000 records from B all having the same key, then yes, an unconstrained join of A and B will produce 10000x20000 results.
But I don't believe that's what you meant. Flink will materialize distributed hash tables in its managed state, which will be sharded across the cluster (by key). For example, as a new record arrives from stream A it will be hashed into the build-side hash table for A, and the corresponding hash table for B will probed to find matching records -- and all suitable results will be emitted.
Note that this is done in parallel. But all events from both A and B for a specific key will be processed by the same instance.
Moreover, is there any cases (maybe network issue), stream B is delayed, then some of record in stream B is not compared to stream A?
If you are doing event time processing in combination with a time-windowed or interval join as provided by the SQL/Table API, then late events (as determined by the watermarking) won't be considered, and the results will be incomplete. With the DataStream API it is possible to implement special handling for late events, such as sending them to a side output, or retracting and updating the results.
For joins without temporal constraints, delayed events are processed normally whenever they arrive. The results are (eventually) complete.

Apache Beam how many writes when using multiple tables

I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?

Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.

can apache flink be used to join huge non real time data smart?

I am supposed to join some huge SQL tables with the json of some REST services by some common key ( we are talking about multiple sql tables with a few REST services calls ). The thing is this data is not real time/ infinite stream and also don’t think I could order the output of the REST services by the join columns. Now the silly way would be to bring all data and then match the rows, but that would imply to store everything in memory/ some storage like Cassandra or Redis.
But, I was wondering if flink could use some king of stream window to join say X elements ( so really just store in RAM just those elements at a point ) but also storing the nonmatched element for later match in maybe some kind of hash map. This is what I mean by smart join.

The devil is in the details, but yes, in principle this kind of data enrichment is quite doable with Flink. Your requirements aren't entirely clear, but I can provide some pointers.
For starters you will want to acquaint youself with Flink's managed state interfaces. Using these interfaces will ensure your application is fault tolerant, upgradeable, rescalable, etc.
If you wanted to simply preload some data, then you might use a RichFlatmap and load the data in the open() method. In your case a CoProcessFunction might be more appropriate. This is a streaming operator with two inputs that can hold state and also has access to timers (which can be used to expire state that is no longer needed, and to emit results after waiting for out-of-order data to arrive).
Flink also has support for asynchronous i/o, which can make working with external services more efficient.
One could also consider approaching this with Flink's higher level SQL and Table APIs, by wrapping the REST service calls as user-defined functions.

how to compose transform that can do sql lead/lag functions in Dataflow (Beam)

I am looking for a way to do sql like lead/lag function in Google Dataflow/Beam. In my case if done in sql, it would be something like
lead(balance, 1) over(partition by orderId order by order_Date)
In Beam, we parse the input text file and create a class Client_Orders to hold the data. For simplicity, let's say we orderId, order_Date and balance members in this class. And we create partitions with the orderId by constructing KV in PCollections
PCollection <KV<String, Iterable<Client_Orders>>> mainCollection = pipeline.apply(TextIO.Read.named("Reading input file")
.from(options.getInputFilePath()))
.apply(ParDo.named("Extracting client order terms from file") // to produce Client_Orders object
.apply('create KV...", GroupByKey.<String, Client_Orders>create());
In Beam, I know we can do windowing, but that requires in general to set a window size in terms of duration Windows.of(Duration.standardDays(n)), but that doesn't seem to help in this case, should I iterate through the PCollection using order_Date ?

If your data is too large per-key to sort in memory, you will want the Beam "sorter" extension.
I will explain:
In Beam (hence Dataflow) the elements of a PCollection are unordered. This supports the unified programming model whereby the same data yields the same output whether it arrives as a real-time stream or is read from stored files. It also supports isolated failure recovery, provides robustness to network delays, etc.
In many years of massive-scale data processing, almost all uses of global order have turned out to be non-useful, in part because anyone who needs scalability finds a different way to achieve their goals). And even if global ordering exists, processing does not occur in order (because it is parallel) so global ordering would be lost almost immediately. So global ordering is not on the roadmap.
The kind of ordering you need, though, is per key. This is common and useful and often known as "value sorting". When a GroupByKey operation yields the grouped values for a key (an element of type KV<K, Iterable<V>>) there is often a benefit to a user-defined order for the values. Since it is sorting within a single element the order is preserved as the element travels through your pipeline. And it is not necessarily prohibitively expensive to sort the values - often the very same operation that groups by key can be leveraged to also sort the values as they are being grouped. This is on the Beam roadmap, but not yet part of the Beam model.
So, for now, there is the above Java-based extension that can sort the values for you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.