I try to join two DataStreams by IDs, and found there are two API set can do so,
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/joining.html
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/sql/queries.html#joins
It seems both of them can get the job done.
So my questions are:
What is the main different? How to select?
If I join stream A and B, and both has lot of records (eg. A:10000,
B:20000), are all records in different stream compared to each other
one by one? Total number of comparison is 10000x20000?
Moreover, is there any cases (maybe network issue), stream B is
delayed, then some of record in stream B is not compared to stream
A?
Thanks.
What the main differences? How to choose?
There are several different APIs that can be used to implement joins with Flink. You'll find a survey of the different approaches in the Apache Flink developer training materials shared by Ververica, at https://training.ververica.com/decks/joins/?mode=presenter (behind the registration form). Disclaimer: I wrote these training materials.
To summarize:
The low-level building block for implementing streaming joins is the KeyedCoProcessFunction. Using this directly makes sense in special cases where having complete control is valuable, but for most purposes you're better off using a higher-level API.
The DataSet API offers batch joins implemented as hash joins, sort-merge joins, and broadcast joins. This API has been soft deprecated, and will ultimately be replaced by a combination of bounded streaming and Flink's relational APIs (SQL/Table).
The DataStream API only offers some time windowed and interval joins. It doesn't support any joins where unbounded state retention might be required.
The SQL/Table API supports a wide range of both batch and streaming joins:
STREAMING & BATCH
Time-Windowed and Interval INNER + OUTER JOIN
Non-windowed INNER + OUTER JOIN
STREAMING ONLY
Time-versioned INNER JOIN
External lookup INNER JOIN
The SQL optimizer is able to reason about state no longer needed because of temporal constraints. But some streaming joins do have the potential to require unbounded state to produce fully correct results; a state retention policy can be put in place to clear out stale entries that are unlikely to be needed.
Note that the Table API is fully interoperable with the DataStream API. I would use SQL/Table joins wherever possible, as they are much simpler to implement and are very well optimized.
If I join stream A and B, and both has lot of records (eg. A:10000, B:20000), are all records in different stream compared to each other one by one? Total number of comparison is 10000x20000?
Flink supports equi-key joins, where for some specific key, you want to join records from streams A and B having the same value for that key. If there are 10000 records from A and 20000 records from B all having the same key, then yes, an unconstrained join of A and B will produce 10000x20000 results.
But I don't believe that's what you meant. Flink will materialize distributed hash tables in its managed state, which will be sharded across the cluster (by key). For example, as a new record arrives from stream A it will be hashed into the build-side hash table for A, and the corresponding hash table for B will probed to find matching records -- and all suitable results will be emitted.
Note that this is done in parallel. But all events from both A and B for a specific key will be processed by the same instance.
Moreover, is there any cases (maybe network issue), stream B is delayed, then some of record in stream B is not compared to stream A?
If you are doing event time processing in combination with a time-windowed or interval join as provided by the SQL/Table API, then late events (as determined by the watermarking) won't be considered, and the results will be incomplete. With the DataStream API it is possible to implement special handling for late events, such as sending them to a side output, or retracting and updating the results.
For joins without temporal constraints, delayed events are processed normally whenever they arrive. The results are (eventually) complete.
Related
We have an app server, which processes a large volume of incoming objects.
One of its functions, is to group those objects into groups, based on bespoke collection of grouping keys that depend on object type.
E.g, there is a grouping rule table that says:
Object type 1: grouping keys are col1, col2, col4, col5
Object type 2: grouping keys are col2, col3 ...
Originally, we had a singleton server, and the problem was solved by having an in-memory index, mapping an object type + grouping key string, to a group ID. Then, we had a synchronized code that would check if the index contained an entry for grouping keys of a given object. If so, the object got the group ID from the cache, otherwise, we assigned group ID of the object to be its object ID - and stored it in the cache.
This worked well... until the server was re-designed from a singleton, to several distributed server instances using Ignite cache to store the data (including the grouping cache).
Due to inherent slowness of the Ignite solution, a race condition was introduced, since the synchronization mechanism used in a singleton to prevent them could not sustain the slowness of Ignite (transactions are too slow).
What can be done to solve this problem in a distributed situation, avoiding either race conditions (which produce different group IDs for objects that should be in same group), OR even worse, false positive grouping (e.g. grouping 2 objects that should be in different groups)?
Constraints:
Pure hashing function cannot be used, due to a risk of hash key collisions. The grouping may not have false positives, ever (e.g. assigning same group ID to objects that should not be groupd together). Imagine that this could lead to loss of PII, or other high risk - so no matter how good the hashing function is and how rare collisions are, they are still unacceptable.
Solution must be realtime, since grouping data is used in other functionalities of the server within seconds or possibly fractions of a second of processing an object. So if there is some post-processing sweep to re-group things "correctly" is introduced with a latency of 30 seconds, that risks 30 seconds of group level updates being done to incorrect group membership.
Maintaining individual lists of group IDs synchronized between server instances via messaging system is not acceptable due to high volume of data (e.g. 5 servers * 1million objects would mean sending 4 million group ID updates). That was the whole point of having an Ignite cache.
Technical constraints: Java server instances, running on distinct Linux servers. They use a homegrown MQ like messaging system to talk to each other in general, and Ignite cache cluster to store shared data instead of local in-memory cache (which is the source of the problem).
The performance of Ignite can't cause a race condition. It doesn't matter whether an update takes a microsecond or a minute, a race condition is a synchronisation issue.
In any case, reading and writing a bunch of records in one unit says "transaction." Ignite supports distributed transactions.
try (Transaction tx = ignite.transactions().txStart()) {
cache1.get(...);
cache2.put();
cache2.put();
tx.commit();
}
catch (...) {
If you can't use transactions you need to either "manually" have locks (which is probably going to be slower) or have your Group ID be predictable. For example, your key for group one could be a concatenation of columns 1, 2, 4 and 5.
But really this is a data modelling question that may not be a good fit for Stack Overflow.
I have an architecture question regarding the union of more than two streams in Apache Flink.
We are having three and sometime more streams that are some kind of code book with whom we
have to enrich main stream.
Code book streams are compacted Kafka topics. Code books are something that doesn't change
so often, eg currency. Main stream is a fast event stream.
Our goal is to enrich main stream with code books.
There are three possible ways as I see it to do it:
Make a union of all code books and then join it with main stream and store the
enrichment data as managed, keyed state (so when compact events from kafka expire I have the
codebooks saved in state). This is now only way that I tired to do it.
Deserilized Kafka topic messages which are in JSON to POJOs eg. Currency, OrganizationUnit and so on.
I made one big wrapper class CodebookData with all code books eg:
public class CodebookData {
private Currency currency;
private OrganizationUnit organizationUnit
...
}
Next I mapped incoming stream of every kafka topic to this wrapper class and then made a union:
DataStream<CodebookData> enrichedStream = mappedCurrency.union(mappedOrgUnit).union(mappedCustomer);
When I print CodebookData it is populated like this
CodebookData{
Currency{populated with data},
OrganizationUnit=null,
Customer=null
}
CodebookData{
Curenncy=null,
OrganizationUnit={populated with data},
Customer=null
}
...
Here I stopped because I have problem how to connect this Codebook stream with main stream and save codebook data in value state. I do not have unique foreign key in my Codebook data because every codebook has its own foregin key that connects with main stream, eg. Currency has currencyId, organizationUnit orgID and so on.
Eg.I want to do something like this
SingleOutputStreamOperator<CanonicalMessage> enrichedMainStream = mainStream
.connect(enrichedStream)
.keyBy(?????)
.process(new MyKeyedCoProcessFunction());
and in MyCoProcessFunction I would create ValueState of type CodebookData.
Is this totally wrong or can I do something with this and if it is douable what I am doing wrong?
Second approach is by cascading a series of two-input CoProcessFunction operators with every kafka event source but I read somewhere that this is not optimal approach.
Third approach is broadcast state that is not so much familiar to me. For now I see the problem if I am using RocksDb for checkpointing and savepointing I am not sure that I can then use broadcast state.
Should I use some other approach from approach no.1 whit whom I am currently struggling?
In many cases where you need to do several independent enrichment joins like this, a better pattern to follow is to use a fan-in / fan-out approach, and perform all of the joins in parallel.
Something like this, where after making sure each event on the main stream has a unique ID, you create 3 or more copies of each event:
Then you can key each copy by whatever is appropriate -- the currency, the organization unit, and so on (or customer, IP address, and merchant in the example I took this figure from) -- then connect it to the appropriate cookbook stream, and compute each of the 2-way joins independently.
Then union together these parallel join result streams, keyBy the random nonce you added to each of the original events, and glue the results together.
Now in the case of three streams, this may be overly complex. In that case I might just do a series of three 2-way joins, one after another, using keyBy and connect each time. But at some point, as they get longer, pipelines built that way tend to run into performance / checkpointing problems.
There's an example implementing this fan-in/fan-out pattern in https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6.
I am supposed to join some huge SQL tables with the json of some REST services by some common key ( we are talking about multiple sql tables with a few REST services calls ). The thing is this data is not real time/ infinite stream and also don’t think I could order the output of the REST services by the join columns. Now the silly way would be to bring all data and then match the rows, but that would imply to store everything in memory/ some storage like Cassandra or Redis.
But, I was wondering if flink could use some king of stream window to join say X elements ( so really just store in RAM just those elements at a point ) but also storing the nonmatched element for later match in maybe some kind of hash map. This is what I mean by smart join.
The devil is in the details, but yes, in principle this kind of data enrichment is quite doable with Flink. Your requirements aren't entirely clear, but I can provide some pointers.
For starters you will want to acquaint youself with Flink's managed state interfaces. Using these interfaces will ensure your application is fault tolerant, upgradeable, rescalable, etc.
If you wanted to simply preload some data, then you might use a RichFlatmap and load the data in the open() method. In your case a CoProcessFunction might be more appropriate. This is a streaming operator with two inputs that can hold state and also has access to timers (which can be used to expire state that is no longer needed, and to emit results after waiting for out-of-order data to arrive).
Flink also has support for asynchronous i/o, which can make working with external services more efficient.
One could also consider approaching this with Flink's higher level SQL and Table APIs, by wrapping the REST service calls as user-defined functions.
I am looking for a way to do sql like lead/lag function in Google Dataflow/Beam. In my case if done in sql, it would be something like
lead(balance, 1) over(partition by orderId order by order_Date)
In Beam, we parse the input text file and create a class Client_Orders to hold the data. For simplicity, let's say we orderId, order_Date and balance members in this class. And we create partitions with the orderId by constructing KV in PCollections
PCollection <KV<String, Iterable<Client_Orders>>> mainCollection = pipeline.apply(TextIO.Read.named("Reading input file")
.from(options.getInputFilePath()))
.apply(ParDo.named("Extracting client order terms from file") // to produce Client_Orders object
.apply('create KV...", GroupByKey.<String, Client_Orders>create());
In Beam, I know we can do windowing, but that requires in general to set a window size in terms of duration Windows.of(Duration.standardDays(n)), but that doesn't seem to help in this case, should I iterate through the PCollection using order_Date ?
If your data is too large per-key to sort in memory, you will want the Beam "sorter" extension.
I will explain:
In Beam (hence Dataflow) the elements of a PCollection are unordered. This supports the unified programming model whereby the same data yields the same output whether it arrives as a real-time stream or is read from stored files. It also supports isolated failure recovery, provides robustness to network delays, etc.
In many years of massive-scale data processing, almost all uses of global order have turned out to be non-useful, in part because anyone who needs scalability finds a different way to achieve their goals). And even if global ordering exists, processing does not occur in order (because it is parallel) so global ordering would be lost almost immediately. So global ordering is not on the roadmap.
The kind of ordering you need, though, is per key. This is common and useful and often known as "value sorting". When a GroupByKey operation yields the grouped values for a key (an element of type KV<K, Iterable<V>>) there is often a benefit to a user-defined order for the values. Since it is sorting within a single element the order is preserved as the element travels through your pipeline. And it is not necessarily prohibitively expensive to sort the values - often the very same operation that groups by key can be leveraged to also sort the values as they are being grouped. This is on the Beam roadmap, but not yet part of the Beam model.
So, for now, there is the above Java-based extension that can sort the values for you.
I want to fetch data from two different entites in JPA. I am using Google DataStore with App Engine to store my data on cloud storage. Now what i want is to fetch data from two different entites by making use of Join query.As i am new to app engine and datastore, i don't know how to do that. I referred this link and it says that DataStore doesn't support joins properly. Is that true? Pleas eguide me to solve this problem. Thank you.
The are ample places where it is stated clearly that GAE/Datastore does not do "join queries". Such as https://developers.google.com/appengine/docs/java/datastore/jdo/overview-dn2
If instead you are using google-cloud-sql (why you tag this question as SQL?) then I suggest you update your question to state that
How to join records when your data store does not: write a join in the client application code. Warning - depending on the data, doing this might cost a lot of overhead. This is a straw man answer designed to justify the real answer in the final paragraph.
Conceptually, your application could implement a nested loop join as follows. Choose the entity whose expected record count is lowest for the outer loop. Create a query to iterate over those records. Within the iterator loop for each record, copy the fields used for joining into variables, then create an inner nested query that takes these variables as parameters. Iterate over the records produced by the inner query, and for each inner record, produce a record of output using data from both the inner and the outer current entities.
Because an external nested loop join is such a bad idea, you should really consider redesigning your current schema to produce the results you are after without requiring a join at all. Start by just imagining the output that you want coming directly out of entities of just one Kind. That usually means letting go of relational normal forms. After you have designed appropriate NoSQL structures that can deliver the required outputs, you should then design appropriate NoSQL algorithms to write the data that way.