Kafka streams, branched output to multiple topics - java

In my DSL based transformation, I have a stream-->branch, where in I want branched output redirected to multiple topics.
Current branch.to() method accepts only a String.
Is there any simple option with stream.branch where I can route the result to multiple topics. With a consumer, I can subscribe to multiple topics by providing an array of string as topics.
My problem requires me to take multiple actions if particular predicate satisfies a query.
I tried with stream.branch[index].to(string), but this is not sufficient for my requirement. I am looking for something like stream.branch[index].to(string array of topics) or stream.branch[index].to(string).
I expect the branch.to method with multiple topics or is there any alternate way to achieve the same with streams?
adding sample code.Removed actual variable names.
My Predicates
Predicate <String, MyDomainObject> Predicate1 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
Predicate <String, MyDomainObject> Predicate2 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
KStream <String, MyDomainObject>[] branches= myStream.branch(
Predicate1, Predicate2
);
// here I need your suggestions.
// this is my current implementation
branches[0].to(singleTopic),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));
// I want to send notification to multiple topics. something like below
branches[0].to(topicList),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));

If you know to which topics you want to send the data, you can do the following:
branches[0].to("first-topic");
branches[0].to("second-topic");
// etc.

Related

Can this code be reduced using Java 8 Streams?

I want to use Java 8 lambdas and streams to reduce the amount of code in the following method that produces an Optional. Is it possible to achieve?
My code:
protected Optional<String> getMediaName(Participant participant) {
for (ParticipantDevice device : participant.getDevices()) {
if (device.getMedia() != null && StringUtils.isNotEmpty(device.getMedia().getMediaType())) {
String mediaType = device.getMedia().getMediaType().toUpperCase();
Map<String, String> mediaToNameMap = config.getMediaMap();
if (mediaMap.containsKey(mediaType)) {
return Optional.of(mediaMap.get(mediaType));
}
}
}
return Optional.empty();
}
Yes. Assuming the following class hierarchy (I used records here).
record Media(String getMediaType) {
}
record ParticipantDevice(Media getMedia) {
}
record Participant(List<ParticipantDevice> getDevices) {
}
It is pretty self explanatory. Unless you have an empty string as a key you don't need, imo, to check for it in your search. The main difference here is that once the map entry is found, Optional.map is used to return the value instead of the key.
I also checked this out against your loop version and it works the same.
public static Optional<String> getMediaName(Participant participant) {
Map<String, String> mediaToNameMap = config.getMediaMap();
return participant.getDevices().stream()
.map(ParticipantDevice::getMedia).filter(Objects::nonNull)
.map(media -> media.getMediaType().toUpperCase())
.filter(mediaType -> mediaToNameMap.containsKey(mediaType))
.findFirst()
.map(mediaToNameMap::get);
}
Firstly, since your Map of media types returned by config.getMediaMap() doesn't depend on a particular device, it makes sense to generate it before processing the collection of devices. I.e. regurless of the approach (imperative or declarative) do it outside a Loop, or before creating a Stream, to avoid generating the same Map multiple times.
And to implement this method with Streams, you need to use filter() operation, which expects a Predicate, to apply the conditional logic and map() perform a transformation of stream elements.
To get the first element that matches the conditions apply findFirst(), which produces an optional result, as a terminal operation.
protected Optional<String> getMediaName(Participant participant) {
Map<String, String> mediaToNameMap = config.getMediaMap();
return participant.getDevices().stream()
.filter(device -> device.getMedia() != null
&& StringUtils.isNotEmpty(device.getMedia().getMediaType())
)
.map(device -> device.getMedia().getMediaType().toUpperCase())
.filter(mediaToNameMap::containsKey)
.map(mediaToNameMap::get)
.findFirst();
}

Apache Beam Split to Multiple Pipeline Output

I am subscribing from one topic and contains different event types and they pass in with different attributes.
After I read the element, based on their attribute, I need to move them to different places. This is the sample code look like:
Options options =PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply(
"ReadType1",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}
}))
.apply(
"WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getRunOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType2",
EventIO.<T>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(Event input) {
return input.attributes.get("type").equals("type2");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getBatchOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType3",
EventIO.<Event>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type3");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getCustomIntervalOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.run();
Basically I read an event and filter them on their attribute and write the file. The job failed in dataflow as Workflow failed. Causes: The pubsub configuration contains errors: Subscription 'sub-name' is consumed by multiple stages, this will result in undefined behavior.
So what will be the appropriate way to split the pipeline within the same job?
I tried Pipeline1, Pipeline2,Pipeline3 and it end up need to multiple job name to run multiple pipeline, I am not sure that should the right way to do it.
The two EventIO transforms on the same subscription are the cause of the error. You need to eliminate one of those transforms in order for this to work. This can be done by consuming the subscription into a single PCollection and then applying two filtering branches to that collection individually. Here is a partial example:
// single PCollection of the events consumed from the subscription
PCollection<T> events = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options));
// PCollection of type1 events
PCollection<T> typeOneEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}}));
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
// PCollection of type2 events
PCollection<T> typeTwoEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type2");
}}));
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")
Another possibility is to use some other transforms provided by Apache Beam. Doing so might simplify your solution a little. Once such transform is Partition. Partition allows for the splitting of a single PCollection in a fixed number of PCollections based on a partitioning function. A partial example using Partition is:
// single PCollection of the events consumed from the subscription
PCollectionList<T> eventsByType = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply("Partition By Type",
Partition.of(2, new PartitionFn<T>() {
public int partitionFor(T event, int numPartitions) {
return input.attributes.get("type").equals("type1") ? 0 : 1;
}}));
PCollection<T> typeOneEvents = eventsByType.get(0);
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
PCollection<T> typeTwoEvents = eventsByType.get(1);
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")
The answer should be using Partition in Beam.
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Processing multiple patterns in Flink CEP in Parallel on One stream data

I have following use case.
There is one machine which is sending event streams to Kafka which are being received by CEP engine where warnings are generated when conditions are satisfied on the Stream data.
FlinkKafkaConsumer011<Event> kafkaSource = new FlinkKafkaConsumer011<Event>(kafkaInputTopic, new EventDeserializationSchema(), properties);
DataStream<Event> eventStream = env.addSource(kafkaSource);
Event POJO contains id, name, time, ip.
Machine will send huge data to Kafka and there are 35 unique event names from machine (like name1, name2 ..... name35) and I want to detect patterns for each event name combination (like name1 co-occurred with name2, name1 co-occurred with name3.. etc). I got totally 1225 combinations.
Rule POJO contains e1Name and e2Name.
List<Rule> ruleList -> It contains 1225 rules.
for (Rule rule : ruleList) {
Pattern<Event, ?> warningPattern = Pattern.<Event>begin("start").where(new SimpleCondition<Event>() {
#Override
public boolean filter(Event value) throws Exception {
if(value.getName().equals(rule.getE1Name())) {
return true;
}
return false;
}
}).followedBy("next").where(new SimpleCondition<Event>() {
#Override
public boolean filter(Event value) throws Exception {
if(value.getName().equals(rule.getE2Name())) {
return true;
}
return false;
}
}).within(Time.seconds(30));
PatternStream patternStream = CEP.pattern(eventStream, warningPattern);
}
Is this correct way to execute multiple patterns on one stream data or is there any optimized way to achieve this. With above approach we are getting PartitionNotFoundException and UnknownTaskExecutorException and memory issues.
IMO you don't need patterns to achieve your goal. You can define a stateful map function to the source, which maps event names as pairs (latest two names). After that, window the source to 30 seconds and apply the simple WordCount example to the source.
Stateful map function can be something like this (accepting only event name, you need to change it according to your input -extract event name etc.):
public class TupleMap implements MapFunction<String, Tuple2<String, Integer>>{
Tuple2<String, String> latestTuple = new Tuple2<String, String>();
public Tuple2<String, Integer> map(String value) throws Exception {
this.latestTuple.f0 = this.latestTuple.f1;
this.latestTuple.f1 = value;
return new Tuple2<String, Integer>(this.latestTuple.f0 + this.latestTuple.f1, 1);
}
}
and result with event name pairs and occurrence count as a tuple can be obtained like this (written to a kafka sink maybe?):
DataStream<Tuple2<String, Integer>> source = stream.map(new TupleMap());
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = source.keyBy(0).timeWindow(Time.seconds(30)).sum(1);

Iterate through collection, perform action on each item and return as List

Is there any way to do it with java 8 Stream API?
I need to transform each item of collection to other type (dto mapping) and return all set as a list...
Something like
Collection<OriginObject> from = response.getContent();
DtoMapper dto = new DtoMapper();
List<DestObject> to = from.stream().forEach(item -> dto.map(item)).collect(Collectors.toList());
public class DtoMapper {
public DestObject map (OriginObject object) {
return //conversion;
}
}
Thank you in advance
Update #1: the only stream object is response.getContent()
I think you're after the following:
List<SomeObject> result = response.getContent()
.stream()
.map(dto::map)
.collect(Collectors.toList());
// do something with result if you need.
Note that forEach is a terminal operation. You should use it if you want to do something with each object (such as print it). If you want to continue the chain of calls, perhaps further filtering, or collecting into a list, you should use map.

Remove Element From Map Using Filter

I have a java.util.Map inside an rx.Observable and I want to filter the map (remove an element based on a given key).
My current code is a mix of imperative and functional, I want to accomplish this goal without the call to isItemInDataThenRemove.
public static Observable<Map<String, Object>> filter(Map<String, Object> data, String removeKey) {
return Observable.from(data).filter((entry) -> isItemInDataThenRemove(entry,removeKey));
}
private static boolean isItemInDataThenRemove(Map<String, Object> data, String removeKey) {
for (Map.Entry<String,Object> entry : data.entrySet()) {
if(entry.getKey().equalsIgnoreCase(removeKey)) {
System.out.printf("Found element %s, removing.", removeKey);
data.remove(removeKey);
return true;
}
}
return false;
}
The code you have proposed has a general problem in that it modifies the underlying stream while operating on it. This conflicts with the general requirement for streams for non-interference, and often in practice means that you will get a ConcurrentModificationException when using streams pipelines with containers that remove objects in the underlying container.
In any case (as I learned yesterday) there is a new default method on the Collection class that does pretty much exactly what you want:
private static boolean isItemInDataThenRemove(Map<String, Object> data, String removeKey) {
return data.entrySet().removeIf(entry -> entry.getKey().equalsIgnoreCase(removeKey));
}
WORKING CODE:
private static boolean isItemInDataThenRemove(Map<String, Object> data, String removeKey) {
data.entrySet().stream().filter(entry ->
entry.getKey().equalsIgnoreCase(removeKey)).forEach(entry -> {
data.remove(entry.getKey());
});
return true;
}

Categories

Resources