Apache Beam Split to Multiple Pipeline Output

Apache Beam Split to Multiple Pipeline Output - java

I am subscribing from one topic and contains different event types and they pass in with different attributes.
After I read the element, based on their attribute, I need to move them to different places. This is the sample code look like:
Options options =PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply(
"ReadType1",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}
}))
.apply(
"WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getRunOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType2",
EventIO.<T>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(Event input) {
return input.attributes.get("type").equals("type2");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getBatchOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType3",
EventIO.<Event>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type3");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getCustomIntervalOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.run();
Basically I read an event and filter them on their attribute and write the file. The job failed in dataflow as Workflow failed. Causes: The pubsub configuration contains errors: Subscription 'sub-name' is consumed by multiple stages, this will result in undefined behavior.
So what will be the appropriate way to split the pipeline within the same job?
I tried Pipeline1, Pipeline2,Pipeline3 and it end up need to multiple job name to run multiple pipeline, I am not sure that should the right way to do it.

The two EventIO transforms on the same subscription are the cause of the error. You need to eliminate one of those transforms in order for this to work. This can be done by consuming the subscription into a single PCollection and then applying two filtering branches to that collection individually. Here is a partial example:
// single PCollection of the events consumed from the subscription
PCollection<T> events = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options));
// PCollection of type1 events
PCollection<T> typeOneEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}}));
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
// PCollection of type2 events
PCollection<T> typeTwoEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type2");
}}));
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")
Another possibility is to use some other transforms provided by Apache Beam. Doing so might simplify your solution a little. Once such transform is Partition. Partition allows for the splitting of a single PCollection in a fixed number of PCollections based on a partitioning function. A partial example using Partition is:
// single PCollection of the events consumed from the subscription
PCollectionList<T> eventsByType = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply("Partition By Type",
Partition.of(2, new PartitionFn<T>() {
public int partitionFor(T event, int numPartitions) {
return input.attributes.get("type").equals("type1") ? 0 : 1;
}}));
PCollection<T> typeOneEvents = eventsByType.get(0);
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
PCollection<T> typeTwoEvents = eventsByType.get(1);
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")

The answer should be using Partition in Beam.
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Related

Kafka streams, branched output to multiple topics

In my DSL based transformation, I have a stream-->branch, where in I want branched output redirected to multiple topics.
Current branch.to() method accepts only a String.
Is there any simple option with stream.branch where I can route the result to multiple topics. With a consumer, I can subscribe to multiple topics by providing an array of string as topics.
My problem requires me to take multiple actions if particular predicate satisfies a query.
I tried with stream.branch[index].to(string), but this is not sufficient for my requirement. I am looking for something like stream.branch[index].to(string array of topics) or stream.branch[index].to(string).
I expect the branch.to method with multiple topics or is there any alternate way to achieve the same with streams?
adding sample code.Removed actual variable names.
My Predicates
Predicate <String, MyDomainObject> Predicate1 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
Predicate <String, MyDomainObject> Predicate2 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
KStream <String, MyDomainObject>[] branches= myStream.branch(
Predicate1, Predicate2
);
// here I need your suggestions.
// this is my current implementation
branches[0].to(singleTopic),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));
// I want to send notification to multiple topics. something like below
branches[0].to(topicList),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));

If you know to which topics you want to send the data, you can do the following:
branches[0].to("first-topic");
branches[0].to("second-topic");
// etc.

Processing multiple patterns in Flink CEP in Parallel on One stream data

I have following use case.
There is one machine which is sending event streams to Kafka which are being received by CEP engine where warnings are generated when conditions are satisfied on the Stream data.
FlinkKafkaConsumer011<Event> kafkaSource = new FlinkKafkaConsumer011<Event>(kafkaInputTopic, new EventDeserializationSchema(), properties);
DataStream<Event> eventStream = env.addSource(kafkaSource);
Event POJO contains id, name, time, ip.
Machine will send huge data to Kafka and there are 35 unique event names from machine (like name1, name2 ..... name35) and I want to detect patterns for each event name combination (like name1 co-occurred with name2, name1 co-occurred with name3.. etc). I got totally 1225 combinations.
Rule POJO contains e1Name and e2Name.
List<Rule> ruleList -> It contains 1225 rules.
for (Rule rule : ruleList) {
Pattern<Event, ?> warningPattern = Pattern.<Event>begin("start").where(new SimpleCondition<Event>() {
#Override
public boolean filter(Event value) throws Exception {
if(value.getName().equals(rule.getE1Name())) {
return true;
}
return false;
}
}).followedBy("next").where(new SimpleCondition<Event>() {
#Override
public boolean filter(Event value) throws Exception {
if(value.getName().equals(rule.getE2Name())) {
return true;
}
return false;
}
}).within(Time.seconds(30));
PatternStream patternStream = CEP.pattern(eventStream, warningPattern);
}
Is this correct way to execute multiple patterns on one stream data or is there any optimized way to achieve this. With above approach we are getting PartitionNotFoundException and UnknownTaskExecutorException and memory issues.

IMO you don't need patterns to achieve your goal. You can define a stateful map function to the source, which maps event names as pairs (latest two names). After that, window the source to 30 seconds and apply the simple WordCount example to the source.
Stateful map function can be something like this (accepting only event name, you need to change it according to your input -extract event name etc.):
public class TupleMap implements MapFunction<String, Tuple2<String, Integer>>{
Tuple2<String, String> latestTuple = new Tuple2<String, String>();
public Tuple2<String, Integer> map(String value) throws Exception {
this.latestTuple.f0 = this.latestTuple.f1;
this.latestTuple.f1 = value;
return new Tuple2<String, Integer>(this.latestTuple.f0 + this.latestTuple.f1, 1);
}
}
and result with event name pairs and occurrence count as a tuple can be obtained like this (written to a kafka sink maybe?):
DataStream<Tuple2<String, Integer>> source = stream.map(new TupleMap());
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = source.keyBy(0).timeWindow(Time.seconds(30)).sum(1);

Java 8 One Stream To Multiple Map

Lets say I have huge webserver log file that does not fit in memory. I need to stream this file to a mapreduce method and save to database. I do this using Java 8 stream api. For example, I get a list after the mapreduce process such as, consumption by client, consumption by ip, consumption by content. But, my needs are not that like that given in my example. Since I cannot share code, I just want to give basic example.
By Java 8 Stream Api, I want to read file exactly once, get 3 lists at the same time, while I am streaming file, parallel or sequential. But parallel would be good. Is there any way to do that?

Generally collecting to anything other than standard API's gives you is pretty easy via a custom Collector. In your case collecting to 3 lists at a time (just a small example that compiles, since you can't share your code either):
private static <T> Collector<T, ?, List<List<T>>> to3Lists() {
class Acc {
List<T> left = new ArrayList<>();
List<T> middle = new ArrayList<>();
List<T> right = new ArrayList<>();
List<List<T>> list = Arrays.asList(left, middle, right);
void add(T elem) {
// obviously do whatever you want here
left.add(elem);
middle.add(elem);
right.add(elem);
}
Acc merge(Acc other) {
left.addAll(other.left);
middle.addAll(other.middle);
right.addAll(other.right);
return this;
}
public List<List<T>> finisher() {
return list;
}
}
return Collector.of(Acc::new, Acc::add, Acc::merge, Acc::finisher);
}
And using it via:
Stream.of(1, 2, 3)
.collect(to3Lists());
Obviously this custom collector does not do anything useful, but just an example of how you could work with it.

I have adapted the answer to this question to your case. The custom Spliterator will "split" the stream into multiple streams that collect by different properties:
#SafeVarargs
public static <T> long streamForked(Stream<T> source, Consumer<Stream<T>>... consumers)
{
return StreamSupport.stream(new ForkingSpliterator<>(source, consumers), false).count();
}
public static class ForkingSpliterator<T>
extends AbstractSpliterator<T>
{
private Spliterator<T> sourceSpliterator;
private List<BlockingQueue<T>> queues = new ArrayList<>();
private boolean sourceDone;
#SafeVarargs
private ForkingSpliterator(Stream<T> source, Consumer<Stream<T>>... consumers)
{
super(Long.MAX_VALUE, 0);
sourceSpliterator = source.spliterator();
for (Consumer<Stream<T>> fork : consumers)
{
LinkedBlockingQueue<T> queue = new LinkedBlockingQueue<>();
queues.add(queue);
new Thread(() -> fork.accept(StreamSupport.stream(new ForkedConsumer(queue), false))).start();
}
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
sourceDone = !sourceSpliterator.tryAdvance(t -> queues.forEach(queue -> queue.offer(t)));
return !sourceDone;
}
private class ForkedConsumer
extends AbstractSpliterator<T>
{
private BlockingQueue<T> queue;
private ForkedConsumer(BlockingQueue<T> queue)
{
super(Long.MAX_VALUE, 0);
this.queue = queue;
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
while (queue.peek() == null)
{
if (sourceDone)
{
// element is null, and there won't be no more, so "terminate" this sub stream
return false;
}
}
// push to consumer pipeline
action.accept(queue.poll());
return true;
}
}
}
You can use it as follows:
streamForked(Stream.of(new Row("content1", "client1", "location1", 1),
new Row("content2", "client1", "location1", 2),
new Row("content1", "client1", "location2", 3),
new Row("content2", "client2", "location2", 4),
new Row("content1", "client2", "location2", 5)),
rows -> System.out.println(rows.collect(Collectors.groupingBy(Row::getClient,
Collectors.groupingBy(Row::getContent,
Collectors.summingInt(Row::getConsumption))))),
rows -> System.out.println(rows.collect(Collectors.groupingBy(Row::getClient,
Collectors.groupingBy(Row::getLocation,
Collectors.summingInt(Row::getConsumption))))),
rows -> System.out.println(rows.collect(Collectors.groupingBy(Row::getContent,
Collectors.groupingBy(Row::getLocation,
Collectors.summingInt(Row::getConsumption))))));
// Output
// {client2={location2=9}, client1={location1=3, location2=3}}
// {client2={content2=4, content1=5}, client1={content2=2, content1=4}}
// {content2={location1=2, location2=4}, content1={location1=1, location2=8}}
Note that you can do pretty much anything you want with your the copies of the stream. As per your example, I used a stacked groupingBy collector to group the rows by two properties and then summed up the int property. So the result will be a Map<String, Map<String, Integer>>. But you could also use it for other scenarios:
rows -> System.out.println(rows.count())
rows -> rows.forEach(row -> System.out.println(row))
rows -> System.out.println(rows.anyMatch(row -> row.getConsumption() > 3))

How to do a cartesian product of two PCollections in Dataflow?

I would like to do a cartesian product of two PCollections. Neither PCollection can fit into memory, so doing side input is not feasible.
My goal is this: I have two datasets. One is many elements of small size. The other is few (~10) of very large size. I would like to take the product of these two elements and then produce key-value objects.

I think CoGroupByKey might work in your situation:
https://cloud.google.com/dataflow/model/group-by-key#join
That's what I did for a similar use-case. Though mine had probably not been constrained by the memory (have you tried a larger cluster with bigger machines?):
PCollection<KV<String, TableRow>> inputClassifiedKeyed = inputClassified
.apply(ParDo.named("Actuals : Keys").of(new ActualsRowToKeyedRow()));
PCollection<KV<String, Iterable<Map<String, String>>>> groupedCategories = p
[...]
.apply(GroupByKey.create());
So the collections are keyed by the same key.
Then I declared the Tags:
final TupleTag<Iterable<Map<String, String>>> categoryTag = new TupleTag<>();
final TupleTag<TableRow> actualsTag = new TupleTag<>();
Combined them:
PCollection<KV<String, CoGbkResult>> actualCategoriesCombined =
KeyedPCollectionTuple.of(actualsTag, inputClassifiedKeyed)
.and(categoryTag, groupedCategories)
.apply(CoGroupByKey.create());
And in my case the final step - reformatting the results (from the tagged groups in the continuous flow:
actualCategoriesCombined.apply(ParDo.named("Actuals : Formatting").of(
new DoFn<KV<String, CoGbkResult>, TableRow>() {
#Override
public void processElement(ProcessContext c) throws Exception {
KV<String, CoGbkResult> e = c.element();
Iterable<TableRow> actualTableRows =
e.getValue().getAll(actualsTag);
Iterable<Iterable<Map<String, String>>> categoriesAll =
e.getValue().getAll(categoryTag);
for (TableRow row : actualTableRows) {
// Some of the actuals do not have categories
if (categoriesAll.iterator().hasNext()) {
row.put("advertiser", categoriesAll.iterator().next());
}
c.output(row);
}
}
}))
Hope this helps. Again - not sure about the in memory constraints. Please do tell the results if you try this approach.

to create cartesian product use Apache Beam extension Join
import org.apache.beam.sdk.extensions.joinlibrary.Join;
...
// Use function Join.fullOuterJoin(final PCollection<KV<K, V1>> leftCollection, final PCollection<KV<K, V2>> rightCollection, final V1 leftNullValue, final V2 rightNullValue)
// and the same key for all rows to create cartesian product as it is shown below:
public static void process(Pipeline pipeline, DataInputOptions options) {
PCollection<KV<Integer, CpuItem>> cpuList = pipeline
.apply("ReadCPUs", TextIO.read().from(options.getInputCpuFile()))
.apply("Creating Cpu Objects", new CpuItem()).apply("Preprocess Cpu",
MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(CpuItem.class)))
.via((CpuItem e) -> KV.of(0, e)));
PCollection<KV<Integer, GpuItem>> gpuList = pipeline
.apply("ReadGPUs", TextIO.read().from(options.getInputGpuFile()))
.apply("Creating Gpu Objects", new GpuItem()).apply("Preprocess Gpu",
MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(GpuItem.class)))
.via((GpuItem e) -> KV.of(0, e)));
PCollection<KV<Integer,KV<CpuItem,GpuItem>>> cartesianProduct = Join.fullOuterJoin(cpuList, gpuList, new CpuItem(), new GpuItem());
PCollection<String> finalResultCollection = cartesianProduct.apply("Format results", MapElements.into(TypeDescriptors.strings())
.via((KV<Integer, KV<CpuItem,GpuItem>> e) -> e.getValue().toString()));
finalResultCollection.apply("Output the results",
TextIO.write().to("fps.batchproc\\parsed_cpus").withSuffix(".log"));
pipeline.run();
}
in the code above in this line
...
.via((CpuItem e) -> KV.of(0, e)));
...
i create Map with key equals to 0 for all rows available in the input data. As the result all rows are matched. That is equal to SQL expression JOIN without WHERE clause

Apache Kafka - KafkaStream on topic/partition

I am writing Kafka Consumer for high volume high velocity distributed application. I have only one topic but rate incoming messages is very high. Having multiple partition that serve more consumer would be appropriate for this use-case. Best way to consume is to have multiple stream readers. As per the documentation or available samples, number of KafkaStreams the ConsumerConnector gives out is based on number of topics. Wondering how to get more than one KafkaStream readers [based on the partition], so that I can span one thread per stream or Reading from same KafkaStream in multiple threads would do the concurrent read from multiple partitions?
Any insights are much appreciated.

Would like to share what I found from mailing list:
The number that you pass in the topic map controls how many streams a topic is divided into. In your case, if you pass in 1, all 10 partitions's data will be fed into 1 stream. If you pass in 2, each of the 2 streams will get data from 5 partitions. If you pass in 11, 10 of them will each get data from 1 partition and 1 stream will get nothing.
Typically, you need to iterate each stream in its own thread. This is because each stream can block forever if there is no new event.
Sample snippet:
topicCount.put(msgTopic, new Integer(partitionCount));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = connector.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(msgTopic);
for (final KafkaStream stream : streams) {
ReadTask task = new ReadTask(stream, msgTopic);
task.addObserver(this.msgObserver);
tasks.add(task); executor.submit(task);
}
Reference: http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w#mail.gmail.com%3E

The recommended way to do this is to have a thread pool so Java can handle organisation for you and for each stream the createMessageStreamsByFilter method gives you consume it in a Runnable. For example:
int NUMBER_OF_PARTITIONS = 6;
Properties consumerConfig = new Properties();
consumerConfig.put("zk.connect", "zookeeper.mydomain.com:2181" );
consumerConfig.put("backoff.increment.ms", "100");
consumerConfig.put("autooffset.reset", "largest");
consumerConfig.put("groupid", "java-consumer-example");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(consumerConfig));
TopicFilter sourceTopicFilter = new Whitelist("mytopic|myothertopic");
List<KafkaStream<Message>> streams = consumer.createMessageStreamsByFilter(sourceTopicFilter, NUMBER_OF_PARTITIONS);
ExecutorService executor = Executors.newFixedThreadPool(streams.size());
for(final KafkaStream<Message> stream: streams){
executor.submit(new Runnable() {
public void run() {
for (MessageAndMetadata<Message> msgAndMetadata: stream) {
ByteBuffer buffer = msgAndMetadata.message().payload();
byte [] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
//Do something with the bytes you just got off Kafka.
}
}
});
}
In this example I asked for 6 threads basically because I know that I have 3 partitions for each topic and I listed two topics in my whitelist. Once we have the handles of the incoming streams we can iterate over their content, which are MessageAndMetadata objects. Metadata is really just the topic name and offset. As you discovered you can do it in a single thread if you ask for 1 stream instead of, in my example 6, but if you require parallel processing the nice way is to launch an executor with one thread for each returned stream.

/**
* #param source : source kStream to sink output-topic
*/
private static void pipe(KStream<String, String> source) {
source.to(Serdes.String(), Serdes.String(), new StreamPartitioner<String, String>() {
#Override
public Integer partition(String arg0, String arg1, int arg2) {
return 0;
}
}, "output-topic");
}
above code will write record at partition 1 of topic name "output-topic"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Beam Split to Multiple Pipeline Output - java

The answer should be using Partition in Beam. https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Related

Kafka streams, branched output to multiple topics

Processing multiple patterns in Flink CEP in Parallel on One stream data

Java 8 One Stream To Multiple Map

How to do a cartesian product of two PCollections in Dataflow?

Apache Kafka - KafkaStream on topic/partition

Categories

Resources