Kafka Streams DSL foreach vs process

Kafka Streams DSL foreach vs process - java

I'm implementing the logic for rebuilding a file paginated to several kafka messages with the same key. Every time a page is received its content is appended to the corresponding file in a shared volume and, once the last page is appended, the topology has to include some extra processing steps.
Should this be done with forEach or with process?
Both forEach and process have void type, how can then be added the final extra steps to the topology?

Both accomplish the same goal. foreach is a terminal action of the DSL, however
process method of the Processor API "returns" data to next Processors by forwarding it to the Context, as answered in your last question, not via the process method itself

Related

Is it possible to use Kafka Streams Windowing for no Aggreagate operator?

I would like to use the Windowing mechanism that Kafka Streams provides in order to perform some actions on streaming data.
I've tried to simulate the windowing mechanism using Processor API to perform this action every n-seconds using context.schedule() but in this way I cannot have hopping windows.
Is there a way to achive this?
If I use the time interval in the context.schedule() for the advance of the window, then how can I controll/set the window size?
I need to keep a StateStore to hold a data structure through windows, then based on the content of this data structure I perform some modification of it based on the records that arrive. I need to perform this actions on every record that arrive, so I think I can use the trasform method.
Finally at the end of each window I need to forward some data from the data structure that I mention before.

Dynamically connecting a Kafka input stream to multiple output streams

Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?

If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations

Edit bolt process at run time in Apache Storm

My project streams object data through storm to a graphics application. The appearance of these objects depends upon variables assigned by a bolt in the storm topology.
My question is whether it is possible to update the bolt process by sending a message to it that changes the variables it attaches to object data. For example, after sending a message to the bolt declaring that I want any object with parameter x above a certain number to appear as red rather than blue.
The bolt process would then append a red rgb variable to the object data rather than blue.
I was thinking this would be possible by having a displayConfig class that the bolt uses to apply appearance and who's contents can be edited by messages with a certain header.
Is this possible?

It is possible, but you need to do it manually and prepare you topology before you start it.
There are two ways to do this:
use a local config file for bolt that you put into the worker machine (maybe via NFS). The bolts regularly check the file for updates an read an updated configuration if you do change the file.
You use one more spout that produces a configuration stream. All bolts you want to send a configuration during runtime, need to consumer from this configuration-spout via "allGrouping". When processing input tuple, you check if its a regular data tuple or and configuration tuple (and update you config accordingly).

Process Rows Separately in Oozie

I have a simple input file with 2 columns like
pkg1 date1
pkg2 date2
pkg3 date3
...
...
I want to create a oozie workflow which will process each row separately . For each row, I want to run multiple Actions one after another(Hive,Pig..) and then process another row.
But it is more difficult than I expected. I think, I have to create a loop somehow and iterate through it.
Can you give me architectural advise how I can achieve this?

Oozie does not support loops/cycles, since it is a Directed Acyclic Graph
https://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a2.1_Cycles_in_Workflow_Definitions
Also, there is no inbuilt way (that I'm aware of) to read data from Hive into an Oozie workflow and use it to control the flow of the Oozie workflow.
You could have a single Oozie workflow which launches some custom process (e.g. a Shell Action), and within that process read the data from Hive, and launch a new, separate, Oozie workflow for each entry.

I totally agree with #Mattinbits, you must use some procedural code (shell script, Python, etc) to run the loop and fire the appropriate Pig/Hive tasks.
But if your process must wait for the tasks to complete before launching the next batch, the coordination part might become a bit more complicated to implement. I can think of a very evil way to use Oozie for that coordination...
write down a generic Oozie Workflow that runs the Pig/Hive actions for 1 set of parameters, passed as properties
write down a "master template" Oozie workflow that just runs the WF above as a sub-workflow with dummy values for the properties
cut the template in 3 parts : XML header, sub-workflow call (with placeholders for actual values of properties) and XML footer
your loop will then build the actual "master" Workflow dynamically, by concatenating the header, a call to the sub-workflow for 1st set of values, another call for 2nd set, etc etc, then the footer -- and finally submit the Workflow to Oozie server (using REST or command line interface)
Of course there are some other things to take care of -- generating unique names for sub-workflows Actions, chaining them, handling errors. The usual stuff.

Processing data based on the metadata in the file using apache camel

I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:
Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1
Is it possible in Apache Camel?

Yes.
An example architecture might look something like this:
You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
(You have a few options here)
Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.
Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.
Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:
http://camel.apache.org/message-translator.html
http://camel.apache.org/composed-message-processor.html
http://camel.apache.org/pipes-and-filters.html
http://camel.apache.org/eip.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.