My project streams object data through storm to a graphics application. The appearance of these objects depends upon variables assigned by a bolt in the storm topology.
My question is whether it is possible to update the bolt process by sending a message to it that changes the variables it attaches to object data. For example, after sending a message to the bolt declaring that I want any object with parameter x above a certain number to appear as red rather than blue.
The bolt process would then append a red rgb variable to the object data rather than blue.
I was thinking this would be possible by having a displayConfig class that the bolt uses to apply appearance and who's contents can be edited by messages with a certain header.
Is this possible?
It is possible, but you need to do it manually and prepare you topology before you start it.
There are two ways to do this:
use a local config file for bolt that you put into the worker machine (maybe via NFS). The bolts regularly check the file for updates an read an updated configuration if you do change the file.
You use one more spout that produces a configuration stream. All bolts you want to send a configuration during runtime, need to consumer from this configuration-spout via "allGrouping". When processing input tuple, you check if its a regular data tuple or and configuration tuple (and update you config accordingly).
Related
I am trying to save some data from the Mapper to the Job/Main so that I can use it in other jobs.
I tried to use a static variable in my main class (that contains the main function) but when the Mapper adds data to the static variable and I try to print the variable when the job is done I find that there is no new data, it's like the Mapper modified another instance of that static variable..
Now i'm trying to use the Configuration to set the data from the Mapper:
Mapper
context.getConfiguration().set("3", "somedata");
Main
boolean step1Completed = step1.waitForCompletion(true);
System.out.println(step1.getConfiguration().get("3"));
Unfortunately this prints null.
Is there another way to do things? I am trying to save some data so that I use it in other jobs and I find using a file just for that a bit extreme since the data is only an index of int,string to map some titles that I will need in my last job.
It is not possible as soon as I know. Mappers and Reducers work independently in distributed fashion. Each task has its own local conf instance. You have to persist data to HDFS while each job is independent.
You can also take advantage of MapReduce Chaining mechanism(example) to run a chain of jobs. In addition, you can design workflow in Azkaban, Oozie and etc to pass output to another job.
It is indeed not possible since the configuration goes from the job to the mapper/reducer and not the other way around.
I ended up just reading the file directly from the HDFS in my last job's setup.
Thank you all for the input.
I'm working on creating file uploader. I would like to load files first temp folder and then convert them to needed format. For this I'll create a queue with tasks that will be executed with the executor. But in case of server crash, this queue will be lost. So could anybody suggest me a library without using another server that can make my queue persistent?
Instead of using in-memory queue implementation, you can use persistent options like DB or a JMS Queue. This will avoid loosing the data even if server crashes.
You need to use DB,a and store the bytes in it. Invoke 2 threads one will only feed the data to DB and another will poll onto convert the file. You can maintain the status if the file is changed to the format you wanted, and also the format it needs to be changed in
I have a small topology. It has a kafka spout, a bolt reading from spout (Bolt A).
Bolt A emits to two bolts (Bolt B and Bolt C). I have used fields grouping.
The Bolt A emits two different types of data. One is intended for Bolt B and Other for Bolt C.
My question is, can i configure storm in such a way that, data intended for Bolt B always goes to instances of Bolt B and same for Bolt c?
Currently i am checking the data received in the bolts and skipping unwanted data.
thanks
With standard Storm, the easiest way to do this would be to use "streams." You define a stream in declareOutputFields with the declareStream method on the output field declarer and emit using one of the overloaded versions of emit that lets you specify a stream ID. You also need to use the version of shuffleGrouping that makes the bolt subscribe to a stream.
I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:
Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1
Is it possible in Apache Camel?
Yes.
An example architecture might look something like this:
You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
(You have a few options here)
Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.
Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.
Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:
http://camel.apache.org/message-translator.html
http://camel.apache.org/composed-message-processor.html
http://camel.apache.org/pipes-and-filters.html
http://camel.apache.org/eip.html
I have a Spring Integration app that puts incoming files onto a channel. From there I'd like to be able to send the same file to two different processing pipelines (one archiving to S3, another parsing the contents) and later have a downstream component that can recognise when both have been successfully processed, and thus delete the actual local file.
The semantics are like if I needed a Splitter/Aggregator, but instead of splitting the message I need to duplicate it.
Is there any way to achieve this with available components, or will it require some custom classes?
Yes, a <publish-subscribe-channel/> (with apply-sequence="true") will work similarly to a splitter - however both subscribers to the channel will get the SAME File object. By default the two branches will be executed serially but you can introduce an ExecutorChannel if you want to process in parallel.
If you want each subscriber to get a different File object, you could add a transformer...
<transformer ... expression="new java.io.File(payload.absolutePath)" />