Processing data based on the metadata in the file using apache camel

Processing data based on the metadata in the file using apache camel - java

I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:
Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1
Is it possible in Apache Camel?

Yes.
An example architecture might look something like this:
You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
(You have a few options here)
Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.
Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.
Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:
http://camel.apache.org/message-translator.html
http://camel.apache.org/composed-message-processor.html
http://camel.apache.org/pipes-and-filters.html
http://camel.apache.org/eip.html

Related

Spring Integration aggregating messages that were split twice

I have a use case where my message are being split twice and i want to aggregate all these messages. How can this best be achieved, should i aggregate the messages twice by introducing different sequence headers, or is there a way to aggregate the messages in single aggregating step by overriding the method how messages are grouped?

That's called a "nested splitting" and there is built-in algorithm to push sequence detail headers to the stack for a new splitting context. This would allow in the end to have an ascendant aggregation: the first one aggregate for the closest nested splitting, pops sequence detail headers and allows the next aggregator to deal with its own sequence context.
So, in two words: it is better to have as many aggregator as you have splitting if you want to send a single message in the start and receive a single message in the end.
Of course you can have a custom splitting algorithm with an applySequence = false. As many as you need. And have only a single aggregator in the end, but with a custom correlation logic already.
We have some explanation in the docs: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregatingmessagehandler
Starting with version 5.3, after processing message group, an AbstractCorrelatingMessageHandler performs a MessageBuilder.popSequenceDetails() message headers modification for the proper splitter-aggregator scenario with several nested levels.
We don't have a sample on the matter, but here is a configuration for test case: https://github.com/spring-projects/spring-integration/blob/main/spring-integration-core/src/test/java/org/springframework/integration/aggregator/scenarios/NestedAggregationTests-context.xml

Get the sequence no of the message processing in Spark Streaming

I am using Spark Structured streaming for processing the messages and I am using Java8. I am reading the message from the kafka and writing the message to the file and save the file in HDFS.
I got a requirement like I need to write a sequence number along with the message to file.
For example, if I get the first message from kafka, the output file content will be "message, 1" , for second message its "message,2" etc.. kind of count.
if the message count reaches some threshold let say "message, 999999", then I need to reset the sequence from 1 again from the next message I received.
if the spark streaming job is restarted, it should continue with the sequence where it left. So I need to save this number in HDFS kind of checkPointLocation.
What is the best approach I can use to implement this sequence. Can I use Accumulator to do that? or is there any other better to approach to implement during the distributed processing ? or is it not possible in distributed processing?

It wont be that hard.You can read each message using a map function and keep on adding the count to the messages.The count can be maintained with in your code logic.

Edit bolt process at run time in Apache Storm

My project streams object data through storm to a graphics application. The appearance of these objects depends upon variables assigned by a bolt in the storm topology.
My question is whether it is possible to update the bolt process by sending a message to it that changes the variables it attaches to object data. For example, after sending a message to the bolt declaring that I want any object with parameter x above a certain number to appear as red rather than blue.
The bolt process would then append a red rgb variable to the object data rather than blue.
I was thinking this would be possible by having a displayConfig class that the bolt uses to apply appearance and who's contents can be edited by messages with a certain header.
Is this possible?

It is possible, but you need to do it manually and prepare you topology before you start it.
There are two ways to do this:
use a local config file for bolt that you put into the worker machine (maybe via NFS). The bolts regularly check the file for updates an read an updated configuration if you do change the file.
You use one more spout that produces a configuration stream. All bolts you want to send a configuration during runtime, need to consumer from this configuration-spout via "allGrouping". When processing input tuple, you check if its a regular data tuple or and configuration tuple (and update you config accordingly).

Spring Integration - Aggregate Scattered Messages

I have a Spring Integration app that puts incoming files onto a channel. From there I'd like to be able to send the same file to two different processing pipelines (one archiving to S3, another parsing the contents) and later have a downstream component that can recognise when both have been successfully processed, and thus delete the actual local file.
The semantics are like if I needed a Splitter/Aggregator, but instead of splitting the message I need to duplicate it.
Is there any way to achieve this with available components, or will it require some custom classes?

Yes, a <publish-subscribe-channel/> (with apply-sequence="true") will work similarly to a splitter - however both subscribers to the channel will get the SAME File object. By default the two branches will be executed serially but you can introduce an ExecutorChannel if you want to process in parallel.
If you want each subscriber to get a different File object, you could add a transformer...
<transformer ... expression="new java.io.File(payload.absolutePath)" />

Splitting, enriching and putting back together

I have a message carrying XML (an order) containing multiple homogenous nodes (think a list of products) in addition to other info (think address, customer details, etc.). I have to enrich each 'product' with details provided by another external service and return the same complete XML 'order' message with enriched 'products'.
I came up with this sequence of steps:
Split the original XML with xpath to separate messages (also keeping the original message)
Enrich split messages with additional data
Put enriched parts back into the original message by replacing old elements.
I was trying to use multicast by sending original message to endpoint where splitting and enriching is done and to aggregation endpoint where original message and split-enriched messages are aggregates and then passed to processor which is responsible for combining these parts back to single xml file. But I couldn't get the desired effect...
What would be the correct and nice way to solve this problem?

The Splitter EIP in Camel can aggregate messages back (as a Composed Message Processor EIP).
http://camel.apache.org/splitter
See this video which demonstrates such a use-case
http://davsclaus.blogspot.com/2011/09/video-using-splitter-eip-and-aggregate.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.