I am using Spark Structured streaming for processing the messages and I am using Java8. I am reading the message from the kafka and writing the message to the file and save the file in HDFS.
I got a requirement like I need to write a sequence number along with the message to file.
For example, if I get the first message from kafka, the output file content will be "message, 1" , for second message its "message,2" etc.. kind of count.
if the message count reaches some threshold let say "message, 999999", then I need to reset the sequence from 1 again from the next message I received.
if the spark streaming job is restarted, it should continue with the sequence where it left. So I need to save this number in HDFS kind of checkPointLocation.
What is the best approach I can use to implement this sequence. Can I use Accumulator to do that? or is there any other better to approach to implement during the distributed processing ? or is it not possible in distributed processing?
It wont be that hard.You can read each message using a map function and keep on adding the count to the messages.The count can be maintained with in your code logic.
Related
I'm implementing the logic for rebuilding a file paginated to several kafka messages with the same key. Every time a page is received its content is appended to the corresponding file in a shared volume and, once the last page is appended, the topology has to include some extra processing steps.
Should this be done with forEach or with process?
Both forEach and process have void type, how can then be added the final extra steps to the topology?
Both accomplish the same goal. foreach is a terminal action of the DSL, however
process method of the Processor API "returns" data to next Processors by forwarding it to the Context, as answered in your last question, not via the process method itself
I'm working on creating file uploader. I would like to load files first temp folder and then convert them to needed format. For this I'll create a queue with tasks that will be executed with the executor. But in case of server crash, this queue will be lost. So could anybody suggest me a library without using another server that can make my queue persistent?
Instead of using in-memory queue implementation, you can use persistent options like DB or a JMS Queue. This will avoid loosing the data even if server crashes.
You need to use DB,a and store the bytes in it. Invoke 2 threads one will only feed the data to DB and another will poll onto convert the file. You can maintain the status if the file is changed to the format you wanted, and also the format it needs to be changed in
Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations
I want to read and write into a Java list at the same time.
for example, I want to fill my list from a data source (ex: producers), and read the data using other programs (Apache storm with spot).
to give you a clear idea there is a broker who sends data to Apache storm: Broker -> My API Broker ->spout -> Bolt so my api receive data and try to fill my list and the spout read data, and remove it from the list.
but if tried to do this, I will receive the exception : ConcurrentModificationException
I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:
Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1
Is it possible in Apache Camel?
Yes.
An example architecture might look something like this:
You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
(You have a few options here)
Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.
Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.
Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:
http://camel.apache.org/message-translator.html
http://camel.apache.org/composed-message-processor.html
http://camel.apache.org/pipes-and-filters.html
http://camel.apache.org/eip.html