Source is kafka for our beam pipeline. Apache beam's kafka IO connector supports moving of watermark(in case of flink runner) even if any partition is idle. The applications who would want to process packets based on the timestamp of the packet which is included in the payload would want to use "CustomTimestampPolicyWithLimitedDelay". We use FIXED WINDOWS for a minute for aggregation which is dependent on notion of time. So if time does not advance properly aggregation function is not called and data is missed.
This API has functionality issues. So when the application is initialized , let us just for example Topic a is used as a source with three partitions. These steps were taken to reproduce the issue:
Pump data to only one partition with a frequency with of any x seconds and observation is aggregation function is not called even after several minutes.
Now pump data to all partitions and observation is aggregation function is called at end of minute as expected.
Now pump data to only one partition and that too not till end of minute just before that so that we can generate a idle partition scenario and observation is it works as expected NOW.
So the sort of summary is there is a initialization issue with this api where it does not advance time but after step 2 it stabilizes and works as expected.
This is easily reproducible and would request apache beam to fix this.
As of now the temp fix we have gone is with LogAppendTime which works flawlessly but we do not want to process packets on broker time due to various application needs.
Related
I'm trying to black-box test a Spring Boot application which is using Spring Cloud Stream Kafka. The expected results (in the DB) may differ based on the message processing order. How could I reliably tell if one message was processed and I can send in the next? One important factor is that one message from the test can generate multiple events (messages) within the application.
I did the following methods:
wait fixed amount of time: usually works, but if someone's PC is hot and throttling, it can become flaky, and to be honest this is just ugly
create an aspect to count the method invocations, serve it through a controller, query it multiple times, send the next message when we're "settled": timing of querying matters, unreliable
periodically check Kafka consumer lag, either from code or by querying actuator, with multiple samples: this is mixture of the above two, sometimes slower than the first but more reliable
Is there any official way of doing this?
Configure the container to emit ListenerContainerIdleEvents.
See https://docs.spring.io/spring-kafka/docs/current/reference/html/#idle-containers
Problem statement :
I have one server which is continuously producing prices. This code is written in Java with spring boot.
I have multiples consumer servers / java programs which are running on different systems. Who will use prices which was produced by first producer server.
Question : In one milliseconds, I will produce around 300-400 prices (data). So Which is the best way to transfer continuous data from one producer server to multiple consumer servers ?
Data sice : hardly some kbs.
I researched on google and come to know some answers :
I can create one kafka producer who will produce prices and will send prices on some topics where other kafka consumers will receive/consume the same price on the same topic .
I can write prices in one database table and other consumers will read the prices from same table.
I can use some cloud consept where I will post prices on some cloud platform and from their consumers programs will read the prices.
So which is the better way from among three of us or if you have any other architecture so please help me in this.
Thanks in advance.
Go With kafka!
with kafka you can scale independently on your producer and consumer by Load. During server failover or deployments, it will resume using "offset". if you plan to scale horizontally there is a nice concept of consumer groups.
if u plan using a database u will need to address the below points as a developer.
Your consumers need to sync to the database at scheduled intervals.
You need a mechanism to not consume the same data in horizontally scaled systems.
which cloud mechanism exactly?
if as storage u will get same issues as 2.
I'm new in kstreams, and want to understand some of processes.
Aggregation function is creating a kafka topic with changelog, but why? Only for backup?
As far as I understand kstreams needs statestore locally, so when is used data from topic? Only if some operations fault or to recreate statestore after app restart?
Changelog topics are created for state store.
Although state store is kept locally, its value might be used remotely.
If you performed aggregation several situation can happen:
Application can crashed.
You can stop you application
Rebalance might happen.
etc
In those situation intermediate results are needed to calculate final one. If there wasn't changelog topic with intermediate results, computation would have to start from scratch.
I think changelog topic is for performance and fault tolerance.
Some interesting information about Internal topic and Duality of streams and tables(changelog topic).
How to create a Kafka stream which runs at a specific time everyday, reads messages from a topic, do some transformations and write messages back to a different topic.
For instance a stream that runs at 9pm everyday, fetches all the messages pushed to a topic and write them to another topic.
I tried windowing but all the examples were pertaining to aggregation only. I don't have to do aggregation.
I am using java DSL
Write your java code to do what you want and configure crontab to running your java when you want.
We are running our calculations in a standalone Spark cluster, ver 1.0.2 - the previous major release. We do not have any HA or recovery logic configured.
A piece of functionality on the driver side consumes incoming JMS messages and submits respective jobs to spark.
When we bring the single & only Spark master down (for tests), it seems the driver program is unable to properly figure out that the cluster is no longer usable. This results in 2 major problems:
The driver tries to reconnect endlessly to the master, or at least we couldn't wait until it gives up.
Because of the previous point, submission of new jobs blocks (in org.apache.spark.scheduler.JobWaiter#awaitResult). I presume this is because the cluster is not reported unreacheable/down and the submission simply logic waits until the cluster comes back. For us this means that we run out of the JMS listener threads very fast since they all get blocked.
There are a couple of akka failure detection-related properties that you can configure on Spark, but:
The official documentation strongly doesn't recommend enabling akka's built-in failure detection.
I would really want to understand how this is supposed to work by default.
So, can anyone please explain what's the designed behavior if a single spark master in a standalone deployment mode fails/stops/shuts down. I wasn't able to find any proper doc on the internet about this.
In default, Spark can handle Workers failures but not for the Master (Driver) failure. If the Master crashes, no new applications can be created. Therefore, they provide 2 high availability schemes here: https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
Hope this helps,
Le Quoc Do