Increase number of partitions for a topic in Java - java

I am using name : kafka_2.12 version : 2.3.0. Based on the traffic/load I want to change the maximum partition number for a topic. Is it possible to make this kind of change once Kafka is up and can it be done by code?

Yes you could increase partition by code. Use AdminClient.createPartitions method.
AdminClients.createPartitions method API document
public abstract CreatePartitionsResult createPartitions(java.util.Map<java.lang.String,NewPartitions> newPartitions,CreatePartitionsOptions options)
Increase the number of partitions of the topics given as the keys of newPartitions according to the corresponding values. If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected.
This operation is not transactional so it may succeed for some topics while fail for others.
It may take several seconds after this method returns success for all the brokers to become aware that the partitions have been created. During this time, describeTopics(Collection) may not return information about the new partitions.
How to use:
public static void createPartitions(String topicName, int numPartitions) {
Properties props = new Properties();
props.put("bootstrap.servers","localhost:9092");
AdminClient adminClient = AdminClient.create(props);
Map<String, NewPartitions> newPartitionSet = new HashMap<>();
newPartitionSet.put(topicName, NewPartitions.increaseTo(numPartitions));
adminClient.createPartitions(newPartitionSet);
adminClient.close();
}

Related

Apache Flink : How to Call One Stream from Another Stream

My scenario is, I want to call one stream based on another stream input. Both Stream type is different. The following is my sample code. I want to trigger one stream when some message is received from Kafka stream.
While Application start up, i can read data from DB. Then again i want to get data from DB based on some kafka message. When i receive kafka message in stream , i want to get data from DB again.This is my actual use case.
How to achieve this? Is it possible ?
public class DataStreamCassandraExample implements Serializable{
private static final long serialVersionUID = 1L;
static Logger LOG = LoggerFactory.getLogger(DataStreamCassandraExample.class);
private transient static StreamExecutionEnvironment env;
static DataStream<Tuple4<UUID,String,String,String>> inputRecords;
public static void main(String[] args) throws Exception {
env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool argParameters = ParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(argParameters);
Properties kafkaProps = new Properties();
kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "group1");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("testtopic", new SimpleStringSchema(), kafkaProps);
ClusterBuilder cb = new ClusterBuilder() {
private static final long serialVersionUID = 1L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("127.0.0.1")
.withPort(9042)
.withoutJMXReporting()
.build();
}
};
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat =
new CassandraInputFormat<> ("select * from employee_details", cb);
//While Application is start up , Read data from table and send as stream
inputRecords = getDBData(env,cassandraInputFormat);
// If any data comes from kafka means, again i want to get data from table.
//How to i trigger getDBData() method from inside this stream.
//The below code is not working
DataStream<String> inputRecords1= env.addSource(kafkaConsumer)
.map(new MapFunction<String,String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
inputRecords = getDBData(env,cassandraInputFormat);
return "OK";
}
});
//This is not printed , when i call getDBData() stream from inside the kafka stream.
inputRecords1.print();
DataStream<Employee> empDataStream = inputRecords.map(new MapFunction<Tuple4<UUID,String,String,String>, Tuple2<String,Employee>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Employee> map(Tuple4<UUID,String,String,String> value) throws Exception {
Employee emp = new Employee();
try{
emp.setEmpid(value.f0);
emp.setFirstname(value.f1);
emp.setLastname(value.f2);
emp.setAddress(value.f3);
}
catch(Exception e){
}
return new Tuple2<>(emp.getEmpid().toString(), emp);
}
}).keyBy(0).map(new MapFunction<Tuple2<String,Employee>,Employee>() {
private static final long serialVersionUID = 1L;
#Override
public Employee map(Tuple2<String, Employee> value)
throws Exception {
return value.f1;
}
});
empDataStream.print();
env.execute();
}
private static DataStream<Tuple4<UUID,String,String,String>> getDBData(StreamExecutionEnvironment env,
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat){
DataStream<Tuple4<UUID,String,String,String>> inputRecords = env
.createInput
(cassandraInputFormat
,TupleTypeInfo.of(new TypeHint<Tuple4<UUID,String,String,String>>() {}));
return inputRecords;
}
}
this is going to be a very verbose answer.
To correctly use Flink as a developper, you need to have an understading of its basic concepts. I suggest you start by the architecture overview (https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html), it contains all you need to know in order to get into the world of Flink when you come from programming.
Now, looking at your code, it should not do what you expect because of how Flink will read it. You need to understand that Flink has at least two big steps when it executes your code: first it builds an execution graph which only describes what it needs to do. This happens at the job manager level. The second big step is to ask one or many workers to execute the graph. These two steps are sequential and anything you do regarding the graph description has to be done at the job manager level not inside your operations.
In your case, the graph has:
A Kafak source.
A map that will call getDBData() at a worker level (not good because getDBData() alters the graph by adding a new Input each time it is called).
The line inputRecords = getDBData(env,cassandraInputFormat); will create an orphan branch of the graph. And the line DataStream<Employee> empDataStream = inputRecords.map... will append a branch of a map->keyBy->map to that orphan branch. This will build a part of the graph that will read all the employee records from Cassandra and apply the map->keyBy->map transformations. This will not be linked with the Kafka source in any way.
Now let's get back to your need. I understand you need to fetch data for an employee when his/her id comes from Kafka and do some operations.
The most clean way to handle this is called Side Inputs. This is a data input that you declare when you build your graph and the job manager handles the reading of data and its transmission to the workers. The bad news is that Side Inputs are not yet working for streaming jobs in Flink (https://issues.apache.org/jira/browse/FLINK-2491 - this bug causes streamning jobs to not create checskpoints because side inputs finish quickly and this puts the job in a bizzare state).
With this being said you still have three more options. The right option depends on the size of your employee cassandra table.
The second option is to load all employees to a static final variable employees and use it inside your map functions. The backside of this approach is that the job manager will send a serialized copy of this variable to all workers and may congest your network and may also overload the RAM. If the size of the table is small and should not grow big in the future, then this may be an acceptable work-arround until the Side Inputs are working for streaming jobs. If the size of the table is big or should evolve in the future then consider the third option.
The third option is an improvement of the second one. It uses Flink's broadcast variables (see https://flink.apache.org/2019/06/26/broadcast-state.html and https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html). Short story: it is the same as before with better transfer management. Flink will find the best way to store and send the variable to the workers. This approach is though a litle bit more complicated to implement correctly.
The last option is not advisable in normal circumstances. It simply consists in making a call to Cassandra inside your map operation. This is not a good practice because it adds repeated latency to all your map executions (there will be as many calls as items passing through Kafka). A call means a connection creation, the actual request with the query and waiting for Cassandra to reply and freeing the connection. This can be a lot of work for a step in your graph. It is a solution to consider when you really can not find any alternatives.
For your case, I would advise the third option. I guess the employee table should not be very big and using Broadcast variables is a good choice.

In a Kafka Streams application, is there a way to define a topology with a wildcard list of output topics?

I have multi-schema Kafka Streams application that enriches a record via a join to a KTable, and then passes the enriched record along.
The input topic naming format is currently well defined but I'm changing this to a wildcard. I want to determine the input topic of each record, derive the output topic via regex replacement, and send it on.
E.g. While listening to event.raw.* a record comes in on event.raw.foo and I wish to pass it out on event.foo.
I realise I can get the input topics via the Processor API:
public class EnrichmentProcessor extends AbstractProcessor<String, GenericRecord> {
#Override
public void process(String key, GenericRecord value) {
//Do Join...
//Determine output topic and forward
String outputTopic = context().topic().replaceFirst(".raw.", ".");
context().forward(key, value, To.child(outputTopic));
context().commit();
}
}
But this doesn't help me when I'm trying to define my Topology because I have no way of knowing up front what my output topic is going to be.
InternalTopologyBuilder topologyBuilder = new InternalTopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, genericRecordDeserializer, "event.raw.*")
.addProcessor("ENRICHER", EnrichmentProcessor::new, "SOURCE")
.addSink("OUTPUT", outputTopic, stringSerializer, genericRecordSerializer, "ENRICHER"); // How can I register all possible output topics here?
Has anyone solved a situation like this before?
I know that if I had a list of possible output-topic names up front I could have multiple sinks defined on the topology but I'm not going to.
Is there a way I can define the topology to have dynamically allocated output topic names when I dont't have a hard coded list of possible output topic names up front?
This should be possible: You can use Topology#addSink(..., new TopicNameExtractor(){...}, ...) to dynamically set an output topic name. TopicNameExtractor has access to the RecordContext that allows you to get the input topic name via context.topic(). Hence, you should be able to compute the output topic name, base on the input topic name.

Apache Kafka (KStreams) : How to subscribe to multiple topics?

I have the following code
//Kafka Config setup
Properties props = ...; //setup
List<String> topicList = Arrays.asList({"A", "B", "C"});
StreamBuilder builder = new StreamBuilder();
KStream<String, String> source = builder.stream(topicList);
source
.map((k,v) -> { //busy code for mapping data})
.transformValues(new MyGenericTransformer());
.to((k,v,r) -> {//busy code for topic routing});
new KafkaStream(builder.build(), properties).start();
My Problem : When I add more than one topic to subscribe to (ie A,B,C in above) the Kstream code stops receiving records.
References : https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/StreamsBuilder.html
Relevant Documentation
public <K,V> KStream<K,V> stream(java.util.Collection<java.lang.String> topics)
"If multiple topics are specified there is no ordering guarantee for records from different topics."
What I'm trying to achieve : Have one Kstream (ie 'source' from above) consume/process from multiple topics.
Do the topics share the same key?
Note that the specified input topics must be partitioned by key. If
this is not the case it is the user's responsibility to repartition
the data before any key based operation (like aggregation or join) is
applied to the returned KStream.
this maybe your blocker.
Another possible issue maybe the consumer group used.

What are the best practices to improve kafka streams

I am producing data from one topic A to another B using streams.But it is extremely slow. The topic A has data of ~130M records.
We are filtering messages with specific date and producing to Topic B.Is there a way to speed up?
Below are the configs i am using:
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "test");
// Where to find Kafka broker(s).
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
// Where to find the schema registry instance(s)
streamsConfiguration.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
// streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG, "localhost:" + port);
// streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG, "localhost:8088");
streamsConfiguration.put(StreamsConfig.RETRIES_CONFIG, 10);
streamsConfiguration.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, (10 * 1000L));
streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DefaultBugsnagExceptionHandler.getInstance().getClass());
// streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler);
// Specify (de)serializers for record keys and for record values.
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, stateDir);
streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.LINGER_MS_CONFIG), "10000");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// Records should be flushed every 10 seconds. This is less than the default
// in order to keep this example interactive.
///Messages will be forwarded either when the cache is full or when the commit interval is reached
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 500);
streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
streamsConfiguration.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
StreamsConfig config = new StreamsConfig(streamsConfiguration);
StreamsBuilder builder = new StreamsBuilder();
String start_date = "2018-05-10";
String end_date = "2018-05-16";
//DateFormat format = new SimpleDateFormat("yyyy-MM-dd");
//LocalDate dateTime;
// builder.stream("topicA").to("topicB");
KStream<String, avroschems> source = builder.stream("topicA");
source
.filter((k, value) -> LocalDate.parse(value.getDay()).isAfter(LocalDate.parse(start_date)) && LocalDate.parse (value.getDay()).isBefore(LocalDate.parse(end_date)))
.to("bugSnagIntegration_mobileCrashError_filtered");
System.out.println("Starting Kafka Stream");
return new KafkaStreams(builder.build(), config);
I am trying to copy messages to topicB that is within some date range .Not sure if that is causing the slowness?
How to achieve concurrency?
"Extremely slow" is not a very specific term. You should share some concrete throughput numbers.
About multi-threading: Increasing StreamsConfig.NUM_STREAM_THREADS_CONFIG is correct. However, this only helps if CPU is the bottleneck. If network is the bottleneck, you need to start multiple application instances on different machines (ie, deploy the exact some application multiple times); for this case, all instances will also forma consumer group and share the load. I would recommend to read the docs for more details: https://docs.confluent.io/current/streams/architecture.html#parallelism-model
Additionally, you are able to configure the internally used consumer and producer clients. This might also help to increase throughput. Cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#kafka-consumers-producer-and-admin-client-configuration-parameters

Apache Kafka order windowed messages based on their value

I'm trying to find a way to re-order messages within a topic partition and send ordered messages to a new topic.
I have Kafka publisher that sends String messages of the following format:
{system_timestamp}-{event_name}?{parameters}
for example:
1494002667893-client.message?chatName=1c&messageBody=hello
1494002656558-chat.started?chatName=1c&chatPatricipants=3
Also, we add some message key for each message, to send them to the corresponding partition.
What I want to do is reorder events based on {system-timestamp} part of the message and within a 1-minute window, cause our publishers doesn't guarantee that messages will be sent in accordance with {system-timestamp} value.
For example, we can deliver to the topic, a message with a bigger {system-timestamp} value first.
I've investigated Kafka Stream API and found some examples regarding messages windowing and aggregation:
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-sorter");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> stream = builder.stream("events");
KGroupedStream<String>, String> groupedStream = stream.groupByKey();//grouped events within partion.
/* commented since I think that I don't need any aggregation, but I guess without aggregation I can't use time windowing.
KTable<Windowed<String>, String> windowedEvents = stream.groupByKey().aggregate(
() -> "", // initial value
(aggKey, value, aggregate) -> aggregate + "", // aggregating value
TimeWindows.of(1000), // intervals in milliseconds
Serdes.String(), // serde for aggregated value
"test-store"
);*/
But what should I do next with this grouped stream? I don't see any 'sort() (e1,e2) -> e1.compareTo(e2)' methods available, also windows could be applied to methods like aggregation(), reduce() ,count() , but I think that I don't need any messages data manipulations.
How can I re-order message in the 1-minute window and send them to another topic?
Here's an outline:
Create a Processor implementation that:
in process() method, for each message:
reads the timestamp from the message value
inserts into a KeyValueStore using (timestamp, message-key) pair as the key and the message-value as the value. NB this also provides de-duplication. You'll need to provide a custom Serde to serialize the key so that the timestamp comes first, byte-wise, so that ranged queries are ordered by timestamp first.
in the punctuate() method:
reads the store using a ranged fetch from 0 to timestamp - 60'000 (=1 minute)
sends the fetched messages in order using context.forward() and deletes them from the store
The problem with this approach is that punctuate() is not triggered if no new msgs arrive to advance the "stream time". If this is a risk in your case, you can create an external scheduler that sends periodic "tick" messages to each(!) partition of your topic, that your processor should just ignore, but they'll cause punctuate to trigger in the absence of "real" msgs.
KIP-138 will address this limitation by adding explicit support for system-time punctuation:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics
Here is how I ordered streams in my project.
Created topology with source, processor, sink.
In Processor
process(key, value) -> Added each record to List(instance variable).
Init() -> schedule(WINDOW_BUFFER_TIME, WALL_CLOCK_TIME) -> punctuate (timestamp) sort list of items of window buffer time in List (instance variable) and iterate and forward. Clear List (instance variable).
This logic is working fine for me.

Categories

Resources