Topic created in all kafka port - java

server.propereties setup:
listeners=PLAINTEXT://:29092, SSL://:29093
SSL related set too done.
so that we can connect 29092 for plaintext and 29093 along with SSL setup.
Here am trying to produce data into port 29093 as below
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, System.getProperty("kafkaPort", "localhost:29093"));
//SSL related setup too done in props
Producer<Long, String> producer = new KafkaProducer<>(props, new LongSerializer(), new KafkaSerializer());
final ProducerRecord<Long, String> record = new ProducerRecord<Long, String>(System.getProperty("kafkaTopic", "dqerror"),
content);
RecordMetadata metadata = producer.send(record).get();
After publishing dqerror topic created in both also data get published in both
Data is published into two topic.
Actually, am trying to find is any possible to restrict to drop data into a specific port ?

Data is not published in "both" ports. There is only one Kafka cluster that is listening on two ports. There is one set of disks that the data is written into on your one broker.
Also, from what I can tell, there is only one topic used in your code.
If you want to restrict TCP traffic on any port, that would be a firewall rule from the OS, rather than any Kafka settings or Java code.

Related

Appropriate Architecture for Akka WebSockets with Cluster Sharding

I am attempting to implement a way for users to connect to a specific websocket, which enables all connected clients to send and receive messages to all connected users. This can be thought of as a group chat where there is a dedicated websocket URL per chat room.
I have used a handleWebSocketMessages route and used the following boiler plate code (source) for data distribution across connected users:
Pair<Sink<Message, NotUsed>, Source<Message, NotUsed>> sinkSourcePair =
MergeHub.of(Message.class)
.toMat(BroadcastHub.of(Message.class), Keep.both())
.run(actorSystem);
Sink<Message, NotUsed> hubSink = sinkSourcePair.first();
Source<Message, NotUsed> hubSource = sinkSourcePair.second();
Flow<Message, Message, NotUsed> broadcastFlow = Flow.fromSinkAndSource(hubSink, hubSource);
When a message arrives via the websocket, I want it to be registered by the cluster sharded Actor (entity), which I complete using EntityRef.ask.
Flow<Message, Message, NotUsed> incomingMessageFlow = Flow.of(Message.class)
Flow<Message, Message, NotUsed> recordMessageFlow = ...
Flow<Message, Message, NotUsed> broadcastFlow = Flow.fromSinkAndSource(hubSink, hubSource);
return handleWebSocketMessages(incomingMessageFlow.via(recordMessageFlow).via(broadcastFlow);
The above works fine for clients connected to a single websocket but I want my websockets to be associated with a sharded Actor based on the websocket URL (e.g. ws://localhost/my-group-chat/ws).
Where should I define my broadcast flow? I've tried several approaches:
to define it within Route for websocket handling (makes no sense as it is created new for every connection)
to include it in a sharded actor (fails because of serialization requirements between sharded actors)
to store a map of broadcast flows so that when it exists for a specific URL it is utilized and when it doesn't exist it is initialized. <- this one worked but I don't like it
I believe the broadcast flow should be assigned to the sharded actor, where the current map breaks this pattern in terms of using Akka cluster sharding.
I'd appreciate any ideas.

How to produce messages to different Kafka topics and schema registries from the same producer

I'm trying to produce messages to different kafka topics from the same producer in my Java application.
This is how I create my producer and send a message to the topics.
#Bean
public Producer producer() {
Properties config = sdpProperties();
config.setProperty("schema.registry.url", "");
config.setProperty("client.id", "1"); ...
return new Producer(config);
}
producer.send(topic1, genericRecord, datasetId1);
producer.send(topic2, genericRecord, datasetId2);
However, these two different topics have different schema.registry.urls. Through research, I saw that you can set more than one registry url in the config but when I try to follow this, it only validates against the second url. Messages to topic2 are produced correctly but messages to topic1 are not. Messages to both topics are only validated against url2 instead of validated topic1 against url1 and topic2 against url2.
config.setProperty("schema.registry.url", "ur1,url2");
How can I use the same producer to send messages to these two different topics even though they have different schema.registry.urls? Am I setting this config incorrectly?
Ideally, you shouldn't have multiple registries.
But if you are aware of the architectural design behind that decision, then you must create two unique producers with different registry urls. The comma separation is for load balancing against one "registry cluster", not a loop over multiple, unique registries

RabbitMQ Listening to a queue from multiple servers

I need to realize listening from a queue from two servers. The queue name is the same. The first server is the primary, the second is the backup.
When the main server is down, work with the backup server queue should continue.
My class:
#RabbitListener(queues = "to_client")
public class ClientRabbitService {
Now I use RoutingConnectionFactory:
#Bean
#Primary
public ConnectionFactory routingConnectionFactory() {
SimpleRoutingConnectionFactory rcf = new SimpleRoutingConnectionFactory();
Map<Object, ConnectionFactory> map = new HashMap<>();
map.put("[to_kernel]", mainConnectionFactory());
map.put("[to_kernel_reserve]", reserveConnectionFactory());
map.put("[to_client]", mainConnectionFactory());
rcf.setTargetConnectionFactories(map);
return rcf;
}
[to_kernel] and [to_kernel_reserve] - the queues for sending messages only, [to_client] - to receive them.
Any ideas please?
Is the queue on backup server populated only when primary server is down? If yes you may always listen to both queues (queue on secondary server will be empty when primary is up).
Note that your solution would be more reliable if you use RabbitMQ clustering.
Then, you connect to the cluster (you specify addresses of all machines in cluster).
It is explained in official documentation https://docs.spring.io/spring-amqp/reference/htmlsingle/#connections
Alternatively, if running in a clustered environment, use the
addresses attribute.
<rabbit:connection-factory id="connectionFactory" addresses="host1:5672,host2:5672"/>
When using cluster you will have single queue (replicated across cluster). Note that RabbitMQ suffers significant performance hit when using replication, be sure to read official documentation how to configure clustering https://www.rabbitmq.com/clustering.html

Why Kafka KTable is missing entries?

I have a single instance java application that uses KTable from Kafka Streams. Until recently I could retrieve all data using KTable when suddenly some of the messages seemed to vanish. There should be ~33k messages with unique keys there.
When I want to retrieve messages by key I don't get some of the messages. I use ReadOnlyKeyValueStore to retrieve messages:
final ReadOnlyKeyValueStore<GenericRecord, GenericRecord> store = ((KafkaStreams)streams).store(storeName, QueryableStoreTypes.keyValueStore());
store.get(key);
These are the configuration settings I set to the KafkaStreams.
final Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_SERVER_CONFIG, serverId);
config.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
config.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
Kafka: 0.10.2.0-cp1
Confluent: 3.2.0
Investigations brought me to some very worrying insights. Using REST Proxy I manually read partitions and found out that some offsets return error.
Request:
/topics/{topic}/partitions/{partition}/messages?offset={offset}
{
"error_code": 50002,
"message": "Kafka error: Fetch response contains an error code: 1"
}
No client, neither java nor command line however return any error. They just skip over the faulty missing messages resulting in missing data in KTables. Everything was fine and without notice it seems that somehow some of the messages got corrupt.
I have two brokers and all the topics have the replication factor of 2 and are fully replicated. Both brokers separately return the same. Restarting brokers makes no difference.
What could possibly be the cause?
How to detect this case in a client?
By default Kafka Broker config key cleanup.policy is set to delete. Set it to compact to keep the latest message for each key. See compaction.
Deletion of old messages does not change the minimum offset so trying to retrieve message below it causes an error. The error is very vague. The Kafka Streams client will start reading messages from minimum offset so there is no error. The only visible effect is missing data in KTables.
While the application is running thanks to the caches all data might still be available even after messages are deleted from Kafka itself. They will vanish after cleanup.

Kafka consumer in Spark Streaming

Trying to write a Spark Streaming job that consumes messages from Kafka. Here’s what I’ve done so far:
Started Zookeeper
Started Kafka Server
Sent a few messages to the server. I can see them when I run the following:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic --from-beginning
Now trying to write a program to count # of messages coming in within 5 minutes.
The code looks something like this:
Map<String, Integer> map = new HashMap<String, Integer>();
map.put("mytopic", new Integer(1));
JavaStreamingContext ssc = new JavaStreamingContext(
sparkUrl, " Spark Streaming", new Duration(60 * 5 * 1000), sparkHome, new String[]{jarFile});
JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, "localhost:2181", "1", map);
Not sure what value to use for the 3rd argument (consumer group). When I run this I get Unable to connect to zookeeper server. But Zookeeper is running on port 2181; otherwise step #3 would not have worked.
Seems like I am not using KafkaUtils.createStream properly. Any ideas?
There is no such thing as default consumer group. You can use an arbitrary non-empty string there. If you have only one consumer, its consumer group doesn't really matter. If there are two or more consumers, they can either be a part of the same consumer group or belong to different consumer groups.
From http://kafka.apache.org/documentation.html :
Consumers
...
If all the consumer instances have the same consumer group, then this
works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then
this works like publish-subscribe and all messages are broadcast to
all consumers.
I think the problem may be in 'topics' parameter.
From Spark docs:
Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
You only specified a single partition for your topic, namely '1'. Depending on broker's setting (num.partitions), there may be more than one partitions and your messages may be sent to other partitions which aren't read by your program.
Besides, I believe the partitionIds are 0 based. So if you have only one partition, it will have the id equal to 0.
I think you should specify the ip for zookeeper instead of localhost. Also, the third argument is for consumer group name. It can be any name you like. It is for the time when you have more than one consumer tied to the same group,topic partitions are distributed accordingly.Your tweets should be:
JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, "x.x.x.x", "dummy-group", map);
I was facing the same issue. Here is the solution that worked for me.
The number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.So Spark Streaming requires minimum of two cores . So in my spark-submit, I should mention at-least two cores.
kafka-clients-version.jar should be included in the list of dependent jars in spark-submit.
If zookeeper is running on the same machine as your streaming application then "localhost:2181" will work. Otherwise, you have to mention the address of the host where zookeeper is running and ensure that machine on which streaming app is running is able to talk to zookeeper host on port 2181.
I think, in your code, the second argument for the call
KafkaUtils.createStream, should be the host:port of the kafka server, not the zookeeper host and port. check that once.
EDIT:
Kafka Utils API Documentation
As per the document above, it should be the zookeeper quorum . So Zookeeper hostname and port should be used.
zkQuorum
Zookeeper quorum (hostname:port,hostname:port,..).

Categories

Resources