We have an Apache Flink POC application which works fine locally but after we deploy into Kinesis Data Analytics (KDA) it does not emit records into the sink.
Used technologies
Local
Source: Kafka 2.7
1 broker
1 topic with partition of 1 and replication factor 1
Processing: Flink 1.12.1
Sink: Managed ElasticSearch Service 7.9.1 (the same instance as in case of AWS)
AWS
Source: Amazon MSK Kafka 2.8
3 brokers (but we are connecting to one)
1 topic with partition of 1, replication factor 3
Processing: Amazon KDA Flink 1.11.1
Parallelism: 2
Parallelism per KPU: 2
Sink: Managed ElasticSearch Service 7.9.1
Application logic
The FlinkKafkaConsumer reads messages in json format from the topic
The jsons are mapped to domain objects, called Telemetry
private static DataStream<Telemetry> SetupKafkaSource(StreamExecutionEnvironment environment){
Properties kafkaProperties = new Properties();
kafkaProperties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "BROKER1_ADDRESS.amazonaws.com:9092");
kafkaProperties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "flink_consumer");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>("THE_TOPIC", new SimpleStringSchema(), kafkaProperties);
consumer.setStartFromEarliest(); //Just for repeatable testing
return environment
.addSource(consumer)
.map(new MapJsonToTelemetry());
}
The Telemetry’s timestamp is chosen for EventTimeStamp.
3.1. With forMonotonousTimeStamps
Telemetry’s StateIso is used for keyBy.
4.1. The two letter iso code of the state of USA
5 seconds tumbling window strategy is applied
private static SingleOutputStreamOperator<StateAggregatedTelemetry> SetupProcessing(DataStream<Telemetry> telemetries) {
WatermarkStrategy<Telemetry> wmStrategy =
WatermarkStrategy
.<Telemetry>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.TimeStamp);
return telemetries
.assignTimestampsAndWatermarks(wmStrategy)
.keyBy(t -> t.StateIso)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new WindowCountFunction());
}
A custom ProcessWindowFunction is called to perform some basic aggregation.
6.1. We calculate a single StateAggregatedTelemetry
ElasticSearch is configured as sink.
7.1. StateAggregatedTelemetry data are mapped into a HashMap and pushed into source.
7.2. All setBulkFlushXYZ methods are set to low values
private static void SetupElasticSearchSink(SingleOutputStreamOperator<StateAggregatedTelemetry> telemetries) {
List<HttpHost> httpHosts = new ArrayList<>();
httpHosts.add(HttpHost.create("https://ELKCLUSTER_ADDRESS.amazonaws.com:443"));
ElasticsearchSink.Builder<StateAggregatedTelemetry> esSinkBuilder = new ElasticsearchSink.Builder<>(
httpHosts,
(ElasticsearchSinkFunction<StateAggregatedTelemetry>) (element, ctx, indexer) -> {
Map<String, Object> record = new HashMap<>();
record.put("stateIso", element.StateIso);
record.put("healthy", element.Flawless);
record.put("unhealthy", element.Faulty);
...
LOG.info("Telemetry has been added to the buffer");
indexer.add(Requests.indexRequest()
.index("INDEXPREFIX-"+ from.format(DateTimeFormatter.ofPattern("yyyy-MM-dd")))
.source(record, XContentType.JSON));
}
);
//Using low values to make sure that the Flush will happen
esSinkBuilder.setBulkFlushMaxActions(25);
esSinkBuilder.setBulkFlushInterval(1000);
esSinkBuilder.setBulkFlushMaxSizeMb(1);
esSinkBuilder.setBulkFlushBackoff(true);
esSinkBuilder.setRestClientFactory(restClientBuilder -> {});
LOG.info("Sink has been attached to the DataStream");
telemetries.addSink(esSinkBuilder.build());
}
Excluded things
We managed to put Kafka, KDA and ElasticSearch under the same VPC and same subnets to avoid the need to sign each request
From the logs we could see that the Flink can reach the ES cluster.
Request
{
"locationInformation": "org.apache.flink.streaming.connectors.elasticsearch7.Elasticsearch7ApiCallBridge.verifyClientConnection(Elasticsearch7ApiCallBridge.java:135)",
"logger": "org.apache.flink.streaming.connectors.elasticsearch7.Elasticsearch7ApiCallBridge",
"message": "Pinging Elasticsearch cluster via hosts [https://...es.amazonaws.com:443] ...",
"threadName": "Window(TumblingEventTimeWindows(5000), EventTimeTrigger, WindowCountFunction) -> (Sink: Print to Std. Out, Sink: Unnamed, Sink: Print to Std. Out) (2/2)",
"applicationARN": "arn:aws:kinesisanalytics:...",
"applicationVersionId": "39",
"messageSchemaVersion": "1",
"messageType": "INFO"
}
Response
{
"locationInformation": "org.elasticsearch.client.RequestLogger.logResponse(RequestLogger.java:59)",
"logger": "org.elasticsearch.client.RestClient",
"message": "request [HEAD https://...es.amazonaws.com:443/] returned [HTTP/1.1 200 OK]",
"threadName": "Window(TumblingEventTimeWindows(5000), EventTimeTrigger, WindowCountFunction) -> (Sink: Print to Std. Out, Sink: Unnamed, Sink: Print to Std. Out) (2/2)",
"applicationARN": "arn:aws:kinesisanalytics:...",
"applicationVersionId": "39",
"messageSchemaVersion": "1",
"messageType": "DEBUG"
}
We could also verify that the messages had been read from the Kafka topic and sent for processing by looking at the Flink Dashboard
What we have tried without luck
We had implemented a RichParallelSourceFunction which emits 1_000_000 messages and then exits
This worked well in the Local environment
The job finished in the AWS environment, but there was no data on the sink side
We had implemented an other RichParallelSourceFunction which emits 100 messages at each second
Basically we had two loops a while(true) outer and for inner
After the inner loop we called the Thread.sleep(1000)
This worked perfectly fine on the local environment
But in AWS we could see that checkpoints' size grow continuously and no message appeared in ELK
We have tried to run the KDA application with different parallelism settings
But there was no difference
We also tried to use different watermarking strategies (forBoundedOutOfOrderness, withIdle, noWatermarks)
But there was no difference
We have added logs for the ProcessWindowFunction and for the ElasticsearchSinkFunction
Whenever we run the application from IDEA then these logs were on the console
Whenever we run the application with KDA then there was no such logs in CloudWatch
Those logs that were added to the main they do appear in the CloudWatch logs
We suppose that we don't see data on the sink side because the window processing logic is not triggered. That's why don't see processing logs in the CloudWatch.
Any help would be more than welcome!
Update #1
We have tried to downgrade the Flink version from 1.12.1 to 1.11.1
There is no change
We have tried processing time window instead of event time
It did not even work on the local environment
Update #2
The average message size is around 4kb. Here is an excerpt of a sample message:
{
"affiliateCode": "...",
"appVersion": "1.1.14229",
"clientId": "guid",
"clientIpAddr": "...",
"clientOriginated": true,
"connectionType": "Cable/DSL",
"countryCode": "US",
"design": "...",
"device": "...",
...
"deviceSerialNumber": "...",
"dma": "UNKNOWN",
"eventSource": "...",
"firstRunTimestamp": 1609091112818,
"friendlyDeviceName": "Comcast",
"fullDevice": "Comcast ...",
"geoInfo": {
"continent": {
"code": "NA",
"geoname_id": 120
},
"country": {
"geoname_id": 123,
"iso_code": "US"
},
"location": {
"accuracy_radius": 100,
"latitude": 37.751,
"longitude": -97.822,
"time_zone": "America/Chicago"
},
"registered_country": {
"geoname_id": 123,
"iso_code": "US"
}
},
"height": 720,
"httpUserAgent": "Mozilla/...",
"isLoggedIn": true,
"launchCount": 19,
"model": "...",
"os": "Comcast...",
"osVersion": "...",
...
"platformTenantCode": "...",
"productCode": "...",
"requestOrigin": "https://....com",
"serverTimeUtc": 1617809474787,
"serviceCode": "...",
"serviceOriginated": false,
"sessionId": "guid",
"sessionSequence": 2,
"subtype": "...",
"tEventId": "...",
...
"tRegion": "us-east-1",
"timeZoneOffset": 5,
"timestamp": 1617809473305,
"traits": {
"isp": "Comcast Cable",
"organization": "..."
},
"type": "...",
"userId": "guid",
"version": "v1",
"width": 1280,
"xb3traceId": "guid"
}
We are using ObjectMapper to parse only just some of the fields of the json. Here is how the Telemetry class looks like:
public class Telemetry {
public String AppVersion;
public String CountryCode;
public String ClientId;
public String DeviceSerialNumber;
public String EventSource;
public String SessionId;
public TelemetrySubTypes SubType; //enum
public String TRegion;
public Long TimeStamp;
public TelemetryTypes Type; //enum
public String StateIso;
...
}
Update #3
Source
Subtasks tab
ID
Bytes received
Records received
Bytes sent
Records sent
Status
0
0 B
0
0 B
0
RUNNING
1
0 B
0
2.83 MB
15,000
RUNNING
Watermarks tab
No Data
Window
Subtasks tab
ID
Bytes received
Records received
Bytes sent
Records sent
Status
0
1.80 MB
9,501
0 B
0
RUNNING
1
1.04 MB
5,499
0 B
0
RUNNING
Watermarks
SubTask
Watermark
1
No Watermark
2
No Watermark
According the comments and more information You have provided, it seems that the issue is the fact that two Flink consumers can't consume from the same partition. So, in Your case only one parallel instance of the operator will consume from kafka partition and the other one will be idle.
In general Flink operator will select MIN([all_downstream_parallel_watermarks]), so In Your case one Kafka Consumer will produce normal Watermarks and the other will never produce anything (flink assumes Long.Min in that case), so Flink will select the lower one which is Long.Min. So, window will never be fired, because while the data is flowing one of the watermarks is never generated. The good practice is to use the same paralellism as the number of Kafka partitions when working with Kafka.
After having a support session with the AWS folks it turned out that we have missed to set the time characteristic on the streaming environment.
In 1.11.1 the default value of TimeCharacteristic was IngestionTime.
Since 1.12.1 (see related release notes) the default value is EventTime:
In Flink 1.12 the default stream time characteristic has been changed to EventTime, thus you don’t need to call this method for enabling event-time support anymore.
So, after we have set that EventTime explicitly then it started to generates watermarks like a charm:
streamingEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I have a consumer which polls from multiple topics.
Until now I only produced record into these topics with Java and everything worked fine.
I use the confulent tools with avro.
Now I tried to manually produce a topic via the terminal.
I create a avro-producer with the same schema my other producer uses:
# Produce a record with one field
kafka-avro-console-producer \
--broker-list 127.0.0.1:9092 --topic order_created-in \
--property schema.registry.url=http://127.0.0.1:8081 \
--property value.schema='{"type":"record","name":"test","fields":[{"name":"name","type":"string"},{"name":"APropertie","type":{"type":"array","items":{"type":"record","name":"APropertie","fields":[{"name":"key","type":"string"},{"name":"name","type":"string"},{"name":"date","type":"string"}]}}}]}'
After that I produce a record which follows the specified schema:
{"name": "order_created", "APropertie": [{"key": "1", "name": "testname", "date": "testdate"}]}
The record gets correctly produced to the topic. But my AvroConsumer throws an exception:
Polling
Polling
Polling
Polling
Polling
Polling
Exception in thread "main" org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition order_created-in-0 at offset 1. If needed, please seek past the record to continue consumption.
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 61
Caused by: org.apache.kafka.common.errors.SerializationException: Could not find class test specified in writer's schema whilst finding reader's schema for a SpecificRecord.
Process finished with exit code 1
Any hints?
Thanks!
It has something to do with the config of the producer / consumer.
Normal producers have a config like this:
// normal producer
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("acks", "all");
properties.setProperty("retries", "10");
Avro normally adds the following properties:
// avro part
properties.setProperty("key.serializer", StringSerializer.class.getName());
properties.setProperty("value.serializer", KafkaAvroSerializer.class.getName());
properties.setProperty("schema.registry.url", "http://127.0.0.1:8081");
properties.setProperty("confluent.value.schema.validation", "true");
properties.setProperty("confluent.key.schema.validation", "true");
These have to be included in the the console producer.
I produce the same Avro schema to one topic use different Confluent Registry sources. I get the error when I consume this topic:
org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition XXXXX_XXXX_XXX-0 at offset 0. If needed, please seek past the record to continue consumption.
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 7
Caused by: org.apache.kafka.common.errors.SerializationException: Could not find class XXXXX_XXXX_XXX specified in writer's schema whilst finding reader's schema for a SpecificRecord.
How to ignore differently Avro message-id?
Schema:
{
"type": "record",
"name": "XXXXX_XXXX_XXX",
"namespace": "aa.bb.cc.dd",
"fields": [
{
"name": "ACTION",
"type": [
"null",
"string"
],
"default":null,
"doc":"action"
},
{
"name": "EMAIL",
"type": [
"null",
"string"
],
"default":null,
"doc":"email address"
}
]
}
Produced command
{"Action": "A", "EMAIL": "xxxx#xxx.com"}
It's not possible to use different Registry urls in a producer and be able to consume them consistently.
The reason is that a different ID will be placed in the topic.
The Schema ID lookup cannot be skipped
If you had the used same registry, the same schema payload would always generate the same ID, which the consumer would then be able to use consistently to read messages
I am trying to create a ingest pipeline using below PUT request:
{
"description": "ContentExtractor",
"processors": [
{
"extractor": {
"field": "contentData",
"target_field": "content"
}
}
]
}
But this is resulting in following error:
{
"error": {
"root_cause": [
{
"type": "not_x_content_exception",
"reason": "Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"
}
],
"type": "not_x_content_exception",
"reason": "Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"
},
"status": 500
}
I see below exception in ES logs:
org.elasticsearch.common.compress.NotXContentException: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes
at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:57) ~[elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:65) ~[elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.ingest.PipelineStore.validatePipeline(PipelineStore.java:154) ~[elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.ingest.PipelineStore.put(PipelineStore.java:133) ~[elasticsearch-5.1.2.jar:5.1.2]
This problem happening when Elasticsearch is running in Solaris, same request works fine in case of Linux. What am I doing wrong? Can somebody help me to fix this issue?
Thanks in advance.
Got the exact same error message but (on different version of elasticsearch and) when querying with erroneous
data format (misinterpreted doc https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html : "request body" is expected as plain JSON -- it is not intended to explain HTTP request body)
or using old syntax within path of URL (just after 'index' in the URL) :
curl -XPUT -H "Content-Type: application/json" http://host:port/index/_mapping/_doc -d "mappings=#mymapping.json"
Just remove the "mappings=" and trailing path!
Case
I am using ResultSet's submit method in Java (provided by org.apache.tinkerpop:gremlin-driver:3.0.1-incubating dependency ) to query gremlin server. I need to know how to configure my client to receive response in JSON format .
What I have done
I have tried using both GraphSONMessageSerializerV1d0 and GraphSONMessageSerializerGremlinV1d0 serializers but the response is not a valid json.This is my gremlin-server.yaml file
authentication: {className:
org.apache.tinkerpop.gremlin.server.auth.AllowAllAuthenticator,
config: null}
channelizer:
org.apache.tinkerpop.gremlin.server.channel.WebSocketChannelizer
graphs: {graph: src/test/resources/titan-inmemory.properties}
gremlinPool: 8
host: localhost
maxAccumulationBufferComponents: 1024
maxChunkSize: 8192
maxContentLength: 65536
maxHeaderSize: 8192
maxInitialLineLength: 4096
metrics:
consoleReporter: null
csvReporter: null
gangliaReporter: null
graphiteReporter: null
jmxReporter: null
slf4jReporter: {enabled: true, interval: 180000, loggerName:
org.apache.tinkerpop.gremlin.server.Settings$Slf4jReporterMetrics}
plugins: [aurelius.titan]
port: 8182
processors: []
resultIterationBatchSize: 64
scriptEngines:
gremlin-groovy:
config: null
imports: [java.lang.Math]
scripts: [src/test/resources/generate-asset-plus-locations.groovy]
staticImports: [java.lang.Math.PI]
scriptEvaluationTimeout: 30000
serializedResponseTimeout: 30000
serializers:
- className:
org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
config: {useMapperFromGraph: graph}
- className:
org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
config: {serializeResultToString: true}
- className:
org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializer
GremlinV1d0
config: {useMapperFromGraph: graph}
- className:
org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0
config: {useMapperFromGraph: graph}
ssl: {enabled: false, keyCertChainFile: null, keyFile: null, keyPassword:
null, trustCertChainFile: null}
threadPoolBoss: 1
threadPoolWorker: 1
writeBufferHighWaterMark: 65536
writeBufferLowWaterMark: 32768
So it would be great if some one could help me in configuring the client side to receive the result in JSON format!!
To use GraphSON as the serialization format you just need to specify it to the Cluster builder:
Cluster cluster = Cluster.build().serializer(Serializers.GRAPHSON_V2D0).create();
But it's worth nothing that this won't return you a string of JSON to work with. It tells the server to use JSON as the serialization format, but the driver deserializes the JSON into objects (Maps, Lists, etc.). If you want an actual JSON string then you should return one in your script that you send to the server. Your only other option is to write your own serializer which would always just preserve the string.