I'm reading from a Kafka topic, which contains Avro messages serialized using the KafkaAvroEncoder (which automatically registers the schemas with the topics). I'm using the maven-avro-plugin to generate plain Java classes, which I'd like to use upon reading.
The KafkaAvroDecoder only supports deserializing into GenericData.Record types, which (in my opinion) misses the whole point of having a statically typed language. My deserialization code currently looks like this:
SpecificDatumReader<event> reader = new SpecificDatumReader<>(
event.getClassSchema() // event is my class generated from the schema
);
byte[] in = ...; // my input bytes;
ByteBuffer stuff = ByteBuffer.wrap(in);
// the KafkaAvroEncoder puts a magic byte and the ID of the schema (as stored
// in the schema-registry) before the serialized message
if (stuff.get() != 0x0) {
return;
}
int id = stuff.getInt();
// lets just ignore those special bytes
int length = stuff.limit() - 4 - 1;
int start = stuff.position() + stuff.arrayOffset();
Decoder decoder = DecoderFactory.get().binaryDecoder(
stuff.array(), start, length, null
);
try {
event ev = reader.read(null, decoder);
} catch (IOException e) {
e.printStackTrace();
}
I found my solution cumbersome, so I'd like to know if there is a simpler solution to do this.
Thanks to the comment I was able to find the answer. The secret was to instantiate KafkaAvroDecoder with a Properties specifying the use of the specific Avro reader, that is:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "...");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_C‌ONFIG, "...");
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
VerifiableProp vProps = new VerifiableProperties(props);
KafkaAvroDecoder decoder = new KafkaAvroDecoder(vProps);
MyLittleData data = (MyLittleData) decoder.fromBytes(input);
The same configuration applies for the case of using directly the KafkaConsumer<K, V> class (I'm consuming from Kafka in Storm using the KafkaSpout from the storm-kafka project, which uses the SimpleConsumer, so I have to manually deserialize the messages. For the courageous there is the storm-kafka-client project, which does this automatically by using the new style consumer).
Related
I have a relatively straightforward use case:
Read Avro data from a Kafka topic
Use KPL (v0.14.12) to send this data to Kinesis Data Streams
Use Kinesis Firehose to transform this data into Parquet and transfer it to S3.
The Kafka topic was written into by Kafka Streams using the following producer Configuration:
private void addAwsGlueSpecificProperties(Map<String, Object> props) {
props.put(AWSSchemaRegistryConstants.AWS_REGION, "eu-central-1");
props.put(AWSSchemaRegistryConstants.DATA_FORMAT, DataFormat.AVRO.name());
props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
props.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "Kinesis_Schema_Registry");
props.put(AWSSchemaRegistryConstants.COMPRESSION_TYPE, AWSSchemaRegistryConstants.COMPRESSION.ZLIB.name());
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, GlueSchemaRegistryKafkaStreamsSerde.class.getName());
}
Most notably, I've set SCHEMA_AUTO_REGISTRATION_SETTING to true to try and rule out problems with my schema definition. The auto-registration itself worked without any issues.
I have a very simple loop running for test purposes, which does step 1 and 2 of the above. It looks as follows:
KinesisProducer kinesisProducer = new KinesisProducer(getKinesisConfig());
try (final KafkaConsumer<String, AvroEvent> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList(TOPIC));
while (true) {
log.info("Polling...");
final ConsumerRecords<String, AvroEvent> records = consumer.poll(Duration.ofMillis(100));
for (final ConsumerRecord<String, AvroEvent> record : records) {
final String key = record.key();
ListenableFuture<UserRecordResult> request = kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
Futures.addCallback(request, CALLBACK, executor);
}
Thread.sleep(Duration.ofSeconds(10).toMillis());
}
}
The callback just does a bit of logging on success/failure.
My Kinesis Config looks as follows:
private static KinesisProducerConfiguration getKinesisConfig() {
KinesisProducerConfiguration config = new KinesisProducerConfiguration();
GlueSchemaRegistryConfiguration schemaRegistryConfiguration = getGlueSchemaRegistryConfiguration();
config.setGlueSchemaRegistryConfiguration(schemaRegistryConfiguration);
config.setRegion("eu-central-1");
config.setCredentialsProvider(new DefaultAWSCredentialsProviderChain());
config.setMaxConnections(2);
config.setThreadingModel(KinesisProducerConfiguration.ThreadingModel.POOLED);
config.setThreadPoolSize(2);
config.setRateLimit(100L);
return config;
}
private static GlueSchemaRegistryConfiguration getGlueSchemaRegistryConfiguration() {
GlueSchemaRegistryConfiguration gsrConfig = new GlueSchemaRegistryConfiguration("eu-central-1");
gsrConfig.setAvroRecordType(AvroRecordType.GENERIC_RECORD ); // have also tried SPECIFIC_RECORD
gsrConfig.setRegistryName("Kinesis_Schema_Registry");
gsrConfig.setCompressionType(AWSSchemaRegistryConstants.COMPRESSION.ZLIB);
return gsrConfig;
}
This setup allows me to read Specific Avro records from Kafka and send them to Kinesis. I have also verified that the correct schema version ID is queried from GSR by my code. However, when my data gets to Firehose, I receive only the following error message for all my records (one per record):
{
"attemptsMade": 1,
"arrivalTimestamp": 1659622848304,
"lastErrorCode": "DataFormatConversion.ParseError",
"lastErrorMessage": "Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 3)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream#6252e7eb; line: 1, column: 2]",
"attemptEndingTimestamp": 1659623152452,
"rawData": "<base64EncodedData>",
"sequenceNumber": "<seqNum>",
"dataCatalogTable": {
"databaseName": "<Glue database name>",
"tableName": "<Glue table name>",
"region": "eu-central-1",
"versionId": "LATEST",
"roleArn": "<arn>"
}
}
Unfortunately I can't post the entirety of the data as it is sensitive. However, the relevant part is that it always starts with the above control character that is causing the problem:
0x03 0x05 <schemaVersionId> <data>
My original data does not contain these control characters. After some debugging, I've found that KPL explicitly adds these bytes to the beginning of a UserRecord. In com.amazonaws.services.schemaregistry.serializers.SerializationDataEncoder#write:
public byte[] write(final byte[] objectBytes, UUID schemaVersionId) {
byte[] bytes;
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
writeHeaderVersionBytes(out);
writeCompressionBytes(out);
writeSchemaVersionId(out, schemaVersionId);
boolean shouldCompress = this.compressionHandler != null;
bytes = writeToExistingStream(out, shouldCompress ? compressData(objectBytes) : objectBytes);
} catch (Exception e) {
throw new AWSSchemaRegistryException(e.getMessage(), e);
}
return bytes;
}
With writeHeaderVersionBytes(out) and writeCompressionBytes(out) writing to the front of the stream, respectively:
// byte HEADER_VERSION_BYTE = (byte) 3;
private void writeHeaderVersionBytes(ByteArrayOutputStream out) {
out.write(AWSSchemaRegistryConstants.HEADER_VERSION_BYTE);
}
// byte COMPRESSION_BYTE = (byte) 5
// byte COMPRESSION_DEFAULT_BYTE = (byte) 0
private void writeCompressionBytes(ByteArrayOutputStream out) {
out.write(compressionHandler != null ? AWSSchemaRegistryConstants.COMPRESSION_BYTE
: AWSSchemaRegistryConstants.COMPRESSION_DEFAULT_BYTE);
}
Why is Kinesis unable to parse a message that is produced by the library that is supposed to be best suited for writing to it? What am I missing?
I've finally figured out the problem and it's quite dumb.
What it boils down to, is that the transformer that converts data to parquet in Firehose expects a pure JSON payload. It expects records in the form:
{"itemId": 1, "itemName": "someItem"}{"itemId": 2, "itemName": "otherItem"}
It seemingly does not accept the same data in a different format.
This means that Avro-compatible JSON (where the above itemId would look like "itemId": {"long": 1}, or e.g. binary Avro data, is not compatible with the Kinesis Firehose parquet transformer, regardless of the fact that my schema definition in the Glue Schema Registry is explicitly registered as being in Avro format.
In addition, the Firehose parquet transformer requires the use of a Glue table - creating this table from an imported Avro schema simply does not work (see this answer), and had to be created manually. Luckily, even though it can't use the table that is based on an existing schema, the table definition was the same (with the exception of the Serde it needs to use), so it was relatively easy to fix...
To sum up, to get the above code to work I had to:
Create a Glue table for the schema manually (you can use the first table created from the existing schema as a template for creating this second table, but you can't have Firehose link to the first table)
Change the above code:
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
to:
ByteBuffer data = ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), data);
Note that the I am now using the overloaded addUserRecord function that does not include a Schema parameter, which internally invokes the previous function with a null schema parameter. This prevents the KPL from encoding my payload and instead sends the 'plain' JSON over to KDS.
This is contrary to the only AWS Docs example that I could find on the topic, which likely is meant for a Firehose stream which does not convert the data prior to sending it to its destination.
I can't quite understand the reasons for all these undocumented limitations, and it was a pain to debug seeing how neither of the KPL functions nor KDS explicitly mentions anywhere that I can find that this is the expected behaviour. I feel like it's not worth trying to open an issue/PR over at the KPL repo seeing how it seems like Amazon doesn't really care about maintaining it that much...
I'll probably switch over to the plain Kinesis Client + Kinesis Aggregation for a more robust solution in the future, but hey, at least it works.
In fact I am making a Minecraft plugin and I was wondering how some plugins (without using DB) manage to keep information even when the server is off.
For example if we make a grade plugin and we create a different list or we stack the players who constitute each. When the server will shut down and restart afterwards, the lists will become empty again (as I initialized them).
So I wanted to know if anyone had any idea how to keep this information.
If a plugin want to save informations only for itself, and it don't need to make it accessible from another way (a PHP website for example), you can use YAML format.
Create the config file :
File usersFile = new File(plugin.getDataFolder(), "user-data.yml");
if(!usersFile.exists()) { // don't exist
usersFile.createNewFile();
// OR you can copy file, but the plugin should contains a default file
/*try (InputStream in = plugin.getResource("user-data.yml");
OutputStream out = new FileOutputStream(usersFile)) {
ByteStreams.copy(in, out);
} catch (Exception e) {
e.printStackTrace();
}*/
}
Load the file as Yaml content :
YamlConfiguration config = YamlConfiguration.loadConfiguration(usersFile);
Edit content :
config.set(playerUUID, myVar);
Save content :
config.save(usersFile);
Also, I suggest you to make I/O async (read & write) with scheduler.
Bonus:
If you want to make ONE config file per user, and with default config, do like that :
File oneUsersFile = new File(plugin.getDataFolder(), playerUUID + ".yml");
if(!oneUsersFile.exists()) { // don't exist
try (InputStream in = plugin.getResource("my-def-file.yml");
OutputStream out = new FileOutputStream(oneUsersFile)) {
ByteStreams.copy(in, out); // copy default to current
} catch (Exception e) {
e.printStackTrace();
}
}
YamlConfiguration userConfig = YamlConfiguration.loadConfiguration(oneUsersFile);
PS: the variable plugin is the instance of your plugin, i.e. the class which extends "JavaPlugin".
You can use PersistentDataContainers:
To read data from a player, use
PersistentDataContainer p = player.getPersistentDataContainer();
int blocksBroken = p.get(new NamespacedKey(plugin, "blocks_broken"), PersistentDataType.INTEGER); // You can also use DOUBLE, STRING, etc.
The Namespaced key refers to the name or pointer to the data being stored. The PersistentDataType refers to the type of data that is being stored, which can be any Java primitive type or String. To write data to a player, use
p.set(new NamespacedKey(plugin, "blocks_broken"), PersistentDataType.INTEGER, blocksBroken + 1);
I am receiving a set of events in Avro format on different topics. I want to consume these and write to s3 in parquet format.
I have written a below job that creates a different stream for each event and fetches it schema from the confluent schema registry to create a parquet sink for an event.
This is working fine but the only problem I am facing is whenever a new event start coming I have to change in the YAML config and restart the job every time. Is there any way I do not have to restart the job and it start consuming a new set of events.
YamlReader reader = new YamlReader(topologyConfig);
EventTopologyConfig eventTopologyConfig = reader.read(EventTopologyConfig.class);
long checkPointInterval = eventTopologyConfig.getCheckPointInterval();
topics = eventTopologyConfig.getTopics();
List<EventConfig> eventTypesList = eventTopologyConfig.getEventsType();
CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);
FlinkKafkaConsumer flinkKafkaConsumer = new FlinkKafkaConsumer(topics,
new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
properties);
DataStream<GenericRecord> dataStream = streamExecutionEnvironment.addSource(flinkKafkaConsumer).name("source");
try {
for (EventConfig eventConfig : eventTypesList) {
LOG.info("creating a stream for ", eventConfig.getEvent_name());
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(eventConfig.getSchema_subject(), registryClient)))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
DataStream<GenericRecord> outStream = dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord != null && genericRecord.get(EVENT_NAME).toString().equals(eventConfig.getEvent_name())) {
return true;
}
return false;
});
outStream.addSink(sink).name(eventConfig.getSink_id()).setParallelism(parallelism);
}
} catch (Exception e) {
e.printStackTrace();
}
Yaml file :
!com.bounce.config.EventTopologyConfig
eventsType:
- !com.bounce.config.EventConfig
event_name: "search_list_keyless"
schema_subject: "search_list_keyless-com.bounce.events.keyless.bookingflow.search_list_keyless"
topic: "search_list_keyless"
- !com.bounce.config.EventConfig
event_name: "bike_search_details"
schema_subject: "bike_search_details-com.bounce.events.keyless.bookingflow.bike_search_details"
topic: "bike_search_details"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_lock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_lock"
topic: "analytics-keyless"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_unlock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_unlock"
topic: "analytics-keyless"
checkPointInterval: 1200000
topics: ["search_list_keyless","bike_search_details","analytics-keyless"]
Thanks.
I think you want to use a custom BucketAssigner that uses the genericRecord.get(EVENT_NAME).toString() value as the bucket ID, along with whatever event time bucketing was being done by the EventTimeBucketAssigner.
Then you don't need to create multiple streams, and it should be dynamic (whenever a new event name value occurs in a record being written, you'll get a new output sink).
I'm trying to create Avro file in Java (just testing code at the moment). Everything works fine, the code looks about like this:
GenericRecord record = new GenericData.Record(schema);
File file = new File("test.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(record);
dataFileWriter.close();
The problem I'm facing now is - what kind of Java object do I instantiate when I want to write Union? Not necessarily on the top level, possibly attach the union to a record being written. There are a few objects for complex types prepared, like GenericData.Record, GenericData.Array etc. For those that are not prepared, usually the right object is simply a standard Java object (java.util.Map implementing classes for "map" Avro type etc.).
But I cannot figure out what is the right object to instantiate for writing a Union.
This question refers to writing Avro file WITHOUT code generation. Any help is very much appreciated.
Here's what I did:
Suppose the schema is defined like this:
record MyStructure {
...
record MySubtype {
int p1;
}
union {null, MySubtype} myField = null;
...
}
And this is the Java code:
Schema schema; // the schema of the main structure
// ....
GenericRecord rec = new GenericData.Record(schema);
int i = schema.getField("myField").schema().getIndexNamed("MySubtype");
GenericRecord myField = new GenericData.Record(schema.getField("myField").schema().getTypes().get(i));
myField.put("p1", 100);
rec.put("myField", myField);
I'm starting to design an application, that will, in part, run through a directory of files and compare their extensions to their file headers.
Does anyone have any advice as to the best way to approach this? I know I could simply have a lookup table that will contain the file's header signature. e.g., JPEG: \xFF\xD8\xFF\xE0
I was hoping there might be a simper way.
Thanks in advance for your help.
I'm afraid it'll have to be more complicated than that. Not every file type has a header at all, and some (such as RAR) have their characteristic data structures at the end rather than at the beginning.
You may want to take a look at the Unix file command, which does the same job:
http://linux.die.net/man/1/file
http://linux.die.net/man/5/magic
If you don't need to do dirty work on these values (and you don't have linux) you could simply use an external program, like TrID, that is able to do this thing for you.
Maybe you can just work on its output without caring to doing it by yourself.. in anycase if you have just around 20 kinds of files that you will have to manage having a simple lookup table (eg. HashMap<String,byte[]>) is not that bad. Of cours this will work only if desidered file format has a magic number, otherwise you are on your own (or with an external program).
Because of the problem with the missing significant header for some file types (thanks #Michael) I would create a map of extension to a kind of type checker with a simple API like
public interface TypeCheck throws IOException {
public boolean isValid(InputStream data);
}
Now you can code something like
File toBeTested = ...;
Map<String,TypeCheck> typeCheckByExtension = ...;
TypeCheck check = typeCheckByExtension.get(getExtension(toBeTested.getName()));
if (check != null) {
InputStream in = new FileInputStream(toBeTested);
if (check.isValid(in)) {
// process valid file
} else {
// process invalid file
}
in.close();
} else {
// process unknown file
}
The Header check for JPEG for example may look like
public class JpegTypeCheck implements TypeCheck {
private static final byte[] HEADER = new byte[] {0xFF, 0xD8, 0xFF, 0xE0};
public boolean isValid(InputStream data) throws IOException {
byte[] header = new byte[4];
return data.read(header) == 4 && Arrays.equals(header, HEADER);
}
}
For other types with no significant header you can implement completly other type checks.
You can extract the mime type for each file and compare this to a map of mimetype/extension (Map<String, List<String>>, the first String is the mime type, the second is a list of valid extensions).
Resources :
Get the Mime Type from a File
JMimeMagic
On the same topic :
Java - HowTo extract MimeType from a byte[]
Getting A File's Mime Type In Java
You can know the file type of file reading the header using apache tika. Following code need apache tika jar.
InputStream is = MainApp.class.getResourceAsStream("/NetFx20SP1_x64.txt");
BufferedInputStream bis = new BufferedInputStream(is);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY,MainApp.class.getResource("/NetFx20SP1_x64.txt").getPath());
MediaType mediaType = detector.detect(bis, md);
System.out.println("MIMe Type of File : " + mediaType.toString());