Related
I have a relatively straightforward use case:
Read Avro data from a Kafka topic
Use KPL (v0.14.12) to send this data to Kinesis Data Streams
Use Kinesis Firehose to transform this data into Parquet and transfer it to S3.
The Kafka topic was written into by Kafka Streams using the following producer Configuration:
private void addAwsGlueSpecificProperties(Map<String, Object> props) {
props.put(AWSSchemaRegistryConstants.AWS_REGION, "eu-central-1");
props.put(AWSSchemaRegistryConstants.DATA_FORMAT, DataFormat.AVRO.name());
props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
props.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "Kinesis_Schema_Registry");
props.put(AWSSchemaRegistryConstants.COMPRESSION_TYPE, AWSSchemaRegistryConstants.COMPRESSION.ZLIB.name());
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, GlueSchemaRegistryKafkaStreamsSerde.class.getName());
}
Most notably, I've set SCHEMA_AUTO_REGISTRATION_SETTING to true to try and rule out problems with my schema definition. The auto-registration itself worked without any issues.
I have a very simple loop running for test purposes, which does step 1 and 2 of the above. It looks as follows:
KinesisProducer kinesisProducer = new KinesisProducer(getKinesisConfig());
try (final KafkaConsumer<String, AvroEvent> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList(TOPIC));
while (true) {
log.info("Polling...");
final ConsumerRecords<String, AvroEvent> records = consumer.poll(Duration.ofMillis(100));
for (final ConsumerRecord<String, AvroEvent> record : records) {
final String key = record.key();
ListenableFuture<UserRecordResult> request = kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
Futures.addCallback(request, CALLBACK, executor);
}
Thread.sleep(Duration.ofSeconds(10).toMillis());
}
}
The callback just does a bit of logging on success/failure.
My Kinesis Config looks as follows:
private static KinesisProducerConfiguration getKinesisConfig() {
KinesisProducerConfiguration config = new KinesisProducerConfiguration();
GlueSchemaRegistryConfiguration schemaRegistryConfiguration = getGlueSchemaRegistryConfiguration();
config.setGlueSchemaRegistryConfiguration(schemaRegistryConfiguration);
config.setRegion("eu-central-1");
config.setCredentialsProvider(new DefaultAWSCredentialsProviderChain());
config.setMaxConnections(2);
config.setThreadingModel(KinesisProducerConfiguration.ThreadingModel.POOLED);
config.setThreadPoolSize(2);
config.setRateLimit(100L);
return config;
}
private static GlueSchemaRegistryConfiguration getGlueSchemaRegistryConfiguration() {
GlueSchemaRegistryConfiguration gsrConfig = new GlueSchemaRegistryConfiguration("eu-central-1");
gsrConfig.setAvroRecordType(AvroRecordType.GENERIC_RECORD ); // have also tried SPECIFIC_RECORD
gsrConfig.setRegistryName("Kinesis_Schema_Registry");
gsrConfig.setCompressionType(AWSSchemaRegistryConstants.COMPRESSION.ZLIB);
return gsrConfig;
}
This setup allows me to read Specific Avro records from Kafka and send them to Kinesis. I have also verified that the correct schema version ID is queried from GSR by my code. However, when my data gets to Firehose, I receive only the following error message for all my records (one per record):
{
"attemptsMade": 1,
"arrivalTimestamp": 1659622848304,
"lastErrorCode": "DataFormatConversion.ParseError",
"lastErrorMessage": "Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 3)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream#6252e7eb; line: 1, column: 2]",
"attemptEndingTimestamp": 1659623152452,
"rawData": "<base64EncodedData>",
"sequenceNumber": "<seqNum>",
"dataCatalogTable": {
"databaseName": "<Glue database name>",
"tableName": "<Glue table name>",
"region": "eu-central-1",
"versionId": "LATEST",
"roleArn": "<arn>"
}
}
Unfortunately I can't post the entirety of the data as it is sensitive. However, the relevant part is that it always starts with the above control character that is causing the problem:
0x03 0x05 <schemaVersionId> <data>
My original data does not contain these control characters. After some debugging, I've found that KPL explicitly adds these bytes to the beginning of a UserRecord. In com.amazonaws.services.schemaregistry.serializers.SerializationDataEncoder#write:
public byte[] write(final byte[] objectBytes, UUID schemaVersionId) {
byte[] bytes;
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
writeHeaderVersionBytes(out);
writeCompressionBytes(out);
writeSchemaVersionId(out, schemaVersionId);
boolean shouldCompress = this.compressionHandler != null;
bytes = writeToExistingStream(out, shouldCompress ? compressData(objectBytes) : objectBytes);
} catch (Exception e) {
throw new AWSSchemaRegistryException(e.getMessage(), e);
}
return bytes;
}
With writeHeaderVersionBytes(out) and writeCompressionBytes(out) writing to the front of the stream, respectively:
// byte HEADER_VERSION_BYTE = (byte) 3;
private void writeHeaderVersionBytes(ByteArrayOutputStream out) {
out.write(AWSSchemaRegistryConstants.HEADER_VERSION_BYTE);
}
// byte COMPRESSION_BYTE = (byte) 5
// byte COMPRESSION_DEFAULT_BYTE = (byte) 0
private void writeCompressionBytes(ByteArrayOutputStream out) {
out.write(compressionHandler != null ? AWSSchemaRegistryConstants.COMPRESSION_BYTE
: AWSSchemaRegistryConstants.COMPRESSION_DEFAULT_BYTE);
}
Why is Kinesis unable to parse a message that is produced by the library that is supposed to be best suited for writing to it? What am I missing?
I've finally figured out the problem and it's quite dumb.
What it boils down to, is that the transformer that converts data to parquet in Firehose expects a pure JSON payload. It expects records in the form:
{"itemId": 1, "itemName": "someItem"}{"itemId": 2, "itemName": "otherItem"}
It seemingly does not accept the same data in a different format.
This means that Avro-compatible JSON (where the above itemId would look like "itemId": {"long": 1}, or e.g. binary Avro data, is not compatible with the Kinesis Firehose parquet transformer, regardless of the fact that my schema definition in the Glue Schema Registry is explicitly registered as being in Avro format.
In addition, the Firehose parquet transformer requires the use of a Glue table - creating this table from an imported Avro schema simply does not work (see this answer), and had to be created manually. Luckily, even though it can't use the table that is based on an existing schema, the table definition was the same (with the exception of the Serde it needs to use), so it was relatively easy to fix...
To sum up, to get the above code to work I had to:
Create a Glue table for the schema manually (you can use the first table created from the existing schema as a template for creating this second table, but you can't have Firehose link to the first table)
Change the above code:
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
to:
ByteBuffer data = ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), data);
Note that the I am now using the overloaded addUserRecord function that does not include a Schema parameter, which internally invokes the previous function with a null schema parameter. This prevents the KPL from encoding my payload and instead sends the 'plain' JSON over to KDS.
This is contrary to the only AWS Docs example that I could find on the topic, which likely is meant for a Firehose stream which does not convert the data prior to sending it to its destination.
I can't quite understand the reasons for all these undocumented limitations, and it was a pain to debug seeing how neither of the KPL functions nor KDS explicitly mentions anywhere that I can find that this is the expected behaviour. I feel like it's not worth trying to open an issue/PR over at the KPL repo seeing how it seems like Amazon doesn't really care about maintaining it that much...
I'll probably switch over to the plain Kinesis Client + Kinesis Aggregation for a more robust solution in the future, but hey, at least it works.
I am trying to dynamically parse a given .proto file in Java to decode a Protobuf-encoded binary.
I have the following parsing method, in which the "proto" string contains the content of the .proto file:
public static Descriptors.FileDescriptor parseProto (String proto) throws InvalidProtocolBufferException, Descriptors.DescriptorValidationException {
DescriptorProtos.FileDescriptorProto descriptorProto = DescriptorProtos.FileDescriptorProto.parseFrom(proto.getBytes());
return Descriptors.FileDescriptor.buildFrom(descriptorProto, null);
}
Though, on execution the previous method throws an exception with the message "Protocol message tag had invalid wire type.". I use the example .proto file from Google so I guess it is valid: https://github.com/google/protobuf/blob/master/examples/addressbook.proto
Here is the stack trace:
15:43:24.707 [pool-1-thread-1] ERROR com.github.whiver.nifi.processor.ProtobufDecoderProcessor - ProtobufDecoderProcessor[id=42c8ab94-2d8a-491b-bd99-b4451d127ae0] Protocol message tag had invalid wire type.
com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type.
at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:115)
at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:551)
at com.google.protobuf.GeneratedMessageV3.parseUnknownField(GeneratedMessageV3.java:293)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.<init>(DescriptorProtos.java:88)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.<init>(DescriptorProtos.java:53)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet$1.parsePartialFrom(DescriptorProtos.java:773)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet$1.parsePartialFrom(DescriptorProtos.java:768)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:163)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:197)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:209)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:214)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.parseFrom(DescriptorProtos.java:260)
at com.github.whiver.nifi.parser.SchemaParser.parseProto(SchemaParser.java:9)
at com.github.whiver.nifi.processor.ProtobufDecoderProcessor.lambda$onTrigger$0(ProtobufDecoderProcessor.java:103)
at org.apache.nifi.util.MockProcessSession.write(MockProcessSession.java:895)
at org.apache.nifi.util.MockProcessSession.write(MockProcessSession.java:62)
at com.github.whiver.nifi.processor.ProtobufDecoderProcessor.onTrigger(ProtobufDecoderProcessor.java:100)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:251)
at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:245)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Any idea?
Thank you!
It looks like you're trying to use FileDescriptorSet.parseFrom to populate a FileDescriptorSet. This will only work if the bytes you're providing are the binary protobuf contents - which is to say: a compiled schema. You can get a compiled schema by using the protoc command-line-tool with the --descriptor_set_out option. What you're actually passing it right now is the text bytes that make up the text schema, which is not what parseFrom expects.
Without a compiled schema, you would need a runtime .proto parser. I'm not aware of one for Java; protobuf-net includes one (protobuf-net.Reflection), but that is C#/.NET. Without an available runtime .proto parser, you'd need to shell-execute protoc instead.
Drawing from the other answers, here's a snippet of working Kotlin code from a library I'm developing.
https://github.com/asarkar/okgrpc
private fun lookupProtos(
protoPaths: List<String>,
protoFile: String,
tempDir: Path,
resolved: MutableSet<String>
): List<DescriptorProtos.FileDescriptorProto> {
val schema = generateSchema(protoPaths, protoFile, tempDir)
return schema.fileList
.filter { resolved.add(it.name) }
.flatMap { fd ->
fd.dependencyList
.filterNot(resolved::contains)
.flatMap { lookupProtos(protoPaths, it, tempDir, resolved) } + fd
}
}
private fun generateSchema(
protoPaths: List<String>,
protoFile: String,
tempDir: Path
): DescriptorProtos.FileDescriptorSet {
val outFile = Files.createTempFile(tempDir, null, null)
val stderr = ByteArrayOutputStream()
val exitCode = Protoc.runProtoc(
(protoPaths.map { "--proto_path=$it" } + listOf("--descriptor_set_out=$outFile", protoFile)).toTypedArray(),
DevNull,
stderr
)
if (exitCode != 0) {
throw IllegalStateException("Failed to generate schema for: $protoFile")
}
return Files.newInputStream(outFile).use { DescriptorProtos.FileDescriptorSet.parseFrom(it) }
}
The idea is to use os72/protoc-jar to write out a compiled schema/file descriptor. Then use FileDescriptorSet.parseFrom to read that file, and recurse on its dependencies.
Don't use java String to hold the protobuf payload. The issue is that String does translations behind the scenes, and makes assumptions about character sets.
Protobuf works on byte arrays, and the exact representation in the array has to be unchanged. Going to and from String does not work.
I'm trying to figure out how I can pass a stream of data within ContentProvider.openFile. The data to be sent is created in JNI. I tried createPipe with a transfer thread but I had a ton of trouble with broken pipes. So I thought I might just pass the 'write' pipe to JNI and write the data directly to it.
Java:
ParcelFileDescriptor[] pipe = ParcelFileDescriptor.createPipe();
boolean result = ImageProcessor.getThumb(fd/*source fd*/, pipe[1].getFd()); //JNI call (formerly returned a byte[])
return pipe[0];
C:
unsigned char* jpeg = NULL;
unsigned long jpegSize = 0;
getThumbnail(env, &jpeg, &jpegSize, rawProcessor); // Populates jpeg thumb, works when converted to byte[] in second segment
FILE* out = fdopen(dest, "wb");
int written = fwrite(jpeg, 1, jpegSize, out);
return TRUE;
When I convert to byte[] everything works fine, just not within a ContentProvider obviously:
jbyteArray thumb = env->NewByteArray(jpegSize);
env->SetByteArrayRegion(thumb, 0, jpegSize, (jbyte *) jpeg);
free(jpeg);
return thumb;
When I debug it gets to fwrite then the stack trace just seems to disappear. Never hits return TRUE or return pipe[0], but also doesn't crash or throw. Very strange...
Has anyone done something similar? Is it sufficient to simply write binary to the "write" pipe? Am I doing anything fundamentally wrong here? Thanks.
Update (after discussion with #pskink)
I tried implementing the PipeDataWriter. I used FileProvider.java as an example.
#Override
public void writeDataToPipe(#NonNull ParcelFileDescriptor output, #NonNull Uri uri, #NonNull String mimeType, #Nullable Bundle opts, #Nullable byte[] args)
{
try (FileOutputStream fout = new FileOutputStream(output.getFileDescriptor()))
{
fout.write(args, 0, args.length);
}
catch (IOException e)
{
Log.e(TAG, "Failed transferring", e);
}
}
byte[] rawData = ImageUtil.getRawThumb(fd.getParcelFileDescriptor().getFd());
return openPipeHelper(Uri.parse("invalid"), "image/jpg", null, rawData, this);
However, I'm getting the same errors I got when I used the transfer thread above:
java.io.IOException: write failed: EBADF (Bad file descriptor)
at libcore.io.IoBridge.write(IoBridge.java:498)
at java.io.FileOutputStream.write(FileOutputStream.java:186)
at
com.anthonymandra.content.MetaProvider.writeDataToPipe(MetaProvider.java:273)
and
java.io.IOException: write failed: EPIPE (Broken pipe)
at libcore.io.IoBridge.write(IoBridge.java:498)
at java.io.FileOutputStream.write(FileOutputStream.java:186)
at
com.anthonymandra.content.MetaProvider.writeDataToPipe(MetaProvider.java:273)
When I stepped through to make sure the data was fine for the images I found that everything loaded fine. It looks to me like this is actually a thread safety issue.
There were actually a bunch of things going wrong that all rolled up into a confusing mess:
I wasn't closing the ParcelFileDescriptor in a finally.
I use Glide for an image cache and it uses two fetchers when you load a Uri, meaning openFile was being called twice per file.
(2) caused endless broken pipe errors.
StrictMode was killing the app because of (1) and I missed it in the flurry of errors from (3).
This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this.
My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for example, but the particular protocol is not that important for this purpose). In my context I have a HttpServletResponse I need to write this data to somehow.
I initially attempted to write the data as what amounted to a virtual version of an avro container file (suppose that "response" is of type HttpServletResponse):
response.setContentType("application/octet-stream");
response.setHeader("Content-transfer-encoding", "binary");
ServletOutputStream outStream = response.getOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);
Schema someSchema = Schema.parse(".....some valid avro schema....");
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("somefield", someData);
...
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
fileWriter.create(someSchema, bos);
fileWriter.append(someRecord);
fileWriter.close();
bos.flush();
This was all fine and dandy, except that it turns out Avro doesn't really provide a way to read a container file apart from an actual file: the DataFileReader only has two constructors:
public DataFileReader(File file, DatumReader<D> reader);
and
public DataFileReader(SeekableInput sin, DatumReader<D> reader);
where SeekableInput is some avro-specific customized form whose creation also ends up reading from a file. Now given that, unless there is some way to somehow coerce an InputStream into a File (http://stackoverflow.com/questions/578305/create-a-java-file-object-or-equivalent-using-a-byte-array-in-memory-without-a suggests that there is not, and I have tried looking around the Java documentation as well), this approach won't work if the reader on the other end of the OutputStream receives that avro container file (I'm not sure why they allowed one to output avro binary container files to an arbitrary OutputStream without providing a way to read them from the corresponding InputStream on the other end, but that's beside the point). It seems that the implementation of the container file reader requires the "seekable" functionality that a concrete File provides.
Okay, so it doesn't look like that approach will do what I want. How about creating a JSON response that mimics the avro container file?
public static Schema WRAPPER_SCHEMA = Schema.parse(
"{\"type\": \"record\", " +
"\"name\": \"AvroContainer\", " +
"\"doc\": \"a JSON avro container file\", " +
"\"namespace\": \"org.bar.foo\", " +
"\"fields\": [" +
"{\"name\": \"schema\", \"type\": \"string\", \"doc\": \"schema representing the included data\"}, " +
"{\"name\": \"data\", \"type\": \"bytes\", \"doc\": \"packet of data represented by the schema\"}]}"
);
I'm not sure if this is the best way to approach this given the above constraints, but it looks like this might do the trick. I'll put the schema (of "Schema someSchema" from above, for instance) as a String inside the "schema" field, and then put in the avro-binary-serialized form of a record fitting that schema (ie. "GenericRecord someRecord") inside the "data" field.
I actually wanted to know about a specific detail of that which is described below, but I thought it would be worthwhile to give a bigger context as well, so that if there is a better high-level approach I could be taking (this approach works but just doesn't feel optimal) please do let me know.
My question is, assuming I go with this JSON-based approach, how do I write the avro binary representation of my Record into the "data" field of the AvroContainer schema? For example, I got up to here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
Encoder e = new BinaryEncoder(baos);
datumWriter.write(resultsRecord, e);
e.flush();
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("schema", someSchema.toString());
someRecord.put("data", ByteBuffer.wrap(baos.toByteArray()));
datumWriter = new GenericDatumWriter<GenericRecord>(WRAPPER_SCHEMA);
JsonGenerator jsonGenerator = new JsonFactory().createJsonGenerator(baos, JsonEncoding.UTF8);
e = new JsonEncoder(WRAPPER_SCHEMA, jsonGenerator);
datumWriter.write(someRecord, e);
e.flush();
PrintWriter printWriter = response.getWriter(); // recall that response is the HttpServletResponse
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
printWriter.print(baos.toString("UTF-8"));
I initially tried omitting the ByteBuffer.wrap clause, but then then the line
datumWriter.write(someRecord, e);
threw an exception that I couldn't cast a byte array into ByteBuffer. Fair enough, it looks like when the Encoder class (of which JsonEncoder is a subclass) is called to write an avro Bytes object, it requires a ByteBuffer to be given as an argument. Thus, I tried encapsulating the byte[] with java.nio.ByteBuffer.wrap, but when the data was printed out, it was printed as a straight series of bytes, without being passed through the avro hexadecimal representation:
"data": {"bytes": ".....some gibberish other than the expected format...}
That doesn't seem right. According to the avro documentation, the example bytes object they give says that I need to put in a json object, an example of which looks like "\u00FF", and what I have put in there is clearly not of that format. What I now want to know is the following:
What is an example of an avro bytes format? Does it look something like "\uDEADBEEFDEADBEEF..."?
How do I coerce my binary avro data (as output by the BinaryEncoder into a byte[] array) into a format that I can stick into the GenericRecord object and have it print correctly in JSON? For example, I want an Object DATA for which I can call on some GenericRecord "someRecord.put("data", DATA);" with my avro serialized data inside?
How would I then read that data back into a byte array on the other (consumer) end, when it is given the text JSON representation and wants to recreate the GenericRecord as represented by the AvroContainer-format JSON?
(reiterating the question from before) Is there a better way I could be doing all this?
As Knut said, if you want to use something other than a file, you can either:
use SeekableByteArrayInput, as Knut said, for anything you can shoe-horn into a byte array
Implement SeekablInput in your own way - for example if you were getting it out of some weird database structure.
Or just use a file. Why not?
Those are your answers.
The way I solved this was to ship the schemas separately from the data. I set up a connection handshake that transmits the schemas down from the server, then I send encoded data back and forth. You have to create an outside wrapper object like this:
{'name':'Wrapper','type':'record','fields':[
{'name':'schemaName','type':'string'},
{'name':'records','type':{'type':'array','items':'bytes'}}
]}
Where you first encode your array of records, one by one, into an array of encoded byte arrays. Everything in one array should have the same schema. Then you encode the wrapper object with the above schema -- set "schemaName" to be the name of the schema you used to encode the array.
On the server, you will decode the wrapper object first. Once you decode the wrapper object, you know the schemaName, and you have an array of objects you know how to decode -- use as you will!
Note that you can get away without using the wrapper object if you use a protocol like WebSockets and an engine like Socket.IO (for Node.js) Socket.io gives you a channel-based communication layer between browser and server. In that case, just use a specific schema for each channel, encode each message before you send it. You still have to share the schemas when the connection initiates -- but if you are using WebSockets this is easy to implement. And when you are done you have an arbitrary number of strongly-typed, bidirectional streams between client and server.
Under Java and Scala, we tried using inception via code generated using the Scala nitro codegen. Inception is how the Javascript mtth/avsc library solved this problem. However, we ran into several serialization problems using the Java library where there were erroneous bytes being injected into the byte stream, consistently - and we could not figure out where those bytes were coming from.
Of course that meant building our own implementation of Varint with ZigZag encoding. Meh.
Here it is:
package com.terradatum.query
import java.io.ByteArrayOutputStream
import java.nio.ByteBuffer
import java.security.MessageDigest
import java.util.UUID
import akka.actor.ActorSystem
import akka.stream.stage._
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
import com.nitro.scalaAvro.runtime.GeneratedMessage
import com.terradatum.diagnostics.AkkaLogging
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumWriter, GenericRecord}
import org.apache.avro.io.EncoderFactory
import org.elasticsearch.search.SearchHit
import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag
/*
* The original implementation of this helper relied exclusively on using the Header Avro record and inception to create
* the header. That didn't work for us because somehow erroneous bytes were injected into the output.
*
* Specifically:
* 1. 0x08 prepended to the magic
* 2. 0x0020 between the header and the sync marker
*
* Rather than continue to spend a large number of hours trying to troubleshoot why the Avro library was producing such
* erroneous output, we build the Avro Container File using a combination of our own code and Avro library code.
*
* This means that Terradatum code is responsible for the Avro Container File header (including magic, file metadata and
* sync marker) and building the blocks. We only use the Avro library code to build the binary encoding of the Avro
* records.
*
* #see https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files
*/
object AvroContainerFileHelpers {
val magic: ByteBuffer = {
val magicBytes = "Obj".getBytes ++ Array[Byte](1.toByte)
val mg = ByteBuffer.allocate(magicBytes.length).put(magicBytes)
mg.position(0)
mg
}
def makeSyncMarker(): Array[Byte] = {
val digester = MessageDigest.getInstance("MD5")
digester.update(s"${UUID.randomUUID}#${System.currentTimeMillis()}".getBytes)
val marker = ByteBuffer.allocate(16).put(digester.digest()).compact()
marker.position(0)
marker.array()
}
/*
* Note that other implementations of avro container files, such as the javascript library
* mtth/avsc uses "inception" to encode the header, that is, a datum following a header
* schema should produce valid headers. We originally had attempted to do the same but for
* an unknown reason two bytes wore being inserted into our header, one at the very beginning
* of the header before the MAGIC marker, and one right before the syncmarker of the header.
* We were unable to determine why this wasn't working, and so this solution was used instead
* where the record/map is encoded per the avro spec manually without the use of "inception."
*/
def header(schema: Schema, syncMarker: Array[Byte]): Array[Byte] = {
def avroMap(map: Map[String, ByteBuffer]): Array[Byte] = {
val mapBytes = map.flatMap {
case (k, vBuff) =>
val v = vBuff.array()
val byteStr = k.getBytes()
Varint.encodeLong(byteStr.length) ++ byteStr ++ Varint.encodeLong(v.length) ++ v
}
Varint.encodeLong(map.size.toLong) ++ mapBytes ++ Varint.encodeLong(0)
}
val schemaBytes = schema.toString.getBytes
val schemaBuffer = ByteBuffer.allocate(schemaBytes.length).put(schemaBytes)
schemaBuffer.position(0)
val metadata = Map("avro.schema" -> schemaBuffer)
magic.array() ++ avroMap(metadata) ++ syncMarker
}
def block(binaryRecords: Seq[Array[Byte]], syncMarker: Array[Byte]): Array[Byte] = {
val countBytes = Varint.encodeLong(binaryRecords.length.toLong)
val sizeBytes = Varint.encodeLong(binaryRecords.foldLeft(0)(_+_.length).toLong)
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
binaryRecords.foreach { rec =>
buff.append(rec:_*)
}
buff.append(syncMarker:_*)
buff.toArray
}
def encodeBlock[T](schema: Schema, records: Seq[GenericRecord], syncMarker: Array[Byte]): Array[Byte] = {
//block(records.map(encodeRecord(schema, _)), syncMarker)
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
records.foreach(record => writer.write(record, binaryEncoder))
binaryEncoder.flush()
val flattenedRecords = out.toByteArray
out.close()
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
val countBytes = Varint.encodeLong(records.length.toLong)
val sizeBytes = Varint.encodeLong(flattenedRecords.length.toLong)
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
buff.append(flattenedRecords:_*)
buff.append(syncMarker:_*)
buff.toArray
}
def encodeRecord[R <: GeneratedMessage with com.nitro.scalaAvro.runtime.Message[R]: ClassTag](
entity: R
): Array[Byte] =
encodeRecord(entity.companion.schema, entity.toMutable)
def encodeRecord(schema: Schema, record: GenericRecord): Array[Byte] = {
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(record, binaryEncoder)
binaryEncoder.flush()
val bytes = out.toByteArray
out.close()
bytes
}
}
/**
* Encoding of integers with variable-length encoding.
*
* The avro specification uses a variable length encoding for integers and longs.
* If the most significant bit in a integer or long byte is 0 then it knows that no
* more bytes are needed, if the most significant bit is 1 then it knows that at least one
* more byte is needed. In signed ints and longs the most significant bit is traditionally
* used to represent the sign of the integer or long, but for us it's used to encode whether
* more bytes are needed. To get around this limitation we zig-zag through whole numbers such that
* negatives are odd numbers and positives are even numbers:
*
* i.e. -1, -2, -3 would be encoded as 1, 3, 5, and so on
* while 1, 2, 3 would be encoded as 2, 4, 6, and so on.
*
* More information is available in the avro specification here:
* #see http://lucene.apache.org/core/3_5_0/fileformats.html#VInt
* https://developers.google.com/protocol-buffers/docs/encoding?csw=1#types
*/
object Varint {
import scala.collection.mutable
def encodeLong(longVal: Long): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedLong(longVal, buff)
buff.toArray[Byte]
}
def encodeInt(intVal: Int): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedInt(intVal, buff)
buff.toArray[Byte]
}
def zigZagSignedLong[T <: mutable.Buffer[Byte]](x: Long, dest: T): Unit = {
// sign to even/odd mapping: http://code.google.com/apis/protocolbuffers/docs/encoding.html#types
writeUnsignedLong((x << 1) ^ (x >> 63), dest)
}
def writeUnsignedLong[T <: mutable.Buffer[Byte]](v: Long, dest: T): Unit = {
var x = v
while ((x & 0xFFFFFFFFFFFFFF80L) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
def zigZagSignedInt[T <: mutable.Buffer[Byte]](x: Int, dest: T): Unit = {
writeUnsignedInt((x << 1) ^ (x >> 31), dest)
}
def writeUnsignedInt[T <: mutable.Buffer[Byte]](v: Int, dest: T): Unit = {
var x = v
while ((x & 0xFFFFF80) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
}
I am developing an interface that takes as input an encrypted byte stream -- probably a very large one -- that generates output of more or less the same format.
The input format is this:
{N byte envelope}
- encryption key IDs &c.
{X byte encrypted body}
The output format is the same.
Here's the usual use case (heavily pseudocoded, of course):
Message incomingMessage = new Message (inputStream);
ProcessingResults results = process (incomingMessage);
MessageEnvelope messageEnvelope = new MessageEnvelope ();
// set message encryption options &c. ...
Message outgoingMessage = new Message ();
outgoingMessage.setEnvelope (messageEnvelope);
writeProcessingResults (results, message);
message.writeToOutput (outputStream);
To me, it seems to make sense to use the same object to encapsulate this behaviour, but I'm at a bit of a loss as to how I should go about this. It isn't practical to load all of the encrypted body in at a time; I need to be able to stream it (so, I'll be using some kind of input stream filter to decrypt it) but at the same time I need to be able to write out new instances of this object. What's a good approach to making this work? What should Message look like internally?
I won't create one class to handle in- and output - one class, one responsibility. I would like two filter streams, one for input/decryption and one for output/encryption:
InputStream decrypted = new DecryptingStream(inputStream, decryptionParameters);
...
OutputStream encrypted = new EncryptingStream(outputSream, encryptionOptions);
They may have something like a lazy init mechanism reading the envelope before first read() call / writing the envelope before first write() call. You also use classes like Message or MessageEnvelope in the filter implementations, but they may stay package protected non API classes.
The processing will know nothing about de-/encryption just working on a stream. You may also use both streams for input and output at the same time during processing streaming the processing input and output.
Can you split the body at arbitrary locations?
If so, I would have two threads, input thread and output thread and have a concurrent queue of strings that the output thread monitors. Something like:
ConcurrentLinkedQueue<String> outputQueue = new ConcurrentLinkedQueue<String>();
...
private void readInput(Stream stream) {
String str;
while ((str = stream.readLine()) != null) {
outputQueue.put(processStream(str));
}
}
private String processStream(String input) {
// do something
return output;
}
private void writeOutput(Stream out) {
while (true) {
while (outputQueue.peek() == null) {
sleep(100);
}
String msg = outputQueue.poll();
out.write(msg);
}
}
Note: This will definitely not work as-is. Just a suggestion of a design. Someone is welcome to edit this.
If you need to read and write same time you either have to use threads (different threads reading and writing) or asynchronous I/O (the java.nio package). Using input and output streams from different threads is not a problem.
If you want to make a streaming API in java, you should usually provide InputStream for reading and OutputStream for writing. This way those can then be passed for other APIs so that you can chain things and so get the streams go all the way as streams.
Input example:
Message message = new Message(inputStream);
results = process(message.getInputStream());
Output example:
Message message = new Message(outputStream);
writeContent(message.getOutputStream());
The message needs to wrap the given streams with a classes that do the needed encryption and decryption.
Note that reading multiple messages at same time or writing multiple messages at same time would need support from the protocol too. You need to get the synchronization correct.
You should check Wikipedia article on different block cipher modes supporting encryption of streams. The different encryption algorithms may support a subset of these.
Buffered streams will allow you to read, encrypt/decrypt and write in a loop.
Examples demonstrating ZipInputStream and ZipOutputStream could provide some guidance on how you may solve this. See example.
What you need is using Cipher Streams (CipherInputStream). Here is an example of how to use it.
I agree with Arne, the data processor shouldn't know about encryption, it just needs to read the decrypted body of the message, and write out the results, and stream filters should take care of encryption. However, since this is logically operating on the same piece of information (a Message), I think they should be packaged inside one class which handles the message format, although the encryption/decryption streams are indeed independent from this.
Here's my idea for the structure, flipping the architecture around somewhat, and moving the Message class outside the encryption streams:
class Message {
InputStream input;
Envelope envelope;
public Message(InputStream input) {
assert input != null;
this.input = input;
}
public Message(Envelope envelope) {
assert envelope != null;
this.envelope = envelope;
}
public Envelope getEnvelope() {
if (envelope == null && input != null) {
// Read envelope from beginning of stream
envelope = new Envelope(input);
}
return envelope
}
public InputStream read() {
assert input != null
// Initialise the decryption stream
return new DecryptingStream(input, getEnvelope().getEncryptionParameters());
}
public OutputStream write(OutputStream output) {
// Write envelope header to output stream
getEnvelope().write(output);
// Initialise the encryption
return new EncryptingStream(output, getEnvelope().getEncryptionParameters());
}
}
Now you can use it by creating a new message for the input, and one for the output:
OutputStream output; // This is the stream for sending the message
Message inputMessage = new Message(input);
Message outputMessage = new Message(inputMessage.getEnvelope());
process(inputMessage.read(), outputMessage.write(output));
Now the process method just needs to read chunks of data as required from the input, and write results to the output:
public void process(InputStream input, OutputStream output) {
byte[] buffer = new byte[1024];
int read;
while ((read = input.read(buffer) > 0) {
// Process buffer, writing to output as you go.
}
}
This all now works in lockstep, and you don't need any extra threads. You can also abort early without having to process the whole message (if the output stream is closed for example).