Is there a BSON serializer/deserializer library out there for PHP or Java?
Another possibility is BSON4Jackson extension for Jackson, which adds support for BSON reading/writing.
BSON encoder/decoder in Java is pretty trivial. The following snippet of code is from my app, so it's in Scala. I am sure you could build a Java implementation from it easily.
import org.bson.BSON
import com.mongodb.{DBObject, DBDecoder, DefaultDBDecoder}
def convert(dbo: DBObject): Array[Byte] =
BSON.encode(dbo)
// NB! this is a stateful object and thus it's not thread-safe, so have
// to create one per decoding
def decoder: DBDecoder = DefaultDBDecoder.FACTORY.create
def convert(data: Array[Byte]): DBObject =
// NOTE: we do not support Ref in input, that's why "null" for DBCollection
decoder.decode(data, null)
def convert(is: InputStream): DBObject =
// NOTE: we do not support Ref in input, that's why "null" for DBCollection
decoder.decode(is, null)
The only significant note is that DBEncoder instance has an internal state it (re)uses during decoding, so it's not thread-safe. It should be ok if you decode objects one by one, but otherwise you'd better create an instance per decoding session.
check this link
http://php.net/manual/en/ref.mongo.php
bson_decode — Deserializes a BSON object into a PHP array
bson_encode — Serializes a PHP variable into a BSON string
You might check the MongoDB drivers for those languages, since MongoDB uses BSON. See what they use, or steal their implementation.
And here is a C++11 JSON encoder and decoder I've made using Rapidjson, because the native JSON encoder (BSONObj::jsonString) uses a non-standard encoding for longs: https://gist.github.com/ArtemGr/2c44cb451dc6a0cb46af
Also, unlike the native JSON encoder, this one doesn't have a problem decoding top-level arrays.
Related
I am building a tool in an Apache Beam pipeline which will ingest lots of different types of data (different Schemas, different filetypes, etc.) and will output the results as Avro files. Because there are many different types of output schemas, I'm using GenericRecords to write the Avro data. These GenericRecords include schemas generated during ingestion for each unique file / schema layout. In general, I have been using the built in Avro Schema class to handle these.
I tried using DecoderFactory to convert the Json data to Avro
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, content);
DatumReader<GenericData.Record> reader = new GenericDatumReader<>(schema);
return reader.read(null, decoder);
Which works just fine for the most part, except for when I have a case of a schema that has nullable fields, because the data is being read in from a JSON format that does not include typed fields, so when it creates the Schema it knows whether or not that field can be nullable, or is required, etc. This produces a problem when it writes the data to Avro:
If I have a nullable record that looks like this:
{"someField": "someValue"}
Avro is expecting the JSON data to look like this:
{"someField": {"string": "someValue"}}. This presents a problem anytime this combination appears (which is very frequent).
One possible solution raised was to use an AvroMapper. I laid it out like it shows on that page, created the Schema object as an AvroSchema, packaged the data into a byte array with the schema using AvroMapper.writter()
static GenericRecord convertJsonToGenericRecord(String content, Schema schema)
throws IOException {
JsonNode node = ObjectMappers.defaultObjectMapper().readTree(content);
AvroSchema avroSchema = new AvroSchema(schema);
byte[] avroData =
mapper
.writer(avroSchema)
.writeValueAsBytes(node);
return mapper.readValue(avroData, GenericRecord.class);
Which may hopefully get around the typing problem with nullable records, but which is still giving me issues in the form of not recognizing that the AvroSchema is inside the actual byte array that I'm passing in (avroData). Here is the stack trace:
com.fasterxml.jackson.core.JsonParseException: No AvroSchema set, can not parse
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader._checkSchemaSet(MissingReader.java:68)
at com.fasterxml.jackson.dataformat.avro.deser.MissingReader.nextToken(MissingReader.java:41)
at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(AvroParserImpl.java:97)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4762)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4668)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3691)
When I checked the avroData byte array just to see what it looked like, it did not include anything other than the actual value I passed into it. It didn't include the schema, and it didn't even include the header or key. For the test, I'm using a single K/V pair as in the example above, and all I got back was the value.
An alternative route that I may pursue if this doesn't work is to manually format the JSON data as it comes in, but this is messy, and will require lots of recursion. I'm 99% sure that I can get it working that way, but would love to avoid it if at all possible.
To reiterate, what I'm trying to do is package incoming JSON-formatted data (string, byte array, node, whatever) with an Avro Schema to create GenericRecords which will be output to .avro files. I need to find a way to ingest the data and Schema such that it will allow for nullable records to be untyped in the JSON-string.
Thank you for your time, and don't hesitate to ask clarifying questions.
What is the best way to extend org.teiid.translator.ws to connect to a webservice that returns JSONP (whose mediatype is usually application/javascript)? The existing ws translator can read only JSON or XML. In general, was the translator designed to facilitate the injection of transformation logic to handle any webpage structure/format (e.g., JSONP, plaintext, html, etc.)?
For JSONP, I am leaning towards creating my own implementation of org.teiid.core.types.InputStreamFactory, say com.acme.JsonpToJsonInputStreamFactory, in which I define my own JsonpToJsonReaderInputStream (extending ReaderInputStream) that skips the leading
randomFunctionName(
and trailing
)
of a JSONP payload, and modify ClobInputStreamFactory.getInputStream to return that, instead of ReaderInputStream. Then I replace both instances of
ds = new InputStreamFactory.ClobInputStreamFactory(...);
in translator-ws-jsonp.BinaryWSProcedureExecution (where translator-ws-jsonp is based on translator-ws) with
ds = new JsonpToJsonInputStreamFactory.ClobInputStreamFactory(...);
WS translator returns the results in Blob form, how you unpack the results is up to you. IMO, you do not really need to build another translator.
Currently, the typical use case in JDV is to read blob and use JSONTOXML function to convert into XML such that the results can be then parsed into a tabular structure using constructs like XMLTABLE. So, you can write a UDF like JSONPTOJSON which does above like you mention then use JSONTOXML(JSONPTOJSON(blob)) as input to XMLTABLE.
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.
I stored the java object in hbase (i.e) Let's say I have an object 'User' with 3 parameters like firstname, middlename and lastname. I used the following code for serialization in java
Object object = (object) user;
byte[] byteData = SerializationUtils.serialize((Serializable) object);
and stored in hbase like 'storing complete object (in byte[] format of above) in the Value portion of the KeyValue pair'
It is stored in hbase like (Example)
column=container:container, timestamp=1480016194005, value=\xAC\xED\x00\x05sr\x00&com.test.container\x07\x89\x83\xFA\x7F\xD0F\xA5\x02\x00\x08I\x00\x07classIdJ\x00\x14dateTimeInLongFormatZ\x00\x04rootZ\x00\x09undefinedL\x00\x03keyt\x00\x12Ljava/lang/String;L\x00\x04modeq\x00~\x00\x01L\x00\x04nameq\x00~\x00\x01L\x00\x06userIdq\x00~\x00\x01xp\x00\x00\x00\x02\x00\x00\x01X\x967\xBA\xF0\x00\x00t\x00\x1Econtainer_393_5639181044834024t\x00\x06expandt\x00\x02ert\x00\x08testadmin
when I try to retrieve the data, I used the following deserialization in java and converted back to object of readable format
object = SerializationUtils.deserialize(bytes);
I would like to retrieve the data stored in java format via happybase using python and I achieved it and received the data as available in hbase like
It is stored in hbase like (Example)
column=container:container, timestamp=1480016194005, value=\xAC\xED\x00\x05sr\x00&com.test.container\x07\x89\x83\xFA\x7F\xD0F\xA5\x02\x00\x08I\x00\x07classIdJ\x00\x14dateTimeInLongFormatZ\x00\x04rootZ\x00\x09undefinedL\x00\x03keyt\x00\x12Ljava/lang/String;L\x00\x04modeq\x00~\x00\x01L\x00\x04nameq\x00~\x00\x01L\x00\x06userIdq\x00~\x00\x01xp\x00\x00\x00\x02\x00\x00\x01X\x967\xBA\xF0\x00\x00t\x00\x1Econtainer_393_5639181044834024t\x00\x06expandt\x00\x02ert\x00\x08testadmin
Is there a way to deserialize the java object via python
Thanks Much
Hari
There is a Python library for that:
https://pypi.python.org/pypi/javaobj-py3/
Usage seems pretty easy with:
import javaobj
jobj = self.read_file("obj5.ser")
pobj = javaobj.loads(jobj)
print(pobj)
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.