Using LARGE JSON Data Source with Apache Flink and Python (2 questions)

Using LARGE JSON Data Source with Apache Flink and Python (2 questions) - java

Flink's Batch Python API, Version 1.2 supports
read_text(path)
read_csv(path, type)
from_elements(*args)
And some specific data types: (int, float, bool, string) and byte arrays.
Python is pretty strong with dictionaries, and reading a json file through json.load(file) creates a dict type variable, but as far as i searched, dictionary manipulation isn't supported by flink atm.
Question num0: So, besides parsing strings and trying to mess with regular expressions, is there any other way to handle large json files with flink's python api?
I was thinking to preprocess my json file using lists and in combination with from_elements(*args) to achieve something.
Handling Objects seems to be a problem with Python+Flink as well.
Question num1: Is switching to Java my best bet?
I would like to have an easy way to handle json data, as well as using objects derived from these data.

Related

High performance JSON text parsing and storing in SQLite

I'm working on some refactoring of my Android (kotlin/java) app to greatly improve the performance of an initial data synchronization that is done with our back-end systems and stored in a local SQLite db. The app is used on all kind of Android devices this sync can take up hours on the older Android devices.
The back-end system uses an JSON (UTF-8) API with around 10.000 items per batch with a lot of strings.
To achieve the highest performance possible I think I have to find a way to use / parse Strings more efficient. With just a normal JSON parser and using the Android SQLiteStatement classes I can only do this:
Parse the received JSON response (in-mem byte[]) into objects with there Strings.
These Strings are backed by new char[] and the bytes are first copied & converted to UTF-16 to the char[]. Effectively doubling the memory needed for a String.
The SQLite db uses UTF-8 encoding. So binding a string to its statement also involves some conversion steps.
I already implemented some idea's but still I have some problems;
Instead of parsing (index-overlay-parsing) to String objects I can parse these Strings to an object which has an reference to the original byte[] buffer, offset & length. The SQLiteStatement class allows to bind a byte[] as BLOB. Effectively inserting UTF-8 bytes directly in SQLite.
This approach is already much faster but still there is some memory copying involved. A neater approach will be that the SQLiteStatement allows to bind the original byte[] buffer with an offset and length. But this class is final...
Another idea was to subclass String and let this class be backed by the original byte[] buffer, offset & length. But also the String class is final...
Implementing some CharSequence sounds a neat approach but the SQLiteStatement does not have some method the bind that type...
Binary serdes does not greatly improve performance because of all the strings.
So I was wondering if you guys have some ideas how the reduce the object allocation and memory copying?
Can the Unsafe package do any help here? (proxy String?)
Another option is to copy the android.sqlite package and create my own SQLiteStatement with support for byte[]/offset/length or CharSequence.
Any other ideas?

Creating a parquet file on AWS Lambda function

I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function.
After looking on Google I didn't found a solution to this without have some sort of Hadoop.
Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?

If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.
It involved using Pandas:
df = pd.read_json(file.json)
followed by converting it into parquet file:
pq = pa.parquet.write_table(df, file.pq)
Above example does auto-inference of data-types. You can override it by using the argument dtype while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array.
Update (a more generic solution):
Consider using json2parquet.
However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:
{a: {b: {c: d}}} to {a.b.c: d}
Then, this data needs to be ingested as pyarrow batch with json2parquet:
pa_batch = j2p.ingest_data(data)
and now the batch can be loaded as PyArrow dataframe:
df = pa.Table.from_batches([pa_batch])
and output in parquet file:
pq = pa.parquet.write_table(df, file.pq)

Load a Perl Hash into Java

I have a big .pm File, which only consist of a very big Perl hash with lots of subhashes. I have to load this hash into a Java program, do some work and changes on the data lying below and save it back into a .pm File, which should look similar to the one i started with.
By now, i tried to convert it linewise by regex and string matching, converting it into a XML Document and later Elementwise parse it back into a perl hash.
This somehow works, but seems quite dodgy. Is there any more reliable way to parse the perl hash without having a perl runtime installed?

You're quite right, it's utterly filthy. Regex and string for XML in the first place is a horrible idea, and honestly XML is probably not a good fit for this anyway.
I would suggest that you consider JSON. I would be stunned to find java can't handle JSON and it's inherently a hash-and-array oriented data structure.
So you can quite literally:
use JSON;
print to_json ( $data_structure, { pretty => 1 } );
Note - it won't work for serialising objects, but for perl hash/array/scalar type structures it'll work just fine.
You can then import it back into perl using:
my $new_data = from_json $string;
print Dumper $new_data;
Either Dumper it to a file, but given you requirement is multi-language going forward, just using native JSON as your 'at rest' data is probably a more sensible choice.
But if you're looking at parsing perl code within java, without a perl interpreter? No, that's just insanity.

Java Serialization to transfer data between any language

Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.

I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.

There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?

Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack

Incremental streaming JSON library for Java

Can anyone recommend a JSON library for Java which allows me to give it chunks of data as they come in, in a non-blocking fashion? I have read through A better Java JSON library and similar questions, and haven't found precisely what I'd like.
Essentially, what I'd like is a library which allows me to do something like the following:
String jsonString1 = "{ \"A broken";
String jsonString2 = " json object\" : true }";
JSONParser p = new JSONParser(...);
p.parse(jsonString1);
p.isComplete(); // returns false
p.parse(jsonString2);
p.isComplete(); // returns true
Object o = p.getResult();
Notice the actual key name ("A broken json object") is split between pieces.
The closest I've found is this async-json-library which does almost exactly what I'd like, except it cannot recover objects where actual strings or other data values are split between pieces.

There are a few blocking streaming/incemental JSON parsers (as per Is there a streaming API for JSON?); but for async nothing yet that I am aware of.
The lib you refer to seems badly named; it does not seem to do real asynchronous processing, but merely allow one to parse sequence of JSON documents (which multiple other libs allow doing as well)
If there were people who really wanted this, writing one is not impossible -- for XML there is Aalto, and handling JSON is quite a bit simpler than XML.
For what it is worth, there is actually this feature request to add non-blocking parsing mode for Jackson; but very few users have expressed interest in getting that done (via voting for the feature request).
EDIT: (2016-01) while not async, Jackson ObjectMapper allows for convenient sub-tree by sub-tree binding of parts of the stream as well -- see ObjectReader.readValues() (ObjectReader created from ObjectMapper), or short-cut versions of ObjectMapper.readValues(...). Note the trailing s in there, which implies a stream of Objects, not just a single one.

Google Gson can incrementally parse Json from an InputStream
https://sites.google.com/site/gson/streaming

I wrote such a parser: JsonParser.java. See examples how to use it:JsonParserTest.java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.