I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function.
After looking on Google I didn't found a solution to this without have some sort of Hadoop.
Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?
If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.
It involved using Pandas:
df = pd.read_json(file.json)
followed by converting it into parquet file:
pq = pa.parquet.write_table(df, file.pq)
Above example does auto-inference of data-types. You can override it by using the argument dtype while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array.
Update (a more generic solution):
Consider using json2parquet.
However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:
{a: {b: {c: d}}} to {a.b.c: d}
Then, this data needs to be ingested as pyarrow batch with json2parquet:
pa_batch = j2p.ingest_data(data)
and now the batch can be loaded as PyArrow dataframe:
df = pa.Table.from_batches([pa_batch])
and output in parquet file:
pq = pa.parquet.write_table(df, file.pq)
Related
I have a very long string in a text file.It is basically the below string repeated around 1000 times (as one long string, not 1000 strings).The string has variables which change with each repetition (those in bold).I'd like to extract the variables in an automated way, and return the output into either a CSV or formatted txt file (Random Bank, Random Rate, Random Product)I can do this successfully using https://regex101.com, however it involves a lot of manual copy&paste.I'd like to write a bash script to automate extracting the information, but have no luck in attempting various grep commands.How can this be done? (I'd also consider doing in Java).
[{"AccountName":"Random Product","AccountType":"Variable","AccountTypeId":1,"AER":Random Rate,"CanManageByMobileApp":false,"CanManageByPost":true,"CanManageByTelephone":true,"CanManageInBranch":false,"CanManageOnline":true,"CanOpenByMobileApp":false,"CanOpenByPost":false,"CanOpenByTelephone":false,"CanOpenInBranch":false,"CanOpenOnline":true,"Company":"Random Bank","Id":"S9701Monthly","InterestPaidFrequency":"Monthly"
This is JSON formatted data which you can't parse with regular expression engines. Get a JSON parser. If this file is larger than, say, 1GB, find one that lets you 'stream' (which is the term for parsing it and dealing with the data as it parses, vs the more usual route which turns the entire input into an object; if the file is huge, that object'd be huge, might run out of memory - hence you'd need the streaming aspect).
Here is one tutorial for Jackson-streaming.
Flink's Batch Python API, Version 1.2 supports
read_text(path)
read_csv(path, type)
from_elements(*args)
And some specific data types: (int, float, bool, string) and byte arrays.
Python is pretty strong with dictionaries, and reading a json file through json.load(file) creates a dict type variable, but as far as i searched, dictionary manipulation isn't supported by flink atm.
Question num0: So, besides parsing strings and trying to mess with regular expressions, is there any other way to handle large json files with flink's python api?
I was thinking to preprocess my json file using lists and in combination with from_elements(*args) to achieve something.
Handling Objects seems to be a problem with Python+Flink as well.
Question num1: Is switching to Java my best bet?
I would like to have an easy way to handle json data, as well as using objects derived from these data.
I'm new to Java programming, and I ran into this problem:
I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects.
Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees.
The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.
My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them.
I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.
For the tree writing, I was looking for something like this (in C)
fwrite(tree, sizeof(struct TrieTree), 1, file)
For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.
Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.
I think you are (for starters) best off when trying to do this with serialization.
Here is just one example from stackoverflow: What is object serialization?
(I think copy&paste of the code does not make sense, please follow the link to read)
Admittedly this does not yet solve your index creation problem.
Here is an alternative to Java native serialization, Google Protocol Buffers.
I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.
What is it:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.
Performance:
This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind.
Here is stack overflow link discussing Java native vs GPB:
High performance serialization: Java vs Google Protocol Buffers vs ...?
How does it work:
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.
You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:
Person john = Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Read all about it here:
https://developers.google.com/protocol-buffers/
You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.
Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.
I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.
There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?
Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack
I have a java class file with three arrayLists, one with type String, one with type Integer and other is ArrayList with type (ArrayList(String)). I have to write these these arraylists to a structure in C with character arrays, integers and short and output a file in a specefic format extension. The file has to be readable again by the same application. What is the best way to trasnfer the data from java to c structure and then output the c structure in a file. Thank you
There is no "C compatible file" format. If you have C structs written to disk file directly, then those are in an ad-hoc binary format. Exact format depends on things like packing and padding of the struct, byte order, word size of the CPU (like, 32 or 64 bit), etc.
So, start by defining the format, then forget it is produced by C.
Once you have the format defined, you can write a program to parse it in Java. If it is short with fixed length records, I'd probably create a class, which internally has just a private byte[] array, and then methods to manipulate it, save it and load it.
I suggest you write/read the data to a ByteBuffer using native byte ordering. The rest is up to you are to how you do it.
A library which might help is Javolution's Struct library which helps you map C structs onto ByteBuffers. This can help with C's various padding rules i.e.the exact layout might not be obvious.