How do you query a parquet file using parquet-mr?

How do you query a parquet file using parquet-mr? - java

I have a parquet file stored in AWS S3 that I want to query. I want to retrieve a certain row of data given that it equals a value. Almost like I would in SQL:
SELECT * FROM file.parquet WHERE id = '1234';
I am using parquet-mr to load it in to memory directly from S3 and read it and have it set up with a AvroParquetReader to read the rows.
I've copied every row into a Map for easy querying for now, however is there a better way to do this? The documentation for parquet-mr is not great, and most tutorials use deprecated methods.
Here is some example code of what i've got:
final ParquetReader<GenericRecord> reader = AvroParquetReader
.<GenericRecord>builder(internalPath)
.withConf(parquetConfiguration).build();
You can use reader.read() to get the next row in the file (which is what i've used to put it in to a HashMap, but I can't find any methods in parquet-mr that allow you to query a file without loading the entire file in to memory.

The feature you are looking for is called predicate pushdown. You can read about it and find examples here.

Related

Searching and storing values from CSV

I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?

If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}

If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.

It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).

Spark - Update the record (in parquet file) if already exists

I am writing a Spark job to read the data from json file and write it to parquet file, below is the example code:
DataFrame dataFrame = new DataFrameReader(sqlContext).json(textFile);
dataFrame = dataFrame.withColumn("year", year(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame = dataFrame.withColumn("month", month(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame.write().mode(SaveMode.Append).partitionBy("year", "month").parquet("<some_path>");
Json file consists of lots of json records and I want the record to be updated in parquet if it already exists. I have tried Append mode but it seems to be working on file level rather than record level (i.e. if file already exists, it writes in the end). So, running this job for the same file duplicates the records.
Is there any way we can specify dataframe row id as a unique key and ask spark to update the record if it already exists? All the save modes seem to be checking the files and not the records.

Parquet is a file format rather than a database, in order to achieve an update by id, you will need to read the file, update the value in memory, than re-write the data to a new file (or overwrite the existing file).
You might be better served using a database if this is a use-case that will occur frequently.

You can have a look at Apache ORC file format instead, see:
https://orc.apache.org/docs/acid.html
According your use case, or HBase if you want stay in top of HDFS.
But keep in mind that HDFS is a write once file system, if this is not fitting your need, go for something else (maybe elasticsearch, mongodb).
Else, in HDFS, you must create new files every-time, you must setup an incremental process to build a "delta" file, then merge OLD + DELTA = NEW_DATA.

Efficient data import PostgreSQL DB

I just designed a Pg database and need to choose a way of populating my DB with data, the data consists of txt and csv files but can generally be any type of file containing characters with delimiters, I'm programming in java in order to the data to have the same structure (there's lots of different kinds of files and I need to find what each column of the file represents so I can associate it with a column of my DB) I thought of two ways:
Convert the files into one same type of file (JSON) and then get the DB to regularly check the JSON file and import its content.
Directly connect to the database via JDBC send the strings to the DB (I still need to create a backup file containing what was inserted into the DB so in both cases there is a file created and written into).
Which would you go with time efficiency wise? I'm kinda tempted into using the first one as it would be easier to handle a json file in the DB.
If you have any other suggestion that would also be welcome!

JSON or CSV
If you have the liberty of converting your data either to CSV or JSON format, CSV is the one to choose. This is because you will then be able to use COPY FROM to bulk load large amounts of data at once into postgresql.
CSV is supported by COPY but JSON is not.
Directly inserting values.
This is the approach to take if you only need to insert a few (or maybe even a few thousand) records but not suited for large number of records because it will be slow.
If you choose this approach you can create the back up using COPY TO. However if you feel that you need to create the backup file with your java code. Choosing the format as CSV means you would be able to bulk load as discussed above.

H2:How can i change csvread functionality by change h2 source code

I have the following SQL code:
create table cross_links(sid varchar,tid varchar,snd int)
as
select * from csvread('csvfile')
I want to read csvfile twice. The second is exchange the position of sid and tid and then insert into the table. But it cost some performence, so I want to read it only once and the result is the same as read it twice.
How can I do it?
I think it must change the source code of H2.

First, you don't need to do this. You can just write a simple CSV reader yourself that swaps or renames the columns as it reads them in.
Also, with your approach, you would also need to modify csvread to support different types of data - it only supports VARCHAR. That is going to be more work!

Insert Query Builder for java

I have a use case where in I need to read rows from a file, transform them using an engine and then write the output to a database (that can be configured).
While I could write a query builder of my own, I was interested in knowing if there's already an available solution (library).
I searched online and could find jOOQ library but it looks like it is type-safe and has a code-gen tool so is probably suited for static database schema's. In the use case that I have db's can be configured dynamically and the meta-data is programatically read and made available for write-purposes (so a list of tables would be made available, user can select the columns to write and the insert script for these column needs to be dynamically created).
Is there any library that could help me with the use case?

If I understand correctly you need to query the database structure, display the result to via a GUI and have the user map data from a file to that structure?
Assuming this is the case, you're not looking for a 'library', you're looking for an ETL tool.
Alternatively, if you're set on writing something yourself, the (very) basic way to do this is:
the structure of a database using Connection.getMetaData(). The exact usage can vary between drivers so you'll need to create an abstraction layer that meets your needs - I'd assume you're just interested in the table structure here.
the format of the file needs to be mapped to a similar structure to the tables.
provide a GUI that allows the user to connect elements from the file to columns in the table including any type mapping that is needed.
create a parametrized insert statement based on file element to column mapping - this is just a simple bit of string concatenation.
loop throw the rows in the file performing a batch insert for each.
My advice, get an ETL tool, this sounds like a simple problem, but it's full of idiosyncrasies - getting even an 80% solution will be tough and time consuming.

jOOQ (the library you referenced in your question) can be used without code generation as indicated in the jOOQ manual:
http://www.jooq.org/doc/latest/manual/getting-started/use-cases/jooq-as-a-standalone-sql-builder
http://www.jooq.org/doc/latest/manual/sql-building/plain-sql
When searching through the user group, you'll find other users leveraging jOOQ in the way you intend

The setps you need to do is:
read the rows
build each row into an object
transform the above object to target object
insert the target object into the db
Among the above 4 steps, the only thing you need to do is step 3.
And for the above purpose, you can use Transmorph, EZMorph, Commons-BeanUtils, Dozer, etc.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do you query a parquet file using parquet-mr? - java

The feature you are looking for is called predicate pushdown. You can read about it and find examples here.

Related

Searching and storing values from CSV

Spark - Update the record (in parquet file) if already exists

Efficient data import PostgreSQL DB

H2:How can i change csvread functionality by change h2 source code

Insert Query Builder for java

Categories

Resources