Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 27 days ago.
The community reviewed whether to reopen this question 27 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I couldn't find any plain English explanations regarding Apache Parquet files. Such as:
What are they?
Do I need Hadoop or HDFS to view/create/store them?
How can I create parquet files?
How can I view parquet files?
Any help regarding these questions is appreciated.
What is Apache Parquet?
Apache Parquet is a binary file format that stores data in a columnar fashion.
Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.
Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:
Columnar storage: efficient data retrieval, efficient compression, etc...
Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
Supported by all Apache big data products
Do I need Hadoop or HDFS?
No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.
All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.
How can I create/read Parquet Files?
As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.
So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.
To create your own parquet files:
In Java please see my following post: Generate Parquet File using Java
In .NET please see the following library: parquet-dotnet
To view parquet file contents:
Please try the following Windows utility: https://github.com/mukunku/ParquetViewer
Are there other methods?
Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response
This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer here or the official docs in case of Python.
Basically this allows you to quickly read/ write parquet files in a pandas DataFrame like fashion giving you the benefits of using notebooks to view and handle such files like it was a regular csv file.
EDIT:
As an example, given the latest version of Pandas, make sure pyarrow is installed:
Then you can simply use pandas to manipulate parquet files:
import pandas as pd
# read
df = pd.read_parquet('myfile.parquet')
# write
df.to_parquet('my_newfile.parquet')
df.head()
In addition to #sal's extensive answer there is one further question I encountered in this context:
How can I access the data in a parquet file with SQL?
As we are still in the Windows context here, I know of not that many ways to do that. The best results were achieved by using Spark as the SQL engine with Python as interface to Spark. However, I assume that the Zeppelin environment works as well, but did not try that out myself yet.
There is very well done guide by Michael Garlanyk to guide one through the installation of the Spark/Python combination.
Once set up, I'm able to interact with parquets through:
from os import walk
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'
# Getting all parquet files in a dir as spark contexts.
# There might be more easy ways to access single parquets, but I had nested dirs
dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))
# for each parquet file, i.e. table in our database, spark creates a tempview with
# the respective table name equal the parquet filename
print('New tables available: \n')
for parquet in filenames:
print(parquet[:-8])
spark.read.parquet(parquetdir+'\\'+parquet).createOrReplaceTempView(parquet[:-8])
Once loaded your parquets this way, you can interact with the Pyspark API e.g. via:
my_test_query = spark.sql("""
select
field1,
field2
from parquetfilename1
where
field1 = 'something'
""")
my_test_query.show()
Do I need Hadoop or HDFS to view/create/store them?
No. Can be done using a library from your favorite language. Ex: With Python you can use. PyArrow, FastParquet, pandas.
How can I view parquet files? How can I create parquet files?
(GUI option for Windows, Linux, MAC)
You can use DBeaver to view parquet data, view metadata and statistics, run sql query on one or multiple files, generate new parquet files etc..
DBeaver leverages DuckDB driver to perform operations on parquet file.
Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document.
Here is a Youtube video that explains this - https://youtu.be/j9_YmAKSHoA
Alternative:
DuckDB CLI tool usage
Maybe too late for this thread, just make some complement for anyone who wants to view Parquet file with a desktop application running on MAC or Linux.
There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, etc.
Here's a quick "hack" to show single table parquet files using Python in Windows (I use Anaconda Python):
Install pyarrow package https://pypi.org/project/pyarrow/
Install pandasgui package https://pypi.org/project/pandasgui/
Create this simple script parquet_viewer.py:
import pandas as pd
from pandasgui import show
import sys
import os
dfs = {}
for fn in sys.argv[1:]:
dfs[os.path.basename(fn)] = pd.read_parquet(fn)
show(**dfs)
Associate .parquet file extension by running these commands as administrator (of course you need to adapth the paths to your Python installation):
assoc .parquet=parquetfile
ftype parquetfile="c:\Python3\python.exe" "\<path to>\parquet_viewer.py" "%1"
This will allow to open parquet files compressed with compression formats (e.g. Zstd) not supported by the .NET viewer in #Sal's answer.
On Mac if we want to view the content we can install 'parquet-tools'
brew install parquet-tools
parquet-tools head filename
We can always read the parquet file to a dataframe in Spark and see the content.
They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.
This link allows you to view small parquet files:
http://parquet-viewer-online.com/
It was originally submitted by Rodrigo Lozano. This site is based on the github project here: https://github.com/elastacloud/parquet-dotnet
There is a plugin for Excel that allows to connect to qarquet files, but it is behind a paywall:
https://www.cdata.com/drivers/parquet/excel/
You can view Parquet files on Windows / MacOS / Linux by having DBeaver connect to an Apache Drill instance through the JDBC interface of the latter:
Download Apache Drill
Choose the links for "non-Hadoop environments".
Click either on "Find an Apache Mirror" or "Direct File Download", not on "Client Drivers (ODBC/JDBC)"
Extract the file
tar -xzvf apache-drill-1.20.2.tar.gz
cd in the extracted folder and run Apache Drill in embedded mode:
cd apache-drill-1.20.2/
bin/drill-embedded
You should end up at a prompt saying apache drill> with no errors.
Make sure the server is running by connecting from your web browser to the web interface of Apache Drill at http://localhost:8047/.
Download DBeaver
From DBeaver, click on File -> New, under the "DBeaver" folder, select "Database Connection", then click "Next"
Select "Apache Drill" and click "Next"
In the connection settings, under the "Main" tab:
In "Connect by:" select "URL"
In "JDBC URL:", write "jdbc:drill:drillbit=localhost"
In username, write "admin"
In password, write "admin"
Click "OK"
Now to view your parquet database, click on "SQL Editor" -> "Open SQL script", write:
SELECT *
FROM `dfs`.`[PATH_TO_PARQUET_FILE]`
LIMIT 10;
and click the play button.
Done!
You can view it with web assembly app completely in browser https://parquetdbg.aloneguid.uk/
Related
I have some pig output files and want to read them on another machine(without hadoop installation). I just want to read a tab-seperated plain text line and parse it into a java object. I am guessing we should be able to use pig.jar as dependency and be able to read it. I could not find relevant documentation. I think this class could be used? How can we provide the schema also.
I suggest you to store data in Avro serialization format. It is Pig-independent and it allows to handle complex data structures like you described (so you don't need to write your own parser). See this article for examples.
Your pig output files are just text files, right? Then you don't need any pig or hadoop jars.
Last time i worked with Pig was on amazon's EMR platform, and the output files were stashed in an s3 bucket. They were just text files and standard java can read the file in.
That class you referenced is for reading into pig from some text format.
Are you asking for a library to parse the pig data model into java objects? I.e. the text representation of tuples & bags, etc? If so then its probably easier to write it yourself. It's a VERY simple data model with only 3 -ish datatypes..
was Just Wondering Is there any way that we can Copy our Local Content Like XlS sheet to the file system on a remote server (e.g. a SQL Server) using Java?
I dont want Exact Code, just want some Headstart any help is appreciated:)
I agree with #beny23, reading a Microsoft format to upload info to Microsoft SQL Server does not sound like a problem where Java would be the first option.
In any case, you just have to consider the parts of your problem. Java can read Excel files (Look for Apache POI). From then, connecting Java to SQL Server is also a solved problem. (Although there several drivers and you might have to try more than one).
I'm trying to read excel file and pass all the data to DB. I found a few code examples but all of them required external jars. How can I read excel files using only the standard library?
IF you don't want to use a library then you will have to download the Excel file format specs from MS and write an Excel parser yourself (which is extremely complicated and takes > 10 years for one developer). For the OpenXML format spec see here and here.
Thus I really recommend using a library for that...
Try Apache POI - a free Java library for dealing with MS Office documents..
You can save as the excel file *.cvs and sperated ";". Then, you can read file line by line and get the columns which is getting from each token.
Microsoft excel uses a binary way to save its data, so manually reading excel files might be a hassle. If you could convert the excel (xls) to a comma seperated values (csv) file, then you can just read the file and split your input on the comma's.
This is a difficult problem. First off, it is not as simple as "adding a third party library". There are no existing EXCEL reading libraries that do not cost money and the one that I know that does work is very expensive AND has bugs in it.
One strategy is to create an Excel add in that reads the data and transfers it to your application by OLE or the clipboard or by a TCP/IP port or saves it to a temporary file. If you look in the source code for OPeNDAP.org's ODC project you can find an Excel add in and TCP capability to do this.
You can try referring to the reader in OpenOffice which is open source code, however, in my opinion that code is not easily refactorable into a private project for various reasons.
Microsoft has components and tools to open Excel files and expose them via COM objects.
You can also learn the BIFF format and write your own parser. You probably would want to write a parser for BIFF5, but be forewarned, this is a BIG project, even if you only parse a limited number of data types.
I am having a database in .dbf (FoxPro) format.
How to retrieve data from FoxPro using Java?
If the data can be migrated to MySQL, How to do the conversion?
Taking the data to intermediate formats seems flawed as there are limitation with memo fields and CSV or Excel files.
If you are interested in a more direct approach you could consider something like "VFP2MySQL Data Upload program" or "Stru2MySQL_2", both written by Visual FoxPro developers. Search for them on this download page:
http://leafe.com/dls/vfp
DB-Convert (http://dbconvert.com/convert-foxpro-to-mysql-sync.php) is a commercial product that you might find helpful.
Rick Schummer, VFP MVP
You can use XBaseJ to access (and even modify write) data from FoxPro databases directly from Java with simple API.
This would allow you to have the two applications (the old FoxPro and the new Java one) side by side by constantly synchronizing the data until the new application is ready to replace the old one (e.g. many times the customers still hang on and trust more their old application for a while).
Do you have a copy of FoxPro? You can save the database as an HTML file, if you want. Then, from HTML, you can save to any format you want. I recently did this to save a FoxPro table as an Excel spreadsheet (not that I'd suggest using that for your Java code).
If you plan on using Java, once you have access to the data, why not use one of Java's native storage types?
I worked on the same project once long back where the project had be done with FoxPro and then we migrated that project to Java with MySQL.
We had the data in Excel sheets or .txt files, so we created tables as exact replica of the FoxPro data and transferred the data from the Excel/CSV /txt to MySQL using the Import data feature.
Once we did this, I think further you can take care from MySQL Data.
But remember work will take some time, and we need to be patient.
I suppose doing a CSV export of your FoxPro data and then writing a little Java programme that takes the CSV as input is your best bet. Writing a programme that both connects to FoxPro and MySQL in Java is needlessly complicated, you are doing a one time migration.
By the way PHP could do an excellent job at inserting the data into MySQL too. The main thing is that you get your data in the MySQL schema, so you can use it with your new application (which I assume is in Java.)
Two steps: DBF => CSV and the CSV => MySQL.
To convert DBF(Foxpro tables) to CSV the below link helps a lot
http://1stopit.blogspot.com/2009/06/dbf-to-mysql-conversion-on-windows.html
CSV => MySQL
MySQL itself supports CSV import option (or) to read csv file this link helps
http://www.csvreader.com/java_csv.php
I read the CSV file using Java CsvReader and inserted the records through program. For that i used PreparedSatement with Batch the below link gives samples for that
http://www.codeweblog.com/switch-to-jdbc-oracle-many-data-insert/
The interop library is slow and needs MS Office installed.
Many times you don't want to install MS Office on servers.
I'd like to use Apache POI, but I'm on .NET.
I need only to extract the text portion of the files, not creating nor "storing information" in Office files.
I need to tell you that I've got a very large document library, and I can't convert it to newer XML files.
I don't want to write a parser for the binaries files.
A library like Apache POI does this for us. Unfortunately, it is only for the Java platform. Maybe I should consider writing this application in Java.
I am still not finding an open source alternative to POI in .NET, I think I'll write my own application in Java.
For all MS Office versions:
You could use the third-party components like TX Text Controls for Word and TMS Flexcel Studio for Excel
For the new Office (2007):
You could do some basic stuff using .net functionality from system.io.packaging. See how at http://msdn.microsoft.com/en-us/library/bb332058.aspx
For the old Office (before 2007):
The old Office formats are now documented: http://www.microsoft.com/interop/docs/officebinaryformats.mspx. If you want to do something really easy you might consider trying it. But be aware that these formats are VERY complex.
Check out the Aspose components. They are designed to mimic the Interop functionality without requiring a full Office install on a server.
As the new docx formats are inherently XML based files, you can create and manipulate them programmatically with standard XML DOM techniques, once you know the structure.
The files are basically zip archives with an alternate file extension. Use the System.IO.Packaging namespace to get access to the internal elements of the file, then open them into a XmlDocument to perform the manipulation.
There are examples available for doing this, and the Office Open XML project on SourceForge may be worth looking at for inspiration.
As for the older binary formats, these were proprietary to MS, and the only way you're likely to get at the content from within is through the Office object model (requires an Office install), or a third party file converter/parser.
Unfortunately there's nothing first party and native to the .NET platform to work with these files.
What do you need to do with those file? If you just want to stream them to the user, then the basic file streams are fine. If you want to create new files (perhaps based on a template) to send to the user that the user can open in Office, there are a variety or work-arounds.
If you're actually keeping data in Office documents for use by your web site, you're doing it wrong. Office documents, even Excel spreadsheets and access databases, are not really an appropriate choice for use with an interactive web site.
If the document is in word 2007 format, you can use the system.io.packaging library to interact with it programatically.
RWendi
In Java world, there is also JExcelApi. It is very clearly written, from what I was able to see, much cleaner then POI. So maybe even a port of that code to .NET is not out of the question, depending of course you have enough of time on your hands.
OpenOffice.
You can program against it and have it do a lot for you, without spending the money on a license for the server, or have the vulnerability associated with it on your server.
Microsoft Excel workbooks can be read using an ODBC driver (or is it an OLE DB driver? can't remember) that makes the workbook look like a database table. But I don't know whether that driver is available without the Office Suite itself.
You can use OpenOffice. It has a command-line conversion tool:
Conversion Howto
In short, you define a macro in OpenOffice and you call that macro with a command-line
argument to OpenOffice. In that argument the name of the local file (the Office file) is
encoded.
It's not a great sollution, but it should be workable.