Store images into HDFS, Hive - java

I'm using Hadoop 2.5 Vanilla version, I need to store large data set of images into HDFS and Hive but i'm not getting how to do?
Can anyone help to fix this
thank you in advance

To store files into HDFS is easy, see the put documentation:
Usage: hdfs dfs -put <localsrc> ... <dst>
You can write scripts to put the image files in place.
There is another question that tells you how to do it with Hive: How to Store Binary Data in Hive?
I see some discussions online that suggest store images to hdfs and store metadata and link to the file in HBase is a better solution than store images directly to HBase.
See following links for reference:
http://apache-hbase.679495.n3.nabble.com/Storing-images-in-Hbase-td4036184.html
http://www.quora.com/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
https://www.linkedin.com/groups/What-is-best-NoSQL-DB-3638279.S.5866843079608131586

Related

How to view Apache Parquet file in Windows? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 27 days ago.
The community reviewed whether to reopen this question 27 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I couldn't find any plain English explanations regarding Apache Parquet files. Such as:
What are they?
Do I need Hadoop or HDFS to view/create/store them?
How can I create parquet files?
How can I view parquet files?
Any help regarding these questions is appreciated.
What is Apache Parquet?
Apache Parquet is a binary file format that stores data in a columnar fashion.
Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.
Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:
Columnar storage: efficient data retrieval, efficient compression, etc...
Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
Supported by all Apache big data products
Do I need Hadoop or HDFS?
No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.
All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.
How can I create/read Parquet Files?
As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.
So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.
To create your own parquet files:
In Java please see my following post: Generate Parquet File using Java
In .NET please see the following library: parquet-dotnet
To view parquet file contents:
Please try the following Windows utility: https://github.com/mukunku/ParquetViewer
Are there other methods?
Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response
This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer here or the official docs in case of Python.
Basically this allows you to quickly read/ write parquet files in a pandas DataFrame like fashion giving you the benefits of using notebooks to view and handle such files like it was a regular csv file.
EDIT:
As an example, given the latest version of Pandas, make sure pyarrow is installed:
Then you can simply use pandas to manipulate parquet files:
import pandas as pd
# read
df = pd.read_parquet('myfile.parquet')
# write
df.to_parquet('my_newfile.parquet')
df.head()
In addition to #sal's extensive answer there is one further question I encountered in this context:
How can I access the data in a parquet file with SQL?
As we are still in the Windows context here, I know of not that many ways to do that. The best results were achieved by using Spark as the SQL engine with Python as interface to Spark. However, I assume that the Zeppelin environment works as well, but did not try that out myself yet.
There is very well done guide by Michael Garlanyk to guide one through the installation of the Spark/Python combination.
Once set up, I'm able to interact with parquets through:
from os import walk
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'
# Getting all parquet files in a dir as spark contexts.
# There might be more easy ways to access single parquets, but I had nested dirs
dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))
# for each parquet file, i.e. table in our database, spark creates a tempview with
# the respective table name equal the parquet filename
print('New tables available: \n')
for parquet in filenames:
print(parquet[:-8])
spark.read.parquet(parquetdir+'\\'+parquet).createOrReplaceTempView(parquet[:-8])
Once loaded your parquets this way, you can interact with the Pyspark API e.g. via:
my_test_query = spark.sql("""
select
field1,
field2
from parquetfilename1
where
field1 = 'something'
""")
my_test_query.show()
Do I need Hadoop or HDFS to view/create/store them?
No. Can be done using a library from your favorite language. Ex: With Python you can use. PyArrow, FastParquet, pandas.
How can I view parquet files? How can I create parquet files?
(GUI option for Windows, Linux, MAC)
You can use DBeaver to view parquet data, view metadata and statistics, run sql query on one or multiple files, generate new parquet files etc..
DBeaver leverages DuckDB driver to perform operations on parquet file.
Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document.
Here is a Youtube video that explains this - https://youtu.be/j9_YmAKSHoA
Alternative:
DuckDB CLI tool usage
Maybe too late for this thread, just make some complement for anyone who wants to view Parquet file with a desktop application running on MAC or Linux.
There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, etc.
Here's a quick "hack" to show single table parquet files using Python in Windows (I use Anaconda Python):
Install pyarrow package https://pypi.org/project/pyarrow/
Install pandasgui package https://pypi.org/project/pandasgui/
Create this simple script parquet_viewer.py:
import pandas as pd
from pandasgui import show
import sys
import os
dfs = {}
for fn in sys.argv[1:]:
dfs[os.path.basename(fn)] = pd.read_parquet(fn)
show(**dfs)
Associate .parquet file extension by running these commands as administrator (of course you need to adapth the paths to your Python installation):
assoc .parquet=parquetfile
ftype parquetfile="c:\Python3\python.exe" "\<path to>\parquet_viewer.py" "%1"
This will allow to open parquet files compressed with compression formats (e.g. Zstd) not supported by the .NET viewer in #Sal's answer.
On Mac if we want to view the content we can install 'parquet-tools'
brew install parquet-tools
parquet-tools head filename
We can always read the parquet file to a dataframe in Spark and see the content.
They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.
This link allows you to view small parquet files:
http://parquet-viewer-online.com/
It was originally submitted by Rodrigo Lozano. This site is based on the github project here: https://github.com/elastacloud/parquet-dotnet
There is a plugin for Excel that allows to connect to qarquet files, but it is behind a paywall:
https://www.cdata.com/drivers/parquet/excel/
You can view Parquet files on Windows / MacOS / Linux by having DBeaver connect to an Apache Drill instance through the JDBC interface of the latter:
Download Apache Drill
Choose the links for "non-Hadoop environments".
Click either on "Find an Apache Mirror" or "Direct File Download", not on "Client Drivers (ODBC/JDBC)"
Extract the file
tar -xzvf apache-drill-1.20.2.tar.gz
cd in the extracted folder and run Apache Drill in embedded mode:
cd apache-drill-1.20.2/
bin/drill-embedded
You should end up at a prompt saying apache drill> with no errors.
Make sure the server is running by connecting from your web browser to the web interface of Apache Drill at http://localhost:8047/.
Download DBeaver
From DBeaver, click on File -> New, under the "DBeaver" folder, select "Database Connection", then click "Next"
Select "Apache Drill" and click "Next"
In the connection settings, under the "Main" tab:
In "Connect by:" select "URL"
In "JDBC URL:", write "jdbc:drill:drillbit=localhost"
In username, write "admin"
In password, write "admin"
Click "OK"
Now to view your parquet database, click on "SQL Editor" -> "Open SQL script", write:
SELECT *
FROM `dfs`.`[PATH_TO_PARQUET_FILE]`
LIMIT 10;
and click the play button.
Done!
You can view it with web assembly app completely in browser https://parquetdbg.aloneguid.uk/

Read images and displaying(Outputting) from HDFS using Hadoop MapReduce

I haved only played accross text files in hadoop. I would like to experiment on images too.
How can I read an image and display(output) images?
When I googled,the stackoverflow itself gave me an idea that
We need to convert images to sequence file and then this file is taken as the input to MapReduce job. Is it like that. So in 2 nd MapReduce job how can we out it as images?
And we will not get the exact single image in our map(if the image size is large), Do we need to go with WholeFileInputFormat?
What I did so far is
Copied images into HDFS
Wrote a MapReduce job to convert images to sequence file .
Please Advice.
Can any one help me with examples.

Upload multiple files to azure blob at a time

i am trying to upload files to azure blob, i have referred this code for the same.
and i am able to successfully upload files too, my problem is..
using this code i have to upload files one by one and i am getting more than one files at a time so each time i need to iterate over the list and pass files one by one
what i want to do is to upload all files to azure blob in one go.
i tried searching on internet but unable to find any way :(
please help
Not relevant to Java, but you may check out the AzCopy tool which might be useful to you. It supports uploading blobs in parallel.
http://blogs.msdn.com/b/windowsazurestorage/archive/2012/12/03/azcopy-uploading-downloading-files-for-windows-azure-blobs.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2013/09/07/azcopy-transfer-data-with-re-startable-mode-and-sas-token.aspx

Java Unzip each file into Database

I am attempting to write a java application that will unzip an archive and store it in a database.
I would like to insert each file in the database after it has been extracted, does anyone have a good example of a java unzip procedure?
A little google-search would have helped you. Tutorial by Sun.
If you want to store the extracted data in a MySQL-Database you'll want to use a BLOB to do so. Tutorial might be found here.
Notice: BLOBs should not grow bigger then 1M because they'll be slower then a normal file-system. Here is the full article.

How to read .rpt files and load data to database

I need to read .rpt files and load the data into MySQL or Oracle using Java.
Can anyone help in this?
As far as i know, .rpt files are used by BIRT.
Are the .rpt (report) files processed in any way or do you just have to write them into a database?
In case of the latter, the necessary steps are:
Create a list of all .rpt files
For each of the files:
Read the file into a String
CREATE a SQL INSERT statement and execute it against your database
Profit.
I will expand this answer if you clarify your requirements.

Categories

Resources