I haved only played accross text files in hadoop. I would like to experiment on images too.
How can I read an image and display(output) images?
When I googled,the stackoverflow itself gave me an idea that
We need to convert images to sequence file and then this file is taken as the input to MapReduce job. Is it like that. So in 2 nd MapReduce job how can we out it as images?
And we will not get the exact single image in our map(if the image size is large), Do we need to go with WholeFileInputFormat?
What I did so far is
Copied images into HDFS
Wrote a MapReduce job to convert images to sequence file .
Please Advice.
Can any one help me with examples.
Related
I have >400 JPG files and a JSON file for each which contains the image tags, description and title. I've found this command
exiftool -json=picture.json picture.jpg
But I don't want to run this for each and every file.
How can I run this command for the folder containing the JPGs and JSONs or is there another way I can batch process these?
Each JSON file has the same name as it's JPG counterpart so it's easy to identify which files match up to each other.
Assuming your JPGs and JSONs have the same filename, but different extesion(e.g. picture001.jpg has an associated picture001.json,etc.), a batch for loop might work.
Assuming you've already cd-ed into the folder and the files aren't nested in folders, something like this should work
( for jpg in *.jpg; do exiftool -json=${jpg/\.jpg/.json} $jpg; done )
Note that this isn't tested. I recommend making a copy of your folder and testing there beforehand to make sure you don't irreversibly damage them.
I've also noticed you're using the java tag. I had to work with EXIF data in Java a while back (on Android then) and I used the JHeader library. If you want to roll your own little java command line tool, you should be able to use Java's IO classes to traverse your directory and files and the JHeader library to modify the EXIF data.
I'm using Hadoop 2.5 Vanilla version, I need to store large data set of images into HDFS and Hive but i'm not getting how to do?
Can anyone help to fix this
thank you in advance
To store files into HDFS is easy, see the put documentation:
Usage: hdfs dfs -put <localsrc> ... <dst>
You can write scripts to put the image files in place.
There is another question that tells you how to do it with Hive: How to Store Binary Data in Hive?
I see some discussions online that suggest store images to hdfs and store metadata and link to the file in HBase is a better solution than store images directly to HBase.
See following links for reference:
http://apache-hbase.679495.n3.nabble.com/Storing-images-in-Hbase-td4036184.html
http://www.quora.com/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
https://www.linkedin.com/groups/What-is-best-NoSQL-DB-3638279.S.5866843079608131586
I'm making two Java applications one to collect data, another to use it. The one collecting will be importing a file from the other which will include data and images and will be decrypted.
I'm unsure what filetype to use. So far all of the data is in XML and works great but I need the images and was hoping not to have to rely on giving all the images in a folder with a path reference.
Ideas?
well, I think that the best way is to create your own format (.myformat or .data). This file will be in fact a Zip file that contains your XML file and images.
There is no perfect example writen in java as far as I know. However, here are some examples :
Not in java
The best example is, as #Bolo said, the odt format. Indeed, OpenOffice writes the doc in an xml file, and the images too. All that is wrapped in an odt file.
The .exe file is an other example. The C files and the resources are put in a single file. try to open it with 7-zip, you'll see.
The Skyrim plugins are .esp file that contain the dds, the scripts, the niffs (textures)...
In java
The minecraft texture packs are a zip file that contains a .mcmeta file (the infos) and the textures (.png)
Jar files are like exe.
If both programs are in java you could also go with serialization, which is basically saving an object as a file (suffix will be .ser I think) and then being able to retrieve it. You should google it, even if it won't help right now it is quite good to know about it.
I'd suggest using JSON. Gson is a decent library.
You can embed images as byte arrays.
Save the serialized string in a file with a preferred extension, read it from the second application, de-serialize, and reconstruct images.
You can convert binary image data to text with Base64 encoding and this way you can embed your images in XML. [1]: http://en.wikipedia.org/wiki/Base64
Im using hadoop map and reduce program . And i need to read a multiple file and output it into multiple files
Example
Input \ one.txt
two.txt
three.txt
Output \
one_out.txt
two_out.txt
I need to get some thing like this. How can i achieve this.
Kindly help me
Thanks
If the file size is small, you can simply use FileInputFormat, and hadoop will internally spawn a separate mapper task for every file, which will eventually generate output file for corresponding input file (if there are no reducers involved).
If the file is huge, you need to write a custominput format, and specify isSplittable(false). It will ensure that hadoop does not split your file across mappers and will not generate multiple output files per input file
I've used Apache Flume to pipe a large amount of tweets into the HDFS of Hadoop. I was trying to do sentiment analysis on this data - just something simple to begin with, like positive v negative word comparison.
My problem is that all the guides I find showing me how to do it have a text file of positive and negative words and then a huge text file with every tweet.
As I used Flume, all my data is already in Hadoop. When I access it using localhost:50070 I can see the data, in separate files according to month/day/hour, with each file containing three or four tweets. I have maybe 50 of these files for every hour. Although it doesn't say anywhere, I'm assuming they are in JSON format.
Bearing this in mind, how can I perform my analysis on them? In all the examples I've seen where the Mapper and Reducer have been written, there has been a single file this has been performed on, not a large collection of small JSON files. What should my next step be?
This example should get you started
https://github.com/cloudera/cdh-twitter-example
Basically use hive external table to map your json data and query using hiveql
When you want to process all the files in a directory, you can just specify the path of the directory as your input file to your hadoop job so that it will consider all the files in that directory as its input.
For example if your small files are in the directory /user/flume/tweets/.... then in your hadoop job you can just specify /user/flume/tweets/ as your input file.
If you want to automate the analysis for every one hour you need to write one oozie workflow.
You can refer to the below link for sentiment analysis in hive
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/