multiple file output in hadoop mapreduce streaming

multiple file output in hadoop mapreduce streaming - java

Im using hadoop map and reduce program . And i need to read a multiple file and output it into multiple files
Example
Input \ one.txt
two.txt
three.txt
Output \
one_out.txt
two_out.txt
I need to get some thing like this. How can i achieve this.
Kindly help me
Thanks

If the file size is small, you can simply use FileInputFormat, and hadoop will internally spawn a separate mapper task for every file, which will eventually generate output file for corresponding input file (if there are no reducers involved).
If the file is huge, you need to write a custominput format, and specify isSplittable(false). It will ensure that hadoop does not split your file across mappers and will not generate multiple output files per input file

Related

Using the code to input data for backpropagation

I am learning to build neural nets and I came across this code on github,
https://github.com/PavelJunek/back-propagation-java
There is a training set and a validation set required to be used but I don't know where to input the files. The Readme doesn't quite explain how to use the files. How do I test with different csv files I have on this code?

How so? It tells you exactly what to do. The program needs to get two CSV files: a CSV file containing all the training data and a second CSV file containing all of the validation data.
If you have a look at the Program.java file (in the main method), you'll see that you need to pass both files as arguments with the command line.

How to know which Kind of Sequence file it is?

I am new to Hadoop and came across few Sequence files. As I read Sequence File there are 3 ways to create a sequence file. Now I have a sequence file , how do I know which what kind of sequence file it is. How do i read Meta information about that. I need this because, I have got a sequence file and it is expected I create a similar sequence file.
Is there any hadoop command I can use to check this information?

SequenceFile is a flat file consisting of binary key/value pairs.
The SequenceFile.Reader acts as a bridge and can read any of the
SequenceFile formats.
You don't need to mention the SequenceFile format to the SequenceFile.Reader, by default the reader instance will get these details and decompresses the file according to the codec found in the file format.
Check out examples here:
Reading and Writing Sequencefile using Hadoop 2.0 Apis
Reading and Writing SequenceFile Example

processing zipped xml files in hadoop using mapreduce

I have a file structure like this.
a.zip contains a1.zip,a2.zip,a3.zip and then each of these zipped files have one xml file per zip.
I need to process these xml files. currently I am extracting zipped files from a.zip, storing them in hdfs and running a MR job to process a1.zip, a2.zip ..... using custom input format and record reader.
Can anyone help me with a better solution where I dont have to unzip a.zip and still process the files in parallel.

Why don't you write a normal java pre-processor class which you can call from the main program. The steps would be:
1) pre-processor class would programmatically extracts the a.zip file into a temp location.
2) programmatically add the child zip classes to hdfs.
3) fire the XML processing in the way you are doing now.
4) if you wish, you can extend the pre-processor class to directly place XML, such that you could keep xml processing program simpler.
Let me know if something is not clear here.

How to read pig output in separate Java program

I have some pig output files and want to read them on another machine(without hadoop installation). I just want to read a tab-seperated plain text line and parse it into a java object. I am guessing we should be able to use pig.jar as dependency and be able to read it. I could not find relevant documentation. I think this class could be used? How can we provide the schema also.

I suggest you to store data in Avro serialization format. It is Pig-independent and it allows to handle complex data structures like you described (so you don't need to write your own parser). See this article for examples.

Your pig output files are just text files, right? Then you don't need any pig or hadoop jars.
Last time i worked with Pig was on amazon's EMR platform, and the output files were stashed in an s3 bucket. They were just text files and standard java can read the file in.
That class you referenced is for reading into pig from some text format.
Are you asking for a library to parse the pig data model into java objects? I.e. the text representation of tuples & bags, etc? If so then its probably easier to write it yourself. It's a VERY simple data model with only 3 -ish datatypes..

Sentiment analysis on JSON tweets in Hadoop HDFS

I've used Apache Flume to pipe a large amount of tweets into the HDFS of Hadoop. I was trying to do sentiment analysis on this data - just something simple to begin with, like positive v negative word comparison.
My problem is that all the guides I find showing me how to do it have a text file of positive and negative words and then a huge text file with every tweet.
As I used Flume, all my data is already in Hadoop. When I access it using localhost:50070 I can see the data, in separate files according to month/day/hour, with each file containing three or four tweets. I have maybe 50 of these files for every hour. Although it doesn't say anywhere, I'm assuming they are in JSON format.
Bearing this in mind, how can I perform my analysis on them? In all the examples I've seen where the Mapper and Reducer have been written, there has been a single file this has been performed on, not a large collection of small JSON files. What should my next step be?

This example should get you started
https://github.com/cloudera/cdh-twitter-example
Basically use hive external table to map your json data and query using hiveql

When you want to process all the files in a directory, you can just specify the path of the directory as your input file to your hadoop job so that it will consider all the files in that directory as its input.
For example if your small files are in the directory /user/flume/tweets/.... then in your hadoop job you can just specify /user/flume/tweets/ as your input file.
If you want to automate the analysis for every one hour you need to write one oozie workflow.
You can refer to the below link for sentiment analysis in hive
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

multiple file output in hadoop mapreduce streaming - java

Im using hadoop map and reduce program . And i need to read a multiple file and output it into multiple files Example Input \ one.txt two.txt three.txt Output \ one_out.txt two_out.txt I need to get some thing like this. How can i achieve this. Kindly help me Thanks

Related

Using the code to input data for backpropagation

How to know which Kind of Sequence file it is?

processing zipped xml files in hadoop using mapreduce

How to read pig output in separate Java program

Sentiment analysis on JSON tweets in Hadoop HDFS

Categories

Resources