Hadoop MapReduce with RDF/XML files - java

So i have ten different files where each file looks like this.
<DocID1> <RDF Document>
<DocID2> <RDF Document>
.
.
.
.
<DocID50000> <RDF Document>
There are actually ~56,000 lines per file. There's a document ID in each line and a RDF document.
My objective is to pass into each mapper as the input key value pair and emit multiple for the output key value pairs. In the reduce step, I will store these into a Hive table.
I have a couple of questions getting started and I am completely new to RDF/XML files.
How am I supposed to parse each line of the document to get separately to pass to each mapper?
Is there an efficient way of controlling the size of the input for the mapper?

1- If you are using TextInputFormat you are automatically getting 1 line(1 split) in each mapper as the value. Convert this line into String and do the desired processing. Alternatively you could make use of Hadoop Streaming API by using StreamXmlRecordReader. You have to provide the start and end tag and all the information sandwiched between start and tag will be fed to the mapper(In your case <DocID1> and <RDF Document>).
Usage :
hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=DocID,end=RDF Document" ..... (rest of the command)
2- Why do you need that? Your goal is to feed one complete line to a mapper. It's something which is the the job of InputFormat you are using. If you still need it, you have to write custom code for this and for this particular case it's going to be a bit tricky.

Related

Giving Hashmap as input to Mapper instead of a file

I am writing a MR code to perform Regex pattern match for data that is available in HBASE and HDFS.
My input file is a large CSV file that has the keys to fetch the unique data from HBASE. This input file can have duplicates.
My question -
In my Main class - I want to read the Input file and perform some processing and hold the data into a hashmap before feeding it to mapper class.
Of all the examples, I had seen, we can input only file path as input to mapper class,
is there a way to input an hashmap to mapper instead of a file?
Thank You
Pranay Vyas
Two things:
Map reduce works on the data which is there in HDFS. So your best choice is save your map data as file in HDFS, and move towards map reduce.
However, since your data is coming from HBase, why not use this, and read the data and perform your regex operations on it. Let me know if I missed something

Hadoop - Merge reducer outputs to a single file using Java

I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'
I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?
Thanks in advance.
you can use getmerge through shell command to merge multiple file into single file.
Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>
Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link
if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.
--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;
I hope this will help you.
The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.
That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.
In that case I would first run your MapReduce program,
Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.
Or you could also just use a single mapper with your original program with
Job.setNumberOfReducer(1);

Avoiding file collisions in Hadoop Pig script that writes multiple output files

I'm writing a Pig script that looks as follows:
...
myGroup = group simplifiedJoinData by (dir1, dir2, dir3, dir4);
betterGroup = foreach myGroup {
value1Value2 = foreach simplifiedJoinedGroup generate value1, value2;
distinctValue1Value2 = DISTINCT value1Value2; generate group, distinctValue1Value2;
}
store betterGroup into '/myHdfsPath/myMultiStorageTest' using MyMultiStorage('output', '0', 'none' );
Please note that the schema of simplifiedJoinData is simplifiedJoinedGroup: {dir1: long,dir2: long,dir3: chararray,dir4: chararray,value1: chararray,value2: chararray}
It uses a custom storage class (MyMultiStorage - basically a modified version of MultiStorage in the piggybank) that writes multiple output files. The custom storage class expects that the values passed to it are in the following format:
{group:(dir1:long,dir2:long,dir3:chararray,dir4:chararray), bag:{(value1:chararrary,value2:chararray)}}
What I'd like the custom storage class to do is output multiple files as follows:
dir/dir2/dir3/dir4/value1_values.txt
dir/dir2/dir3/dir4/value2_values.txt
where the value1_values.txt contains all the value1 values and value2_values.txt contains all the value2 values. Ideally I would prefer not to write multiple part files that have to be combined later (Note that the example has been simplified for the purposes of this discussion. The real output files are binary structures that can't be combined with a simple cat). I have this working for small data sets; however, when I run with larger data sets, I run into issues where I get exceptions in Hadoop that the output file name already exists or that it is already being created:
java.io.IOException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
I suspect that this is because mutiple mappers or reducers are attempting to write the same file, and I am not using part IDs in the filename as PigStorage does. However, I would have expected that by grouping the data, I'd only have one record for each dir1, dir2, dir3, dir4 combination, and, as such, only one mapper or reducer would be attempting to write a particular file for a given run. I've tried running without speculative execution for both map and reduce tasks, but that seems to have had no effect. Clearly I don't understand what's going on here.
My question is: Why am I getting the AlreadyBeingCreatedException?
If there is no way for me to have a single reducer write all data for each record, it would be acceptable to have to write multiple parts output files in a directory (one per reducer) and combine them after the fact. It just wouldn't be ideal. However, as of yet, I have not been able to determine the proper way to have the custom storage class determine a unique filename, and I still end up with multiple reducers trying to create/write the same file. Is there a particular method in the job configuration or context that would allow me to coordinate parts accross the job?
Thanks in advance for any help you can provide.
Turns out that there was a condition where I was generating the same file name due to a tuple parsing error. I was getting the AlreadyBeingCreatedException for that exact reason.
Nothing wrong with the custom store function, or approaching the problem in this manner. Just a silly mistake on my part!

Writing Hadoop MapReduce output to just 2 flat files

So I have a MapReduce job that takes in multiple news articles and outputs the following key value pairs.
.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.
As you can see, there are two main different types of data rows that I am currently outputting and right now, these get mixed up in the flat files outputted by mapreduce. Is there anyway I can simply output social_tags to file1 and topic_codes to file2 OR maybe output social_tags to a specified group of files(social1.txt, social2.txt ..etc) and topic_codes to another group (topic1.txt, topic2.txt...etc)
The reason I'm asking this is so that I can store all these into a Hive table later on easily. I preferably would want to have a separate table for each different data type(topic_code, social_tag,... etc.) If any of you guys know a simple way to achieve this without separating the mapreduce output to different files, that would be really helpful too.
Thanks in advance!
You can use MultipleOutputs as already suggested.
As you have asked for a simple way to achieve this without separating the mapreduce output to different files. Here is a quick way, if the amount of data is not real huge !!!. And the logic to differentiate the data is not too complex.
First load the mixed output file into a hive table (say main_table). Then you can create two different tables (topic_code, social_tag), and insert the data from the main table after filtering it by where clause.
hive > insert into table topic_code
> select * from main_table
> where $condition;
// $condition = the logic you would use to differentiate the records in the MR job
I think you can try MultipleOutputs present in hadoop API. MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer (or
mapper in a map-only job) to create more than a single file. File names are of the form
name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an
arbitrary name that is set by the program, and nnnnn is an integer designating the part
number, starting from zero.
In the reducer, where we generate the output, we construct an instance of MultipleOutputs in the setup()method and assign it to an instance variable. We then use the
MultipleOutputsinstance in the reduce()method to write to the output, in place of the
context. The write()method takes the key and value, as well as a name.
You can look into the below link for details
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

How do I output whole files from a map job?

This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.
The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).

Categories

Resources