I am writing a MR code to perform Regex pattern match for data that is available in HBASE and HDFS.
My input file is a large CSV file that has the keys to fetch the unique data from HBASE. This input file can have duplicates.
My question -
In my Main class - I want to read the Input file and perform some processing and hold the data into a hashmap before feeding it to mapper class.
Of all the examples, I had seen, we can input only file path as input to mapper class,
is there a way to input an hashmap to mapper instead of a file?
Thank You
Pranay Vyas
Two things:
Map reduce works on the data which is there in HDFS. So your best choice is save your map data as file in HDFS, and move towards map reduce.
However, since your data is coming from HBase, why not use this, and read the data and perform your regex operations on it. Let me know if I missed something
Related
My job is to get data from CSV between two dates. For this firstly I have loaded all data into a list of beans. Then loop through the list and prepared my result.
The problem is in this process I have to loop through a lot of data.
I want a solution where I will load only necessary data from CSV. Ultimately I want to reduce the operations I am doing.
beansExtractedFromCsv = new CsvToBeanBuilder(new FileReader(s3StorageLocation.concat(s3FileName))).withType(CsvExtractTemplate.class).build().parse();
In this line, I am parsing all data from CSV
dataWriteOnCSVOrDB(fromTime, toTime, false, beansExtractedFromCsv);
Here I am passing all data extracted from CSV to my method, where I am looping through a lot of data to calculate my expected result.
There is another option in OpenCsv to read line by line. I haven't tried that.
I need to do data reconciliation in Hadoop based on key comparisons. That means I will have old data in one folder and the newer data will be put into different folders. At the end of the batch I was planning simply on moving the newer data to reside with the old one. The data would be json files from which I have to extract the keys.
I'm taking my first steps with Hadoop so I just wanna do it with MapReduce program only, i.e. without tools such as Spark, Pig, Hive etc. I was thinking of simply going through all the old data at the beginning of the program, before Job object creation, and putting all the IDs into a Java HashMap that would be accessible from the mapper task. If there's a key missing in the newer data, the mapper would output that. The reducer would concern itself with categories of the IDs that are missing but that's another story. After the job has finished, I would move the newer data into old data's folder.
The only thing that I find a bit clunky is this loading phase into Java HashMap object. This is not probably the most elegant solution so I was wondering if MapReduce model has some dedicated data structures/functionality for that kind of purpose (populating a global hash map with all the data from HDFS before the first map task is run)?
I think solution with HashMap is not a good idea. You can use few inputs for your command.
Depends on input file mapper can understand if this data is new and write it with suitable value. Then reducer will check if this data is contained only in "new input" and write this data.
So as result of job you will get only new data.
So I have a MapReduce job that takes in multiple news articles and outputs the following key value pairs.
.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.
As you can see, there are two main different types of data rows that I am currently outputting and right now, these get mixed up in the flat files outputted by mapreduce. Is there anyway I can simply output social_tags to file1 and topic_codes to file2 OR maybe output social_tags to a specified group of files(social1.txt, social2.txt ..etc) and topic_codes to another group (topic1.txt, topic2.txt...etc)
The reason I'm asking this is so that I can store all these into a Hive table later on easily. I preferably would want to have a separate table for each different data type(topic_code, social_tag,... etc.) If any of you guys know a simple way to achieve this without separating the mapreduce output to different files, that would be really helpful too.
Thanks in advance!
You can use MultipleOutputs as already suggested.
As you have asked for a simple way to achieve this without separating the mapreduce output to different files. Here is a quick way, if the amount of data is not real huge !!!. And the logic to differentiate the data is not too complex.
First load the mixed output file into a hive table (say main_table). Then you can create two different tables (topic_code, social_tag), and insert the data from the main table after filtering it by where clause.
hive > insert into table topic_code
> select * from main_table
> where $condition;
// $condition = the logic you would use to differentiate the records in the MR job
I think you can try MultipleOutputs present in hadoop API. MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer (or
mapper in a map-only job) to create more than a single file. File names are of the form
name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an
arbitrary name that is set by the program, and nnnnn is an integer designating the part
number, starting from zero.
In the reducer, where we generate the output, we construct an instance of MultipleOutputs in the setup()method and assign it to an instance variable. We then use the
MultipleOutputsinstance in the reduce()method to write to the output, in place of the
context. The write()method takes the key and value, as well as a name.
You can look into the below link for details
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
So i have ten different files where each file looks like this.
<DocID1> <RDF Document>
<DocID2> <RDF Document>
.
.
.
.
<DocID50000> <RDF Document>
There are actually ~56,000 lines per file. There's a document ID in each line and a RDF document.
My objective is to pass into each mapper as the input key value pair and emit multiple for the output key value pairs. In the reduce step, I will store these into a Hive table.
I have a couple of questions getting started and I am completely new to RDF/XML files.
How am I supposed to parse each line of the document to get separately to pass to each mapper?
Is there an efficient way of controlling the size of the input for the mapper?
1- If you are using TextInputFormat you are automatically getting 1 line(1 split) in each mapper as the value. Convert this line into String and do the desired processing. Alternatively you could make use of Hadoop Streaming API by using StreamXmlRecordReader. You have to provide the start and end tag and all the information sandwiched between start and tag will be fed to the mapper(In your case <DocID1> and <RDF Document>).
Usage :
hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=DocID,end=RDF Document" ..... (rest of the command)
2- Why do you need that? Your goal is to feed one complete line to a mapper. It's something which is the the job of InputFormat you are using. If you still need it, you have to write custom code for this and for this particular case it's going to be a bit tricky.
I have a Map{String,Object}. Here the Object(value for the map) could be either a String or Map{String,String}.
So if I split up the Map it could be something like this :
Map{String,Map{string,String}} and Map{String,String}
I want to write this Map to Hadoop DFS using sequence file in a key,value pair. I want this Map to be value for by sequence file and hence needs to make it writable. I have written on class but it gives me issues if I write multiple records, then while reading it back, values from adjacent records gets mixed up.
Please suggest me some solution to this problem, or sample code to make this nested Map writable.
You aren't owe to make your map Writable, you can just serialize it to bytes with your serialization framework of choice like java serialization or protobuf and writes bytes to your sequence file.