Splitting Reducer output in Hadoop

Splitting Reducer output in Hadoop - java

The output files produced by my Reduce operation is huge (1 GB after Gzipping). I want it produce break output into smaller files of 200 MB. Is there a property/Java class to split reduce output by size or no. of lines ?
I can not increase the number of reducers because that has negative impact on performance of the hadoop job.

I'm curious as to why you cannot just use more reducers, but I will take you at your word.
One option you can do is use MultipleOutputs and write to multiple files from one reducer. For example, say that the output file for each reducer is 1GB and you want 256MB files instead. This means you need to write 4 files per reducer rather than one file.
In your job driver, do this:
JobConf conf = ...;
// You should probably pass this in as parameter rather than hardcoding 4.
conf.setInt("outputs.per.reducer", 4);
// This sets up the infrastructure to write multiple files per reducer.
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);
In your reducer, do this:
#Override
public void configure(JobConf conf) {
numFiles = conf.getInt("outputs.per.reducer", 1);
multipleOutputs = new MultipleOutputs(conf);
// other init stuff
...
}
#Override
public void reduce(YourKey key
Iterator<YourValue> valuesIter,
OutputCollector<OutKey, OutVal> ignoreThis,
Reporter reporter) {
// Do your business logic just as you're doing currently.
OutKey outputKey = ...;
OutVal outputVal = ...;
// Now this is where it gets interesting. Hash the value to find
// which output file the data should be written to. Don't use the
// key since all the data will be written to one file if the number
// of reducers is a multiple of numFiles.
int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles;
// Now use multiple outputs to actually write the data.
// This will create output files named: multi_0-r-00000, multi_1-r-00000,
// multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files
// will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001.
multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter)
.collect(outputKey, outputValue);
}
#Overrider
public void close() {
// You must do this!!!!
multipleOutputs.close();
}
This pseudo code was written with the old mapreduce api in mind. Equivalent apis exist using the mapreduce api, though, so either way, you should be all set.

There's no property to do this. You'll need to write your own output format & record writer.

Related

How does Apache Flink parallelize reading of a CSV file

I am using readCsvFile(path) function in Apache Flink api to read a CSV file and store it in a list variable. How does it work using multiple threads?
For example, is it splitting the file based on some statistics? if yes, what statistics? Or does it read the file line by line and then send the lines to threads to process them?
Here is the sample code:
//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class)
.collect();
Suppose that we have a 800mb CSV file on local disk, how does it distribute the work between those 4 threads?

The readCsvFile() API method internally creates a data source with a CsvInputFormat which is based on Flink's FileInputFormat. This InputFormat generates a list of so-called InputSplits. An InputSplit defines which range of a file should be scanned. The splits are then distributed to data source tasks.
So, each parallel task scans a certain region of a file and parses its content. This is very similar to how it is done by MapReduce / Hadoop.

This is the same as How does Hadoop process records split across block boundaries?
I extract some code from flink-release-1.1.3 DelimitedInputFormat file.
// else ..
int toRead;
if (this.splitLength > 0) {
// if we have more data, read that
toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
}
else {
// if we have exhausted our split, we need to complete the current record, or read one
// more across the next split.
// the reason is that the next split will skip over the beginning until it finds the first
// delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
// previous split.
toRead = this.readBuffer.length;
this.overLimit = true;
}
It's clear that if it don't read line delimiter in one split, it will get another split to find.( I haven't find The corresponding code, I will try.)
Plus: the image below is how I find the code, from readCsvFile() to DelimitedInputFormat.

Compare two files in Hadoop MapReduce

Hi I'm pretty new to Hadoop and mapreduce..I'm wondering if something like this is possible.
I'm trying to compare two files through Mapreduce..
The first file may look something like this:
t1 r1
t2 r2
t1 r4
The second file will look something like this:
u1 t1 r1
u2 t2 r3
u3 t2 r2
u4 t1 r1
I want it to emit u1, u3 and u4 based on the files. The second file will be considerably larger than the first file. I'm not too sure how to compare these files; is this doable in one MapReduce job? I'm willing to chain MapReduce jobs if I have to though.

You could do a mapside join by placing 1st file in distributed cache and traversing the second file in map phase to do a join.
How to read from distributed cache:
#Override
protected void setup(Context context) throws IOException,InterruptedException
{
Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path findlist:filelist)
{
if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
{
fetchvalue(findlist,context);
}
}
}
public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
{
BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
//some operations with the file
}
How to add a file to distributed cache:
DistributedCache.addCacheFile(new URI("/user/hduser`/test/mapmainfile.dat"),conf);`

You can use a mapper side join for comparison. Use distributed cache to pass smaller file to all the mappers and read larger file record by record through the mapper.
Now you can easily compare the large file record received against the small file (from distributed cache) and emit records which match.
Note: This will work only when the first file is sufficiently small to fit in the memory of the mapper. Generally an catalog file or a look up file

You can use reduce side join if both files are large, for that:
Create two mappers for each of these two files using MultipleInput format. so one input file going to one mapper, another file to another mapper.
Send first mappers output data with a key as composit key (TextPair). first part in the pair is like "t1 r1", "t1, r2" etc and 2nd part is "0" from first mapper, and "1" from 2nd mapper. For values, emit nullWritable from first mapper and u1, u2 etc from 2nd. So ouptput from first mapper will be such as (("t1 r1", "0"),null), and 2nd mapper output as (("t1 r1", 1), u1), (("t1 r1", 1"), u4) etc using "0" from first mapper, so that first mapper ouput is received first.
Implement partitioner and group comparator based on first part of TextPair key.
In the reducer you would get data grouped by first part and recive it like this - [("t1 r1", 0"), null), (("t1 r1", 1"), u1), (("t1 r1", 1"), u4) ]
Discard all the inputs which doesn't fist key with "0" (so it will remove unmatched entries) and emit rest of the values u1, u4 etc.

Using Data from a file as Hash-Map in Map Reduce job Hadoop

I have a file with 10,000("small file") rows with Key,Value
different keys in small file can have the same value.
I have to word count on a different file(big file).
buy i need to replace the key from the ("big file") with the Value from the ("small file") -in Mapper.
Only After it count it in reducer.
i would like to achieve it using single map reduce job WITHOUT using pig/hive.
could you help me and guide me how to do it ?
The small file will on hdfs and im not sure how would other nodes would be able to read from it - don't think its even recommended - because the node with the small file will have to work really hard sending data to each map task.

You could do a mapside join and then count the results in reduce side. Place your small file in the distributed cache so that your data will be available to all the nodes. In your mapper store all the key,value pairs in a java hashmap in the setup method and stream the big file through, then do a join in the map method. So this will yield something like this.
Small file (K,V)
Big file (K1,V1)
Mapper output (V(key),V1(value))
Then do a count in the reducer based on V(or interchange the key,value pair in the map output to achieve your need.
How to read from a distributed cache:
#Override
protected void setup(Context context) throws IOException,InterruptedException
{
Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path findlist:filelist)
{
if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
{
fetchvalue(findlist,context);
}
}
}
public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
{
BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
//some operations with the file
}
How to add a file to distributed cache:
DistributedCache.addCacheFile(new URI("/user/hduser/test/mapmainfile.dat"),conf);

How to import a CSV into HBASE table using MapReduce

Hi I am quite new to hadoop and I am trying to import a csv table to Hbase using MapReduce.
I am using hadoop 1.2.1 and hbase 1.1.1
i have data in following format:
Wban Number, YearMonthDay, Time, Hourly Precip
03011,20060301,0050,0
03011,20060301,0150,0
I have written the following code for bulk load
public class BulkLoadDriver extends Configured implements Tool{
public static void main(String [] args) throws Exception{
int result= ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
}
public static enum COUNTER_TEST{FILE_FOUND, FILE_NOT_FOUND};
public String tableName="hpd_table";// name of the table to be inserted in hbase
#Override
public int run(String[] args) throws Exception {
//Configuration conf= this.getConf();
Configuration conf = HBaseConfiguration.create();
Job job= new Job(conf,"BulkLoad");
job.setJarByClass(getClass());
job.setMapperClass(bulkMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
TableMapReduceUtil.initTableReducerJob(tableName, null, job); //for HBase table
job.setNumReduceTasks(0);
return (job.waitForCompletion(true)?0:1);
}
private static class bulkMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
//static class bulkMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String [] val= value.toString().split(",");
// store the split values in the bytes format so that they can be added to the PUT object
byte[] wban=Bytes.toBytes(val[0]);
byte[] ymd= Bytes.toBytes(val[1]);
byte[] tym=Bytes.toBytes(val[2]);
byte[] hPrec=Bytes.toBytes(val[3]);
Put put=new Put(wban);
put.add(ymd, tym, hPrec);
System.out.println(wban);
context.write(new ImmutableBytesWritable(wban), put);
context.getCounter(COUNTER_TEST.FILE_FOUND).increment(1);
}
}
}
I have created a jar for this and ran following in the terminal:
hadoop jar ~/hadoop-1.2.1/MRData/bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table
But the output that I get is hundreds of following type of lines:
attempt_201509012322_0001_m_000000_0: [B#2d22bfc8
attempt_201509012322_0001_m_000000_0: [B#445cfa9e
I am not sure what do they mean and how to perform this bulk upload. please help.
Thanks in advance.

There are several ways to import data into HBase. Please have a look at this following link:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_import.html
HBase BulkLoad:
Data file in CSV format
Process your data into HFile format. See http://hbase.apache.org/book/hfile_format.html for details about HFile format. Usually you use a MapReduce job for the conversion, and you often need to write the Mapper yourself because your data is unique. The job must to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following:
Inspects the table to configure a total order partitioner
Uploads thepartitions file to the cluster and adds it to the
DistributedCache
Sets the number of reduce tasks to match the current number of
regions
Sets the output key/value class to match HFileOutputFormat
requirements
Sets the Reducer to perform the appropriate sorting
(either KeyValueSortReducer or PutSortReducer)
One HFile is created per region in the output folder. Input data is almost completely re-written, so you need available disk space at least twice the size of the original data set. For example, for a 100 GB output from mysqldump, you should have at least 200 GB of available disk space in HDFS. You can delete the original input file at the end of the process.
Load the files into HBase. Use the LoadIncrementalHFiles command (more commonly known as the completebulkload tool), passing it a URL that locates the files in HDFS. Each file is loaded into the relevant region on the RegionServer for the region. You can limit the number of versions that are loaded by passing the --versions= N option, where N is the maximum number of versions to include, from newest to oldest (largest timestamp to smallest timestamp).
If a region was split after the files were created, the tool automatically splits the HFile according to the new boundaries. This process is inefficient, so if your table is being written to by other processes, you should load as soon as the transform step is done.

Hadoop (1.1.2) XML processing & re-writing file

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting Reducer output in Hadoop - java

There's no property to do this. You'll need to write your own output format & record writer.

Related

How does Apache Flink parallelize reading of a CSV file

Compare two files in Hadoop MapReduce

Using Data from a file as Hash-Map in Map Reduce job Hadoop

How to import a CSV into HBASE table using MapReduce

Hadoop (1.1.2) XML processing & re-writing file

Categories

Resources