Compare two files in Hadoop MapReduce

Compare two files in Hadoop MapReduce - java

Hi I'm pretty new to Hadoop and mapreduce..I'm wondering if something like this is possible.
I'm trying to compare two files through Mapreduce..
The first file may look something like this:
t1 r1
t2 r2
t1 r4
The second file will look something like this:
u1 t1 r1
u2 t2 r3
u3 t2 r2
u4 t1 r1
I want it to emit u1, u3 and u4 based on the files. The second file will be considerably larger than the first file. I'm not too sure how to compare these files; is this doable in one MapReduce job? I'm willing to chain MapReduce jobs if I have to though.

You could do a mapside join by placing 1st file in distributed cache and traversing the second file in map phase to do a join.
How to read from distributed cache:
#Override
protected void setup(Context context) throws IOException,InterruptedException
{
Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path findlist:filelist)
{
if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
{
fetchvalue(findlist,context);
}
}
}
public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
{
BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
//some operations with the file
}
How to add a file to distributed cache:
DistributedCache.addCacheFile(new URI("/user/hduser`/test/mapmainfile.dat"),conf);`

You can use a mapper side join for comparison. Use distributed cache to pass smaller file to all the mappers and read larger file record by record through the mapper.
Now you can easily compare the large file record received against the small file (from distributed cache) and emit records which match.
Note: This will work only when the first file is sufficiently small to fit in the memory of the mapper. Generally an catalog file or a look up file

You can use reduce side join if both files are large, for that:
Create two mappers for each of these two files using MultipleInput format. so one input file going to one mapper, another file to another mapper.
Send first mappers output data with a key as composit key (TextPair). first part in the pair is like "t1 r1", "t1, r2" etc and 2nd part is "0" from first mapper, and "1" from 2nd mapper. For values, emit nullWritable from first mapper and u1, u2 etc from 2nd. So ouptput from first mapper will be such as (("t1 r1", "0"),null), and 2nd mapper output as (("t1 r1", 1), u1), (("t1 r1", 1"), u4) etc using "0" from first mapper, so that first mapper ouput is received first.
Implement partitioner and group comparator based on first part of TextPair key.
In the reducer you would get data grouped by first part and recive it like this - [("t1 r1", 0"), null), (("t1 r1", 1"), u1), (("t1 r1", 1"), u4) ]
Discard all the inputs which doesn't fist key with "0" (so it will remove unmatched entries) and emit rest of the values u1, u4 etc.

Related

How does Apache Flink parallelize reading of a CSV file

I am using readCsvFile(path) function in Apache Flink api to read a CSV file and store it in a list variable. How does it work using multiple threads?
For example, is it splitting the file based on some statistics? if yes, what statistics? Or does it read the file line by line and then send the lines to threads to process them?
Here is the sample code:
//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class)
.collect();
Suppose that we have a 800mb CSV file on local disk, how does it distribute the work between those 4 threads?

The readCsvFile() API method internally creates a data source with a CsvInputFormat which is based on Flink's FileInputFormat. This InputFormat generates a list of so-called InputSplits. An InputSplit defines which range of a file should be scanned. The splits are then distributed to data source tasks.
So, each parallel task scans a certain region of a file and parses its content. This is very similar to how it is done by MapReduce / Hadoop.

This is the same as How does Hadoop process records split across block boundaries?
I extract some code from flink-release-1.1.3 DelimitedInputFormat file.
// else ..
int toRead;
if (this.splitLength > 0) {
// if we have more data, read that
toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
}
else {
// if we have exhausted our split, we need to complete the current record, or read one
// more across the next split.
// the reason is that the next split will skip over the beginning until it finds the first
// delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
// previous split.
toRead = this.readBuffer.length;
this.overLimit = true;
}
It's clear that if it don't read line delimiter in one split, it will get another split to find.( I haven't find The corresponding code, I will try.)
Plus: the image below is how I find the code, from readCsvFile() to DelimitedInputFormat.

Using Data from a file as Hash-Map in Map Reduce job Hadoop

I have a file with 10,000("small file") rows with Key,Value
different keys in small file can have the same value.
I have to word count on a different file(big file).
buy i need to replace the key from the ("big file") with the Value from the ("small file") -in Mapper.
Only After it count it in reducer.
i would like to achieve it using single map reduce job WITHOUT using pig/hive.
could you help me and guide me how to do it ?
The small file will on hdfs and im not sure how would other nodes would be able to read from it - don't think its even recommended - because the node with the small file will have to work really hard sending data to each map task.

You could do a mapside join and then count the results in reduce side. Place your small file in the distributed cache so that your data will be available to all the nodes. In your mapper store all the key,value pairs in a java hashmap in the setup method and stream the big file through, then do a join in the map method. So this will yield something like this.
Small file (K,V)
Big file (K1,V1)
Mapper output (V(key),V1(value))
Then do a count in the reducer based on V(or interchange the key,value pair in the map output to achieve your need.
How to read from a distributed cache:
#Override
protected void setup(Context context) throws IOException,InterruptedException
{
Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path findlist:filelist)
{
if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
{
fetchvalue(findlist,context);
}
}
}
public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
{
BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
//some operations with the file
}
How to add a file to distributed cache:
DistributedCache.addCacheFile(new URI("/user/hduser/test/mapmainfile.dat"),conf);

How to import a CSV into HBASE table using MapReduce

Hi I am quite new to hadoop and I am trying to import a csv table to Hbase using MapReduce.
I am using hadoop 1.2.1 and hbase 1.1.1
i have data in following format:
Wban Number, YearMonthDay, Time, Hourly Precip
03011,20060301,0050,0
03011,20060301,0150,0
I have written the following code for bulk load
public class BulkLoadDriver extends Configured implements Tool{
public static void main(String [] args) throws Exception{
int result= ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
}
public static enum COUNTER_TEST{FILE_FOUND, FILE_NOT_FOUND};
public String tableName="hpd_table";// name of the table to be inserted in hbase
#Override
public int run(String[] args) throws Exception {
//Configuration conf= this.getConf();
Configuration conf = HBaseConfiguration.create();
Job job= new Job(conf,"BulkLoad");
job.setJarByClass(getClass());
job.setMapperClass(bulkMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
TableMapReduceUtil.initTableReducerJob(tableName, null, job); //for HBase table
job.setNumReduceTasks(0);
return (job.waitForCompletion(true)?0:1);
}
private static class bulkMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
//static class bulkMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String [] val= value.toString().split(",");
// store the split values in the bytes format so that they can be added to the PUT object
byte[] wban=Bytes.toBytes(val[0]);
byte[] ymd= Bytes.toBytes(val[1]);
byte[] tym=Bytes.toBytes(val[2]);
byte[] hPrec=Bytes.toBytes(val[3]);
Put put=new Put(wban);
put.add(ymd, tym, hPrec);
System.out.println(wban);
context.write(new ImmutableBytesWritable(wban), put);
context.getCounter(COUNTER_TEST.FILE_FOUND).increment(1);
}
}
}
I have created a jar for this and ran following in the terminal:
hadoop jar ~/hadoop-1.2.1/MRData/bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table
But the output that I get is hundreds of following type of lines:
attempt_201509012322_0001_m_000000_0: [B#2d22bfc8
attempt_201509012322_0001_m_000000_0: [B#445cfa9e
I am not sure what do they mean and how to perform this bulk upload. please help.
Thanks in advance.

There are several ways to import data into HBase. Please have a look at this following link:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_import.html
HBase BulkLoad:
Data file in CSV format
Process your data into HFile format. See http://hbase.apache.org/book/hfile_format.html for details about HFile format. Usually you use a MapReduce job for the conversion, and you often need to write the Mapper yourself because your data is unique. The job must to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following:
Inspects the table to configure a total order partitioner
Uploads thepartitions file to the cluster and adds it to the
DistributedCache
Sets the number of reduce tasks to match the current number of
regions
Sets the output key/value class to match HFileOutputFormat
requirements
Sets the Reducer to perform the appropriate sorting
(either KeyValueSortReducer or PutSortReducer)
One HFile is created per region in the output folder. Input data is almost completely re-written, so you need available disk space at least twice the size of the original data set. For example, for a 100 GB output from mysqldump, you should have at least 200 GB of available disk space in HDFS. You can delete the original input file at the end of the process.
Load the files into HBase. Use the LoadIncrementalHFiles command (more commonly known as the completebulkload tool), passing it a URL that locates the files in HDFS. Each file is loaded into the relevant region on the RegionServer for the region. You can limit the number of versions that are loaded by passing the --versions= N option, where N is the maximum number of versions to include, from newest to oldest (largest timestamp to smallest timestamp).
If a region was split after the files were created, the tool automatically splits the HFile according to the new boundaries. This process is inefficient, so if your table is being written to by other processes, you should load as soon as the transform step is done.

Merge tab delimited files by key

I have three MapReduce jobs that produce tab delimited files, that operate on the same files. The first value is the key. This is the case for every output of these three MR jobs.
What I want to do now, is use MapReduce to "stitch" these files together by key. What would be the best Mapper output and Reducer input? I tried using ArrayWritable, but because of the shuffle, for some records the ArrayWritable from 1 file is in the third position, instead of the second.
I want this:
Key \t Values-from-first-MR-job \t Values-from-second-MR-job \t Values-from-third-MR-job
And this should be the same for all records. But, as I said, because of the shuffle, sometimes this happens for a few records:
Key \t Values-from-third-MR-job \t Values-from-first-MR-job \t Values-from-second-MR-job
How should I set up my Mapper and Reducer to fix this?

It's possible with simple tagging on the emitted value since only three types of files are involved. In map extract the path of the split, identify its position and add a suitable prefix to the value. For clarity, say the outputs are in 3 directories :
path1/mr_out_1
path2/mr_out_2
path3/mr_out_3
Using TextInputForamt for all these paths, in map you will do :
String[] keyVal = value.spilt("\t",2);
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String dirName = filePath.getParent().getName().toString();
Text outValue = new Text();
if(dirName.equals("mr_out_1")){
outValue.set("1_" + keyVal[1]);
} else if(dirName.equals("mr_out_2")){
outValue.set("2_" + keyVal[1]);
} else {
outValue.set("3_" + keyVal[1]);
}
context.write(new Text(keyVal[0]), outVal);
If you have all the files in the same directory, use the fileName instead of dirName. Then identify the flag based on the name(a regex match may be suitable) :
String fileName = filePath.getName().toString();
if(fileName.matches("regex")){ ... }
In reduce just put the incoming values to a List and sort. Rest is simple enough.
List<String> list = new ArrayList<String>(3);
for(Text v : values){
list.add(v.toString());
}
Collections.sort(list);
StringBuilder builder = new StringBuilder();
for(String s : list){
builder.append(s.substring(2)+"\t");
}
context.write(key, new Text(builder.toString().trim()));
I think it will serve the purpose. Keep in mind that the Collection.sort strategy will fail if there are more than 9 files (due to alphabetical order). Then you may extract the tag separately, cast it to an Integer and use a TreeMap<tag, actualString> for sorting.
NB: All the above snippets are using new API. I didn't use an IDE to write those, so few syntax errors may exist. And again I didn't follow proper conventions in writing. Say the outKey of map could be a class member, and using outKey.set(keyVal[0]) could remove a Text object creation overhead.

Splitting Reducer output in Hadoop

The output files produced by my Reduce operation is huge (1 GB after Gzipping). I want it produce break output into smaller files of 200 MB. Is there a property/Java class to split reduce output by size or no. of lines ?
I can not increase the number of reducers because that has negative impact on performance of the hadoop job.

I'm curious as to why you cannot just use more reducers, but I will take you at your word.
One option you can do is use MultipleOutputs and write to multiple files from one reducer. For example, say that the output file for each reducer is 1GB and you want 256MB files instead. This means you need to write 4 files per reducer rather than one file.
In your job driver, do this:
JobConf conf = ...;
// You should probably pass this in as parameter rather than hardcoding 4.
conf.setInt("outputs.per.reducer", 4);
// This sets up the infrastructure to write multiple files per reducer.
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);
In your reducer, do this:
#Override
public void configure(JobConf conf) {
numFiles = conf.getInt("outputs.per.reducer", 1);
multipleOutputs = new MultipleOutputs(conf);
// other init stuff
...
}
#Override
public void reduce(YourKey key
Iterator<YourValue> valuesIter,
OutputCollector<OutKey, OutVal> ignoreThis,
Reporter reporter) {
// Do your business logic just as you're doing currently.
OutKey outputKey = ...;
OutVal outputVal = ...;
// Now this is where it gets interesting. Hash the value to find
// which output file the data should be written to. Don't use the
// key since all the data will be written to one file if the number
// of reducers is a multiple of numFiles.
int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles;
// Now use multiple outputs to actually write the data.
// This will create output files named: multi_0-r-00000, multi_1-r-00000,
// multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files
// will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001.
multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter)
.collect(outputKey, outputValue);
}
#Overrider
public void close() {
// You must do this!!!!
multipleOutputs.close();
}
This pseudo code was written with the old mapreduce api in mind. Equivalent apis exist using the mapreduce api, though, so either way, you should be all set.

There's no property to do this. You'll need to write your own output format & record writer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compare two files in Hadoop MapReduce - java

Related

How does Apache Flink parallelize reading of a CSV file

Using Data from a file as Hash-Map in Map Reduce job Hadoop

How to import a CSV into HBASE table using MapReduce

Merge tab delimited files by key

Splitting Reducer output in Hadoop

Categories

Resources