hadoop, map/reduce output file(part-00000) and distributed cache - java

the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) which may not be a text file and store in distributed cache.

My suggestion:
Create a new Hadoop job with the following properties:
Input the directory with all the part-... files.
Create a custom OutputFormat class that writes to your distributed cache.
Now make your job to look essentially to have something like this:
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setMapperClass(IdentityMapper.class);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputFormat(DistributedCacheOutputFormat.class);
Have a look at the Yahoo Hadoop tutorial because it has some examples on this point: http://developer.yahoo.com/hadoop/tutorial/module5.html#outputformat
HTH

Related

Data reconciliation in Hadoop in the most primitive way

I need to do data reconciliation in Hadoop based on key comparisons. That means I will have old data in one folder and the newer data will be put into different folders. At the end of the batch I was planning simply on moving the newer data to reside with the old one. The data would be json files from which I have to extract the keys.
I'm taking my first steps with Hadoop so I just wanna do it with MapReduce program only, i.e. without tools such as Spark, Pig, Hive etc. I was thinking of simply going through all the old data at the beginning of the program, before Job object creation, and putting all the IDs into a Java HashMap that would be accessible from the mapper task. If there's a key missing in the newer data, the mapper would output that. The reducer would concern itself with categories of the IDs that are missing but that's another story. After the job has finished, I would move the newer data into old data's folder.
The only thing that I find a bit clunky is this loading phase into Java HashMap object. This is not probably the most elegant solution so I was wondering if MapReduce model has some dedicated data structures/functionality for that kind of purpose (populating a global hash map with all the data from HDFS before the first map task is run)?
I think solution with HashMap is not a good idea. You can use few inputs for your command.
Depends on input file mapper can understand if this data is new and write it with suitable value. Then reducer will check if this data is contained only in "new input" and write this data.
So as result of job you will get only new data.

Hadoop - Merge reducer outputs to a single file using Java

I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'
I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?
Thanks in advance.
you can use getmerge through shell command to merge multiple file into single file.
Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>
Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link
if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.
--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;
I hope this will help you.
The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.
That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.
In that case I would first run your MapReduce program,
Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.
Or you could also just use a single mapper with your original program with
Job.setNumberOfReducer(1);

Control number of hadoop mapper output files

I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.

Hadoop MapReduce: Read a file and use it as input to filter other files

I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!
If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.
I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.

How do I append to a file in hadoop?

I want to create a file in HDFS that has a bunch of lines, each generated by a different call to map. I don't care about the order of the lines, just that they all get added to the file. How do I accomplish this?
If this is not possible, then is there a standard way to generate unique file names to put each line of output into a separate file?
There is no way to append to an existing file in hadoop at the moment, but that's not what it sounds like you want to do anyway. It sounds like you want to have the output from your Map Reduce job go to a single file, which is quite possible. The number of output files is (less than or) equal to the number of reducers, so if you set your number of reducers to 1, you'll get a single file of output.
Before you go and do that however, think if that's what you really want. You'll be creating a bottle neck in your pipeline where it needs to pass all your data through a single machine for that reduce. Within the HDFS distributed file system, the difference between having one file and having several files is pretty transparent. If you want a single file outside the cluster, you might do better to use getmerge from the file system tools.
Both your map and reduce functions should output the lines. In other words, your reduce function is a pass through function that doesn't do much. Set the number of reducers to 1. The output will be a list of all the lines in one file.

Categories

Resources