I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?
It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.
In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )
Related
Are there any ways to improve the MapReduce performance by changing the number of map tasks or changing the split sizes of each mapper?
For example, I have a 100GB text file and 20 nodes. I want to run a WordCount job on the text file, what is the ideal number of mappers or the ideal split size so that it can be done faster?
Would it be faster with more mappers?
Would it be faster with a smaller split size?
EDIT
I am using hadoop 2.7.1, just so you know there is YARN.
It is not necessarily faster when you use more mappers. Each mapper has a start up and setup time. In the early days of hadoop when mapreduce was the de facto standard it was said that a mapper should run ~10 minutes. Today the documentations recommends 1 minute. You can vary the number of map tasks by using setNumMapTasks(int) which you can define within the JobConf. IN the documentation of the method are very good information about the mapper count:
How many maps?
The number of maps is usually driven by the total size
of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 or so for very cpu-light
map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
The default behavior of file-based InputFormats is to split the input
into logical InputSplits based on the total size, in bytes, of input
files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size
can be set via mapreduce.input.fileinputformat.split.minsize.
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to
set it even higher.
Your question is probably related to this SO question.
To be honest, try to have a look at modern frameworks as well, like Apache Spark and Apache Flink.
In hadoop I'd like to split a file (almost) equally to each mapper. The file is large and I want to use a specific number of mappers at which are defined at job start. Now I've customized the input split but I want to be sure that if I split the file in two (or more splits) I won't cut a line in half as I want each mapper to have complete lines and not broken ones.
So the question is this, how can I get the approximate size of a filesplit during each creation or if that is not possible how I can estimate the number of (almost) equal filesplits for a large file given the constraint that I don't want to have any broken lines in any mapper instance.
Everything that you are asking for is the default behavior in Map Reduce. Like mappers always process complete lines. By default Map Reduce strives to spread out the load among st mappers evenly.
You can get more details about it here you can check out the InputSplits para.
Also this answer here as linked by #Shaw, talks about how exactly the case of lines spread across blocks splits is handled.
A think a through reading of the hadoop bible should clear out most of your doubts in thsi regard
I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split
data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?
This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.
There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:
If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
You might find you have too many sentences to fit them all into memory at once.
A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.
A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.
I have a file where I store some data, this data should be used by every mapper for some calculations.
I know how to read the data from the file and this can be done inside the mapper function, however, this data is the same for every mapper so I would like to store it somewhere(variable) before the mapping process beings and then use the contents in the mappers.
if I do this in the map function and have for example a file with 10 lines as input, then the map function will be called 10 times, correct? so if I read the file contents in the map function I will read it 10 times which is unnecessary
thanks in advance
Because many of your Mappers run inside of a different JVM (and possibly on different machines), you cannot read the data into your application once prior to submitting it to Hadoop. However, you can use the Distributed Cache to "Distribute application-specific large, read-only files efficiently."
As per that link: "Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves."
If I understand right, you want to call only 1 function to read all the lines in a file. Assuming yes, here is my view on it.
The mapper allows you to read 1 line at a time for safety sake so that you can control how many lines of input to read. And this takes a certain amount of memory. For one example, what if the file is large like 1GB size. Are you willing to read all the contents? This will take up a considerable amount of memory and have impact on the performance.
This is the safety aspect that I mentioned earlier.
My conclusion is that there is no Mapper function that reads all the contents of a file.
Do you agree?
I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
Read in one piece of your data
Process the piece of data
Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help