Hadoop InputSplit for large text-based files

Hadoop InputSplit for large text-based files - java

In hadoop I'd like to split a file (almost) equally to each mapper. The file is large and I want to use a specific number of mappers at which are defined at job start. Now I've customized the input split but I want to be sure that if I split the file in two (or more splits) I won't cut a line in half as I want each mapper to have complete lines and not broken ones.
So the question is this, how can I get the approximate size of a filesplit during each creation or if that is not possible how I can estimate the number of (almost) equal filesplits for a large file given the constraint that I don't want to have any broken lines in any mapper instance.

Everything that you are asking for is the default behavior in Map Reduce. Like mappers always process complete lines. By default Map Reduce strives to spread out the load among st mappers evenly.
You can get more details about it here you can check out the InputSplits para.
Also this answer here as linked by #Shaw, talks about how exactly the case of lines spread across blocks splits is handled.
A think a through reading of the hadoop bible should clear out most of your doubts in thsi regard

Related

Modify mapper size (split size) in MapReduce to have a faster performance

Are there any ways to improve the MapReduce performance by changing the number of map tasks or changing the split sizes of each mapper?
For example, I have a 100GB text file and 20 nodes. I want to run a WordCount job on the text file, what is the ideal number of mappers or the ideal split size so that it can be done faster?
Would it be faster with more mappers?
Would it be faster with a smaller split size?
EDIT
I am using hadoop 2.7.1, just so you know there is YARN.

It is not necessarily faster when you use more mappers. Each mapper has a start up and setup time. In the early days of hadoop when mapreduce was the de facto standard it was said that a mapper should run ~10 minutes. Today the documentations recommends 1 minute. You can vary the number of map tasks by using setNumMapTasks(int) which you can define within the JobConf. IN the documentation of the method are very good information about the mapper count:
How many maps?
The number of maps is usually driven by the total size
of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 or so for very cpu-light
map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
The default behavior of file-based InputFormats is to split the input
into logical InputSplits based on the total size, in bytes, of input
files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size
can be set via mapreduce.input.fileinputformat.split.minsize.
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to
set it even higher.
Your question is probably related to this SO question.
To be honest, try to have a look at modern frameworks as well, like Apache Spark and Apache Flink.

Does separating read/write operations improve program efficiency?

I've written a program for class which takes in data from a URL, parses it for key phrases and then writes to a text file the phrase, line number, and column number.
Currently I am doing this as a single operation where the URL is fed to a BufferedReader for reading, to a Scanner for Parsing and then into a loop where each line is combed through and a series of conditional statements are used to check for the presence of said key phrases. When a match is found I write to a file.
The file read is about 60K lines of text and it takes about 4000ms on average to run this full operation from start to finish. Would it be more efficient to break apart the tasks and first read through the file into a Data Structure and then output the results to the file instead of doing both at the same time?
Also, how big of an impact would pulling the data from the URL have vs. reading it locally? I have the option to do both and but figure this would depend upon my broadband speeds.
EDIT: Somewhat of a nice test case. Over the week we've changed our ISP and upgraded our broadband speeds from 6Mb/sec to 30Mb/sec. This is brought my average read/parse/write times down to 1500ms. Interesting to see how small variances can make such impacts in performance.

This depends on the way you implement parallelism in your data crunching part.
At the moment you sequentially read everything - then crunch the data - then write it. So even if you broke it into 3 threads each one depends on the result of the previous.
So unless you start processing the data before it is fully received, this would not make a difference but only add overhead.
You would have to model a producer/consumer like flow where e.g. lines are read individually and then put on a work queue for processing. Same for processed lines which are then put on a queue to be written to a file.
This would allow parallel read / process / write actions to take place.
Btw - probably you are mostly limited by the speed to read the file from an URL, since all other steps happen locally and are orders of magnitudes faster.

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Processing different files on separate nodes using Hadoop MapReduce

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )

Java Hadoop: is it possible to read the contents of a file only once?

I have a file where I store some data, this data should be used by every mapper for some calculations.
I know how to read the data from the file and this can be done inside the mapper function, however, this data is the same for every mapper so I would like to store it somewhere(variable) before the mapping process beings and then use the contents in the mappers.
if I do this in the map function and have for example a file with 10 lines as input, then the map function will be called 10 times, correct? so if I read the file contents in the map function I will read it 10 times which is unnecessary
thanks in advance

Because many of your Mappers run inside of a different JVM (and possibly on different machines), you cannot read the data into your application once prior to submitting it to Hadoop. However, you can use the Distributed Cache to "Distribute application-specific large, read-only files efficiently."
As per that link: "Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves."

If I understand right, you want to call only 1 function to read all the lines in a file. Assuming yes, here is my view on it.
The mapper allows you to read 1 line at a time for safety sake so that you can control how many lines of input to read. And this takes a certain amount of memory. For one example, what if the file is large like 1GB size. Are you willing to read all the contents? This will take up a considerable amount of memory and have impact on the performance.
This is the safety aspect that I mentioned earlier.
My conclusion is that there is no Mapper function that reads all the contents of a file.
Do you agree?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.