map-only job - order - java

I have a csv file. Say it has 2 splits, i.e., one block will be handled by the 1st map() task and the other by the 2nd map() task.
In the given csv I am replacing "0" with false and "1" with true. So I will be writing a map-only job for that. After the job completion, will I get the same order of the input file as result? As shuffle and sort are done after Map job.
Is there any way to get the same order of the input file as result?

You can do job.setNumReduceTasks(0);. This way, shuffle and sort won't happen.
However, there will be as many output files as there are map tasks (2 in this case). If you concatenate them, you'll get what you want. This probably doesn't matter since in most cases, Hadoop let you supply a folder wherever file is expected.

Related

SparkSession readText file with custom logic

I want to read spark text file into JavaRDD, Below code works perfectly fine
JavaRDD rdd = sparkSession.sparkContext().textFile(filePath, 100).toJavaRDD();
I want to apply some conditional reading in this function of textFile
For example:
if the content of text file is as below (note this is simplified example)
1
2
2
3
4
4
I want to be able to look ahead or look back and eliminate duplicates based on some logic.
I don't want to do it at the time of processing rdd. I want to be able to do it at the time of reading text file itself.
As Spark goes through an optimizer. spark will actually perform the transformations and the filter for each line as it is read, it will not need to put all the data in memory.
My advice is to use the filter operation. Furthermore you can persist resulting RDD to avoid recomputation.

Can a reducer pass a message to driver in Hadoop mapreduce?

I have to implement a loop of map-reduce jobs. Each iteration will terminate or continue depending on the previous one. The choice to be made is based on "is one word appears in the reducer output".
Of course I can inspect the whole output txt file with my driver program. But it is just a single word and going through the whole file will overkill. I am considering is there any way to build the communication between reducer and the driver, the reducer can notify the driver once it detects the word? Since the message to be transferred is few.
Your solution will be not a clean solution and hard to maintain.
There are multiple ways to achieve what you have asked for .
1. Reducer as soon as it finds a word writes to a HDFS location (opens file on hdfs predefine filedir and writes there)
2. client keeps polling the predefined filedir / output dir of the job. If the output dir is found and there is no filedir it means word wasnt there.
3. Use Zookeper
Best solution would be to , emit from mapper only if the word is found,
else not emit anything. This will fasten your job and spawn a single
reducer. Now you can safely check if the output of the job has any file on output or not. Use Lazy initialization, in case no rows comes to reducer no output file would be created

Add input data on the fly to Hadoop Map-Reduce Job?

Can I append input files or input data to a map-reduce job while it's running without creating a race condition?
I think in theory you can add more files into the input as long as it:
Matches your FileInputFormat pattern
Happens before InputFormat.getSplits() call which really gives you a very short time after you submit a job.
Regarding the race condition after splits are computed, note that append to existing files is only available since the version 0.21.0.
And even if you can modify your files, your split points already precomputed and most likely your new data will not be picked up by mappers. Though, I doubt that it will lead to a crash of your flow.
What you can experiment with is to disable splits within a file (that is assign a mapper per file) and try to append. I think some data that had a chance to get flushed may end up in a mapper (that's just my wild guess).
Effectively the answer is "no". The splits are computed very early in the game: and after that your new files will not be included.

Control number of hadoop mapper output files

I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.

How do I append to a file in hadoop?

I want to create a file in HDFS that has a bunch of lines, each generated by a different call to map. I don't care about the order of the lines, just that they all get added to the file. How do I accomplish this?
If this is not possible, then is there a standard way to generate unique file names to put each line of output into a separate file?
There is no way to append to an existing file in hadoop at the moment, but that's not what it sounds like you want to do anyway. It sounds like you want to have the output from your Map Reduce job go to a single file, which is quite possible. The number of output files is (less than or) equal to the number of reducers, so if you set your number of reducers to 1, you'll get a single file of output.
Before you go and do that however, think if that's what you really want. You'll be creating a bottle neck in your pipeline where it needs to pass all your data through a single machine for that reduce. Within the HDFS distributed file system, the difference between having one file and having several files is pretty transparent. If you want a single file outside the cluster, you might do better to use getmerge from the file system tools.
Both your map and reduce functions should output the lines. In other words, your reduce function is a pass through function that doesn't do much. Set the number of reducers to 1. The output will be a list of all the lines in one file.

Categories

Resources