Can I append input files or input data to a map-reduce job while it's running without creating a race condition?
I think in theory you can add more files into the input as long as it:
Matches your FileInputFormat pattern
Happens before InputFormat.getSplits() call which really gives you a very short time after you submit a job.
Regarding the race condition after splits are computed, note that append to existing files is only available since the version 0.21.0.
And even if you can modify your files, your split points already precomputed and most likely your new data will not be picked up by mappers. Though, I doubt that it will lead to a crash of your flow.
What you can experiment with is to disable splits within a file (that is assign a mapper per file) and try to append. I think some data that had a chance to get flushed may end up in a mapper (that's just my wild guess).
Effectively the answer is "no". The splits are computed very early in the game: and after that your new files will not be included.
Related
I have to implement a loop of map-reduce jobs. Each iteration will terminate or continue depending on the previous one. The choice to be made is based on "is one word appears in the reducer output".
Of course I can inspect the whole output txt file with my driver program. But it is just a single word and going through the whole file will overkill. I am considering is there any way to build the communication between reducer and the driver, the reducer can notify the driver once it detects the word? Since the message to be transferred is few.
Your solution will be not a clean solution and hard to maintain.
There are multiple ways to achieve what you have asked for .
1. Reducer as soon as it finds a word writes to a HDFS location (opens file on hdfs predefine filedir and writes there)
2. client keeps polling the predefined filedir / output dir of the job. If the output dir is found and there is no filedir it means word wasnt there.
3. Use Zookeper
Best solution would be to , emit from mapper only if the word is found,
else not emit anything. This will fasten your job and spawn a single
reducer. Now you can safely check if the output of the job has any file on output or not. Use Lazy initialization, in case no rows comes to reducer no output file would be created
For a project I am working on, I am trying to count the vowels in text file as fast as possible. In order to do so, I am trying a concurrent approach. I was wondering if it is possible to concurrently read a text file as a way to speed up the counting? I believe the bottleneck is the I/O, and since right now I am reading the file in via a buffered reader and processing line by line, I was wondering if it was possible to read multiple sections of the file at once.
My original thought was to use
Split File - Java/Linux
but apparently MappedByteBuffers are not great performance wise, and I still need to read line by line from each MappedByteBuffer once I split.
Another option is to split after reading a certain number of lines, but that defeats the purpose.
Would appreciate any help.
The following will NOT split the file - but can help in concurrently processing it!
Using Streams in Java 8 you can do things like:
Stream<String> lines = Files.lines(Paths.get(filename));
lines.filter(StringUtils::isNotEmpty) // ignore empty lines
and if you want to run in parallel you can do:
lines.parallel().filter(StringUtils::isNotEmpty)
In the example above I was filtering empty lines - but of course you can modify it to your use (counting vowels) by implementing your own method and calling it.
I am trying to find out where does the output of a Map task is saved to disk before it can be used by a Reduce task.
Note: - version used is Hadoop 0.20.204 with the new API
For example, when overwriting the map method in the Map class:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
// code that starts a new Job.
}
I am interested to find out where does context.write() ends up writing the data. So far i've ran into the:
FileOutputFormat.getWorkOutputPath(context);
Which gives me the following location on hdfs:
hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0
When i try to use it as input for another job it gives me the following error:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0
Note: the job is started in the Mapper, so technically, the temporary folder where the Mapper task is writing it's output exists when the new job begins. Then again, it still says that the input path does not exist.
Any ideas to where the temporary output is written to? Or maybe what is the location where i can find the output of a Map task during a job that has both a Map and a Reduce stage?
Map reduce framework will store intermediate output into local disk rather than HDFS as this would cause unnecessarily replication of files.
So, I've figured out what is really going on.
The output of the mapper is buffered until it gets to about 80% of its size, and at that point it begins to dump the result to its local disk and continues to admit items into the buffer.
I wanted to get the intermediate output of the mapper and use it as input for another job, while the mapper was still running. It turns out that this is not possible without heavily modifying the hadoop 0.20.204 deployment. The way the system works is even after all the things that are specified in the map context:
map .... {
setup(context)
.
.
cleanup(context)
}
and the cleanup is called, there is still no dumping to the temporary folder.
After, the whole Map computation everything eventually gets merged and dumped to disk and becomes the input for the Shuffling and Sorting stages that precede the Reducer.
So far from all I've read and looked at, the temporary folder where the output should be eventually, is the one that I was guessing beforehand.
FileOutputFormat.getWorkOutputPath(context)
I managed to the what I wanted to do in a different way. Anyway
any questions there might be about this, let me know.
Task tracker starts a separate JVM process for every Map or Reduce task.
Mapper output (intermediate data) is written to the Local file system (NOT HDFS) of each mapper slave node. Once data transferred to Reducer, We won’t be able to access these temporary files.
If you what to see your Mapper output, I suggest using IdentityReducer?
I want to write some lines to a file, and I need each line of writing is a atomic operation.
For example, I have 3 lines:
111111111111111111111111
222222222222222222222222
333333333333333333333333
When I write them into a file line by line, the program may be exit by error, so the saved data may be:
11111111111111111111111
222222
This is not what I expected. I hope each line is a transaction, a atomic operation.
How should I do this?
Currently I use Java to do this.
There isn't a 100% reliable way to guarantee this.
I think the closest you can get is by calling flush() on the output stream and then sync() on the underlying file descriptor. Again, there are failure modes where this won't help.
If you really need atomic writing of new lines to a file, I guess the only way is to create a copy under a new name, write the new line and rename the new file to the original name. The rename operation is atomic, at least under POSIX. On Windows you would need to remove the original file before renaming, which bears the problem of not being able to restore the file if a problem occurs in the that process.
You can use flush/sync as #aix suggests. Otherwise (and better -- 99.999% reliable) is to use some sort of environment (such as a database) that includes transaction support and use commit.
I want to create a file in HDFS that has a bunch of lines, each generated by a different call to map. I don't care about the order of the lines, just that they all get added to the file. How do I accomplish this?
If this is not possible, then is there a standard way to generate unique file names to put each line of output into a separate file?
There is no way to append to an existing file in hadoop at the moment, but that's not what it sounds like you want to do anyway. It sounds like you want to have the output from your Map Reduce job go to a single file, which is quite possible. The number of output files is (less than or) equal to the number of reducers, so if you set your number of reducers to 1, you'll get a single file of output.
Before you go and do that however, think if that's what you really want. You'll be creating a bottle neck in your pipeline where it needs to pass all your data through a single machine for that reduce. Within the HDFS distributed file system, the difference between having one file and having several files is pretty transparent. If you want a single file outside the cluster, you might do better to use getmerge from the file system tools.
Both your map and reduce functions should output the lines. In other words, your reduce function is a pass through function that doesn't do much. Set the number of reducers to 1. The output will be a list of all the lines in one file.