I am trying to save some data from the Mapper to the Job/Main so that I can use it in other jobs.
I tried to use a static variable in my main class (that contains the main function) but when the Mapper adds data to the static variable and I try to print the variable when the job is done I find that there is no new data, it's like the Mapper modified another instance of that static variable..
Now i'm trying to use the Configuration to set the data from the Mapper:
Mapper
context.getConfiguration().set("3", "somedata");
Main
boolean step1Completed = step1.waitForCompletion(true);
System.out.println(step1.getConfiguration().get("3"));
Unfortunately this prints null.
Is there another way to do things? I am trying to save some data so that I use it in other jobs and I find using a file just for that a bit extreme since the data is only an index of int,string to map some titles that I will need in my last job.
It is not possible as soon as I know. Mappers and Reducers work independently in distributed fashion. Each task has its own local conf instance. You have to persist data to HDFS while each job is independent.
You can also take advantage of MapReduce Chaining mechanism(example) to run a chain of jobs. In addition, you can design workflow in Azkaban, Oozie and etc to pass output to another job.
It is indeed not possible since the configuration goes from the job to the mapper/reducer and not the other way around.
I ended up just reading the file directly from the HDFS in my last job's setup.
Thank you all for the input.
Related
I need to do data reconciliation in Hadoop based on key comparisons. That means I will have old data in one folder and the newer data will be put into different folders. At the end of the batch I was planning simply on moving the newer data to reside with the old one. The data would be json files from which I have to extract the keys.
I'm taking my first steps with Hadoop so I just wanna do it with MapReduce program only, i.e. without tools such as Spark, Pig, Hive etc. I was thinking of simply going through all the old data at the beginning of the program, before Job object creation, and putting all the IDs into a Java HashMap that would be accessible from the mapper task. If there's a key missing in the newer data, the mapper would output that. The reducer would concern itself with categories of the IDs that are missing but that's another story. After the job has finished, I would move the newer data into old data's folder.
The only thing that I find a bit clunky is this loading phase into Java HashMap object. This is not probably the most elegant solution so I was wondering if MapReduce model has some dedicated data structures/functionality for that kind of purpose (populating a global hash map with all the data from HDFS before the first map task is run)?
I think solution with HashMap is not a good idea. You can use few inputs for your command.
Depends on input file mapper can understand if this data is new and write it with suitable value. Then reducer will check if this data is contained only in "new input" and write this data.
So as result of job you will get only new data.
For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!
You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))
I have two Mapper Classes which take some files from the same folder as input and based on the name of the file which has a timestamp determines which mapper the file has to be given as an Input. At times it so happens that the same input file is to be given as an input to two different Mappers. Now I've tested it to work when two different inputs are given to both Mappers but When I give them the same input , then one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
The code is enormous so instead of putting it here , I'll describe what I had done. I created two lists and scanning through the files in the directory and based on the names of the files which have timestamps , I put them in two different lists and then add them to two different Mappers i.e. both of them are computed differently so I use different Mappers to compute , which is then used to compare in the reducer, but when it is the same Input file as the time criteria for both mappers is almost same one of the mapper doesn't generate any result. So is it because one mapper is not able to access the file because the other is using it and if that is the case is there any way around it.
Here MapPath1 is one list while MapPath2 is another
for(i=0;i<MapPath1.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath1.get(i)),TextInputFormat.class,Map1.class);
if(type.equals("comparative"))
for(i=0;i<MapPath2.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath2.get(i)),TextInputFormat.class,Map2.class);
Update
I just Found this question ( Multiple mappers in hadoop ) to be similar to mine but I don't want to be duplicating the input file as it can be large. Can any one direct me on how can I create two separate jobs using different Mappers and provide it to a single reducer.
one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
My guess that both the mappers are getting launched on the same task tracker node and intermediate mapper output location is shared by both the mapper task - You should check the task tracker nodes where these map tasks are launched to confirm this.
Also you should run mapper(s) only job, by setting number of reduce tasks to zero and check the output - this is to confirm that mapper are not sharing output directories.
To give solution to your problem - it sounds like you are passing same file to both the mappers and data from both the mappers given to single reducer. This has some duplication, Is your job output ok to have this duplication?
I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.
I am trying to find out where does the output of a Map task is saved to disk before it can be used by a Reduce task.
Note: - version used is Hadoop 0.20.204 with the new API
For example, when overwriting the map method in the Map class:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
// code that starts a new Job.
}
I am interested to find out where does context.write() ends up writing the data. So far i've ran into the:
FileOutputFormat.getWorkOutputPath(context);
Which gives me the following location on hdfs:
hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0
When i try to use it as input for another job it gives me the following error:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0
Note: the job is started in the Mapper, so technically, the temporary folder where the Mapper task is writing it's output exists when the new job begins. Then again, it still says that the input path does not exist.
Any ideas to where the temporary output is written to? Or maybe what is the location where i can find the output of a Map task during a job that has both a Map and a Reduce stage?
Map reduce framework will store intermediate output into local disk rather than HDFS as this would cause unnecessarily replication of files.
So, I've figured out what is really going on.
The output of the mapper is buffered until it gets to about 80% of its size, and at that point it begins to dump the result to its local disk and continues to admit items into the buffer.
I wanted to get the intermediate output of the mapper and use it as input for another job, while the mapper was still running. It turns out that this is not possible without heavily modifying the hadoop 0.20.204 deployment. The way the system works is even after all the things that are specified in the map context:
map .... {
setup(context)
.
.
cleanup(context)
}
and the cleanup is called, there is still no dumping to the temporary folder.
After, the whole Map computation everything eventually gets merged and dumped to disk and becomes the input for the Shuffling and Sorting stages that precede the Reducer.
So far from all I've read and looked at, the temporary folder where the output should be eventually, is the one that I was guessing beforehand.
FileOutputFormat.getWorkOutputPath(context)
I managed to the what I wanted to do in a different way. Anyway
any questions there might be about this, let me know.
Task tracker starts a separate JVM process for every Map or Reduce task.
Mapper output (intermediate data) is written to the Local file system (NOT HDFS) of each mapper slave node. Once data transferred to Reducer, We won’t be able to access these temporary files.
If you what to see your Mapper output, I suggest using IdentityReducer?