I have the following use case. In an oozie workflow, a map-reduce action generates a series of diagnostic counters. I want to have another java action following the map-reduce action. The java action basically does validation based on the counters from the map-reduce action and generate some notifications based on the validation conditions and results. The key thing for this idea to work is that the java action must be able to access all counters in the upstream map-reduce action, just like how oozie can use EL to access them in its workflow xml.
Right now I have no idea where to start for this. So, any pointer is very much appreciated.
update
For example, suppose I have a map-reduce action named foomr. In oozie workflow xml, you can use EL to access counters, e.g., ${hadoop:counters("foomr")[RECORDS][MAP_IN]}. Then, my question would be, how can I get the same counter inside a java action? Does oozie expose any API to access values that are accessible to EL as in a workflow xml?
You can use capture output tag to capture the output of java action. These output in java properties format can be propogated in the oozie nodes.
The capture-output element can be used to propagate values back into Oozie context, which can then be accessed via EL-functions. This needs to be written out as a java properties format file.(From documentation page of oozie).
See below example to see how EL constants are used in a pig script. Refer to below HDFS EL constants which can be used.
Hadoop EL Constants
RECORDS: Hadoop record counters group name.
MAP_IN: Hadoop mapper input records counter name.
MAP_OUT: Hadoop mapper output records counter name.
REDUCE_IN: Hadoop reducer input records counter name.
REDUCE_OUT: Hadoop reducer input record counter name.
GROUPS: 1024 * Hadoop mapper/reducer record groups counter name.
Example showing use of EL constants user which is used to caluculate path dynamically. In similar way you can use above HDFS EL constants or user defined ones in workflow.
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/pig"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>id.pig</script>
<param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/text</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/pig</param>
</pig>
Edit :
You can also use the oozie java api which will give wf_actionData for a actionName.
org.apache.oozie.DagELFunctions.wf_actionData(String actionName).
Return the action data for an action.
Parameters: actionName action name.
Returns: value of the property.
I saw below line in oozie docs under the Parameterization of Workflows section:
EL expressions can be used in the configuration values of action and decision nodes. They can be used in XML attribute values and in XML element and attribute values.
They cannot be used in XML element and attribute names. They cannot be used in the name of a node and they cannot be used within the transition elements of a node.
oozie docs
I think oozie is not exposing the workflow action data within the action nodes. we can pass it from outside as parameters to the java action.
If hadoop counters are to be accessed then I think you should check if YARN or jobtracker expose any web services API's where you can pass a jobname and get corresponding counters as output.
Related
I am trying to save some data from the Mapper to the Job/Main so that I can use it in other jobs.
I tried to use a static variable in my main class (that contains the main function) but when the Mapper adds data to the static variable and I try to print the variable when the job is done I find that there is no new data, it's like the Mapper modified another instance of that static variable..
Now i'm trying to use the Configuration to set the data from the Mapper:
Mapper
context.getConfiguration().set("3", "somedata");
Main
boolean step1Completed = step1.waitForCompletion(true);
System.out.println(step1.getConfiguration().get("3"));
Unfortunately this prints null.
Is there another way to do things? I am trying to save some data so that I use it in other jobs and I find using a file just for that a bit extreme since the data is only an index of int,string to map some titles that I will need in my last job.
It is not possible as soon as I know. Mappers and Reducers work independently in distributed fashion. Each task has its own local conf instance. You have to persist data to HDFS while each job is independent.
You can also take advantage of MapReduce Chaining mechanism(example) to run a chain of jobs. In addition, you can design workflow in Azkaban, Oozie and etc to pass output to another job.
It is indeed not possible since the configuration goes from the job to the mapper/reducer and not the other way around.
I ended up just reading the file directly from the HDFS in my last job's setup.
Thank you all for the input.
I have a simple input file with 2 columns like
pkg1 date1
pkg2 date2
pkg3 date3
...
...
I want to create a oozie workflow which will process each row separately . For each row, I want to run multiple Actions one after another(Hive,Pig..) and then process another row.
But it is more difficult than I expected. I think, I have to create a loop somehow and iterate through it.
Can you give me architectural advise how I can achieve this?
Oozie does not support loops/cycles, since it is a Directed Acyclic Graph
https://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a2.1_Cycles_in_Workflow_Definitions
Also, there is no inbuilt way (that I'm aware of) to read data from Hive into an Oozie workflow and use it to control the flow of the Oozie workflow.
You could have a single Oozie workflow which launches some custom process (e.g. a Shell Action), and within that process read the data from Hive, and launch a new, separate, Oozie workflow for each entry.
I totally agree with #Mattinbits, you must use some procedural code (shell script, Python, etc) to run the loop and fire the appropriate Pig/Hive tasks.
But if your process must wait for the tasks to complete before launching the next batch, the coordination part might become a bit more complicated to implement. I can think of a very evil way to use Oozie for that coordination...
write down a generic Oozie Workflow that runs the Pig/Hive actions for 1 set of parameters, passed as properties
write down a "master template" Oozie workflow that just runs the WF above as a sub-workflow with dummy values for the properties
cut the template in 3 parts : XML header, sub-workflow call (with placeholders for actual values of properties) and XML footer
your loop will then build the actual "master" Workflow dynamically, by concatenating the header, a call to the sub-workflow for 1st set of values, another call for 2nd set, etc etc, then the footer -- and finally submit the Workflow to Oozie server (using REST or command line interface)
Of course there are some other things to take care of -- generating unique names for sub-workflows Actions, chaining them, handling errors. The usual stuff.
My current issue here is with trying to develop a set of Knime nodes that provide integration with Apache Oozie. Namely i'm trying to build, launch and monitor Oozie workflows from within Knime.
I've had some success with implementing this for linear Oozie workflows, but have become quite stumped when branching needs to be included.
As background, let me explain the way i did this for linear workflows:
Essentially my solution expresses each Oozie Action as a Knime Node. Each of these nodes has 2 modes of operation, the proper one being called based on the content of certain flow variables. These 2 modes are needed because i have to execute the Oozie portion (OozieStartAction to OozieStopAction) twice, the first iteration generating the Oozie workflow, and the second launching and monitoring it. Also, flow variables persist between iterations of this loop.
In one mode of operation, a node appends the xml content particular to the Oozie action it represents to the overall Oozie workflow xml and then forwards it.
In the other, the node simply polls Oozie for the status of the action it represents.
The following flow vars are used in this workflow:
-OOZIE_XML: contains oozie workflow xml
-OOZIE_JOB_ID: id of the running oozie job launched with the assembled workflow
-PREV_ACTION_NAME: Name of the previous action
In the example above, what would happen step by step is the following:
-OozieStartNode runs, sees it has a blank or no OOZIE_XML variable, so it creates one itself, setting the basic workflow-app and start xml nodes. It also creates a PREV_ACTION_NAME flow var with value "start".
-The first OozieGenericAction sees that it has a blank OOZIE_JOB_ID so it appends a new action to the workflow-app node in the received OOZIE_XML, gets the node with the "name" attribute equal to PREV_ACTION_NAME and sets its transition to the action it just created. PREV_ACTION_NAME is then overwritten with the current action's name.
...
-The StopOozieAction simply creates an end node and sets the previous action's transition to it, much like the previous generic action.
-In the second iteration, OozieStart sees it has XML data, so the secondary execution mode is called. This uploads the workflow XML into hdfs and creates a new Oozie job with this workflow, and forwards the received JobId as OOZIE_JOB_ID.
-The following Oozie Actions, having a valid OOZIE_JOB_ID, simply poll Oozie for their action names' status, ending execution once their respecive actions finish running
The main problem i'm facing is in the workflow xml assembly, as, for one, i can't use the prev node name variable while using branching. If i had a join action with many nodes linking to it, one prev node would overwrite the others and node relation data would be lost.
Does anybody have any broad ideas in which way i could take this ?
How about using a variable to column where there's a column in the recursive loop called (Previous Acction Name). It might seem like overkill keeping the same value in one for all rows, but the recursive loop would pass it along just like any other column.
BTW, have you seen these?
https://www.knime.org/knime-big-data-connectors
So I have a MapReduce job that takes in multiple news articles and outputs the following key value pairs.
.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.
As you can see, there are two main different types of data rows that I am currently outputting and right now, these get mixed up in the flat files outputted by mapreduce. Is there anyway I can simply output social_tags to file1 and topic_codes to file2 OR maybe output social_tags to a specified group of files(social1.txt, social2.txt ..etc) and topic_codes to another group (topic1.txt, topic2.txt...etc)
The reason I'm asking this is so that I can store all these into a Hive table later on easily. I preferably would want to have a separate table for each different data type(topic_code, social_tag,... etc.) If any of you guys know a simple way to achieve this without separating the mapreduce output to different files, that would be really helpful too.
Thanks in advance!
You can use MultipleOutputs as already suggested.
As you have asked for a simple way to achieve this without separating the mapreduce output to different files. Here is a quick way, if the amount of data is not real huge !!!. And the logic to differentiate the data is not too complex.
First load the mixed output file into a hive table (say main_table). Then you can create two different tables (topic_code, social_tag), and insert the data from the main table after filtering it by where clause.
hive > insert into table topic_code
> select * from main_table
> where $condition;
// $condition = the logic you would use to differentiate the records in the MR job
I think you can try MultipleOutputs present in hadoop API. MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer (or
mapper in a map-only job) to create more than a single file. File names are of the form
name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an
arbitrary name that is set by the program, and nnnnn is an integer designating the part
number, starting from zero.
In the reducer, where we generate the output, we construct an instance of MultipleOutputs in the setup()method and assign it to an instance variable. We then use the
MultipleOutputsinstance in the reduce()method to write to the output, in place of the
context. The write()method takes the key and value, as well as a name.
You can look into the below link for details
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
So i have ten different files where each file looks like this.
<DocID1> <RDF Document>
<DocID2> <RDF Document>
.
.
.
.
<DocID50000> <RDF Document>
There are actually ~56,000 lines per file. There's a document ID in each line and a RDF document.
My objective is to pass into each mapper as the input key value pair and emit multiple for the output key value pairs. In the reduce step, I will store these into a Hive table.
I have a couple of questions getting started and I am completely new to RDF/XML files.
How am I supposed to parse each line of the document to get separately to pass to each mapper?
Is there an efficient way of controlling the size of the input for the mapper?
1- If you are using TextInputFormat you are automatically getting 1 line(1 split) in each mapper as the value. Convert this line into String and do the desired processing. Alternatively you could make use of Hadoop Streaming API by using StreamXmlRecordReader. You have to provide the start and end tag and all the information sandwiched between start and tag will be fed to the mapper(In your case <DocID1> and <RDF Document>).
Usage :
hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=DocID,end=RDF Document" ..... (rest of the command)
2- Why do you need that? Your goal is to feed one complete line to a mapper. It's something which is the the job of InputFormat you are using. If you still need it, you have to write custom code for this and for this particular case it's going to be a bit tricky.