My current issue here is with trying to develop a set of Knime nodes that provide integration with Apache Oozie. Namely i'm trying to build, launch and monitor Oozie workflows from within Knime.
I've had some success with implementing this for linear Oozie workflows, but have become quite stumped when branching needs to be included.
As background, let me explain the way i did this for linear workflows:
Essentially my solution expresses each Oozie Action as a Knime Node. Each of these nodes has 2 modes of operation, the proper one being called based on the content of certain flow variables. These 2 modes are needed because i have to execute the Oozie portion (OozieStartAction to OozieStopAction) twice, the first iteration generating the Oozie workflow, and the second launching and monitoring it. Also, flow variables persist between iterations of this loop.
In one mode of operation, a node appends the xml content particular to the Oozie action it represents to the overall Oozie workflow xml and then forwards it.
In the other, the node simply polls Oozie for the status of the action it represents.
The following flow vars are used in this workflow:
-OOZIE_XML: contains oozie workflow xml
-OOZIE_JOB_ID: id of the running oozie job launched with the assembled workflow
-PREV_ACTION_NAME: Name of the previous action
In the example above, what would happen step by step is the following:
-OozieStartNode runs, sees it has a blank or no OOZIE_XML variable, so it creates one itself, setting the basic workflow-app and start xml nodes. It also creates a PREV_ACTION_NAME flow var with value "start".
-The first OozieGenericAction sees that it has a blank OOZIE_JOB_ID so it appends a new action to the workflow-app node in the received OOZIE_XML, gets the node with the "name" attribute equal to PREV_ACTION_NAME and sets its transition to the action it just created. PREV_ACTION_NAME is then overwritten with the current action's name.
...
-The StopOozieAction simply creates an end node and sets the previous action's transition to it, much like the previous generic action.
-In the second iteration, OozieStart sees it has XML data, so the secondary execution mode is called. This uploads the workflow XML into hdfs and creates a new Oozie job with this workflow, and forwards the received JobId as OOZIE_JOB_ID.
-The following Oozie Actions, having a valid OOZIE_JOB_ID, simply poll Oozie for their action names' status, ending execution once their respecive actions finish running
The main problem i'm facing is in the workflow xml assembly, as, for one, i can't use the prev node name variable while using branching. If i had a join action with many nodes linking to it, one prev node would overwrite the others and node relation data would be lost.
Does anybody have any broad ideas in which way i could take this ?
How about using a variable to column where there's a column in the recursive loop called (Previous Acction Name). It might seem like overkill keeping the same value in one for all rows, but the recursive loop would pass it along just like any other column.
BTW, have you seen these?
https://www.knime.org/knime-big-data-connectors
Related
I'm implementing the logic for rebuilding a file paginated to several kafka messages with the same key. Every time a page is received its content is appended to the corresponding file in a shared volume and, once the last page is appended, the topology has to include some extra processing steps.
Should this be done with forEach or with process?
Both forEach and process have void type, how can then be added the final extra steps to the topology?
Both accomplish the same goal. foreach is a terminal action of the DSL, however
process method of the Processor API "returns" data to next Processors by forwarding it to the Context, as answered in your last question, not via the process method itself
For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!
You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))
I have a simple input file with 2 columns like
pkg1 date1
pkg2 date2
pkg3 date3
...
...
I want to create a oozie workflow which will process each row separately . For each row, I want to run multiple Actions one after another(Hive,Pig..) and then process another row.
But it is more difficult than I expected. I think, I have to create a loop somehow and iterate through it.
Can you give me architectural advise how I can achieve this?
Oozie does not support loops/cycles, since it is a Directed Acyclic Graph
https://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a2.1_Cycles_in_Workflow_Definitions
Also, there is no inbuilt way (that I'm aware of) to read data from Hive into an Oozie workflow and use it to control the flow of the Oozie workflow.
You could have a single Oozie workflow which launches some custom process (e.g. a Shell Action), and within that process read the data from Hive, and launch a new, separate, Oozie workflow for each entry.
I totally agree with #Mattinbits, you must use some procedural code (shell script, Python, etc) to run the loop and fire the appropriate Pig/Hive tasks.
But if your process must wait for the tasks to complete before launching the next batch, the coordination part might become a bit more complicated to implement. I can think of a very evil way to use Oozie for that coordination...
write down a generic Oozie Workflow that runs the Pig/Hive actions for 1 set of parameters, passed as properties
write down a "master template" Oozie workflow that just runs the WF above as a sub-workflow with dummy values for the properties
cut the template in 3 parts : XML header, sub-workflow call (with placeholders for actual values of properties) and XML footer
your loop will then build the actual "master" Workflow dynamically, by concatenating the header, a call to the sub-workflow for 1st set of values, another call for 2nd set, etc etc, then the footer -- and finally submit the Workflow to Oozie server (using REST or command line interface)
Of course there are some other things to take care of -- generating unique names for sub-workflows Actions, chaining them, handling errors. The usual stuff.
I have the following use case. In an oozie workflow, a map-reduce action generates a series of diagnostic counters. I want to have another java action following the map-reduce action. The java action basically does validation based on the counters from the map-reduce action and generate some notifications based on the validation conditions and results. The key thing for this idea to work is that the java action must be able to access all counters in the upstream map-reduce action, just like how oozie can use EL to access them in its workflow xml.
Right now I have no idea where to start for this. So, any pointer is very much appreciated.
update
For example, suppose I have a map-reduce action named foomr. In oozie workflow xml, you can use EL to access counters, e.g., ${hadoop:counters("foomr")[RECORDS][MAP_IN]}. Then, my question would be, how can I get the same counter inside a java action? Does oozie expose any API to access values that are accessible to EL as in a workflow xml?
You can use capture output tag to capture the output of java action. These output in java properties format can be propogated in the oozie nodes.
The capture-output element can be used to propagate values back into Oozie context, which can then be accessed via EL-functions. This needs to be written out as a java properties format file.(From documentation page of oozie).
See below example to see how EL constants are used in a pig script. Refer to below HDFS EL constants which can be used.
Hadoop EL Constants
RECORDS: Hadoop record counters group name.
MAP_IN: Hadoop mapper input records counter name.
MAP_OUT: Hadoop mapper output records counter name.
REDUCE_IN: Hadoop reducer input records counter name.
REDUCE_OUT: Hadoop reducer input record counter name.
GROUPS: 1024 * Hadoop mapper/reducer record groups counter name.
Example showing use of EL constants user which is used to caluculate path dynamically. In similar way you can use above HDFS EL constants or user defined ones in workflow.
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/pig"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>id.pig</script>
<param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/text</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/pig</param>
</pig>
Edit :
You can also use the oozie java api which will give wf_actionData for a actionName.
org.apache.oozie.DagELFunctions.wf_actionData(String actionName).
Return the action data for an action.
Parameters: actionName action name.
Returns: value of the property.
I saw below line in oozie docs under the Parameterization of Workflows section:
EL expressions can be used in the configuration values of action and decision nodes. They can be used in XML attribute values and in XML element and attribute values.
They cannot be used in XML element and attribute names. They cannot be used in the name of a node and they cannot be used within the transition elements of a node.
oozie docs
I think oozie is not exposing the workflow action data within the action nodes. we can pass it from outside as parameters to the java action.
If hadoop counters are to be accessed then I think you should check if YARN or jobtracker expose any web services API's where you can pass a jobname and get corresponding counters as output.
I have been working on branch X and I need to move my code to branch Y. All my code are new classes that I started, so no one else has been working/modified my code, this also does not exist in the branch that I'm moving the code to.
So my question is, what is the process to move the code from one branch to another? i have never done it before. Do i copy and paste the classes into the new branch or is there a tool that is usually used for this ?
The key of a ClearCase merge is to do the merge in the destination view (the view associated with the branch or the UCM stream to which you merge to.
You can then start the merge with:
cleartool merge
Cleartool Merge Manager: see "Howto merge using the ClearCase Merge manager"
As you can see, the first step is to select said target view:
I would recommend using a dynamic view rather than a snapshot view: a snapshot would start by an automatic update (which takes time), as opposed to a dynamic view which would start the merge immediately.
See more at "What are the differences between a snapshot view and a dynamic view?"
It supposes that you have:
a source branch or a label which will identify the source versions you want to merge,
a destination view with a config spec allowing to create new versions on top of a destination branch (so a config spec with -mkbranch rules in it)
See more at "About merging files and directories in base ClearCase":
yepp, there's a tool that helps you:
it's the ClearCase MergeManager. It has a nice GUI and helps you to get the job done.