I have a simple input file with 2 columns like
pkg1 date1
pkg2 date2
pkg3 date3
...
...
I want to create a oozie workflow which will process each row separately . For each row, I want to run multiple Actions one after another(Hive,Pig..) and then process another row.
But it is more difficult than I expected. I think, I have to create a loop somehow and iterate through it.
Can you give me architectural advise how I can achieve this?
Oozie does not support loops/cycles, since it is a Directed Acyclic Graph
https://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a2.1_Cycles_in_Workflow_Definitions
Also, there is no inbuilt way (that I'm aware of) to read data from Hive into an Oozie workflow and use it to control the flow of the Oozie workflow.
You could have a single Oozie workflow which launches some custom process (e.g. a Shell Action), and within that process read the data from Hive, and launch a new, separate, Oozie workflow for each entry.
I totally agree with #Mattinbits, you must use some procedural code (shell script, Python, etc) to run the loop and fire the appropriate Pig/Hive tasks.
But if your process must wait for the tasks to complete before launching the next batch, the coordination part might become a bit more complicated to implement. I can think of a very evil way to use Oozie for that coordination...
write down a generic Oozie Workflow that runs the Pig/Hive actions for 1 set of parameters, passed as properties
write down a "master template" Oozie workflow that just runs the WF above as a sub-workflow with dummy values for the properties
cut the template in 3 parts : XML header, sub-workflow call (with placeholders for actual values of properties) and XML footer
your loop will then build the actual "master" Workflow dynamically, by concatenating the header, a call to the sub-workflow for 1st set of values, another call for 2nd set, etc etc, then the footer -- and finally submit the Workflow to Oozie server (using REST or command line interface)
Of course there are some other things to take care of -- generating unique names for sub-workflows Actions, chaining them, handling errors. The usual stuff.
Related
I'm implementing the logic for rebuilding a file paginated to several kafka messages with the same key. Every time a page is received its content is appended to the corresponding file in a shared volume and, once the last page is appended, the topology has to include some extra processing steps.
Should this be done with forEach or with process?
Both forEach and process have void type, how can then be added the final extra steps to the topology?
Both accomplish the same goal. foreach is a terminal action of the DSL, however
process method of the Processor API "returns" data to next Processors by forwarding it to the Context, as answered in your last question, not via the process method itself
I have a use case when I need to capture the data flow from one API to another. For example my code reads data from database using hibernate and during the data processing I convert one POJO to another and perform some more processing and then finally convert into final result hibernate object. In a nutshell something like POJO1 to POJO2 to POJO3.
In Java is there a way where I can deduce that an attribute from POJO3 was made/transformed from this attribute of POJO1. I want to look something where I can capture data flow from one model to another. This tool can be either compile time or runtime, I am ok with both.
I am looking for a tool which can run in parallel with code and provide data lineage details on each run basis.
Now instead of Pojos I will call them States! You are having a start position you iterate and transform your model through different states. At the end you have a final terminal state that you would like to persist to the database
stream(A).map(P1).map(P2).map(P3)....-> set of B
If you use a technic known as Event sourcing you can deduce it yes. How would this look like then? Instead of mapping directly A to state P1 and state P1 to state P2 you will queue all your operations that are necessary and enough to map A to P1 and P1 to P2 and so on... If you want to recover P1 or P2 at any time, it will be just a product of the queued operations. You can at any time rewind forward or rewind backwards as long as you have not yet chaged your DB state. P1,P2,P3 can act as snapshots.
This way you will be able to rebuild the exact mapping flow for this attribute. How fine grained you will queue your oprations, if it is going to be as fine as attribute level , or more course grained it is up to you.
Here is a good article that depicts event sourcing and how it works: https://kickstarter.engineering/event-sourcing-made-simple-4a2625113224
UPDATE:
I can think of one more technic to capture the attribute changes. You can instument your Pojo-s, it is pretty much the same technic used by Hibernate to enhance Pojos and same technic profiles use to for tracing. Then you can capture and react to each setter invocation on the Pojo1,Pojo2,Pojo3. Not sure if I would have gone that way though....
Here is some detiled readin about the byte code instrumentation if https://www.cs.helsinki.fi/u/pohjalai/k05/okk/seminar/Aarniala-instrumenting.pdf
I would imagine two reasons, either the code is not developed by you and therefore you want to understand the flow of data along with combinations to convert input to output OR your code is behaving in a way that you are not expecting.
I think you need to log the values of all the pojos, inputs and outputs to any place that you can inspect later for each run.
Example: A database table if you might need after hundred of runs, but if its one time may be to a log in appropriate form. Then you need to yourself manually use those data values layer by later to map to the next layer. I think with availability of code that would be easy. If you have a different need pls. explain.
Please accept and like if you appreciate my gesture to help with my ideas n experience.
There are "time travelling debuggers". For Java, a quick search did only spill this out:
Chronon Time Travelling Debugger, see this screencast how it might help you .
Since your transformations probably use setters and getters this tool might also be interesting: Flow
Writing your own java agent for tracking this is probably not what you want. You might be able to use AspectJ to add some stack trace logging to getters and setters. See here for a quick introduction.
I was reading the spring documentation for spring batch project, I want to know if there is an out of the box configuration to chain steps, it means the output of the first step be the input for the second one and so on.
I'm not asking about step flows which one execute after other, is more about use the exit of the item processor of a step to be the input of the next one.
What I have in mind is use a normal step with reader, processor and in the writer create a flat file that could be reader by the second reader in the next step but this seems to be inefficiently as need to write objects that are in the jvm and restore them with the second reader.
If not sure if this is possible with spring normal config, or jsr does not work exactly as I want
Instead of multiple steps use multiple ItemProcessors in a chain. You can chain them using a CompositeItemProcessor.
EDIT :
I was reading about the spring batch strategies and I dont found any out of the box configuration in xml to chain the steps in a kind of pipeline, the best option that fit my needs is use ItemProcessorAdapter to run the different logic that I need in the steps and use CompositeItemProcessor (6.21) to make a chain of them.
My current issue here is with trying to develop a set of Knime nodes that provide integration with Apache Oozie. Namely i'm trying to build, launch and monitor Oozie workflows from within Knime.
I've had some success with implementing this for linear Oozie workflows, but have become quite stumped when branching needs to be included.
As background, let me explain the way i did this for linear workflows:
Essentially my solution expresses each Oozie Action as a Knime Node. Each of these nodes has 2 modes of operation, the proper one being called based on the content of certain flow variables. These 2 modes are needed because i have to execute the Oozie portion (OozieStartAction to OozieStopAction) twice, the first iteration generating the Oozie workflow, and the second launching and monitoring it. Also, flow variables persist between iterations of this loop.
In one mode of operation, a node appends the xml content particular to the Oozie action it represents to the overall Oozie workflow xml and then forwards it.
In the other, the node simply polls Oozie for the status of the action it represents.
The following flow vars are used in this workflow:
-OOZIE_XML: contains oozie workflow xml
-OOZIE_JOB_ID: id of the running oozie job launched with the assembled workflow
-PREV_ACTION_NAME: Name of the previous action
In the example above, what would happen step by step is the following:
-OozieStartNode runs, sees it has a blank or no OOZIE_XML variable, so it creates one itself, setting the basic workflow-app and start xml nodes. It also creates a PREV_ACTION_NAME flow var with value "start".
-The first OozieGenericAction sees that it has a blank OOZIE_JOB_ID so it appends a new action to the workflow-app node in the received OOZIE_XML, gets the node with the "name" attribute equal to PREV_ACTION_NAME and sets its transition to the action it just created. PREV_ACTION_NAME is then overwritten with the current action's name.
...
-The StopOozieAction simply creates an end node and sets the previous action's transition to it, much like the previous generic action.
-In the second iteration, OozieStart sees it has XML data, so the secondary execution mode is called. This uploads the workflow XML into hdfs and creates a new Oozie job with this workflow, and forwards the received JobId as OOZIE_JOB_ID.
-The following Oozie Actions, having a valid OOZIE_JOB_ID, simply poll Oozie for their action names' status, ending execution once their respecive actions finish running
The main problem i'm facing is in the workflow xml assembly, as, for one, i can't use the prev node name variable while using branching. If i had a join action with many nodes linking to it, one prev node would overwrite the others and node relation data would be lost.
Does anybody have any broad ideas in which way i could take this ?
How about using a variable to column where there's a column in the recursive loop called (Previous Acction Name). It might seem like overkill keeping the same value in one for all rows, but the recursive loop would pass it along just like any other column.
BTW, have you seen these?
https://www.knime.org/knime-big-data-connectors
I'm developing a big job in Talend Open Studio (about 90 components now and minimum 150 at the end of the development). But the number of components is limited in the subjobs because the Java method of a subjob can't exceed the 65536 bytes limit.
So I split my job in multiple subjobs using tBufferOuput/tBufferInput couple to pass data between each subjob. And now the problem is that I need to empty the globalBuffer before each tBufferOuput.
I've searched on the web and I found a solution with a tJava component using globalBuffer.clear(); in it but when I do that, my job finishes without handling any data.
If you aren't passing the data to a parent job and are instead keeping the data inside the same job (just multiple subjobs inside) you'd probably be better off with the tHash components. This allows you to cache some data (either in memory or temporarily to disk) and then retrieve that data by linking a tHashInput to a specific tHashOutput.
The tBuffer components just drop all the data into one specific pool and pick up from there so aren't really suited for multiple inputs and outputs to a job (although this can be a desired result). You're best off only using them for passing data back to a parent job by using a tBufferOutput in a child job and then linking the tRunJob in the parent job to whatever you want to pass the data to (which could be another job).