Packaging multiple Apache Beam pipelines in one jar file - java

I'm working on a project with many Beam pipelines written in Java that needs to be packaged as a jar file for execution from our job scheduler. I've attempted to use build profiles for creating a jar for each main but this seems messy and I've had issues with dependency conflicts (with beam-sdks-java-io-amazon-web-services when its not used its still looking for required region options). I'm also just looking for overall sustainable project structure advice for a growing Beam code base.
What are the best practices for packaging pipelines to be executed on a schedule? Should I package multiple pipelines together so that I can execute each pipeline using the pipeline name and pipeline options parameters, if so, how? (potentially using some sort of master runner main that executes pipelines based on input parameters) Or should each pipeline be its own Maven project (this requires many jars)? Thoughts?

I don't think there's a recommended way of solving this. Each way has benefits and downsides (e.g. consider the effort of updating the pipelines).
I think the common jar solution is fine if it works for you. E.g., there are multiple Beam example pipelines in the same package, and you run them by specifying the main class. It is similar to what you are trying to achieve.
Whether you need a master main also depends on specifics of your project and environment. It may be sufficient to just use java -cp mainclass and get around without extra management code.

Related

How to Build Maven Modules in Parallel in Separate JVMs

We have a multi-module Maven project that takes about 2 hours to build and we would like to speed that up by making use of concurrency.
We are aware of the -T option which (as explained i.e. here) allows using multiple threads within the same JVM for the build.
Sadly, there is a lot of legacy code (which uses a lot of global states) in the project which makes executing multiple test in parallel in a single JVM very hard. Removing all of these blockers from the project would be a lot of work which we would like to avoid.
The surefire and failsafe plugins have multiple options regarding parallel execution behavior, however, as I understand it, this would only parallelize the test executions. Also, spawning a separate JVM for each test (class) seems kind of overkill to me. That would probably just as soon cause the build to take even longer than it does now.
Ideally, we would like to do the parallelization on the Maven reactor level and have it build each module in its own (single threaded) JVM with up to x JVMs running in parallel.
So my question is: is there a way to make maven create a separate JVM for each module build?
Alternatively, can we parallelize the build while making sure that tests (over all modules) are executed sequentially?
I am not completely sure this works but I guess if you use Maven Toolchains, then each module will start its own forked JVM for the tests, not reusing already running ones.
I guess it is worth a try.

Flink: Wrap executable non-flink jar to run it in a flink cluster

Assume that I have an executable jar file that doesn't have any flink code inside and my job is to make it distributed with flink. I have already done this once by creating and executing the StreamExecutionEnvironment somewhere in the code and placing the code of the jar that can be distributed inside flink operators (i.e., map functions).
Yesterday, I was asked to do a similar job but with minimal effort. They told me to find a way to wrap this flink-less jar in a way that it can be executed by a flink cluster (without injecting code and altering the jar like i did above). Is there a way to do this? The docs state that to support execution from a packaged jar "a program must use the environment obtained by StreamExecutionEnvironment.getExecutionEnvironment()". Is there no other way?
My only guess right now is to wrap the entry point of the jar. To place it inside flink operators but unfortunately I don't know that this jar does
You could write a map only program and package it in jar. In the map, you execute the main through reflection of the provided jar.
Your small wrapper could be put into Flink lib to be reusable for other jars or you add the other jar in the distributed cache.
Btw, I haven't fully understood the use case or I find weird, since it's unclear to me how the parallelism is supposed to work. So sorry if the answer does not help.

Compiling a Single NiFi Standard Processor

I have an issue with the NiFi InvokeHTTP processor which requires me to make modifications to it. I am not trying to replace it but to create a fork which I can use alongside the original.
The easiest way I have found to do this is to clone the code, checkout the 1.10 tag and run mvn clean install in the nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors directory.
However, the result of this is a JAR file named "nifi-standard-processors-1.10.0.jar". This contains ALL of the standard processors. Instead of this, I am looking to output each processor individually so I can upload only the modified InvokeHTTP processor to NiFi.
The only thing I can think of is to delete the source for the other processors individually which seems a little long-winded. I have had a look in pom.xml and cannot see anything obvious which would allow me to do this either.
Does anyone know how I can achieve this? Apologies if this is an easy question; I haven't used Java in over a decade and this is my first time using Maven.
Thank you in advance.
Is the code change one you can make by extending the processor rather than changing it at the source? If so, I'd recommend creating a custom processor which extends InvokeHTTP in its own Maven bundle (e.g. nifi-harry-bundle) which depends on nifi-standard-processors. This will allow you to use the functionality already provided, modify what you need to, and then only compile and build the new code, and copy/paste that NAR (NiFi Archive) directly into the NiFi lib/ directory to consume it.
See building custom processors and this presentation for more details.

Having a "library" of jenkinsfiles

We have 1900 separate Java projects. At the moment, we thinking about introducing Jenkins. Many of the projects would need to have the same jenkinsfile, except for 1 or 2 parameters which need to be set.
In the Java projects, I would like to "import" jenkinsfiles, following the logic
Use jenkinsfile "Standard-Jar-Build" with parameters according to "project.properties"
What could be a way to have such a "jenkinsfile library"?
In the company I work for right now I created a set of jenkinsfiles that represent certain pipeline workflows. These workflows are generic and all the project specific configuration can be passed to the jenkinsfile using job parameters.
So the only thing projects have to do is "pipeline from SCM" and point to our script files and then customise is for their project using the available properties.
Also to keep the jenkinsfile small we also use the global library feature. We call it 'common' and it contains all kinds of methods that the jenkinsfiles can use.
Added bonus: everything in the common is automatically allowed, no whitelisting needed.

Java code coverage without instrumentation

I'm trying to figure out which tool to use for getting code-coverage information for projects that are running in kind of stabilization environment.
The projects are deployed as a war and running on Jboss. I need server-side coverage while running manual / automated tests interacting with a running server.
Lets assume I cannot change projects' build and therefore cannot add any kind of instrumentation to their jars as part of the build process. I also don't have access to code.
I've made some reading on various tools and they are all presenting techniques involving instrumenting the jars on build (BTW - doesn't that affect production, or two kinds of outputs are generated?)
One tool though, JaCoCo, mentioned "on-the-fly-instrumentation" feature. Can someone explain what does it mean? Can this help me with my limitations?
I've also heard on code-coverage using runtime profiling techniques - can someone help on that issue?
Thanks,
Ben
AFAIK "on-the-fly-instrumentation" means that the coveragetool hooks into the Classloading-Mechanism by using a special ClassLoader and edits the Class-Bytecode when it's being loaded.
The result should be the same as in "offline-instrumentation" with the JARs.
Have also a look at EMMA, which supports both mechanisms. There's also a Plugin for Eclipse.
A possible solution to this problem without actual code instrumentation is to use a jvm c-agent. It is possible to attach agents to the jvm. In such an agent you can intercept every method call done in your java code without changes to the bytecodes.
At every intercepted method call you then write info about the method call which can be evaluated later for code coverage purposes.
Here you'l find the official guide to the JVMTI JVMTI which defines how jvm agents can be written.
You don't need to change the build or even have access to the code to instrument the classes. Just instrument the classes found in the delivered jar, re-jar them and redeploy the application with the instrumented jars.
Cobertura even has an ant task that does that for you: it takes a war file, instrument the classes inside the jars inside the war, and rebuild a new war file. See https://github.com/cobertura/cobertura/wiki/Ant-Task-Reference
To answer your question about instrumenting the jars on build: yes, of course, the instrumented classes are not used in production. They're only used for the tests.

Categories

Resources