I have a number of builds in my Jenkinsfile which now run in parallel. But the master server is a bit overstrained. So my idea is to limit it's builds to a configured value concurrentBuilds.
https://issues.jenkins-ci.org/browse/JENKINS-44085 inspired me but I'm a bit stuck in my plan. I have a list of services which now gathered in a map which are run in parallel like this:
def stepsForParallel = [:]
stage('read modules') {
readMavenPom().modules.findAll { module ->
module.endsWith('-service')
}.each { service ->
stepsForParallel[service] = transformIntoStep(service) // this returns { build module } to avoid immediate execution
}
}
stage('modules') {
parallel stepsForParallel
}
The build function does use parallel too. So it I get a lot of parallel tasks.
My idea was to create a LinkedBlockingDeque (let's call it stepDeque) that gathers all steps that should be done in parallel. Then I'd create a second one (let's call it workingDeque) with a size of the configured concurrentBuilds.
But then my issue arises: as far as I know I can only run parallel on a map. So, when one of the tasks of the workingDeque finishes I have a free thread.
So my question is: when I poll a job from stepDeque and add it to workingDeque, is there a way to solely run the step I just added? Or is there a simpler way to achieve this?
I have written a Step class which knows its dependents. At the start I gather all steps and put them in a LinkedBlockQueue and I create n workers using
def worker = [:]
maxConcurrentSteps.times {
worker["worker${it}"] = {
Step work = getWork() // uses take() to get new work
while (work != null) {
work.run()
work = getWork()
}
}
}
Related
I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.
As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}
I have 1000 big files to be processed in order as mentioned below:
First those files needs to be copied to a different directory in parallel, I am planning to use ExecutorService with 10 threads to achieve it.
As soon as any file is copied to another location(#1), I will submit that file for further processing to ExecutorService with 10 threads.
And finally, another action needs to be performed on these files in parallel, like #2 gets input from #1, #3 gets input from #2.
Now, I can use CompletionService here, so I can process the thread results from #1 to #2 and #2 to #3 in the order they are getting completed. CompletableFuture says we can chain asynchronous tasks together which sounds like something I can use in this case.
I am not sure if I should implement my solution with CompletableFuture (since it is relatively new and ought to be better) or if CompletionService is sufficient? And why should I chose one over another in this case?
It would probably be best if you tried both approaches and then choose the one you are more comfortable with. Though it sounds like CompletableFutures are better suited for this task because they make chaining processing steps / stages really easy. For example in your case the code could look like this:
ExecutorService copyingExecutor = ...
// Not clear from the requirements, but let's assume you have
// a separate executor for this
ExecutorService processingExecutor = ...
public CompletableFuture<MyResult> process(Path file) {
return CompletableFuture
.supplyAsync(
() -> {
// Retrieve destination path where file should be copied to
Path destination = ...
try {
Files.copy(file, destination);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
return destination;
},
copyingExecutor
)
.thenApplyAsync(
copiedFile -> {
// Process the copied file
...
},
processingExecutor
)
// This separate stage does not make much sense, so unless you have
// yet another executor for this or this stage is applied at a different
// location in your code, it should probably be merged with the
// previous stage
.thenApply(
previousResult -> {
// Process the previous result
...
}
);
}
I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.
You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}
An object with static initialization which invokes your system process should do the trick.
object SparkStandIn extends App {
object invokeSystemProcess {
import sys.process._
val errorCode = "echo Whatever you put in this object should be executed once per jvm".!
def doIt(): Unit = {
// this object will construct once per jvm, but objects are lazy in
// another way to make sure instantiation happens is to check that the errorCode does not represent an error
}
}
invokeSystemProcess.doIt()
invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}
A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:
(1 until 100).toSeq.toDS.
mapPartitions(itr => {
sys.process.Process(
Seq("bash", "-c", "echo $(hostname && date)")
).
lines.
toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)
Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.
I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr.
This works perfectly fine when I run the following code in stand-alone mode of Spark, but does not work when the same code is run on a cluster. When I run the same code on a cluster, it is like "send to Solr" part of the code is executed before the list to be sent to Solr is filled with items. I try to force the execution by solrInputDocumentJavaRDD.collect(); after foreach, but it seems like it does not have any effect.
// For each RDD
solrInputDocumentJavaDStream.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
#Override
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
// For each item in a single RDD
solrInputDocumentJavaRDD.foreach(
new VoidFunction<SolrInputDocument>() {
#Override
public void call(SolrInputDocument solrInputDocument) {
// Add the solrInputDocument to the list of SolrInputDocuments
SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument);
}
});
// Try to force execution
solrInputDocumentJavaRDD.collect();
// After having finished adding every SolrInputDocument to the list
// add it to the solrServer, and commit, waiting for the commit to be flushed
try {
if (SolrIndexerDriver.solrInputDocumentList != null
&& SolrIndexerDriver.solrInputDocumentList.size() > 0) {
SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList);
SolrIndexerDriver.solrServer.commit(true, true);
SolrIndexerDriver.solrInputDocumentList.clear();
}
} catch (SolrServerException | IOException e) {
e.printStackTrace();
}
return null;
}
}
);
What should I do, so that sending-to-Solr part executes after the list of SolrDocuments are added to solrInputDocumentList (and works also in cluster mode)?
As I mentioned on the Spark Mailing list:
I'm not familiar with the Solr API but provided that 'SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to:
SolrIndexerDriver.solrInputDocumentList.add(elem)
is happening on different singleton instances of the SolrIndexerDriver on different JVMs while
SolrIndexerDriver.solrServer.commit
is happening on the driver.
In practical terms, the lists on the executors are being filled-in but they are never committed and on the driver the opposite is happening.
The recommended way to handle this is to use foreachPartition like this:
rdd.foreachPartition{iter =>
// prepare connection
Stuff.connect(...)
// add elements
iter.foreach(elem => Stuff.add(elem))
// submit
Stuff.commit()
}
This way you can add the data of each partition and commit the results in the local context of each executor. Be aware that this add/commit must be thread safe in order to avoid data loss or corruption.
have you checked under the spark UI to see the execution plan of this job.
Check how it is getting split into stages and their dependencies. That should give you an idea hopefully.
I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.
In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.
In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.