I have a matrix pipeline job which performs multiple stages (something like ~200) most of which are functional tests whose results are recorded by the following code:
stage('Report') {
script {
def summary = junit allowEmptyResults: true, testResults: "**/artifacts/${product}/test-reports/*.xml"
def buildURL = "${env.BUILD_URL}"
def TestAnalyzer = buildURL.replace("/${env.BUILD_NUMBER}", "/test_results_analyzer")
def TestsURL = buildURL.replace("job/${env.JOB_NAME}/${env.BUILD_NUMBER}", "blue/organizations/jenkins/${env.JOB_NAME}/detail/${env.JOB_NAME}/${env.BUILD_NUMBER}/tests")
slackSend (
color: summary.failCount == 0 ? 'good' : 'warning',
message: "*<${TestsURL}|Test Summary>* for *${env.JOB_NAME}* on *${env.HOSTNAME} - ${product}* <${env.BUILD_URL}| #${env.BUILD_NUMBER}> - <${TestAnalyzer}|${summary.totalCount} Tests>, Failed: ${summary.failCount}, Skipped: ${summary.skipCount}, Passed: ${summary.passCount}"
)
}
}
The problem is that this Report stage regularly fails with the following error:
> Archive JUnit-formatted test results 9m 25s
[2022-11-16T02:51:49.569Z] Recording test results
Java heap space
I have increased the heap space of the jenkins server to 8GB by modifying the systemd service configuration this way:
software-dev#magnet:~$ sudo cat /etc/systemd/system/jenkins.service.d/override.conf
[Service]
Environment="JAVA_OPTS=-Djava.awt.headless=true -Xmx8g"
which was taken into account, because I verified with the following command:
software-dev#magnet:~$ tr '\0' '\n' < /proc/$(pidof java)/cmdline
/usr/bin/java
-Djava.awt.headless=true
-Xmx10g
-jar
/usr/share/java/jenkins.war
--webroot=/var/cache/jenkins/war
--httpPort=8080
I just increased the Heap size to 10GB and I'll wait for the result of this night's build, but I have the feeling that this amount of Heap space really looks excessive, so I'm suspecting that a plugin, maybe the JUnit one, may be buggy and could consume too much memory.
Is anyone aware of such a thing? Could there be workarounds?
More importantly, which methods could I use to try to track if one plugin is consuming too much?
I have notions of Java since my CS degree, but I'm not familiar with the jenkins development ecosystem.
Thank you by advance.
You can try splitting the tests into chunks/batch/groups but this solution requires changes in the code.
More details
https://semaphoreci.com/community/tutorials/how-to-split-junit-tests-in-a-continuous-integration-environment
Grouping JUnit tests
I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}
The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.
Currently we have a dataflow job which reads from pubsub and writes avro file using FileIO.writeDynamic to GCS and when we test with say 10000 events/sec , not able to process faster as WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards is very slow. Below is the snippet we are using to write.
How can we improve
PCollection<Event> windowedWrites = input.apply("Global Window", Window.<Event>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterFirst.of(AfterPane.elementCountAtLeast(50000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(DurationUtils
.parseDuration(windowDuration))))).discardingFiredPanes());
return windowedWrites
.apply("WriteToAvroGCS", FileIO.<EventDestination, Five9Event>writeDynamic()
.by(groupFn)
.via(outputFn, Contextful.fn(
new SinkFn()))
.withTempDirectory(avroTempDirectory)
.withDestinationCoder(destinationCoder)
.withNumShards(1).withNaming(namingFn));
We use custom filenaming say in the format, gs://tenantID.<>/eventname/dddd-mm-dd/<uniq_id-shardInder-of-numOfShards-pane-paneIndex.avro>
As mentioned in the comments, the issue is likely withNumShards(1) which forces everything to happen on one worker.
As Robert said, when using withNumShards(1) Dataflow/Beam cannot parallelize the writting, making it happen on the same worker. When the bundles are relatively high, this has a big impact on the performance of the pipeline. I made an example to demonstrate this:
I ran 3 pipelines that generate a lot of elements (~2gb), the three of them with 10 n1-standard-1 workers but with 1 shard, 10 shards and 0 shards (Dataflow would choose the amount of shards). This is how they behave:
We see a big difference between 0 or 10 Shard vs 1 Shard total time. If we go to the job with 1 shard, we see that only one worker was doing something (I disabled the autoscaling):
As Reza mentioned, this happens because all elements need to be shuffled into the same worker so it writes the 1 shard.
Note that my example is Batch, which has a different behavior than Streaming when it comes to threading, but the effect on pipeline performance is similar enough (in fact, in Streaming it may be even worst).
Here you have a Python code so you can test this yourself:
p = beam.Pipeline(options=pipeline_options)
def long_string_generator():
string = "Apache Beam is an open source, unified model for defining " \
"both batch and streaming data-parallel processing " \
"pipelines. Using one of the open source Beam SDKs, " \
"you build a program that defines the pipeline. The pipeline " \
"is then executed by one of Beam’s supported distributed " \
"processing back-ends, which include Apache Flink, Apache " \
"Spark, and Google Cloud Dataflow. "
word_choice = random.sample(string.split(" "), 20)
return " ".join(word_choice)
def generate_elements(element, amount=1):
return [(element, long_string_generator()) for _ in range(amount)]
(p | Create(range(1500))
| beam.FlatMap(generate_elements, amount=10000)
| WriteToText(known_args.output, num_shards=known_args.shards))
p.run()
If I had 20 directories under trunk/ with lots of files in each and only needed 3 of those directories, would it be possible to do a Subversion checkout with only those 3 directories under trunk?
Indeed, thanks to the comments to my post here, it looks like sparse directories are the way to go. I believe the following should do it:
svn checkout --depth empty http://svnserver/trunk/proj
svn update --set-depth infinity proj/foo
svn update --set-depth infinity proj/bar
svn update --set-depth infinity proj/baz
Alternatively, --depth immediates instead of empty checks out files and directories in trunk/proj without their contents. That way you can see which directories exist in the repository.
As mentioned in #zigdon's answer, you can also do a non-recursive checkout. This is an older and less flexible way to achieve a similar effect:
svn checkout --non-recursive http://svnserver/trunk/proj
svn update trunk/foo
svn update trunk/bar
svn update trunk/baz
Subversion 1.5 introduces sparse checkouts which may be something you might find useful. From the documentation:
... sparse directories (or shallow checkouts) ... allows you to easily check out a working copy—or a portion of a working copy—more shallowly than full recursion, with the freedom to bring in previously ignored files and subdirectories at a later time.
I wrote a script to automate complex sparse checkouts.
#!/usr/bin/env python
'''
This script makes a sparse checkout of an SVN tree in the current working directory.
Given a list of paths in an SVN repository, it will:
1. Checkout the common root directory
2. Update with depth=empty for intermediate directories
3. Update with depth=infinity for the leaf directories
'''
import os
import getpass
import pysvn
__author__ = "Karl Ostmo"
__date__ = "July 13, 2011"
# =============================================================================
# XXX The os.path.commonprefix() function does not behave as expected!
# See here: http://mail.python.org/pipermail/python-dev/2002-December/030947.html
# and here: http://nedbatchelder.com/blog/201003/whats_the_point_of_ospathcommonprefix.html
# and here (what ever happened?): http://bugs.python.org/issue400788
from itertools import takewhile
def allnamesequal(name):
return all(n==name[0] for n in name[1:])
def commonprefix(paths, sep='/'):
bydirectorylevels = zip(*[p.split(sep) for p in paths])
return sep.join(x[0] for x in takewhile(allnamesequal, bydirectorylevels))
# =============================================================================
def getSvnClient(options):
password = options.svn_password
if not password:
password = getpass.getpass('Enter SVN password for user "%s": ' % options.svn_username)
client = pysvn.Client()
client.callback_get_login = lambda realm, username, may_save: (True, options.svn_username, password, True)
return client
# =============================================================================
def sparse_update_with_feedback(client, new_update_path):
revision_list = client.update(new_update_path, depth=pysvn.depth.empty)
# =============================================================================
def sparse_checkout(options, client, repo_url, sparse_path, local_checkout_root):
path_segments = sparse_path.split(os.sep)
path_segments.reverse()
# Update the middle path segments
new_update_path = local_checkout_root
while len(path_segments) > 1:
path_segment = path_segments.pop()
new_update_path = os.path.join(new_update_path, path_segment)
sparse_update_with_feedback(client, new_update_path)
if options.verbose:
print "Added internal node:", path_segment
# Update the leaf path segment, fully-recursive
leaf_segment = path_segments.pop()
new_update_path = os.path.join(new_update_path, leaf_segment)
if options.verbose:
print "Will now update with 'recursive':", new_update_path
update_revision_list = client.update(new_update_path)
if options.verbose:
for revision in update_revision_list:
print "- Finished updating %s to revision: %d" % (new_update_path, revision.number)
# =============================================================================
def group_sparse_checkout(options, client, repo_url, sparse_path_list, local_checkout_root):
if not sparse_path_list:
print "Nothing to do!"
return
checkout_path = None
if len(sparse_path_list) > 1:
checkout_path = commonprefix(sparse_path_list)
else:
checkout_path = sparse_path_list[0].split(os.sep)[0]
root_checkout_url = os.path.join(repo_url, checkout_path).replace("\\", "/")
revision = client.checkout(root_checkout_url, local_checkout_root, depth=pysvn.depth.empty)
checkout_path_segments = checkout_path.split(os.sep)
for sparse_path in sparse_path_list:
# Remove the leading path segments
path_segments = sparse_path.split(os.sep)
start_segment_index = 0
for i, segment in enumerate(checkout_path_segments):
if segment == path_segments[i]:
start_segment_index += 1
else:
break
pruned_path = os.sep.join(path_segments[start_segment_index:])
sparse_checkout(options, client, repo_url, pruned_path, local_checkout_root)
# =============================================================================
if __name__ == "__main__":
from optparse import OptionParser
usage = """%prog [path2] [more paths...]"""
default_repo_url = "http://svn.example.com/MyRepository"
default_checkout_path = "sparse_trunk"
parser = OptionParser(usage)
parser.add_option("-r", "--repo_url", type="str", default=default_repo_url, dest="repo_url", help='Repository URL (default: "%s")' % default_repo_url)
parser.add_option("-l", "--local_path", type="str", default=default_checkout_path, dest="local_path", help='Local checkout path (default: "%s")' % default_checkout_path)
default_username = getpass.getuser()
parser.add_option("-u", "--username", type="str", default=default_username, dest="svn_username", help='SVN login username (default: "%s")' % default_username)
parser.add_option("-p", "--password", type="str", dest="svn_password", help="SVN login password")
parser.add_option("-v", "--verbose", action="store_true", default=False, dest="verbose", help="Verbose output")
(options, args) = parser.parse_args()
client = getSvnClient(options)
group_sparse_checkout(
options,
client,
options.repo_url,
map(os.path.relpath, args),
options.local_path)
Or do a non-recursive checkout of /trunk, then just do a manual update on the 3 directories you need.
If you already have the full local copy, you can remove unwanted sub folders by using --set-depth command.
svn update --set-depth=exclude www
See: http://blogs.collab.net/subversion/sparse-directories-now-with-exclusion
The set-depth command support multipile paths.
Updating the root local copy will not change the depth of the modified folder.
To restore the folder to being recusively checkingout, you could use --set-depth again with infinity param.
svn update --set-depth=infinity www
I'm adding this information for those using the TortoiseSvn tool: to obtain the OP same functionality, you can use the Choose items... button in the Checkout Depth section of the Checkout function, as shown in the following screenshot:
Sort of. As Bobby says:
svn co file:///.../trunk/foo file:///.../trunk/bar file:///.../trunk/hum
will get the folders, but you will get separate folders from a subversion perspective. You will have to go separate commits and updates on each subfolder.
I don't believe you can checkout a partial tree and then work with the partial tree as a single entity.
Not in any especially useful way, no. You can check out subtrees (as in Bobby Jack's suggestion), but then you lose the ability to update/commit them atomically; to do that, they need to be placed under their common parent, and as soon as you check out the common parent, you'll download everything under that parent. Non-recursive isn't a good option, because you want updates and commits to be recursive.
we know map has two part 'chunk and combine ',and reduce has 3 part 'shuffle ,sort and reduce '.
In hadoop source code,what is the command for having each parts time,
There is an API for the JobTracker for submitting and tracking MR jobs in a network environment.
Check this for more details.
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobTracker.html
TaskReport[] maps = jobtracker.getMapTaskReports("job_id");
for (TaskReport rpt : maps) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
TaskReport[] reduces = jobtracker.getReduceTaskReports("job_id");
for (TaskReport rpt : reduces) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
or If you are using Hadoop 2.x, ResourceManager REST API's have been provided.
https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html