Java - Apache Spark communication

Java - Apache Spark communication - java

I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Related

How to investigate Java Heap Space errors in jenkins

I have a matrix pipeline job which performs multiple stages (something like ~200) most of which are functional tests whose results are recorded by the following code:
stage('Report') {
script {
def summary = junit allowEmptyResults: true, testResults: "**/artifacts/${product}/test-reports/*.xml"
def buildURL = "${env.BUILD_URL}"
def TestAnalyzer = buildURL.replace("/${env.BUILD_NUMBER}", "/test_results_analyzer")
def TestsURL = buildURL.replace("job/${env.JOB_NAME}/${env.BUILD_NUMBER}", "blue/organizations/jenkins/${env.JOB_NAME}/detail/${env.JOB_NAME}/${env.BUILD_NUMBER}/tests")
slackSend (
color: summary.failCount == 0 ? 'good' : 'warning',
message: "*<${TestsURL}|Test Summary>* for *${env.JOB_NAME}* on *${env.HOSTNAME} - ${product}* <${env.BUILD_URL}| #${env.BUILD_NUMBER}> - <${TestAnalyzer}|${summary.totalCount} Tests>, Failed: ${summary.failCount}, Skipped: ${summary.skipCount}, Passed: ${summary.passCount}"
)
}
}
The problem is that this Report stage regularly fails with the following error:
> Archive JUnit-formatted test results 9m 25s
[2022-11-16T02:51:49.569Z] Recording test results
Java heap space
I have increased the heap space of the jenkins server to 8GB by modifying the systemd service configuration this way:
software-dev#magnet:~$ sudo cat /etc/systemd/system/jenkins.service.d/override.conf
[Service]
Environment="JAVA_OPTS=-Djava.awt.headless=true -Xmx8g"
which was taken into account, because I verified with the following command:
software-dev#magnet:~$ tr '\0' '\n' < /proc/$(pidof java)/cmdline
/usr/bin/java
-Djava.awt.headless=true
-Xmx10g
-jar
/usr/share/java/jenkins.war
--webroot=/var/cache/jenkins/war
--httpPort=8080
I just increased the Heap size to 10GB and I'll wait for the result of this night's build, but I have the feeling that this amount of Heap space really looks excessive, so I'm suspecting that a plugin, maybe the JUnit one, may be buggy and could consume too much memory.
Is anyone aware of such a thing? Could there be workarounds?
More importantly, which methods could I use to try to track if one plugin is consuming too much?
I have notions of Java since my CS degree, but I'm not familiar with the jenkins development ecosystem.
Thank you by advance.

You can try splitting the tests into chunks/batch/groups but this solution requires changes in the code.
More details
https://semaphoreci.com/community/tutorials/how-to-split-junit-tests-in-a-continuous-integration-environment
Grouping JUnit tests

Quarkus - KStream and KTable join does not output messages

I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}

The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.

Google dataflow job which reads from Pubsub and writes to GCS is very slow (WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards) takes too long

Currently we have a dataflow job which reads from pubsub and writes avro file using FileIO.writeDynamic to GCS and when we test with say 10000 events/sec , not able to process faster as WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards is very slow. Below is the snippet we are using to write.
How can we improve
PCollection<Event> windowedWrites = input.apply("Global Window", Window.<Event>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterFirst.of(AfterPane.elementCountAtLeast(50000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(DurationUtils
.parseDuration(windowDuration))))).discardingFiredPanes());
return windowedWrites
.apply("WriteToAvroGCS", FileIO.<EventDestination, Five9Event>writeDynamic()
.by(groupFn)
.via(outputFn, Contextful.fn(
new SinkFn()))
.withTempDirectory(avroTempDirectory)
.withDestinationCoder(destinationCoder)
.withNumShards(1).withNaming(namingFn));
We use custom filenaming say in the format, gs://tenantID.<>/eventname/dddd-mm-dd/<uniq_id-shardInder-of-numOfShards-pane-paneIndex.avro>

As mentioned in the comments, the issue is likely withNumShards(1) which forces everything to happen on one worker.

As Robert said, when using withNumShards(1) Dataflow/Beam cannot parallelize the writting, making it happen on the same worker. When the bundles are relatively high, this has a big impact on the performance of the pipeline. I made an example to demonstrate this:
I ran 3 pipelines that generate a lot of elements (~2gb), the three of them with 10 n1-standard-1 workers but with 1 shard, 10 shards and 0 shards (Dataflow would choose the amount of shards). This is how they behave:
We see a big difference between 0 or 10 Shard vs 1 Shard total time. If we go to the job with 1 shard, we see that only one worker was doing something (I disabled the autoscaling):
As Reza mentioned, this happens because all elements need to be shuffled into the same worker so it writes the 1 shard.
Note that my example is Batch, which has a different behavior than Streaming when it comes to threading, but the effect on pipeline performance is similar enough (in fact, in Streaming it may be even worst).
Here you have a Python code so you can test this yourself:
p = beam.Pipeline(options=pipeline_options)
def long_string_generator():
string = "Apache Beam is an open source, unified model for defining " \
"both batch and streaming data-parallel processing " \
"pipelines. Using one of the open source Beam SDKs, " \
"you build a program that defines the pipeline. The pipeline " \
"is then executed by one of Beam’s supported distributed " \
"processing back-ends, which include Apache Flink, Apache " \
"Spark, and Google Cloud Dataflow. "
word_choice = random.sample(string.split(" "), 20)
return " ".join(word_choice)
def generate_elements(element, amount=1):
return [(element, long_string_generator()) for _ in range(amount)]
(p | Create(range(1500))
| beam.FlatMap(generate_elements, amount=10000)
| WriteToText(known_args.output, num_shards=known_args.shards))
p.run()

How to automatically commit only src folder (and its content) of each projects with SVN [duplicate]

If I had 20 directories under trunk/ with lots of files in each and only needed 3 of those directories, would it be possible to do a Subversion checkout with only those 3 directories under trunk?

Indeed, thanks to the comments to my post here, it looks like sparse directories are the way to go. I believe the following should do it:
svn checkout --depth empty http://svnserver/trunk/proj
svn update --set-depth infinity proj/foo
svn update --set-depth infinity proj/bar
svn update --set-depth infinity proj/baz
Alternatively, --depth immediates instead of empty checks out files and directories in trunk/proj without their contents. That way you can see which directories exist in the repository.
As mentioned in #zigdon's answer, you can also do a non-recursive checkout. This is an older and less flexible way to achieve a similar effect:
svn checkout --non-recursive http://svnserver/trunk/proj
svn update trunk/foo
svn update trunk/bar
svn update trunk/baz

Subversion 1.5 introduces sparse checkouts which may be something you might find useful. From the documentation:
... sparse directories (or shallow checkouts) ... allows you to easily check out a working copy—or a portion of a working copy—more shallowly than full recursion, with the freedom to bring in previously ignored files and subdirectories at a later time.

I wrote a script to automate complex sparse checkouts.
#!/usr/bin/env python
'''
This script makes a sparse checkout of an SVN tree in the current working directory.
Given a list of paths in an SVN repository, it will:
1. Checkout the common root directory
2. Update with depth=empty for intermediate directories
3. Update with depth=infinity for the leaf directories
'''
import os
import getpass
import pysvn
__author__ = "Karl Ostmo"
__date__ = "July 13, 2011"
# =============================================================================
# XXX The os.path.commonprefix() function does not behave as expected!
# See here: http://mail.python.org/pipermail/python-dev/2002-December/030947.html
# and here: http://nedbatchelder.com/blog/201003/whats_the_point_of_ospathcommonprefix.html
# and here (what ever happened?): http://bugs.python.org/issue400788
from itertools import takewhile
def allnamesequal(name):
return all(n==name[0] for n in name[1:])
def commonprefix(paths, sep='/'):
bydirectorylevels = zip(*[p.split(sep) for p in paths])
return sep.join(x[0] for x in takewhile(allnamesequal, bydirectorylevels))
# =============================================================================
def getSvnClient(options):
password = options.svn_password
if not password:
password = getpass.getpass('Enter SVN password for user "%s": ' % options.svn_username)
client = pysvn.Client()
client.callback_get_login = lambda realm, username, may_save: (True, options.svn_username, password, True)
return client
# =============================================================================
def sparse_update_with_feedback(client, new_update_path):
revision_list = client.update(new_update_path, depth=pysvn.depth.empty)
# =============================================================================
def sparse_checkout(options, client, repo_url, sparse_path, local_checkout_root):
path_segments = sparse_path.split(os.sep)
path_segments.reverse()
# Update the middle path segments
new_update_path = local_checkout_root
while len(path_segments) > 1:
path_segment = path_segments.pop()
new_update_path = os.path.join(new_update_path, path_segment)
sparse_update_with_feedback(client, new_update_path)
if options.verbose:
print "Added internal node:", path_segment
# Update the leaf path segment, fully-recursive
leaf_segment = path_segments.pop()
new_update_path = os.path.join(new_update_path, leaf_segment)
if options.verbose:
print "Will now update with 'recursive':", new_update_path
update_revision_list = client.update(new_update_path)
if options.verbose:
for revision in update_revision_list:
print "- Finished updating %s to revision: %d" % (new_update_path, revision.number)
# =============================================================================
def group_sparse_checkout(options, client, repo_url, sparse_path_list, local_checkout_root):
if not sparse_path_list:
print "Nothing to do!"
return
checkout_path = None
if len(sparse_path_list) > 1:
checkout_path = commonprefix(sparse_path_list)
else:
checkout_path = sparse_path_list[0].split(os.sep)[0]
root_checkout_url = os.path.join(repo_url, checkout_path).replace("\\", "/")
revision = client.checkout(root_checkout_url, local_checkout_root, depth=pysvn.depth.empty)
checkout_path_segments = checkout_path.split(os.sep)
for sparse_path in sparse_path_list:
# Remove the leading path segments
path_segments = sparse_path.split(os.sep)
start_segment_index = 0
for i, segment in enumerate(checkout_path_segments):
if segment == path_segments[i]:
start_segment_index += 1
else:
break
pruned_path = os.sep.join(path_segments[start_segment_index:])
sparse_checkout(options, client, repo_url, pruned_path, local_checkout_root)
# =============================================================================
if __name__ == "__main__":
from optparse import OptionParser
usage = """%prog [path2] [more paths...]"""
default_repo_url = "http://svn.example.com/MyRepository"
default_checkout_path = "sparse_trunk"
parser = OptionParser(usage)
parser.add_option("-r", "--repo_url", type="str", default=default_repo_url, dest="repo_url", help='Repository URL (default: "%s")' % default_repo_url)
parser.add_option("-l", "--local_path", type="str", default=default_checkout_path, dest="local_path", help='Local checkout path (default: "%s")' % default_checkout_path)
default_username = getpass.getuser()
parser.add_option("-u", "--username", type="str", default=default_username, dest="svn_username", help='SVN login username (default: "%s")' % default_username)
parser.add_option("-p", "--password", type="str", dest="svn_password", help="SVN login password")
parser.add_option("-v", "--verbose", action="store_true", default=False, dest="verbose", help="Verbose output")
(options, args) = parser.parse_args()
client = getSvnClient(options)
group_sparse_checkout(
options,
client,
options.repo_url,
map(os.path.relpath, args),
options.local_path)

Or do a non-recursive checkout of /trunk, then just do a manual update on the 3 directories you need.

If you already have the full local copy, you can remove unwanted sub folders by using --set-depth command.
svn update --set-depth=exclude www
See: http://blogs.collab.net/subversion/sparse-directories-now-with-exclusion
The set-depth command support multipile paths.
Updating the root local copy will not change the depth of the modified folder.
To restore the folder to being recusively checkingout, you could use --set-depth again with infinity param.
svn update --set-depth=infinity www

I'm adding this information for those using the TortoiseSvn tool: to obtain the OP same functionality, you can use the Choose items... button in the Checkout Depth section of the Checkout function, as shown in the following screenshot:

Sort of. As Bobby says:
svn co file:///.../trunk/foo file:///.../trunk/bar file:///.../trunk/hum
will get the folders, but you will get separate folders from a subversion perspective. You will have to go separate commits and updates on each subfolder.
I don't believe you can checkout a partial tree and then work with the partial tree as a single entity.

Not in any especially useful way, no. You can check out subtrees (as in Bobby Jack's suggestion), but then you lose the ability to update/commit them atomically; to do that, they need to be placed under their common parent, and as soon as you check out the common parent, you'll download everything under that parent. Non-recursive isn't a good option, because you want updates and commits to be recursive.

command in hadoop source code for map time and reduce time

we know map has two part 'chunk and combine ',and reduce has 3 part 'shuffle ,sort and reduce '.
In hadoop source code,what is the command for having each parts time,

There is an API for the JobTracker for submitting and tracking MR jobs in a network environment.
Check this for more details.
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobTracker.html
TaskReport[] maps = jobtracker.getMapTaskReports("job_id");
for (TaskReport rpt : maps) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
TaskReport[] reduces = jobtracker.getReduceTaskReports("job_id");
for (TaskReport rpt : reduces) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
or If you are using Hadoop 2.x, ResourceManager REST API's have been provided.
https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Apache Spark communication - java

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

Related

How to investigate Java Heap Space errors in jenkins

Quarkus - KStream and KTable join does not output messages

Google dataflow job which reads from Pubsub and writes to GCS is very slow (WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards) takes too long

How to automatically commit only src folder (and its content) of each projects with SVN [duplicate]

command in hadoop source code for map time and reduce time

Categories

Resources