Spark (Kafka) Streaming Memory Issue

Spark (Kafka) Streaming Memory Issue - java

I am testing my first Spark Streaming pipline which processes messages from Kafka. However, after several testing runs, I got the following error message
There is insufficient memory for the Java Runtime Environment to continue.
My testing data is really small thus this should not happen. After looking into the process, I realized maybe previously submitted spark jobs were not removed completely?
I usually submit jobs like below, and I am using Spark 2.2.1
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ~/script/to/spark_streaming.py
And stop it using `Ctrl+C'
Last few lines of the script looks like:
ssc.start()
ssc.awaitTermination()
Update
After I changing the way to submit a spark streaming job (command like below), I still ran into same issue which is after killing the job, memory will not be released.I only started Hadoop and Spark for those 4 EC2 nodes.
/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --py-files ~/config.py --master spark://<master_IP>:7077 --deploy-mode client ~/spark_kafka.py

When you press Ctrl-C, only the submitter process is interrupted, the job itself continues to run. Eventually your system runs out of memory so no new JVM can be started.
Furthermore, even if you restart the cluster, all previously running jobs will be restarted again.
Read how to stop a running Spark application properly.

It might be the problem of bunch of driver (spark-app-driver process) processes running on the host you use to submit spark job. Try doing something like
ps aux --forest
or similar depending on your platform to understand what are the processes running at the moment. Or you can have a look at answers over the stackoverflow Spark Streaming with Actor Never Terminates , it might give you a glue on what is happening.

Related

Spark submit job fails for cluster mode but works in local for copyToLocal from HDFS in java

I'm running a Java code to copy the files from HDFS to local using Spark cluster mode in spark submit.
The job runs fine with spark local but fails in cluster mode.
It throws a java.io.exeception: Target /mypath/ is a directory.
I don't understand why is it failing in cluster. But I don't recieve any exceptions in local.

That behaviour is because in the first case (local) your driver is in the same machine that you are running the whole Spark job. In the second case (cluster), your driver program is shipped to one of your workers and execute the process from there.
In general, when you want to run Spark jobs as a cluster mode and you need to pre-process local files such as JSON, XML, among others, you need to ship them along with the executable using the following sentence --files <myfile>. Then in your driver program you will be able to see that particular file. If you want to include multiple files, put them separated by comma(,).
The approach is the same when you want to add some jars dependencies, you need to use --jars <myJars>.
For more details about this, check this thread.

How does Apache Spark's CoarseGrainedScheduler get started?

I'm trying to instrument a Spark (v 1.6.1) application with an APM (Application Performance Management System). To do so with the APM of choice, I must instrument the JVM startup string(s) with a -javaagent flag, pointing to my APM along with any pertinent APM options (e.g., unique-host-id).
On the Spark Master server, I have done this successfully, and on the various Spark Worker servers, I have successfully instrumented the spark worker process as well.
However, when I look at the Java processes that are running, I see one additional process that does not contain my startup string: the CoarseGrainedScheduler. The CoarseGrainedScheduler is the actual Spark executor which runs the worker application code submitted to the Worker by the Master.
I cannot determine from where the CoarseGrainedScheduler is invoked.
So, for more context, here's how I've instrumented the Spark Worker startup string: in {SPARK_HOME}/conf/spark_env.sh, I added the following environment variable:
SPARK_DAEMON_JAVA_OPTS="<java-agent-startup-string>"
This gets carried through to the eventual invocation of {SPARK_HOME}/bin/spark-class, which is the root of all Spark invocations; that is, all spark commands that emanate from {SPARK_HOME}/bin or {SPARK_HOME}/sbin eventually delegate to spark_class.
This is seemingly not, however, where CoarseGrainedScheduler is invoked from. Looking at this document, though, that is where it gets invoked from:
$ ./bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend <opts>
Why, then, is my startup string not being picked up? Working from the assumption/instruction that CoarseGrainedExecutorBackend is invoked via spark-class, I actually edited that file to add my startup string when running any java command, and that also fails to add my startup string to the CoarseGrainedExecutorBackend, although it does add it to the Spark Worker process itself. So again, it seems as though CoarseGrainedExecutorBackend is not started via spark-class, even though the linked document says it is.
Can anyone help me find the root of the CoarseGrainedExecutorBackend process and how it's invoked? If I can provide any additional details, just let me know.

Where should a Kafka consumer that processes HDFS data run?

I am new to Hadoop and Kafka. I inherited code for a Kafka consumer that runs on a desktop Windows machine, receives the HDFS location of new XML data available on a remote cluster, downloads the data for processing, and writes the result back out to the HDFS cluster.
It seems to me that the consumer should run on the cluster because that's where the data is but all the sample Kafka consumer code I see suggests that producer/consumers run on regular desktop machines. What is the typical target platform for Kafka consumer?

Producers and consumers can run anywhere. The examples you see imply a desktop execution because that code is much simpler than, say, code running within a Storm topology and examples tend to be overly simple. The only reason for a desktop environment would be the presence of a UI for the application.
If the application is headless, then it does make a lot of sense to move the execution as close to the data (both Kafka and HDFS) as possible.

Java process hangs, no thread dump could be taken

I am facing with a strange case. I'd be glad if you could share your comments.
We have solution running on Java 1.6.085 and sometimes Java process is getting hang in production. The solution is running on Linux server.
I investigated GC logs, there is no Full GC. Pause times also look reasonable.
Then we tried to take a thread dump when case happens however kill -3, ./jstack or ./jstack -F do not work. No thread dump could be taken. What could be the reason for that ? Any ideas on investigating the issue ?
BR
-emre

After a while it is understood that the issue occured due to pstack and qdb commands which are executed on java process for operational purposes. Somehow pstack and qdb suspends the java process. Therefore we couldnt be able to take thread or heap dump

We're using jConsole with the topthreads plugin to analyze such cases. The plugin uses JMX to check the thread runtimes and displays their CPU usage since start of the tracking procedure as well as the current stack trace for each thread.
To connect our servers from a local machine we use tunnels in putty, i.e. we first connect to the server via putty and then connect jConsole to a local port which is tunneled to the server.

Oozie java action gets killed then restarted by cluster

I’m using an oozie java action step to start a java main. This java application does some calculations and then runs another map-reduce job based on that data.
Since the oozie java action runs as a map-only job, it is also seen in job tracker.
One of our nodes was low on memory so the task tracker killed the oozie map-only job and restarted it on another node.
However before killing it, the java application had already spawned its own map reduce job.
When the oozie map-only job was restarted on the other node, it again spawned yet another map-reduce job with the same data as the former one.
Looking in job tracker now has duplicate map-reduce jobs running against the same data.
How do you prevent/manage/alter settings such that the java program that oozie initiates in the map-only process only get run once, or is it necessary to have to constrain the Java application to be able to be run multiple times.
Any help would be appreciated,
Ken

There is not a lot you can do on the Oozie side if the one-mapper bootstrap jobs are failing because the hosts are out of memory. This host OOM scenario can be very problematic for every service in the cluster.
The preferred way to deal with this is to ensure that the host does not run out of memory at all by only allowing as many map and reduce slots on each TaskTracker node as there is memory available. You may also find that this allocation of resources to nodes is more efficient and tunable by using the YARN resource management framework instead of JobTracker-based MapReduce (MR1).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.