Apache Spark: Garbage Collection Logs for Driver

Apache Spark: Garbage Collection Logs for Driver - java

My Spark driver runs out of memory after running for about 10 hours with the error Exception in thread "dispatcher-event-loop-17" java.lang.OutOfMemoryError: GC overhead limit exceeded. To further debug, I enabled G1GC mode and also the GC logs option using spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
but it looks like it is not taking effect on the driver.
The job got stuck again on the driver after 10 hours and I dont see any GC logs under stdout on the driver node under /var/log/hadoop-yar/userlogs/[application-id]/[container-id]/stdout - so not sure where else to look. According to Spark GC tuning docs, it looks like these settings only happen on worker nodes (which I can see in this case as well as workers have GC logs in stdout after I had used the same configs under spark.executor.extraJavaOptions). Is there anyway to enable/acquire GC logs from the driver? Under Spark UI -> Environment, I see these options are listed under spark.driver.extraJavaOptions which is why I assumed it would be working.
Environment:
The cluster is running on Google Dataproc and I use /usr/bin/spark-submit --master yarn --deploy-mode cluster ... from the master to submit jobs.
EDIT
Setting the same options for the driver during the spark-submit command works and I am able to see the GC logs on stdout for the driver. Just that setting the options via SparkConf programmatically does not seem to take effect for some reason.

I believe spark.driver.extraJavaOptions is handled by SparkSubmit.scala, and needs to be passed at invocation. To do that with Dataproc you can add that to the properties field (--properties in gcloud dataproc jobs submit spark).
Also instead of -Dlog4j.configuration=log4j.properties you can use this guide to configure detailed logging.
I could see GC driver logs with:
gcloud dataproc jobs submit spark --cluster CLUSTER_NAME --class org.apache.spark.examples.SparkPi --jars file:///usr/lib/spark/examples/jars/spark-examples.jar --driver-log-levels ROOT=DEBUG --properties=spark.driver.extraJavaOptions="-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp" --
You probably don't need --driver-log-levels ROOT=DEBUG, but can copy in your logging config from log4j.properties. If you really want to use log4j.properties, you can probably use --files log4j.properties

Related

Prometheus JMX Exporter java agent for Kafka won't run

I am attempting to setup confluent kafka v5.4 and running the prometheus JMX exporter. I have found this blog for how to get this setup https://alex.dzyoba.com/blog/jmx-exporter/ . Kafka is setup and runs just fine but the endpoint on port 8080 returns nothing. I've tried just about everything for how I call the javaagent in the systemd script but nothing seems to work.
Description=Confluent Kafka Broker
After=network.target network-online.target remote-fs.target zookeeper.service
[Service]
Type=forking
User=confluent
Group=confluent
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent=/opt/prometheus/jmx_prometheus_javaagent.jar=8080:/opt/prometheus/config.yaml"
Environment=LOG_DIR=/var/log/confluent
ExecStart=/opt/confluent/confluent-5.4.0/bin/kafka-server-start -daemon /opt/confluent/confluent-5.4.0/etc/kafka/server.properties
ExecStop=/opt/confluent/confluent-5.4.0/bin/kafka-server-stop
SuccessExitStatus=143
[Install]
WantedBy=multi-user.target
Any ideas on how to call that java agent in the systemd script to get it to work correctly? I have tried multiple options for calling the OPTS with none of them working. I've tried putting the -javaagent command in KAFKA_OPTS. Feel like I've tried just about every option. Kafka logs don't give any clues and I'm not sure of where else to look at logs for this type of issue.
OS Centos 7 JMX exporter 0.12.0 Java openJDK 11
Logs that I have found are not telling me anything as to why it's not running. Maybe I'm looking at the wrong logs.
Edit:
conflue+ 11578 47.4 13.8 8679808 536764 ? Sl 11:59 0:35 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xlog:gc*:file=/var/log/confluent/kafkaServer-gc.log:time,tags:filecount=10,filesize=102400 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent=/opt/prometheus/jmx_prometheus_javaagent.jar=8080:/opt/prometheus/config.yaml -Dkafka.logs.dir=/var/log/confluent -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /opt/confluent/confluent-5.4.0/bin/../ce-broker-plugins/build/libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-broker-plugins/build/dependant-libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-auth-providers/build/libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-auth-providers/build/dependant-libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-rest-server/build/libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-rest-server/build/dependant-libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-audit/build/libs/*:/opt/confluent/confluent-5.4.0/bin/../ce-audit/build/dependant-libs/*:/opt/confluent/confluent-5.4.0/bin/../share/java/kafka/*:/opt/confluent/confluent-5.4.0/bin/../share/java/confluent-metadata-service/*:/opt/confluent/confluent-5.4.0/bin/../share/java/rest-utils/*:/opt/confluent/confluent-5.4.0/bin/../share/java/confluent-common/*:/opt/confluent/confluent-5.4.0/bin/../share/java/confluent-security/schema-validator/*:/opt/confluent/confluent-5.4.0/bin/../support-metrics-client/build/dependant-libs-2.12.10/*:/opt/confluent/confluent-5.4.0/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/*:/opt/confluent/confluent-5.4.0/bin/../support-metrics-fullcollector/build/dependant-libs-2.12.10/*:/opt/confluent/confluent-5.4.0/bin/../support-metrics-fullcollector/build/libs/*:/usr/share/java/support-metrics-fullcollector/* io.confluent.support.metrics.SupportedKafka /opt/confluent/confluent-5.4.0/etc/kafka/server.properties

Make sure you run systemctl daemon-reload each time you edit a service file.
Also, I would suggest not using LOG_DIR, removing the RollingFileAppender from the log4j.properties, then letting journalctl handle all your logging from SystemD

Figured this out
ExecStart=/opt/confluent/confluent-5.4.0/bin/kafka-server-start -daemon /opt/confluent/confluent-5.4.0/etc/kafka/server.properties
was not correct
ExecStart=/opt/confluent/confluent-5.4.0/bin/kafka-server-start -daemon /etc/kafka/server.properties
was correct, even though the 2 are symlinked.

SAP Hybris - Tomcat ignoring memory settings

I'm running a local installation of SAP Hybris 1811. I'm trying to increase its memory size since I've been getting OutOfMemory exceptions during SOLR index jobs.
However, I'm not able to reliably increase the memory via any method I've tried. Sometimes after struggling a lot (building the app multiple times, restarting, etc.) Hybris is able to see and use the set memory (I check this using backoffice), but most of the time it defaults to 2 GB and runs out of memory quickly.
What I've tried:
set JAVA_OPTS=-Xms10G -Xmx10G; in catalina.bat
tomcat.javaoptions=-Xmx10G -Xms10G in local.properties
What is the correct way to reliably set a higher memory for local Hybris server?

Please try the following in your local.properties:
tomcat.generaloptions=-Xmx10G -ea -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dorg.tanukisoftware.wrapper.WrapperManager.mbean=true -Djava.endorsed.dirs="%CATALINA_HOME%/lib/endorsed" -Dcatalina.base=%CATALINA_BASE% -Dcatalina.home=%CATALINA_HOME% -Dfile.encoding=UTF-8 -Djava.util.logging.config.file=jdk_logging.properties -Djava.io.tmpdir="${HYBRIS_TEMP_DIR}"
Please make sure to execute ant after making this change. As a general note, whenever you make any change related to tomcat, you need to execute ant.
For production environment, you can set this property as follows:
java.mem=10G
tomcat.generaloptions=-Xmx${java.mem} -Xms${java.mem} -Xss256K -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSScavengeBeforeRemark -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:"${HYBRIS_LOG_DIR}/tomcat/java_gc.log" -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dorg.tanukisoftware.wrapper.WrapperManager.mbean=true -Djava.endorsed.dirs=../lib/endorsed -Dcatalina.base=%CATALINA_BASE% -Dcatalina.home=%CATALINA_HOME% -Dfile.encoding=UTF-8 -Djava.util.logging.config.file=jdk_logging.properties -Djava.io.tmpdir="${HYBRIS_TEMP_DIR}" -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000

After a bit of digging, I found that the memory limit was ignored only when I've tried to run Hybris server with the debug parameter. I discovered that the properties I've tried to set using tomcat.javaoptions were not in the wrapper-debug.conf file which is used when starting the server in debug mode.
Long story short:
tomcat.javaoptions only gets applied to the default wrapper.conf file and is ignored when launching the server with any parameter such as debug.
For the changes to be applied to the wrapper-debug.conf, I needed to use tomcat.debugjavaoptions property.
In the end, my config file with working memory limit looks like this:
...
tomcat.javaoptions=-Xmx10G -Xms5G
tomcat.debugjavaoptions=-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,address=8000,suspend=n -Xmx10G -Xms5G

GC Logs Overwritten when JVM Crashes

I'm tuning our product for G1GC, and as part of that testing, I'm experiencing regular segfaults on my Spark Workers, which of course causes the JVM to crash. When this happens, the Spark Worker/Executor JVM automagically restarts itself, which then overwrites the GC logs that were written for the previous Executor JVM.
To be honest, I'm not quite sure the mechanism for how the Executor JVM restarts itself, but I launch the Spark Driver service via init.d, which in turn calls off to a bash script. I do use a timestamp in that script that gets appended to the GC log filename:
today=$(date +%Y%m%dT%H%M%S%3N)
SPARK_HEAP_DUMP="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${SPARK_LOG_HOME}/heapdump_$$_${today}.hprof"
SPARK_GC_LOGS="-Xloggc:${SPARK_LOG_HOME}/gc_${today}.log -XX:LogFile=${SPARK_LOG_HOME}/safepoint_${today}.log"
GC_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:+PrintFlagsFinal -XX:+PrintJNIGCStalls -XX:+PrintTLAB -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=15 -XX:GCLogFileSize=48M -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintReferenceGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1"
I think the problem is that this script sends these options along to the Spark Driver, which then passes them off to the Spark Executors (via the -Dspark.executor.extraJavaOptions argument), which are all separate servers, and when the Executor JVM crashes, it simply uses the command that had been originally sent to start back up, which would mean that the timestamp portion of the GC log filename is static:
SPARK_STANDALONE_OPTS=`property ${SPARK_APP_CONFIG}/spark.properties "spark-standalone.extra.args"`
SPARK_STANDALONE_OPTS="$SPARK_STANDALONE_OPTS $GC_OPTS $SPARK_GC_LOGS $SPARK_HEAP_DUMP"
exec java ${SPARK_APP_HEAP_DUMP} ${GC_OPTS} ${SPARK_APP_GC_LOGS} \
${DRIVER_JAVA_OPTIONS} \
-Dspark.executor.memory=${EXECUTOR_MEMORY} \
-Dspark.executor.extraJavaOptions="${SPARK_STANDALONE_OPTS}" \
-classpath ${CLASSPATH} \
com.company.spark.Main >> ${SPARK_APP_LOGDIR}/${SPARK_APP_LOGFILE} 2>&1 &
This is making it difficult for me to debug the cause of the segfaults, since I'm losing the activity and state of the Workers that led up to the JVM crash. Any ideas for how I can handle this situation and keep the GC logs on the Workers, even after a JVM crash/segfault?

If you are using Java 8 and above, you may consider getting away with it by adding %p to the log file name to introduce the PID which will be kind of unique per crash.

Why does Docker kill jvm?

I use DCOS with Spring boot applications inside Docker containers. I noticed that sometimes containers are killed, but there are no errors in the container logs, only:
Killed
W1114 19:27:59.663599 119266 logging.cpp:91] RAW: Received signal SIGTERM
from process 6484 of user 0; exiting
HealthCheck is enabled only for SQL connection and disk space. The disk is ok on all nodes, in case of SQL problems error should appear in the logs. Other reason could be the memory but it also looks fine.
From marathon.production.json:
"cpus": 0.1,
"mem": 1024,
"disk": 0
And docker-entrypoint.sh:
java -Xmx1024m -server -XX:MaxJavaStackTraceDepth=10 -XX:+UseNUMA
-XX:+UseCondCardMark -XX:-UseBiasedLocking -Xms1024M -Xss1M
-XX:MaxPermSize=128m -XX:+UseParallelGC -jar app.jar
What could be the reason for container killing and are there any logs on DCOS regarding it?

Solved with java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap

Or just use openjdk:11.0-jre-slim

Set a JVM to dump heap when OutOfMemoryError is thrown

I am trying to set the JVM of the server I am working on, so it dumps a heap to file when an OOME occurs.
I know I have to add this option -XX:-HeapDumpOnOutOfMemoryError to the JVM arguments somewhere, but I can't figure how to do this.
FYI, I can access the server through PuTTY, so I am looking for a command line way of doing this.
The JVM I am using is OpenJDK64-Bit Server VM.
I don't know if that's relevant, but the application is a war file.
PS : ps -ef|grep java
tomcat 23837 1 0 Mar25 ? 00:03:46 /usr/lib/jvm/jre/bin/java -classpath :/usr/share/tomcat6/bin/bootstrap.jar:/usr/share/tomcat6/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat6 -Dcatalina.home=/usr/share/tomcat6 -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat6/temp -Djava.util.logging.config.file=/usr/share/tomcat6/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start
EDIT :
I found something, correct me if I'm wrong : since I am using Tomcat, I decided to add these lines in the tomcat.conf file:
JAVA_OPTS=-XX:-HeapDumpOnOutOfMemoryError
JAVA_OPTS=-XX:HeapDumpPath=/root/dump
JAVA_OPTS=-Xmx20m
What do you think ?

This option from the HotSpot VM options. I would think it'd be the same in the OpenJDK VM but let me know if it's not.
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path to dump file>
You can also manually generate a memory map using jmap if you know the process id:
jmap -J-d64 -dump:format=b,file=<path to dump file> <jvm pid>
You can use JHat to analyze the dump.
jhat <path to dump file>

As mentioned by #CoolBeans, the JVM options to use are:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path to dump file>
For setting this in tomcat, create a file named setenv.sh (setenv.bat for windows) under TOMCAT_HOME/bin directory & add the following line
export CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path to dump file>"
CATALINA_OPTS is preferred for these kind of options as they need not be applied to the shutdown process.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.