HDFS Datanode crashes with OutOfMemoryError

HDFS Datanode crashes with OutOfMemoryError - java

I´m having repeated crashes in my Cloudera cluster HDFS Datanodes due to an OutOfMemoryError:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /tmp/hdfs_hdfs-DATANODE-e26e098f77ad7085a5dbf0d369107220_pid18551.hprof ...
Heap dump file created [2487730300 bytes in 16.574 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh"
# Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...
18551 TS 19 ? 00:25:37 java
Wed Aug 7 11:44:54 UTC 2019
JAVA_HOME=/usr/lib/jvm/java-openjdk
using /usr/lib/jvm/java-openjdk as JAVA_HOME
using 5 as CDH_VERSION
using /run/cloudera-scm-agent/process/3087-hdfs-DATANODE as CONF_DIR
using as SECURE_USER
using as SECURE_GROUP
CONF_DIR=/run/cloudera-scm-agent/process/3087-hdfs-DATANODE
CMF_CONF_DIR=/etc/cloudera-scm-agent
4194304
When analyzing the heap dump, the apparent biggest suspects are millions of instances of ScanInfo apparently quequed in the ExecutorService of the class org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.
When I inspect the content of each ScanInfo runnable object, I don´t see anything weird:
Apart from this and a bit high block count in HDFS, I don´t get any other information apart from the different DataNodes crashing randomly in my cluster.
Any idea why these objects keep queueing up in the DirectoryScanner thread pool?

You can try once below command.
$ hadoop dfsadmin -finalizeUpgrade
The -finalizeUpgrade command removes the previous version of the NameNode’s and DataNodes’ storage directories.

Related

Is this caused by insufficient memory?

This problem occurred when I used chipyard to compile Boom. Is this because of insufficient memory? I am running on a 1 core 2G cloud server.
/bin/bash: line 1: 9986 Killed java -Xmx8G -Xss8M
-XX:MaxPermSize=256M -jar /home/cuiyujie/workspace/Boom/chipyard/generators/rocket-chip/sbt-launch.jar
-Dsbt.sourcemode=true -Dsbt.workspace=/home/cuiyujie/workspace/Boom/chipyard/tools ";project utilities; runMain utilities.GenerateSimFiles -td
/home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig
-sim verilator"
/home/cuiyujie/workspace/Boom/chipyard/common.mk:86: recipe for target
'/home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/sim_files.f'
failed
make: *** [/home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/sim_files.f]
Error 137
When I adjusted the memory to 4G, this appeared.
Done elaborating. OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000006dc3b7000, 97148928, 0) failed;
error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 97148928 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/cuiyujie/workspace/Boom/chipyard/hs_err_pid2876.log /home/cuiyujie/workspace/Boom/chipyard/common.mk:97: recipe for target
'generator_temp' failed make: *** [generator_temp] Error 1
Should I adjust to 8G memory, or through what command to increase the memory size that the process can use?
When I adjusted the memory to 16G, this appeared.
/bin/bash: line 1: 2642 Killed java -Xmx8G -Xss8M
-XX:MaxPermSize=256M -jar /home/cuiyujie/workspace/Boom/chipyard/generators/rocket-chip/sbt-launch.jar
-Dsbt.sourcemode=true -Dsbt.workspace=/home/cuiyujie/workspace/Boom/chipyard/tools ";project tapeout; runMain barstools.tapeout.transforms.GenerateTopAndHarness -o
/home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.top.v
-tho /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.harness.v
-i /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.fir
--syn-top ChipTop --harness-top TestHarness -faf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.anno.json
-tsaof /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.top.anno.json
-tdf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/firrtl_black_box_resource_files.top.f
-tsf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.top.fir
-thaof /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.harness.anno.json
-hdf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/firrtl_black_box_resource_files.harness.f
-thf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.harness.fir
--infer-rw --repl-seq-mem -c:TestHarness:-o:/home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.top.mems.conf
-thconf /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig/chipyard.TestHarness.LargeBoomConfig.harness.mems.conf
-td /home/cuiyujie/workspace/Boom/chipyard/sims/verilator/generated-src/chipyard.TestHarness.LargeBoomConfig
-ll error" /home/cuiyujie/workspace/Boom/chipyard/common.mk:123: recipe for target 'firrtl_temp' failed make: *** [firrtl_temp] Error
137

Short answer : yes
Error 137 is thrown when your host runs out of memory.
"I am running on a 1 core 2G cloud server"
When you try to assign 8GB to the JVM, OOM-Killer says "no-no, f... no way", and kicks in sending a SIGKILL; This Killer is a proactive process that jumps into saving the system when its memory level goes too low, by killing the resource-abusive processes.
In this case, the abusive process (very abusive, indeed) is your java program, which is trying to allocate more than(*) 4 times the maximum available memory in your host.
Exit Codes With Special Meanings
[error code 137 --> kill -9] (SIGKILL)
You should either:
Assign at max ~1.2GB - 1.5GB to your process. (and keep your fingers crossed)
Change your host for something a little powerful/bigger if you do require that much memory for your process.
Check if you really require 8GB for that process.
Also note that the given params are error-prone: Xmx8G -Xss8M means a maximum of 8GB and a minimum of 8M for the heap. This should be closer, as Xmx8G - Xms4G
*As the free memory won't be 2GB either, but something in between 1.6-1.8 GB

I've got java.lang.OutOfMemoryError: Java heap space testing ActiveMQ by JMeter on Linux build agent

I run JMeter test for ActiveMQ using Linux build agent I've got java.lang.OutOfMemoryError: Java heap space. Detailed log:
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136
at org.apache.jmeter.protocol.jms.sampler.SubscriberSampler.extractContent(SubscriberSampler.java:282) ~[ApacheJMeter_jms.jar:5.3]
at org.apache.jmeter.protocol.jms.sampler.SubscriberSampler.sample(SubscriberSampler.java:186) ~[ApacheJMeter_jms.jar:5.3
at org.apache.jmeter.protocol.jms.sampler.BaseJMSSampler.sample(BaseJMSSampler.java:98) ~
at org.apache.jmeter.threads.JMeterThread.doSampling(JMeterThread.java:635) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:558) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:489) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256) ~[ApacheJMeter_core.jar:5.4]
I've already allocated maximum HEAP memory (-Xmx8g), but it doesn't help. Yet the same test with the same configuration on Windows build agent passed without Out of memory error.
How can it be handled? Maybe some configuration should be done for Linux machine?

Are you sure your Heap setting gets applied on Linux?
You can check it my creating a simple test plan with single JSR223 Sampler using the following code:
println('Max heap size: ' + Runtime.getRuntime().maxMemory() / 1024 / 1024 + ' megabytes')
and when you run JMeter in command-line non-GUI mode you will see the current maximum JVM heap size printed:
In order to make the change permanent amend this line in jmeter startup script according to your requirements.

The issue was resolved after updating Java to 11 version on Linux machines.

Java OutOfMemoryError when building AOSP 10

I want to build Android 10 from source code and followed the official instructions. In order to get started, I want to simply build it for the emulator. However, the build keeps failing and I get the following error:
[11177/12864] rm -rf "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/out" "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/srcjars" "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/stubsDir" && mkdir -p "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/out" "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/srcjars" "out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/stubsDir" && out/soong/host/linux-x86/bin/zipsync -d out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/srcjars -l out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/srcjars/list -f "*.java" out/soong/.intermediates/frameworks/base/framework-javastream-protos/gen/frameworks/base/core/proto/android/privacy.srcjar [...]
FAILED: out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs-stubs.srcjar out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs_api.txt out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs_removed.txt out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/private.txt out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs_annotations.zip out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-versions.xml out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs_api.xml out/soong/.intermediates/frameworks/base/api-stubs-docs/android_common/api-stubs-docs_last_released_api.xml
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:68)
at java.base/java.nio.CharBuffer.allocate(CharBuffer.java:341)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:794)
at java.base/java.nio.charset.Charset.decode(Charset.java:818)
at com.intellij.openapi.fileEditor.impl.LoadTextUtil.convertBytes(LoadTextUtil.java:640)
at com.intellij.openapi.fileEditor.impl.LoadTextUtil.getTextByBinaryPresentation(LoadTextUtil.java:555)
at com.intellij.openapi.fileEditor.impl.LoadTextUtil.getTextByBinaryPresentation(LoadTextUtil.java:545)
at com.intellij.openapi.fileEditor.impl.LoadTextUtil.loadText(LoadTextUtil.java:531)
at com.intellij.openapi.fileEditor.impl.LoadTextUtil.loadText(LoadTextUtil.java:503)
at com.intellij.mock.MockFileDocumentManagerImpl.getDocument(MockFileDocumentManagerImpl.java:53)
at com.intellij.psi.AbstractFileViewProvider.getDocument(AbstractFileViewProvider.java:194)
at com.intellij.psi.AbstractFileViewProvider$VirtualFileContent.getText(AbstractFileViewProvider.java:484)
at com.intellij.psi.AbstractFileViewProvider.getContents(AbstractFileViewProvider.java:174)
at com.intellij.psi.impl.source.PsiFileImpl.loadTreeElement(PsiFileImpl.java:204)
at com.intellij.psi.impl.source.PsiFileImpl.calcTreeElement(PsiFileImpl.java:709)
at com.intellij.psi.impl.source.PsiJavaFileBaseImpl.getClasses(PsiJavaFileBaseImpl.java:66)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$Companion.findClassInPsiFile(KotlinCliJavaFileManagerImpl.kt:250)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$Companion.access$findClassInPsiFile(KotlinCliJavaFileManagerImpl.kt:246)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl.findPsiClassInVirtualFile(KotlinCliJavaFileManagerImpl.kt:216)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl.access$findPsiClassInVirtualFile(KotlinCliJavaFileManagerImpl.kt:47)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClasses$1$$special$$inlined$forEachClassId$lambda$1.invoke(KotlinCliJavaFileManagerImpl.kt:155)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClasses$1$$special$$inlined$forEachClassId$lambda$1.invoke(KotlinCliJavaFileManagerImpl.kt:47)
at org.jetbrains.kotlin.cli.jvm.index.JvmDependenciesIndexImpl$traverseDirectoriesInPackage$1.invoke(JvmDependenciesIndexImpl.kt:77)
at org.jetbrains.kotlin.cli.jvm.index.JvmDependenciesIndexImpl$traverseDirectoriesInPackage$1.invoke(JvmDependenciesIndexImpl.kt:32)
at org.jetbrains.kotlin.cli.jvm.index.JvmDependenciesIndexImpl.search(JvmDependenciesIndexImpl.kt:131)
at org.jetbrains.kotlin.cli.jvm.index.JvmDependenciesIndexImpl.traverseDirectoriesInPackage(JvmDependenciesIndexImpl.kt:76)
at org.jetbrains.kotlin.cli.jvm.index.JvmDependenciesIndex$DefaultImpls.traverseDirectoriesInPackage$default(JvmDependenciesIndex.kt:35)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClasses$1.invoke(KotlinCliJavaFileManagerImpl.kt:151)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClasses$1.invoke(KotlinCliJavaFileManagerImpl.kt:47)
at org.jetbrains.kotlin.util.PerformanceCounter.time(PerformanceCounter.kt:91)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl.findClasses(KotlinCliJavaFileManagerImpl.kt:147)
at com.intellij.psi.impl.PsiElementFinderImpl.findClasses(PsiElementFinderImpl.java:45)
When searching for a solution, I only find problems related to jack-server. As I understand jack is not used in recent builds anymore. Also, I tried to reduce the number of build threads using m -j1 without success.
Here is some info about my setup: 4 core CPU, 8GB RAM
============================================
PLATFORM_VERSION_CODENAME=REL
PLATFORM_VERSION=10
TARGET_PRODUCT=aosp_arm
TARGET_BUILD_VARIANT=eng
TARGET_BUILD_TYPE=release
TARGET_ARCH=arm
TARGET_ARCH_VARIANT=armv7-a-neon
TARGET_CPU_VARIANT=generic
HOST_ARCH=x86_64
HOST_2ND_ARCH=x86
HOST_OS=linux
HOST_OS_EXTRA=Linux-4.4.0-142-generic-x86_64-Ubuntu-14.04.6-LTS
HOST_CROSS_OS=windows
HOST_CROSS_ARCH=x86
HOST_CROSS_2ND_ARCH=x86_64
HOST_BUILD_TYPE=release
BUILD_ID=QQ1D.200205.002
OUT_DIR=out
============================================

After some research I found a solution. During build, /prebuilts/jdk/jdk9/linux-x86/bin/java is being called without the -Xmx option. When typing
$ /prebuilts/jdk/jdk9/linux-x86/bin/java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize'
into the commandline, I found out that a maximum of only about 2 GB of heap were allowed.
Solution
Because I don't know in which file java is being called, I just set the heap size to 4 GB using an enviroment variable:
$ export _JAVA_OPTIONS="-Xmx4g"
Java will automatically pick this option up.
(Optional) I also increased the swap size from 8 GB to 20 GB.

How to fix heap space when increasing node size on a spot instance?

We have been using EC2 Windows spot instance as Jenkins slaves, however, when increasing the size from M5d2XLarge to M5d4XLarge (in an attempt to overcome memory issues in some builds), those fail to be created.
I attempted to set Java Opts Xms and Xmx as part of the init script, however, the error is unchanged. I've read that WinRM may be the issue, but I have a hard time connecting that with the node size increase.
This is the expected log results (from M5d2XLarge)
WinRM service responded. Waiting for WinRM service to stabilize on EC2 (Dev-US-East-1) - Windows Server 2019 Docker (i-xxxxxxxxxx)
WinRM should now be ok on EC2 (Dev-US-East-1) - Windows Server 2019 Docker (i-xxxxxxxxxx)
Connected with WinRM.
Creating tmp directory if it does not exist
Executing init script
C:\Users\Administrator>if not exist C:\Jenkins mkdir C:\Jenkins
init script ran successfully
remoting.jar sent remotely. Bootstrapping it
Launching via WinRM:java -jar C:\Windows\Temp\remoting.jar -workDir C:\Jenkins
Both error and output logs will be printed to C:\Jenkins\remoting
However, right after "Launching via WinRM", it fails due to a garbage collector issue (from M5d4XLarge):
init script ran successfully remoting.jar sent remotely. Bootstrapping it
Launching via WinRM:java -jar C:\Windows\Temp\remoting.jar -workDir C:\Jenkins
Error occurred during initialization of VM
Unable to allocate 512384KB bitmaps for parallel garbage collection for the requested 16396288KB heap.
Ouch:
java.io.EOFException: unexpected stream termination

Try updating the script that runs the remoting.jar. Modify this line that runs jar to
java -Xmx1024m -jar C:\Windows\Temp\remoting.jar -workDir C:\Jenkins

Hadoop jps & format name node cause Segmentation fault

I started up the cluster last night with little issues. After 45 minutes it went to do a log roll, and then the cluster started to throw JVM wait errors. Since then the cluster won't restart. When starting, the resource manager isn't getting started.
The server name node and data nodes are also off line.
I had two installs of hadoop 2.8 on the server, removed first one and reinstalled the second, making the adjustments to the files to get it restarted.
Error logs from when it crashed, appear to be a Java Stack over flow, and out of range, with a growing saved memory size in the logs. My expectations is that I have misconfigured memory some place. I went to deleted and reformat the name nodes and I get the same segmentation error. At this point not sure what to do.
Ubuntu-Mate 16.04, Hadoop 2.8, Spark for Hadoop 2.7, NFS, Scalia, ...
When I go to start yarn now I get the following error message
hduser#nodeserver:/opt/hadoop-2.8.0/sbin$ sudo ./start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.8.0/logs/yarn-root->resourcemanager-nodeserver.out
/opt/hadoop-2.8.0/sbin/yarn-daemon.sh: line 103: 5337 Segmentation >fault nohup nice -n $YARN_NICENESS "$HADOOP_YARN_HOME"/bin/yarn -->config $YARN_CONF_DIR $command "$#" > "$log" 2>&1 < /dev/null
node1: starting nodemanager, logging to /opt/hadoop-2.8.0/logs/yarn->root-nodemanager-node1.out
node3: starting nodemanager, logging to /opt/hadoop-2.8.0/logs/yarn->root-nodemanager-node3.out
node2: starting nodemanager, logging to /opt/hadoop-2.8.0/logs/yarn->root-nodemanager-node2.out
starting proxyserver, logging to /opt/hadoop-2.8.0/logs/yarn-root->proxyserver-nodeserver.out
/opt/hadoop-2.8.0/sbin/yarn-daemon.sh: line 103: 5424 Segmentation >fault nohup nice -n $YARN_NICENESS "$HADOOP_YARN_HOME"/bin/yarn -->config $YARN_CONF_DIR $command "$#" > "$log" 2>&1 < /dev/null
hduser#nodeserver:/opt/hadoop-2.8.0/sbin$
Editing to add more error outputs for help
>hduser#nodeserver:/opt/hadoop-2.8.0/sbin$ jps
Segmentation fault
and
>hduser#nodeserver:/opt/hadoop-2.8.0/bin$ sudo ./hdfs namenode -format
Segmentation fault
logs which appear to show the Java stack went crazy and expanded from 512k to 5056k. So, how does one reset their stack?
Heap:
def new generation total 5056K, used 1300K [0x35c00000, 0x36170000, >0x4a950000)
eden space 4544K, 28% used [0x35c00000, 0x35d43b60, 0x36070000)
from space 512K, 1% used [0x360f0000, 0x360f1870, 0x36170000)
to space 512K, 0% used [0x36070000, 0x36070000, 0x360f0000)
tenured generation total 10944K, used 9507K [0x4a950000, 0x4b400000, >0x74400000)
the space 10944K, 86% used [0x4a950000, 0x4b298eb8, 0x4b299000, 0x4b400000)
Metaspace used 18051K, capacity 18267K, committed 18476K, >reserved 18736K
Update 24 hours later, I have tried complete reinstall's on Java and Hadoop, and still no luck. When I try java -version I still get segmentation fault.
Appears I have a Stack Overflow, and no easy fix. Easier to start over and rebuild the cluster with clean software.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.