OOM crash - Hadoop Filesystem Cache Growth

OOM crash - Hadoop Filesystem Cache Growth - java

I have a java program that submits thousands of Hadoop-DistCp jobs, and this is causing a OOM error on the client java process side (not on the cluster).
The DistCp jobs are submitted internally within the same java process. I do
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
val distCp = new DistCp(<config>, new DistCpOptions(<foo>, <bar>))
distCp.createAndSubmitJob()
In other words, the java program is not spawning external OS processes and running the distCp cli tool separately.
The problem is the the java program, after a few thousand distCp job submissions, is eventually crashing with an OutOfMemory error. This can take some days.
This is a nuisance because re-starting the java program is a not a reasonable solution for us on the medium term.
By analysing a few heap dumps, the issue became clear. Almost all the heap is being used to hold objects on the map Map<Key, FileSystem> of the FileSystem.Cache of org.apache.hadoop.fs.FileSystem. This cache is global.
A few debugging sessions later I found that upon each instantiation of new DistCp(<config>, new DistCpOptions(<foo>, <bar>)) there is eventually a call to FileSystem.newInstance(<uri>, <conf>) via the org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter of the library org.apache.hadoop:hadoop-yarn-common.
The problem with this FileSystem.newInstance(<uri>, <conf>) calls is that it creates a unique entry on the cache each time and there doesn't seem to exist any mechanism to clear these entries. I would have tried to set the configuration flag fs.hdfs.impl.disable.cache to true, but again the FileSystem.newInstance bypasses the check of the flag, so it doesn't work.
Alternatively, I could to FileSystem.closeAll(). But this will close all file systems of the program, including "legitime" uses on other parts of the application.
In essence,
My java program launches, over time, thousands of distCp jobs within the same java process.
Each distCp job seems to add one entry to the global FileSystem.Cache.
The cache grows, and eventually the program crashes with OOM.
Does anyone see a solution to this?
I am surprised to not have found similar issues on the internet. I would have thought that having java process launching thousands of distCp jobs to be quite a common usage.

Related

Is it better to launch a Java app once and sleep or repeat launching and killing?

I have a Java application that needs to run several times. Every time it runs, it checks if there's data to process and if so, it processes the data.
I'm trying to figure out what's the best approach (performance, resource consumption, etc.) to do this:
1.- Launch it once, and if there's nothing to process make it sleep (All Java).
2.- Using a bash script to launch the Java app, and when it finishes, sleep (the script) and then relaunch the java app.
I was wondering if it is best to keep the Java app alive (sleeping) or relaunching every time.

It's hard to answer your question without the specific context. On the face of it, your questions sounds like it could be a premature optimization.
Generally, I suggest you do what's easier for you to do (and to maintain), unless you have good reasons not to. Here are some possible good reasons, pick the ones appropriate to your situation:
For sleeping in Java:
The check of whether there's new data is easier in Java
Starting the Java program takes time or other resources, for example if on startup, your program needs to load a bunch of data
Starting the Java process from bash is complex for some reason - maybe it requires you to fiddle with a bunch of environment variables, files or something else.
For re-launching the Java program from bash:
The check of whether there's new data is easier in bash
Getting the Java process to sleep is complex - maybe your Java process is a complex multi-threaded beast, and stopping, and then re-starting the various threads is complicated.
You need the memory in between Java jobs - killing the Java process entirely would free all of its memory.

I would not keep it alive.
Instead of it you can use some Job which runs at defined intervals you can use jenkins or you can use Windows scheduler and configure it to run every 5 minutes (as you wish).
Run a batch file with Windows task scheduler
And from your batch file you can do following:
javac JavaFileName.java // To Compile
java JavaFileName // to execute file
See here how to execute java file from cmd :
How do I run a Java program from the command line on Windows?

I personally would determine it, by the place where the application is working.
if it would be my personal computer, I would use second option with bash script (as resources on my local machine might change a lot, due to extensive use of some other programs and it can happen that at some point I might be running out of memory for example)
if it goes to cloud (amazon, google, whatever) I know exactly what kind of processes are running there (it should not change so dynamically comparing to my local PC) and long running java with some scheduler would be fine for me

Profile a Java Application Running on Google Dataflow

Do you have any idea how to profile a java application running on a dataflow worker?
Do you know any tools that can allow me to discover memory leaks of my application?

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.
You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.
It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

How to handle varying input batch size in mapreduce jobs

Issue -
I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.
Questions -
1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.
2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.
3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?
Thanks in advance.

About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).
About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.
See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps

Oozie java action gets killed then restarted by cluster

I’m using an oozie java action step to start a java main. This java application does some calculations and then runs another map-reduce job based on that data.
Since the oozie java action runs as a map-only job, it is also seen in job tracker.
One of our nodes was low on memory so the task tracker killed the oozie map-only job and restarted it on another node.
However before killing it, the java application had already spawned its own map reduce job.
When the oozie map-only job was restarted on the other node, it again spawned yet another map-reduce job with the same data as the former one.
Looking in job tracker now has duplicate map-reduce jobs running against the same data.
How do you prevent/manage/alter settings such that the java program that oozie initiates in the map-only process only get run once, or is it necessary to have to constrain the Java application to be able to be run multiple times.
Any help would be appreciated,
Ken

There is not a lot you can do on the Oozie side if the one-mapper bootstrap jobs are failing because the hosts are out of memory. This host OOM scenario can be very problematic for every service in the cluster.
The preferred way to deal with this is to ensure that the host does not run out of memory at all by only allowing as many map and reduce slots on each TaskTracker node as there is memory available. You may also find that this allocation of resources to nodes is more efficient and tunable by using the YARN resource management framework instead of JobTracker-based MapReduce (MR1).

Slowing process creation under Java?

I have a single, large heap (up to 240GB, though in the 20-40GB range for most of this phase of execution) JVM [1] running under Linux [2] on a server with 24 cores. We have tens of thousands of objects that have to be processed by an external executable & then load the data created by those executables back into the JVM. Each executable produces about half a megabyte of data (on disk) that when read right in, after the process finishes, is, of course, larger.
Our first implementation was to have each executable handle only a single object. This involved the spawning of twice as many executables as we had objects (since we called a shell script that called the executable). Our CPU utilization would start off high, but not necessarily 100%, and slowly worsen. As we began measuring to see what was happening we noticed that the process creation time [3] continually slows. While starting at sub-second times it would eventually grow to take a minute or more. The actual processing done by the executable usually takes less than 10 seconds.
Next we changed the executable to take a list of objects to process in an attempt to reduce the number of processes created. With batch sizes of a few hundred (~1% of our current sample size), the process creation times start out around 2 seconds & grow to around 5-6 seconds.
Basically, why is it taking so long to create these processes as execution continues?
[1] Oracle JDK 1.6.0_22
[2] Red Hat Enterprise Linux Advanced Platform 5.3, Linux kernel 2.6.18-194.26.1.el5 #1 SMP
[3] Creation of the ProcessBuilder object, redirecting the error stream, and starting it.

My guess is that you MIGHT be running into problems with fork/exec, if Java is using the fork/exec system calls to spawn subprocesses.
Normally fork/exec is fairly efficient, because fork() does very little - all pages are copy-on-write. This stops being so true with very large processes (i.e. those with gigabytes of pages mapped) because the page tables themselves take a relatively long time to create - and of course, destroy, as you immediately call exec.
As you're using a huge amount of heap, this might be affecting you. The more pages you have mapped in, the worse it may become, which could be what's causing the progressive slowdown.
Consider either:
Using posix_spawn, if that is NOT implemented by fork/exec in libc
Using a single subprocess which is responsible for creating / reaping others; spawn this once and use some IPC (pipes etc) to tell it what to do.
NB: This is all speculation; you should probably do some experiments to see whether this is the case.

Most likely you are running out of a resource. Are your disks getting busier as you create these processes. Do you ensure you have less processes than you have cores? (To minimise context switches) Is your load average below 24?
If your CPU consumption is dropping you are likely to be hitting IO (disk/network) contention i.e. the processes cannot get/write data fast enough to keep them busy. If you have 24 cores, how many disks do you have?
I would suggest you have one process per CPU (in your case I imagine 4) Give each JVM six tasks to run concurrently to use all its cores without overloading the system.

You would be much better off using a set of long lived processes pulling your data off of queues and sending them back that constantly forking new processes for each event, especially from the host JVM with that enormous heap.
Forking a 240GB image is not free, it consumes a large amount of virtual resources, even if only for a second. The OS doesn't know how long the new process will be aware so it must prepare itself as if the entire process will be long lived, thus it sets up the virtual clone of all 240GB before obliterating it with the exec call.
If instead you had a long lived process that you could end objects to via some queue mechanism (and there are many for both Java and C, etc.), that would relieve you of some of the pressure of the forking process.
I don't know how you are transferring the data form the JVM to the external program. But if your external program can work with stdin/stdout, then (assuming you're using unix), you could leverage inetd. Here you make a simple entry in the inetd configuration file for your process, and assign it a port. Then you open up a socket, pour the data down in to it, then read back from the socket. Inetd handles the networking details for you and your program works as simply with stdin and stdout. Mind you'll have an open socket on the network, which may or may not be secure in your deployment. But it's pretty trivial to set up even legacy code to run via a network service.
You could use a simple wrapper like this:
#!/bin/sh
infile=/tmp/$$.in
outfile=/tmp/$$.out
cat > $infile
/usr/local/bin/process -input $infile -output $outfile
cat $outfile
rm $infile $outfile
It's not the highest performing server on the planet designed to zillions of transactions, but it's sure a lot faster than forking 240GB over and over and over.

I most agree with Peter. Your are most probably suffering from IO bottlenecks. Once you have may process the OS has to work harder too for trivial tasks hence having exponential performance penalty.
So the 'solution' could be to create 'consumer' processes, only initialise certain few; as Peter suggested one per CPU or more. Then use some form of IPC to 'transfer' these objects to the consumer processes.
Your 'consumer' processes should manage sub-process creation; the processing executable which I presume you don't have any access to, and this way you don't clutter the OS with too many executables and the 'job' will be "eventually" complete.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.