How to handle varying input batch size in mapreduce jobs

How to handle varying input batch size in mapreduce jobs - java

Issue -
I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.
Questions -
1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.
2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.
3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?
Thanks in advance.

About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).
About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.
See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps

Related

OOM crash - Hadoop Filesystem Cache Growth

I have a java program that submits thousands of Hadoop-DistCp jobs, and this is causing a OOM error on the client java process side (not on the cluster).
The DistCp jobs are submitted internally within the same java process. I do
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
val distCp = new DistCp(<config>, new DistCpOptions(<foo>, <bar>))
distCp.createAndSubmitJob()
In other words, the java program is not spawning external OS processes and running the distCp cli tool separately.
The problem is the the java program, after a few thousand distCp job submissions, is eventually crashing with an OutOfMemory error. This can take some days.
This is a nuisance because re-starting the java program is a not a reasonable solution for us on the medium term.
By analysing a few heap dumps, the issue became clear. Almost all the heap is being used to hold objects on the map Map<Key, FileSystem> of the FileSystem.Cache of org.apache.hadoop.fs.FileSystem. This cache is global.
A few debugging sessions later I found that upon each instantiation of new DistCp(<config>, new DistCpOptions(<foo>, <bar>)) there is eventually a call to FileSystem.newInstance(<uri>, <conf>) via the org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter of the library org.apache.hadoop:hadoop-yarn-common.
The problem with this FileSystem.newInstance(<uri>, <conf>) calls is that it creates a unique entry on the cache each time and there doesn't seem to exist any mechanism to clear these entries. I would have tried to set the configuration flag fs.hdfs.impl.disable.cache to true, but again the FileSystem.newInstance bypasses the check of the flag, so it doesn't work.
Alternatively, I could to FileSystem.closeAll(). But this will close all file systems of the program, including "legitime" uses on other parts of the application.
In essence,
My java program launches, over time, thousands of distCp jobs within the same java process.
Each distCp job seems to add one entry to the global FileSystem.Cache.
The cache grows, and eventually the program crashes with OOM.
Does anyone see a solution to this?
I am surprised to not have found similar issues on the internet. I would have thought that having java process launching thousands of distCp jobs to be quite a common usage.

How to check why job gets killed on Google Dataflow ( possible OOM )

I've got the simple task. I've got a bunch of files ( ~100GB in total ), each line represents one entity. I have to send this entity to JanusGraph server.
2018-07-07_05_10_46-8497016571919684639 <- job id
After a while, I am getting OOM, logs say that Java gets killed.
From dataflow view, i can see the following logs:
Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0
Logs are saying:
E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child
E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB
More here: https://pastebin.com/raw/MftBwUxs
How can I debug what's going on?

There is too few information to debug the issue right now, so I am providing general information about Dataflow.
The most intuitive way for me to find the logs is going to Google Cloud Console -> Dataflow -> Select name of interest -> upper right corner (errors + logs).
More detailed information about monitoring is described here (in beta phase).
Some basic clues to troubleshoot the pipeline, as well as the most common error messages, are described here.
If you are not able to fix the issue, update the post with the error information please.
UPDATE
Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:
Consider one of, or a combination of, the following courses of action:
Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd"
when you run your pipeline.
UPDATE 2
For specific OOM errors you can use:
--dumpHeapOnOOM will cause a heap dump to be saved locally when the JVM crashes due to OOM.
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.
Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.
Find other references on DataflowPipelineDebugOptions methods.
UPDATE 3
I did not find public documentation about this but I tested that Dataflow scales the heap JVM size with the machine type (workerMachineType), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.
On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.

Profile a Java Application Running on Google Dataflow

Do you have any idea how to profile a java application running on a dataflow worker?
Do you know any tools that can allow me to discover memory leaks of my application?

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.
You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.
It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

JVM running out of connections resulting into high CPU utilization and OutOfMemoryException

We have a 64 bit linux machine and we make multiple HTTP connections to other services and Drools Guvnor website(Rule engine if you don't know) is one of them. In drools, we create knowledge base per rule being fired and creation of knowledge base makes a HTTP connection to Guvnor website.
All other threads are blocked and CPU utilization goes up to ~100% resulting into OOM. We can make changes to compile the rules after 15-20 mins. but I want to be sure of the problem if someone has already faced it.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000 threads, Can it be a reason?
I have a couple of question:
When do we know that we are running over capacity?
How many threads can be spawned internally (any rough estimate or formula relating diff parameters will work)?
Has anyone else seen similar issues with Drools? Concurrent access to Guvnor website is basically causing the issue.
Thanks,

I am basing my answer on the assumption that you are creating a knowledge base for each request, and this knowledge base creation incudes the download of latest rule sources from Guvnor please correct if I am mistaken.
I suspect that the build /compilation of packages is taking time and hog your system.
Instead of compiling packages on each and every request, you can download pre build packages from guvnor, and also you can cache this packages locally if your rules does not change much. Only restriction is that you need to use the same version of drools both on guvnor and in your application.

I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000
threads, Can it be a reason?
That number does look large but we dont know if a majority of those threads belong to you java app. Create a java thread dump to confirm this. Your thread dump will also show the CPU time taken by each thread.
When do we know that we are running over capacity?
You have 100% CPU and an OOM error. You are over capacity :) Jokes aside, you should monitor your HTTP connection queue to determine what you are doing wrong. Your post says nothing about how you are handling the HTTP connections (presumably through some sort of pooling mechanism backed by a queue ?). I've seen containers and programs queue requests infinitely causing them to crash with a big bang. Plot the following graphs to isolate your problem
The number of blocking threads over time
Time taken for each thread
Number of threads per thread pool and how they increase / decrease with time (pool size)
How many threads can be spawned internally (any rough estimate or
formula relating diff parameters will work)?
Only a load test can answer this question. Load your server and determine the number of concurrent users it can support at 60-70% capacity. Note the number of threads spawned internally at this point. That is your peak capacity (allowing room for unexpected traffic)
Has anyone else seen similar issues with Drools? Concurrent access to
Guvnor website is basically causing the issue
I cant help there since I've not accessed drools this way. Sorry.

Workload Distribution / Parallel Execution in JAVA

I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin

Might want to look into MapReduce and Hadoop

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

Check out Hadoop

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.

The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.