Oozie java action gets killed then restarted by cluster - java

I’m using an oozie java action step to start a java main. This java application does some calculations and then runs another map-reduce job based on that data.
Since the oozie java action runs as a map-only job, it is also seen in job tracker.
One of our nodes was low on memory so the task tracker killed the oozie map-only job and restarted it on another node.
However before killing it, the java application had already spawned its own map reduce job.
When the oozie map-only job was restarted on the other node, it again spawned yet another map-reduce job with the same data as the former one.
Looking in job tracker now has duplicate map-reduce jobs running against the same data.
How do you prevent/manage/alter settings such that the java program that oozie initiates in the map-only process only get run once, or is it necessary to have to constrain the Java application to be able to be run multiple times.
Any help would be appreciated,
Ken

There is not a lot you can do on the Oozie side if the one-mapper bootstrap jobs are failing because the hosts are out of memory. This host OOM scenario can be very problematic for every service in the cluster.
The preferred way to deal with this is to ensure that the host does not run out of memory at all by only allowing as many map and reduce slots on each TaskTracker node as there is memory available. You may also find that this allocation of resources to nodes is more efficient and tunable by using the YARN resource management framework instead of JobTracker-based MapReduce (MR1).

Related

OOM crash - Hadoop Filesystem Cache Growth

I have a java program that submits thousands of Hadoop-DistCp jobs, and this is causing a OOM error on the client java process side (not on the cluster).
The DistCp jobs are submitted internally within the same java process. I do
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
val distCp = new DistCp(<config>, new DistCpOptions(<foo>, <bar>))
distCp.createAndSubmitJob()
In other words, the java program is not spawning external OS processes and running the distCp cli tool separately.
The problem is the the java program, after a few thousand distCp job submissions, is eventually crashing with an OutOfMemory error. This can take some days.
This is a nuisance because re-starting the java program is a not a reasonable solution for us on the medium term.
By analysing a few heap dumps, the issue became clear. Almost all the heap is being used to hold objects on the map Map<Key, FileSystem> of the FileSystem.Cache of org.apache.hadoop.fs.FileSystem. This cache is global.
A few debugging sessions later I found that upon each instantiation of new DistCp(<config>, new DistCpOptions(<foo>, <bar>)) there is eventually a call to FileSystem.newInstance(<uri>, <conf>) via the org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter of the library org.apache.hadoop:hadoop-yarn-common.
The problem with this FileSystem.newInstance(<uri>, <conf>) calls is that it creates a unique entry on the cache each time and there doesn't seem to exist any mechanism to clear these entries. I would have tried to set the configuration flag fs.hdfs.impl.disable.cache to true, but again the FileSystem.newInstance bypasses the check of the flag, so it doesn't work.
Alternatively, I could to FileSystem.closeAll(). But this will close all file systems of the program, including "legitime" uses on other parts of the application.
In essence,
My java program launches, over time, thousands of distCp jobs within the same java process.
Each distCp job seems to add one entry to the global FileSystem.Cache.
The cache grows, and eventually the program crashes with OOM.
Does anyone see a solution to this?
I am surprised to not have found similar issues on the internet. I would have thought that having java process launching thousands of distCp jobs to be quite a common usage.

Spark (Kafka) Streaming Memory Issue

I am testing my first Spark Streaming pipline which processes messages from Kafka. However, after several testing runs, I got the following error message
There is insufficient memory for the Java Runtime Environment to continue.
My testing data is really small thus this should not happen. After looking into the process, I realized maybe previously submitted spark jobs were not removed completely?
I usually submit jobs like below, and I am using Spark 2.2.1
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ~/script/to/spark_streaming.py
And stop it using `Ctrl+C'
Last few lines of the script looks like:
ssc.start()
ssc.awaitTermination()
Update
After I changing the way to submit a spark streaming job (command like below), I still ran into same issue which is after killing the job, memory will not be released.I only started Hadoop and Spark for those 4 EC2 nodes.
/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --py-files ~/config.py --master spark://<master_IP>:7077 --deploy-mode client ~/spark_kafka.py
When you press Ctrl-C, only the submitter process is interrupted, the job itself continues to run. Eventually your system runs out of memory so no new JVM can be started.
Furthermore, even if you restart the cluster, all previously running jobs will be restarted again.
Read how to stop a running Spark application properly.
It might be the problem of bunch of driver (spark-app-driver process) processes running on the host you use to submit spark job. Try doing something like
ps aux --forest
or similar depending on your platform to understand what are the processes running at the moment. Or you can have a look at answers over the stackoverflow Spark Streaming with Actor Never Terminates , it might give you a glue on what is happening.

what is the difference between spark scheduling mode and application queue in spark?

While testing the behavior of spark jobs when multiple jobs are submitted to run concurrently or smaller jobs submitted later. i came across two settings in spark ui. one is scheduling mode available withing spark as shown in below image
And one is under scheduler as show below
I want to understand the difference between two settings and preemption. My Requirement is that while running the bigger job, small jobs submitted in between must get the resources without waiting longer.
Let me explain it for the Spark On Yarn mode.
When you submit a scala code to spark, spark client will interact with yarn and launch a yarn application. This application will be duty on all the jobs in your scala code. In most cases, each job correspond to an Spark Action like reduce()、collect().Then ,the problem comes, how to schedule different jobs in this application, for example, in your application , there a 3 concurrent jobs comes out and waiting for execution? To deal with it , Spark make the scheduler rule for job, including FIFO and Fair.That is to say , spark scheduler ,including FIFO and Fair, is on the level of job, and it is the spark ApplicationMaster which is do the scheduling work.
But yarn's scheduler, is on the level of Container.Yarn doesn't care what is running in this container, maybe the container it is a Mapper task , a Reducer task , a Spark Driver process or a Spark executor process and so on. For example, your MapReduce job is currently asking for 10 container, each container need (10g memory and 2 vcores), and your spark application is currently asking for 4 container ,each container need (10g memory and 2 vcores). Yarn has to decide how many container are now available in the cluster and how much resouce should be allocated for each request by a rule ,this rule is yarn's scheduler, including FairScheduler and CapacityScheduler.
In general, your spark application ask for several container from yarn, yarn will decide how many container can be allocated for your spark application currently by its scheduler.After these container are allocated , Spark ApplicationMaster will decide how to distribute these container among its jobs.
Below is the official document about spark scheduler:https://spark.apache.org/docs/2.0.0-preview/job-scheduling.html#scheduling-within-an-application
I think Spark.scheduling.mode (Fair/FIFO), shown in the figure, is for scheduling tasksets (single-same stage tasks) submitted to the taskscheduler using a FAIR or FIFO policy etc.. These tasksets belong to the same job.
To be able to run jobs concurrently, execute each job (transformations + action) in a separate thread. When a job is submitted to the DAG the main thread is blocked until job completes and result is returned or saved.

How does Hadoop run the java reduce function on the DataNode's

I am confused on how the Datanode's in a hadoop cluster runs the java code for the reduce function of a job. Like, how does hadoop send a java code to another computer to execute?
Does Hadoop inject a java code to the nodes? If so, where is the java code located in hadoop?
Or are the reduce functions ran on the master node not the datanodes?
Help me trace this code where the Master Node sends the java code for the reduce function to a datanode.
As shown in the picture, here is what happens:
You run the job on client by using hadoop jar command in which you pass jar file name, class name and other parameters such as input and output
Client will get new application id and then it will copy the jar file and other job resources to HDFS with high replication factor (by default 10 on large clusters)
Then Client will actually submit the application through resource manager
Resource manager keeps track of cluster utilization and submit application master (which co-ordinates the job execution)
Application master will talk to namenode and determine where the blocks for input are located and then work with nodemanagers to submit the tasks (in the form of containers)
Containers are nothing but JVMs and they run map and reduce tasks (mapper and reducer classes), when the JVM is bootstrapped job resources that are on HDFS will be copied to the JVM. For mappers these JVMs will be created on same nodes on which data exists. Once the processing is started the jar file will be executed to process the data locally on that machine (typical).
To answer your question, reducer will be running on one or more data nodes as part of the containers. Java code will be copied as part of the bootstrap process (when JVM is created). Data will be fetched from mappers over the network.
No. Reduce functions are executed on data nodes.
Hadoop transfers packaged code (jar files) to the data node that are going to process data. At run time data nodes download these code and process task.

Scheduled job in a multi node environment

I am working on a scheduled job that will run at certain interval (eg. once a day at 1pm), scheduled through Cron. I am working with Java and Spring.
Writing the scheduled job is easy enough - it does: grab list of people will certain criteria from db, for each person do some calculation and trigger a message.
I am working on a single-node environment locally and in testing, however when we go to production, it will be multi-node environment (with load balancer, etc). My concern is how would multi node environment affect the scheduled job?
My guess is I could (or very likely would) end up with triggering duplicate message.
Machine 1: Grab list of people, do calculation
Machine 2: Grab list of people, do calculation
Machine 1: Trigger message
Machine 2: Trigger message
Is my guess correct?
What would be the recommended solution to avoid the above issue? Do I need to create a master/slave distributed system solution to manage multi node environment?
If you have something like three Tomcat instances, each load balanced behind Apache, for example, and on each your application runs then you will have three different triggers and your job will run three times. I don't think you will have a multi-node environment with distributed job execution unless some kind of mechanism for distributing the parts of the job is in place.
If you haven't looked at this project yet, take a peek at Spring XD. It handles Spring Batch Jobs and can be run in distributed mode.

Categories

Resources