When we run data intensive job over Hadoop. Hadoop executes the job.
Now what i want is when the job is completed. it will give me the statistics regarding
executed job i.e; time consumed, mapper quantity, reducer quantity and other useful information.
The information displayed in browser like job tracker, data node during the job execution.
But how can i get the statistics in my application which runs the job over Hadoop and gives me results like a report at the end of job completion. My application is in JAVA
Any API which can help me.
Suggestions will be appreciated.
Look into the following methods of JobClient:
getMapTaskReports(JobID)
getReduceTaskReports(JobID)
Both these calls return arrays of TaskReport object, from which you can pull start / finish times, and individual counters for each task
Chirs is correct. The documentation of TaskReport states that org.apache.hadoop.mapred.TaskReport inherits those methods from org.apache.hadoop.mapreduce.TaskReport. So, one could get such values.
Here are the codes to get the start and end time of a job, grouped for each Map and Reduce tasks.
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobStatus;
import org.apache.hadoop.conf.Configuration;
import java.net.InetSocketAddress;
import java.util.*;
import org.apache.hadoop.mapred.TaskReport;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.util.StringUtils;
import java.text.SimpleDateFormat;
public class mini{
public static void main(String args[]){
String jobTrackerHost = "192.168.151.14";
int jobTrackerPort = 54311;
try{
Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new InetSocketAddress(jobTrackerHost, jobTrackerPort), conf);
JobStatus[] activeJobs = jobClient.jobsToComplete();
SimpleDateFormat dateFormat = new SimpleDateFormat("d-MMM-yyyy HH:mm:ss");
for(JobStatus js: activeJobs){
System.out.println(js.getJobID());
RunningJob runningjob = jobClient.getJob(js.getJobID());
while(runningjob.isComplete() == false){ /*Wait till the job completes.*/}
TaskReport[] maptaskreports = jobClient.getMapTaskReports(js.getJobID());
for(TaskReport tr: maptaskreports){
System.out.println("Task ID: "+tr.getTaskID()+" Start TIme: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getStartTime(), 0)+" Finish Time: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getFinishTime(), tr.getStartTime()));
}
TaskReport[] reducetaskreports = jobClient.getReduceTaskReports(js.getJobID());
for(TaskReport tr: reducetaskreports){
System.out.println("Task ID: "+tr.getTaskID()+" Start TIme: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getStartTime(), 0)+" Finish Time: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getFinishTime(), tr.getStartTime()));
}
}
}catch(Exception ex){
ex.printStackTrace();
}
}
}
This is a simple example to get the Start and Finish time of a running job. You can in the way you want.
And here is the run of this program for a "Word Count" MapReduce job.
[root#dev1-slave1 ~]# java -classpath /usr/lib/hadoop/hadoop-core.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/usr/lib/hadoop/lib/commons-lang-2.4.jar:. mini
job_201501151144_0042
Task ID: task_201501151144_0042_m_000000 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:43 (7sec)
Task ID: task_201501151144_0042_m_000001 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:56 (20sec)
Task ID: task_201501151144_0042_m_000002 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:43 (7sec)
Task ID: task_201501151144_0042_m_000003 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:07:53 (10sec)
Task ID: task_201501151144_0042_m_000004 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:07:53 (10sec)
Task ID: task_201501151144_0042_r_000000 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:00 (17sec)
Task ID: task_201501151144_0042_r_000001 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:05 (22sec)
Task ID: task_201501151144_0042_r_000002 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:05 (21sec)
Its good to open the desired jsp files of hadoop in its mapreduce/src/webapps/job/ directory and figure out how JOBTRACKER Web UI is displaying information.
I have derived above codes from jobtasks.jsp.
Hope it helps. :)
Related
I have 2 schedulers, which executes at a fixedDelay of 5s.
I have 2 use-cases:
If If - condition BusinessLogic class is true, then I want to sleep both the schedulers for a time of 3 secs, which means both the schedulers should execute now after 8 secs [5 secs + 3 secs].
If code qualifies the else condition, then both the schedulers should continue to execute at fixed delay of 5 secs.
Code:
Scheduler class:
import java.util.Date;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
#Component
public class TestSchedulers {
#Autowired
private BusinessLogic businessLogic;
#Scheduled(fixedDelay = 5000)
public void scheduler1(){
Date currentDate = new Date();
System.out.println("Started Sceduler 1 at " + currentDate);
String schedulerName = "Scheduler one";
businessLogic.logic(schedulerName);
}
#Scheduled(fixedDelay = 5000)
public void scheduler2(){
Date currentDate= new Date();
System.out.println("Started Sceduler 2 at " + currentDate);
String schedulerName = "Scheduler two";
businessLogic.logic(schedulerName);
}
}
Business logic class:
import java.util.Random;
import org.springframework.stereotype.Service;
#Service
public class BusinessLogic {
public void logic(String schedulerName) {
if(randomGen() < 100){
System.out.println("\nExecuting If condition for [" + schedulerName + "]");
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else if(randomGen() > 100){
System.out.println("\nExecuting Else condition for [" + schedulerName + "]");
}
}
//Generate random numbers
public int randomGen(){
Random rand = new Random();
int randomNum = rand.nextInt((120 - 90) + 1) + 90;
return randomNum;
}
}
The problem
Both the schedulers are not starting at the same time.
When the if part is executing, then only one schedulers sleep for extra 3 secs, but I want both theschedulers to do so.
Log for reference:
Started Sceduler 1 at Sun May 26 12:34:53 IST 2019
Executing If condition for [Scheduler one]
2019-05-26 12:34:53.266 INFO 9028 --- [ main] project.project.App : Started App in 1.605 seconds (JVM running for 2.356)
Started Sceduler 2 at Sun May 26 12:34:56 IST 2019
Executing If condition for [Scheduler two]
Started Sceduler 1 at Sun May 26 12:35:01 IST 2019
Executing Else condition for [Scheduler one]
Started Sceduler 2 at Sun May 26 12:35:04 IST 2019
Executing Else condition for [Scheduler two]
Started Sceduler 1 at Sun May 26 12:35:06 IST 2019
Executing If condition for [Scheduler one]
Started Sceduler 2 at Sun May 26 12:35:09 IST 2019
Executing Else condition for [Scheduler two]
Started Sceduler 1 at Sun May 26 12:35:14 IST 2019
Executing If condition for [Scheduler one]
Started Sceduler 2 at Sun May 26 12:35:17 IST 2019
Executing If condition for [Scheduler two]
Started Sceduler 1 at Sun May 26 12:35:22 IST 2019
Executing Else condition for [Scheduler one]
Started Sceduler 2 at Sun May 26 12:35:25 IST 2019
Executing Else condition for [Scheduler two]
Started Sceduler 1 at Sun May 26 12:35:27 IST 2019
please help..
In each scheduler you invoke if(randomGen() < 100) independently of each other. So for one scheduler it could give result > 100 and for other < 100 or for both it could be the same. What you will need to do is to run randomGen() outside of the schedulers and store the single result in a way that both schedulers can access it and then they will rely on the same value in their if(randomGenValue < 100) statement and will behave the same way
I have researched for this subject a lot but couldn't find any useful information. And I have decided to ask my first question on this platform. So, I am using a scheduled executor to repeat task in a specific period. Everything is fine. But there is a misunderstanding.... My code executes task but if task takes longer than schedule time then it waits to finish task and later starts execute new task. I want it to do that execute task when schedule time arrives and don't wait previous task to finish. How can I achieve this? I used SwingWorker on a swing project but this project is not a swing project. Thanks for reading.
Main method
LogFactory.log(LogFactory.INFO_LEVEL, Config.MODULE_NAME + " - Available processors for Thread Pool: " + AVAILABLE_PROCESSORS);
ScheduledExecutorService executor = Executors.newScheduledThreadPool(AVAILABLE_PROCESSORS);
LogFactory.log(LogFactory.INFO_LEVEL, Config.MODULE_NAME + " - [ScheduledExecutorService] instance created.");
MainWorker task = new MainWorker();
LogFactory.log(LogFactory.INFO_LEVEL, Config.MODULE_NAME + " - [Main worker] created...");
executor.scheduleWithFixedDelay(task, 0, Config.CHECK_INTERVAL, TimeUnit.SECONDS);
Main Worker
public class MainWorker implements Runnable {
private final NIFIncomingController controller = new NIFIncomingController();
#Override
public void run() {
LogFactory.log(LogFactory.INFO_LEVEL, Config.MODULE_NAME + " - [Task] executed - [" + Thread.currentThread().getName() + "]");
controller.run();
}
}
You can try to combine several executors in order to achieve the desired behavior. Please, find an example code below:
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class CombinedThreadPoolsExample {
private static final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
private static final int INITIAL_DELAY = 0;
private static final int FIXED_DELAY_IN_MILLISECONDS = 1000;
private static final int TASK_EXECUTION_IN_MILLISECONDS = FIXED_DELAY_IN_MILLISECONDS * 2;
public static void main(String[] args) {
int availableProcessors = Runtime.getRuntime().availableProcessors();
System.out.println("Available processors: [" + availableProcessors + "].");
ExecutorService fixedThreadPool = Executors.newFixedThreadPool(availableProcessors);
Runnable runnableThatTakesMoreTimeThanSpecifiedDelay = new Runnable() {
#Override
public void run() {
System.out.println("Thread name: [" + Thread.currentThread().getName() + "], time: [" + DATE_FORMAT.format(new Date()) + "].");
try {
Thread.sleep(TASK_EXECUTION_IN_MILLISECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
};
ScheduledExecutorService singleThreadScheduledExecutor = Executors.newSingleThreadScheduledExecutor();
singleThreadScheduledExecutor.scheduleWithFixedDelay(new Runnable() {
#Override
public void run() {
fixedThreadPool.execute(runnableThatTakesMoreTimeThanSpecifiedDelay);
}
}, INITIAL_DELAY, FIXED_DELAY_IN_MILLISECONDS, TimeUnit.MILLISECONDS);
}
}
The first lines of the output are the following for my machine:
Available processors: [8].
Thread name: [pool-1-thread-1], time: [2017-12-25 11:22:00.103].
Thread name: [pool-1-thread-2], time: [2017-12-25 11:22:01.104].
Thread name: [pool-1-thread-3], time: [2017-12-25 11:22:02.105].
Thread name: [pool-1-thread-4], time: [2017-12-25 11:22:03.105].
Thread name: [pool-1-thread-5], time: [2017-12-25 11:22:04.106].
Thread name: [pool-1-thread-6], time: [2017-12-25 11:22:05.107].
Thread name: [pool-1-thread-7], time: [2017-12-25 11:22:06.107].
Thread name: [pool-1-thread-8], time: [2017-12-25 11:22:07.107].
Thread name: [pool-1-thread-1], time: [2017-12-25 11:22:08.108].
Thread name: [pool-1-thread-2], time: [2017-12-25 11:22:09.108].
Thread name: [pool-1-thread-3], time: [2017-12-25 11:22:10.108].
Thread name: [pool-1-thread-4], time: [2017-12-25 11:22:11.109].
However, beware of relying on such solution when a task execution can take a long time, compare to a pool size. Let's say we increase the time necessary for a task execution:
private static final int TASK_EXECUTION_IN_MILLISECONDS = FIXED_DELAY_IN_MILLISECONDS * 10;
This won't cause any exceptions during execution, but cannot enforce the specified delay between executions. It can be observed in the execution output after the aforementioned delay alteration:
Available processors: [8].
Thread name: [pool-1-thread-1], time: [2017-12-25 11:31:23.258].
Thread name: [pool-1-thread-2], time: [2017-12-25 11:31:24.260].
Thread name: [pool-1-thread-3], time: [2017-12-25 11:31:25.261].
Thread name: [pool-1-thread-4], time: [2017-12-25 11:31:26.262].
Thread name: [pool-1-thread-5], time: [2017-12-25 11:31:27.262].
Thread name: [pool-1-thread-6], time: [2017-12-25 11:31:28.263].
Thread name: [pool-1-thread-7], time: [2017-12-25 11:31:29.264].
Thread name: [pool-1-thread-8], time: [2017-12-25 11:31:30.264].
Thread name: [pool-1-thread-1], time: [2017-12-25 11:31:33.260].
Thread name: [pool-1-thread-2], time: [2017-12-25 11:31:34.261].
Thread name: [pool-1-thread-3], time: [2017-12-25 11:31:35.262].
This can be achieved by using a single executor service. Say your schedule time is Y
Identify the time that can be taken by the task in worst case. Say X
If it is not in control, then control it in the task implementation to identify a timeout.
if X > Y, then create another task object and double up the schedule time
So the code may look like
executorService.schedule(taskObject1, 0/*Initial Delay*/, 2Y);
executorService.schedule(taskObject2, Y/*Initial Delay*/, 2Y);
If X > n(Y) then create n+1 tasks with scheduled time as (n+1)Y and initial delay as 0, 2Y, 3Y......(n-1)Y respectively.
Scenario : I want to create a scheduler application which should run shell scripts as per the defined schedule. To keep it simple, I want the user to add script name and execution timings in some external file (properties/xml) which will be used by my application. For now, I am planning to run this application as a background process on Linux server. In future may be we'll make it as a web-app.
What I've tried till now:
I came across xmlschedulingdataprocessorplugin for this purpose but it requires user to write jobs as Java code and then add it in XML file.
I found some examples for scheduling which presently isn't working.
Please suggest some helpful quartz API which can help me in fulfilling this purpose.
UPDATE:
public class CronTriggerExample {
public static void main(String[] args) throws Exception {
String[] a = {"script1.sh:0/10 * * * * ?", "script2.sh:0/35 * * * * ?"};
for (String config : a) {
String[] attr = config.split(":");
System.out.println("Iterating for : "+attr[0]);
JobKey jobKey = new JobKey(attr[0], attr[0]);
Trigger trigger = TriggerBuilder
.newTrigger()
.withIdentity(attr[0], attr[0])
.withSchedule(CronScheduleBuilder.cronSchedule(attr[1]))
.build();
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
scheduler.getContext().put("val", config);
JobDetail job = JobBuilder.newJob(HelloJob.class).withIdentity(jobKey).build();
scheduler.start();
scheduler.scheduleJob(job, trigger);
System.out.println("=======================");
}
}
}
My HelloJob class:
public class HelloJob implements Job {
public void execute(JobExecutionContext context) throws JobExecutionException {
String objectFromContext = null;
Date date = new Date();
try {
SchedulerContext schedulerContext = context.getScheduler().getContext();
objectFromContext = (String) schedulerContext.get("val");
} catch (SchedulerException ex) {
ex.printStackTrace();
}
System.out.println("Triggered "+objectFromContext+" at: "+date);
}
}
OUTPUT:
Iterating for : script1.sh
log4j:WARN No appenders could be found for logger (org.quartz.impl.StdSchedulerFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
=======================
Iterating for : script2.sh
=======================
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:21:50 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:00 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:00 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:10 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:20 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:30 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:35 IST 2016
Triggered script2.sh:0/35 * * * * ? at: Mon Apr 18 12:22:40 IST 2016
What am I missing? I tried to create new Job for each iteration and assign script names as JobExecutionContext
The below tutorial help you to schedule shell script.
http://www.mkyong.com/java/how-to-run-a-task-periodically-in-java/
By using
Runtime.getRuntime().exec("sh shellscript.sh");
You can run shell script.
I would take the following approach :
You create a class JobShellRunner which will implement Job interface from quartz :
public class JobShellRunner implements Job {
#Override
public void execute(JobExecutionContext context)
throws JobExecutionException {
// here you take need information about the shell script you need to run, from the context and run the shell script
}
}
read properties files (i suppose here it will be available information about which shell script to run and it's schedule information). For each information needed you will create it's context and it's trigger:
JobKey jobKey = new JobKey("jobShellRunner", "group1");
// put in the job key need information about shell script (path, etc)
JobDetail jobA = JobBuilder.newJob(JobShellRunner.class)
.withIdentity(jobKey).build();
Then the trigger (note that on the cron expression you should complete with the one that you read from properties file):
Trigger trigger = TriggerBuilder
.newTrigger()
.withIdentity("dummyTriggerName1", "group1")
.withSchedule(
CronScheduleBuilder.cronSchedule("0/5 * * * * ?"))
.build();
Then schedule the job
scheduler.scheduleJob(jobA, trigger);
you can do something like as following way,
First of need to go likewise,
create your java application which has one scheduled-job which will read at
some time interval one propery/xml file which will provide an information
for which shell_file needs to execute and at what time.
While your program's scheduled-job read that property/xml file and getting
information as following,
2.1. Shell-Script-File-Name
2.2. Timing at what time that script needs to be execute.
This information which is read by above(step-2), with help of it, this job
will create newer independent-job which is fully responsible for execute
shell script at particular time.(that time will be your job-time which is
read from your propery/xml file). also take care of it to it should be one time only(as per your requirement).
this above step repeatedly does untill whole information read by this job and every time will generate one newer job.
in case after some time user edit/updated/added new line into property/xml file this java program's scheduled job will read only
that newer changes and accordingly does as like above explained.
you can see below image for better understanding purpose,
For scheduling purpose you can set-up spring-quartz API for schedule job.
here, I am going to give you little bit pseudo code,
public class JobA implements Job {
#Override
public void execute(JobExecutionContext context)
throws JobExecutionException {
// continues read property/xml file untill while file not read
// based upon above read info. generate new job(one time only) at runtime which has capability to execute shell script
// shell script can be execute by java program by this ,
// Runtime.getRuntime().exec("sh /full-path/shell_script_name.sh");
}
}
............
public class CronTriggerExample {
public static void main( String[] args ) throws Exception
{
JobKey jobKeyA = new JobKey("jobA", "group1");
JobDetail jobA = JobBuilder.newJob(JobA.class)
.withIdentity(jobKeyA).build();
Trigger trigger1 = TriggerBuilder
.newTrigger()
.withIdentity("dummyTriggerName1", "group1")
.withSchedule(
CronScheduleBuilder.cronSchedule("0/5 * * * * ?")) // you can set here your comfortable job time...
.build();
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
scheduler.start();
scheduler.scheduleJob(jobA, trigger1);
}
}
So, this is an idea what I believe and represent over here, which is top-most suitable as per your requirement.
We've performed a performance test with Oracle Advanced Queue on our Oracle DB environment. We've created the queue and the queue table with the following script:
BEGIN
DBMS_AQADM.create_queue_table(
queue_table => 'verisoft.qt_test',
queue_payload_type => 'SYS.AQ$_JMS_MESSAGE',
sort_list => 'ENQ_TIME',
multiple_consumers => false,
message_grouping => 0,
comment => 'POC Authorizations Queue Table - KK',
compatible => '10.0',
secure => true);
DBMS_AQADM.create_queue(
queue_name => 'verisoft.q_test',
queue_table => 'verisoft.qt_test',
queue_type => dbms_aqadm.NORMAL_QUEUE,
max_retries => 10,
retry_delay => 0,
retention_time => 0,
comment => 'POC Authorizations Queue - KK');
DBMS_AQADM.start_queue('q_test');
END;
/
We've published 1000000 messages with 2380 TPS using a PL/SQL client. And we've consumed 1000000 messages with 292 TPS, using Oracle JMS API Client.
The consumer rate is almost 10 times slower than the publisher and that speed does not meet our requirements.
Below, is the piece of Java code that we use to consume messages:
if (q == null) initializeQueue();
System.out.println(listenerID + ": Listening on queue " + q.getQueueName() + "...");
MessageConsumer consumer = sess.createConsumer(q);
for (Message m; (m = consumer.receive()) != null;) {
new Timer().schedule(new QueueExample(m), 0);
}
sess.close();
con.close();
Do you have any suggestion on, how we can improve the performance at the consumer side?
Your use of Timer may be your primary issue. The Timer definition reads:
Corresponding to each Timer object is a single background thread that is used to execute all of the timer's tasks, sequentially. Timer tasks should complete quickly. If a timer task takes excessive time to complete, it "hogs" the timer's task execution thread. This can, in turn, delay the execution of subsequent tasks, which may "bunch up" and execute in rapid succession when (and if) the offending task finally completes.
I would suggest you use a ThreadPool.
// My executor.
ExecutorService executor = Executors.newCachedThreadPool();
public void test() throws InterruptedException {
for (int i = 0; i < 1000; i++) {
final int n = i;
// Instead of using Timer, create a Runnable and pass it to the Executor.
executor.submit(new Runnable() {
#Override
public void run() {
System.out.println("Run " + n);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.DAYS);
}
My code is like
String path = "/home/user/tmp/file1";
Path p = FileSystems.getDefault().getPath(path);
PosixFileAttributes attrs = Files.readAttributes(p, PosixFileAttributes.class);
System.out.println("Last Modified Time: "+attrs.lastModifiedTime());
System.out.println("Last Access Time: "+attrs.lastAccessTime());
The time returned by lastModifiedTime() and lastAccessTime() are 4 hours difference with the correct one.
The output is
Last Modified Time: 2014-06-25T12:50:31Z
Last Access Time: 2014-06-25T18:26:07Z
stat file1 produce:
Access: 2014-06-25 14:26:07.870281008 -0400
Modify: 2014-06-25 08:50:31.922861913 -0400
Change: 2014-06-25 08:50:31.922861913 -0400
Any one can help me?
A time like
2014-06-25T12:50:31Z
is in UTC (that's the Z at the end), so it may be off according to your time zone.