Spark Job creation time

Spark Job creation time - java

How to access Job creation time using taskcontext.
I'm planing to get this time in the different executors and set it with persisting data which is help full in later processes. Since job creation time is unique even when when retrieved from different executors it helps to keep track of persisted data in a one job.
Is it possible to get from TaskMetrics?
How to access Jobdata class ?

I could do this using listeners. I had to extend the class SparkListners and track when a job ( all the tasks for a job) ends and start and perform actions depending on that.

Related

Perform an action on an entity after a fixed time

In my Spring project, I have an entity Customer.
Now once we get a new Customer, we persist it in our system, and exactly after one hour, I want to check if the Customer has made any purchase.
If yes, I take some action. If no, the some other.
I contemplated two strategies,
1) Firing up an event when the Customer is persisted. And then having the event listener thread sleep for one hour. I believe this will be a very bad way to handle this.
2) Having a cron check every once in a while for customers for whom one hour has passed since registration. But then, I figure it will be very difficult to be accurate. I would have to run the cron every minute which won't be great.
Any ideas?

You could use the 'ScheduledThreadPoolExecutor' which as per javadoc is:
A ThreadPoolExecutor that can additionally schedule commands to run after a given delay, or to execute periodically
In your case, when a customer is created, you can use the 'schedule' method to wake up after 1 hour and then perform required activities. This method can also be used if you want those activities to be executed periodically as well.

I believe run the cron every minute is not that bad, how many customers would you handle in one minute?

Although not sure why you cannot use the event when a registered Customer will make any purchase i.e. when a particular registered customer will make purchase you can take the action inline as and then.
You described 2 strategies both will work but I would prefer to run cron job which you can configure explicitly. In that way you avoid the overhead of maintaining the threads. If you configure the cron job timing correctly and allow a single job to run at a time I do not see any problem with that. Remember cron jobs are used for batch processing rather than handling events.

Implementing scheduler in spring (defined by user)

I am developing spring mvc application.
I have gone through below links
http://docs.spring.io/spring/docs/current/spring-framework-reference/html/scheduling.html#scheduling-annotation-support-scheduled
http://www.mkyong.com/spring-batch/spring-batch-and-spring-taskscheduler-example/
These guide for how to schedule.
But I have to give it to the users, to schedule(run on daily/weekly basis etc.) some functionality from GUI.
Can any one please help me how can I achieve this?

Suppose you have several tasks to be scheduled by user.
Define a Enum for the tasks names and a Runner to run task by enum. Define a job to be executed every second (minute, hour). The job just checks whether there is a user's task to be executed.
Now user defines such a task whith following params
TaskType (the Enum value)
TaskTime (when it should be started e.g. 12:00)
TaskPeriod (how often it should be called)
TaskTime and TaskPeriod could be joined e.g. in cron expression.
Then all the task info is stored somewhere (e.g. in DB).
Your permanent Job every second reads from the DB whether there is a task to be executed. It checks task time and task period and compares with current time. If it's time to start it gets enum value and calls Runner's method for the enum.

Please check the link. It explains how to schedule tasks by giving crone expressions in a property file.
Other solution is using the quartz library directly. We can schedule or reschedule jobs easily using that. Refer this.
Hope this will help.

Adding a trigger in Quartz scheduler for future use

Quartz API provide a way in which i can create a Job and add it to scheduler for future use by doing something like
SchdularFactory.getSchedulerInstance().addJob(jobDetail, false);
This provides me the flexibility to create jobs store them with the scheduler and use them in later stage.
i am wondering is there any way i can create triggers and add them to scheduler to be used in future.
Not sure if this is valid requirement but if its not possible than all i have to do is to associate the Trigger with any given/existing Job

In Quartz there is a one-to-many relationship between jobs and triggers, which is understandable: one job can be run by several different triggers but one trigger can only run a single job. If you need to run several jobs, create one composite job that runs these jobs manually.
Back to your question. Creating a job without associated triggers is a valid use-case: you have a piece of logic and later you will attach one or more triggers to execute it at different points in time.
The opposite situation is weird - you want to create a trigger that will run something at a given time, but you don't know yet what. I can't imagine use-case for that.
Note that you can create a trigger for future use (with next fire time far in the future), but it must have a job attached.
Finally, check out How-To: Storing a Job for Later Use in the official documentation.

What is the most appropriate way to manage threads executing the same task?

I have a lot of data in a database(PostgreSQL) and need to process all. My program have threads to process all these data and follows these logic.
Get a part of data from database
Process
Save data
I have doubt about how is best way to do this. I have three ideas:
Create a manager class that runs in a loop getting data from database and holding a queue of objects to process. Create a process class that runs in a loop getting the object to process from the manager class.
To de same above, but without the manager class, so the process class will have the queue of objects shared between it and they will be responsible for getting the data from database too.
A manager class that runs in a loop getting data from database, but it create the process classes with the data to process, so the process class won't request nothing from the manager. It's created, processed and destroyed, and not run in a loop.
I don't know what is better, and if there is another solution more efficient.

You are describing so called manager-worker model. I think that your first description is better.
It pushes data into a queue and multiple workers process it. You can use thread pool for workers. The workers are waiting on queue. Once work is pushed to queue one of the workers takes it immediately. When they are done they can push the result into outgoing queue and yet another thread will send the data to DB. Alternatively each worker can save results himself. It is up to you and depends on your task.
User Excecutors and BlockingQueue for implementation. All you need is in java.util.concurrent package and you can find a lot of tutorials and example in web how to use them.
Good luck.

While your first suggestion is good, I'd try to simplify it a bit
Create a manager class that runs in a loop getting data from database and holding a queue of objects to process. Create a process class that runs in a loop getting the object to process from the manager class.
I'd create a manager class that gains a list of current data to process. It then creates instances of executors which simply run through a single dataset they're provided when they're created. They then exit.
The manager is responsible for producing the looping, or iterating the data sets it's aware of at a given time. I'd further abstract that and have a scheduled task creating a manager periodically to process new data sets.
The reason for this is that it simplifies concurrent programming. The data set processor is only aware of a single set of data, and you can program it as if it is ignorant of concurrency. It gets a job, processes it, and it's done.
Likewise for the manager, it gets a set of data, processes it by creating processors, and it's done.
The last part of the puzzle would be to ensure that no two managers, of you allow multiple instances, are assigned the same sets of data. Probably easiest to understand if you only create a single thread pool to run managers in. If the scheduled time comes up and there's still a manager running, then you don't create a new one.

Job queueing and execute Mechanism

In my webservice all method calls submits jobs to a queue. Basically these operations take long time to execute, so all these operations submit a Job to a queue and return a status saying "Submitted". Then the client keeps polling using another service method to check for the status of the job.
Presently, what I do is create my own Queue, Job classes that are Serializable and persist these jobs (i.e, their serialized byte stream format) into the database. So an UpdateLogistics operation just queues up a "UpdateLogisticsJob" to the queue and returns. I have written my own JobExecutor which wakes up every N seconds, scans the database table for any existing jobs, and executes them. Note the jobs have to persisted because these jobs have to survive app-server crashes.
This was done a long time ago, and I used bespoke classes for my Queues, Jobs, Executors etc. But now, I would like to know has someone done something similar before? In particular,
Are there frameworks available for this ? Something in Spring/Apache etc
Any framework that is easy to adapt/debug and plays well along with libraries like Spring will be great.
EDIT - Quartz
Sorry if I had not explained more, Quartz is good for stateless jobs (and also for some stateful jobs), but the key for me is very stateful persisted "job instances" (not just jobs or tasks). So for example an operation of executeWorkflow("SUBMIT_LEAVE") might actually create 5 job instances each with atleast 5-10 parameters like userId, accountId etc to be saved into the database.
I was looking for some support around that area, where Job instances can be saved into DB and recreated etc ?

Take a look at JBoss jBPM. It's a workflow definition package that lets you mix automated and manual processes. Tasks are persisted to a DB back end, and it looks like it has some asynchronous execution properties.

I haven't used Quartz for a long time, but I suspect it would be capable of everything you want to do.

spring-batch plus quartz

Depending upon the nature of your job, you might also look into spring-integration to assist with queue processing. But spring-batch will probably handle most of your requirements.

Please try ted-driver (https://github.com/labai/ted)
It's purpose is similar to what you need - you create task (or many of them), which is saved in db, and then ted-driver is responsible to execute it. On error you can postpone retry for later or finish with status error.
Unlike other java frameworks, here tasks are in simple and clear structure in database, where you can manually search or update using standard sql.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.