I have a java application which does a flink batch processing of a batch obtained by querying the tables from the database and feed it into a kafka topic. How would I make this scheduled periodically. Is there a flink scheduler? For example, my java application should keep running in the background and the flink scheduler should periodically query the tables from the database and flink batch process it and feed into kafka(flink batch processing and feeding into kafka is already done part of my application). Please help if anyone has pointers on this.
Flink does not provide a job scheduler.
Have you considered implementing the use case with a continuously running Flink DataStream application? You could implement a SourceFunction that periodically queries the database.
Continuous streaming applications have the benefits of fewer moving parts (no scheduler, no failure handling if something goes wrong) and a consistent view across the boundaries of "batches". The down side is that the job is always consuming resources (Flink is not able to automatically scale-down at low load).
Related
I am creating an apache beam pipeline that reads data from cloud storage and writes that data to Big Table. I want the job to stop automatically by itself when the data was fully read and written. How to accomplish that?
How to stop a streaming pipeline in google cloud dataflow
I saw this question and there is a way to cancel the job. But this will stop before the end of the execution. How to ensure the job is done?
The pipeline you described is a batch pipeline because it read from a bounded source.
The job will automatically finish after reading all your data from GCS files and written them to BigTable.
No need to drain or stop it because it’s not a steaming job.
It will stop by itself after treating all the data.
Hi I am new to Flink and trying to figure out some best practices with the following scenerio:
I am playing around with a Flink job that reads unique data from multiple CSV files. Each row in the CSV is composed of three columns: userId, appId, name. I have to do some processing on each of these records (capitalize the name) and post the record to a Kafka Topic.
The goal is to filter out any duplicate records that exist so we do not have duplicate messages in the output Kafka Topic.
I am doing a keyBy(userId, appId) on the stream and keeping a boolean value state "Processed" to filter out duplicate records.
The issue is when I cancel the Task Manager in the middle of processing a file, to simulate a failure, it will start processing the file from the beginning once it restarts.
This is a problem because the "Processed" State in the Flink job is also wiped clean after the Task Manager fails!
This leads to duplicate messages in the output Kafka topic.
How can I prevent this?
I need to restore the "Processed" Flink state to what it was prior to the Task Manager failing. What is the best practice to do this?
Would Flink's checkpointed function https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#checkpointedfunction help? I think not because this is a keyed stream.
Things to consider:
Flink Checkpointing is already enabled.
K8 pod for Task Manager (can be scaled very fast) and Parallelism is always > 1.
Files can have millions of rows and need to be processed in parallel.
Thank you for the help!
I would recommend to read up on Flink's fault tolerance mechanisms, checkpointing & savepointing. https://nightlies.apache.org/flink/flink-docs-master/docs/learn-flink/fault_tolerance/ and https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/ are good places to start.
I think you could also achieve your deduplication easier by using Table API/SQL. See https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/deduplication/
You need to use flink's managed keyed state, which Flink will maintain in a manner that is consistent with the sources and sinks, and Flink will guarantee exactly-once behavior provided you setup checkpointing and configure the sinks properly.
There's a tutorial on this topic in the Flink documentation, and it happens to use deduplication as the example use case.
For more info on how to configure Kafka for exactly-once operation, see Exactly-once with Apache Kafka.
Disclaimer: I wrote that part of the Flink documentation, and I work for Immerok.
Is there a simple way to start jobs, only when certain data is available in a DB or a JMS queue, i.e. some kind of conditional job launching / hook, ideally by not using any program that's constantly checking the state of a data source.
Imagine: a preceding job has written some data to a DB, and only when this data is written, the next job will start.
While testing the behavior of spark jobs when multiple jobs are submitted to run concurrently or smaller jobs submitted later. i came across two settings in spark ui. one is scheduling mode available withing spark as shown in below image
And one is under scheduler as show below
I want to understand the difference between two settings and preemption. My Requirement is that while running the bigger job, small jobs submitted in between must get the resources without waiting longer.
Let me explain it for the Spark On Yarn mode.
When you submit a scala code to spark, spark client will interact with yarn and launch a yarn application. This application will be duty on all the jobs in your scala code. In most cases, each job correspond to an Spark Action like reduce()、collect().Then ,the problem comes, how to schedule different jobs in this application, for example, in your application , there a 3 concurrent jobs comes out and waiting for execution? To deal with it , Spark make the scheduler rule for job, including FIFO and Fair.That is to say , spark scheduler ,including FIFO and Fair, is on the level of job, and it is the spark ApplicationMaster which is do the scheduling work.
But yarn's scheduler, is on the level of Container.Yarn doesn't care what is running in this container, maybe the container it is a Mapper task , a Reducer task , a Spark Driver process or a Spark executor process and so on. For example, your MapReduce job is currently asking for 10 container, each container need (10g memory and 2 vcores), and your spark application is currently asking for 4 container ,each container need (10g memory and 2 vcores). Yarn has to decide how many container are now available in the cluster and how much resouce should be allocated for each request by a rule ,this rule is yarn's scheduler, including FairScheduler and CapacityScheduler.
In general, your spark application ask for several container from yarn, yarn will decide how many container can be allocated for your spark application currently by its scheduler.After these container are allocated , Spark ApplicationMaster will decide how to distribute these container among its jobs.
Below is the official document about spark scheduler:https://spark.apache.org/docs/2.0.0-preview/job-scheduling.html#scheduling-within-an-application
I think Spark.scheduling.mode (Fair/FIFO), shown in the figure, is for scheduling tasksets (single-same stage tasks) submitted to the taskscheduler using a FAIR or FIFO policy etc.. These tasksets belong to the same job.
To be able to run jobs concurrently, execute each job (transformations + action) in a separate thread. When a job is submitted to the DAG the main thread is blocked until job completes and result is returned or saved.
I am planning to use spring batch in a distributed environment. to do some batch processing tasks.
Now when i mean distributed env i mean i have set of boxes with fronteneding web service. Loadbalancer distributes then distributes the job to boxes.
Now I have few questions:
1)What happends if job is terminated half way(say the box got restarted).Will spring batch automatically restart the job?Or do i need to write my own custom watcher and then call spring batch api to restart job?
2)If spring back has this kind of auto restart .Can 2 boxes pick and execute same job at once?
Is this the case?
Spring Batch has four strategies to handle scalability, see here for further details:
Multi-threaded Step (single process)
Parallel Steps (single process)
Remote Chunking of Step (multi process)
Partitioning a Step (single or multi process)
Yours is a multi-process scenario, so you can choose between step remote chunking and step partioning, depending on the cost of the read part compared to the process/write.
But in both cases there cannot be two instances that do duplicate work, it's all designed to avoid that. that could only happened if by accident deploying one of the two single process mechanisms in different machines, that would cause the problem you mention.
Restart logic is also foreseen, see here the Restartability section for further details.
Upon restart the job will go on reading, processing and writing the next chunk of data. If the reader/processor/writer are configured/written taken into that the task is chunked, it will all work out of the box.
Usually it involves including in the write part marking the read items in that chunk as 'processed'.