I have two questions regarding the failure handling in Flink's DataSet API:
Why isn't the checkpointing mechanism mentioned in documentation of the DataSet API?
How are failures handled in the DataSet API, e.g., for reduce or reduceGroup transformation?
Flink handles failures differently for streaming and batch programs.
For streaming programs, the input stream is unbound such that it is in general not possible or not feasible to replay the complete input in case of a failure. Instead Flink consistently checkpoints the state of operators and user functions and restores the state in case of a failure.
For batch programs, Flink recomputes intermediate results, which were lost due to failures, by reading the necessary input data and evaluating the relevant transformations again. This is true for all transformations, including reduce and reduceGroup.
Related
Hi I am new to Flink and trying to figure out some best practices with the following scenerio:
I am playing around with a Flink job that reads unique data from multiple CSV files. Each row in the CSV is composed of three columns: userId, appId, name. I have to do some processing on each of these records (capitalize the name) and post the record to a Kafka Topic.
The goal is to filter out any duplicate records that exist so we do not have duplicate messages in the output Kafka Topic.
I am doing a keyBy(userId, appId) on the stream and keeping a boolean value state "Processed" to filter out duplicate records.
The issue is when I cancel the Task Manager in the middle of processing a file, to simulate a failure, it will start processing the file from the beginning once it restarts.
This is a problem because the "Processed" State in the Flink job is also wiped clean after the Task Manager fails!
This leads to duplicate messages in the output Kafka topic.
How can I prevent this?
I need to restore the "Processed" Flink state to what it was prior to the Task Manager failing. What is the best practice to do this?
Would Flink's checkpointed function https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#checkpointedfunction help? I think not because this is a keyed stream.
Things to consider:
Flink Checkpointing is already enabled.
K8 pod for Task Manager (can be scaled very fast) and Parallelism is always > 1.
Files can have millions of rows and need to be processed in parallel.
Thank you for the help!
I would recommend to read up on Flink's fault tolerance mechanisms, checkpointing & savepointing. https://nightlies.apache.org/flink/flink-docs-master/docs/learn-flink/fault_tolerance/ and https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/ are good places to start.
I think you could also achieve your deduplication easier by using Table API/SQL. See https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/deduplication/
You need to use flink's managed keyed state, which Flink will maintain in a manner that is consistent with the sources and sinks, and Flink will guarantee exactly-once behavior provided you setup checkpointing and configure the sinks properly.
There's a tutorial on this topic in the Flink documentation, and it happens to use deduplication as the example use case.
For more info on how to configure Kafka for exactly-once operation, see Exactly-once with Apache Kafka.
Disclaimer: I wrote that part of the Flink documentation, and I work for Immerok.
I want to optimize my usage of HBase for faster writes. I have a task which reads from a Kafka topic then write to HBase based on that. Since Kafka will have a log of everything to be written, it's an easy source to recover from. I'm reading "HBase High Perormance Cookbook" and there's this note:
Note that this brings an interesting thought about when to use WAL and when not to. By default, WAL writes are on, and the data are always written to, WAL. But if you are sure the data can be rewritten or a small loss won't be impacting the overall outcome of the processing, you disable the write to WAL. WAL provides an easy and definitive recovery. This is the fundamental reason why, by default, it's always enabled. In scenarios where data loss is not expectable, you should leave it in the default settings; otherwise, change it to use memstore. Alternatively, you can plan for a DR (disaster recovery)
How do I configure this recovery to be automatic? I see 2 options:
I write to HBase without WAL (only to memstore) and am somehow notified that writes were lost and not committed to disk. Then I go back in the Kafka log and replay. or
I write to HBase without WAL (only to memstore) and every so often get notified from HBase what Kafka offset can be committed.
How do I do either of these?
How to configure check pointing for flink batch processing. I'm interested in knowing how checking pointing work internally. Since check point happens at an interval, if the job failed before the next point, won't there be any duplicate processing if it restarts. Does flink check points for each operator, sink and sources?
Flink does not support checkpointing on the DataSet API.
You can use checkpointing in DataStream with finite sources though, which covers most of the DataSet API use cases already. The long-term vision is to completely replace the DataSet API with DataStream + finite sources, such that users do not need to write two programs if they want to analyze a stream or batch.
With Table API and SQL, this goal is already pretty near.
I would like to use the Java Streams API to create a pipeline and then terminate it with iterator(), but I want it to "prepare X elements" asynchronously in advance. (Streams API not required, but preferred).
The situation I have is:
Our application loads data from remote files (over network) to a database
Opening a remote file (i.e. executing the pipeline synchronously for a single element) takes a non-trivial amount of time
I cannot open too many files at once (results in connection timeout)
The order in which the files are loaded matters
The Java Streams API can create the pipeline of commands to do, but to my knowledge, it cannot satisfy both requirements above. It can do either single execution:
files.stream().map(this::convertToInputStream).iterator()
which exacerbates the first requirement. Or it can do wholesale execution:
files.stream().map(this::convertToInputStream).collect(toList())
which fails the second.
I have implemented the requirements using a Deque<InputStream> and Thread logic to keep it populated up to a certain element count, but it is not pretty. I am asking if anyone knows of a way to create pipelines like so (perhaps using libraries) in a more elegant fashion. It seemed like a reasonable use case for the Streams API.
Is it possible to collect the objects within a PCollection in Apache Beam into the driver's memory? Something like:
PCollection<String> distributedWords = ...
List<String> localWords = distributedWords.collect();
I borrowed the method here from Apache Spark, but I was wondering if Apache Beam has a similar functionality as well or not!?
Not directly. The pipeline can write the output into a sink (e.g. GCS bucket or BigQuery table), and signal the progress to the driver program, if needed, via something like PubSub. Then driver program reads from the saved data from the common source. This approach will work for all Beam runners.
There may be other workarounds for specific cases. For example, DirectRunner is a local in-memory execution engine that runs your pipeline locally in-process in a sequential manner. It is used mostly for testing, and if it fits your use case you can leverage it, e.g. by storing the processed data in a shared in-memory storage that can be accessed by both the driver program and the pipeline execution logic, e.g. see TestTable. This won't work in other runners.
In general, Pipeline execution can happen in parallel, and specifics of how it happens is controlled by the runner (e.g. Flink, Dataflow or Spark). Beam pipeline is just a definition of the transformations you're applying to your data plus data sources and sinks. Your driver program doesn't read or collect data itself, and doesn't communicate to the execution nodes directly, it basically only sends the pipeline definition to the runner that then decides how to execute it, potentially spreading it across the fleet of machines (or uses other execution primitives to run it). And then each execution node can independently process the data by extracting it from the input source, transforming and then writing it to the output. The node in general doesn't know about the driver program, it only knows how to execute the pipeline definition. Execution environments / runners can be very different and there's no requirement at the moment for runners to implement such collection mechanism. See https://beam.apache.org/documentation/execution-model/