Collecting the Apache Beam's PCollection objects into driver's memory

Collecting the Apache Beam's PCollection objects into driver's memory - java

Is it possible to collect the objects within a PCollection in Apache Beam into the driver's memory? Something like:
PCollection<String> distributedWords = ...
List<String> localWords = distributedWords.collect();
I borrowed the method here from Apache Spark, but I was wondering if Apache Beam has a similar functionality as well or not!?

Not directly. The pipeline can write the output into a sink (e.g. GCS bucket or BigQuery table), and signal the progress to the driver program, if needed, via something like PubSub. Then driver program reads from the saved data from the common source. This approach will work for all Beam runners.
There may be other workarounds for specific cases. For example, DirectRunner is a local in-memory execution engine that runs your pipeline locally in-process in a sequential manner. It is used mostly for testing, and if it fits your use case you can leverage it, e.g. by storing the processed data in a shared in-memory storage that can be accessed by both the driver program and the pipeline execution logic, e.g. see TestTable. This won't work in other runners.
In general, Pipeline execution can happen in parallel, and specifics of how it happens is controlled by the runner (e.g. Flink, Dataflow or Spark). Beam pipeline is just a definition of the transformations you're applying to your data plus data sources and sinks. Your driver program doesn't read or collect data itself, and doesn't communicate to the execution nodes directly, it basically only sends the pipeline definition to the runner that then decides how to execute it, potentially spreading it across the fleet of machines (or uses other execution primitives to run it). And then each execution node can independently process the data by extracting it from the input source, transforming and then writing it to the output. The node in general doesn't know about the driver program, it only knows how to execute the pipeline definition. Execution environments / runners can be very different and there's no requirement at the moment for runners to implement such collection mechanism. See https://beam.apache.org/documentation/execution-model/

Related

Does flink provide checkpointing for datasets batch processing

How to configure check pointing for flink batch processing. I'm interested in knowing how checking pointing work internally. Since check point happens at an interval, if the job failed before the next point, won't there be any duplicate processing if it restarts. Does flink check points for each operator, sink and sources?

Flink does not support checkpointing on the DataSet API.
You can use checkpointing in DataStream with finite sources though, which covers most of the DataSet API use cases already. The long-term vision is to completely replace the DataSet API with DataStream + finite sources, such that users do not need to write two programs if they want to analyze a stream or batch.
With Table API and SQL, this goal is already pretty near.

Java: Parallel stream pipeline -> iterator() with "prepare X elements"

I would like to use the Java Streams API to create a pipeline and then terminate it with iterator(), but I want it to "prepare X elements" asynchronously in advance. (Streams API not required, but preferred).
The situation I have is:
Our application loads data from remote files (over network) to a database
Opening a remote file (i.e. executing the pipeline synchronously for a single element) takes a non-trivial amount of time
I cannot open too many files at once (results in connection timeout)
The order in which the files are loaded matters
The Java Streams API can create the pipeline of commands to do, but to my knowledge, it cannot satisfy both requirements above. It can do either single execution:
files.stream().map(this::convertToInputStream).iterator()
which exacerbates the first requirement. Or it can do wholesale execution:
files.stream().map(this::convertToInputStream).collect(toList())
which fails the second.
I have implemented the requirements using a Deque<InputStream> and Thread logic to keep it populated up to a certain element count, but it is not pretty. I am asking if anyone knows of a way to create pipelines like so (perhaps using libraries) in a more elegant fashion. It seemed like a reasonable use case for the Streams API.

ETL process to transfer data from one Db to another using Apache Spark

I need to create an ETL process that will extract, tranform & then load 100+ tables from several instances of SQLServer to as many instances of Oracle in parallel on a daily basis. I understand that I can create multiple threads in Java to accomplish this but if all of them run on the same machine this approach won't scale. Another approach could be to get a bunch of ec2 instances & start transferring tables for each instance on a different ec2 instance. With this approach, though, I would have to take care of "elasticity" by adding/removing machines from my pool.
Somehow I think I can use "Apache Spark on Amazon EMR" to accomplish this, but in the past I've used Spark only to handle data on HDFS/Hive, so not sure if transferring data from one Db to another Db is a good use case for Spark - or - is it?

Starting from your last question:
"Not sure if transferring data from one Db to another Db is a good use case for Spark":
It is, within the limitation of the JDBC spark connector. There are some limitations such as the missing support in updates, and the parallelism when reading the table (requires splitting the table by a numeric column).
Considering the IO cost and the overall performance of RDBMS, running the jobs in a FIFO mode does not sound like a good idea. You can submit each one of the jobs with a configuration that requires 1/x of cluster resources so x tables will be processed in parallel.

Using Spring Batch to horizontally scale external process execution

One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.

What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0

nested iterations with Apache Spark?

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. I haven't been able to find any confirmation on that, does it support it?
In addition, is there any example of the use of nested iterations?
Thanks!

Just about anything can be done, but the question is what fits the execution model well enough to bother. Spark's operations are inherently parallel, not iterative. That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).
However a Spark (driver) program is just a program and can do whatever you want, locally. Of course, nested loops or whatever you like are entirely fine just as in any scala program.
I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver.
So the process is:
Broadcast a bucketing scheme
Bucket according to that scheme in a distributed operation
Pull small summary stats to the driver
Update bucketing scheme and send again
repeat...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.