I would like to use the Java Streams API to create a pipeline and then terminate it with iterator(), but I want it to "prepare X elements" asynchronously in advance. (Streams API not required, but preferred).
The situation I have is:
Our application loads data from remote files (over network) to a database
Opening a remote file (i.e. executing the pipeline synchronously for a single element) takes a non-trivial amount of time
I cannot open too many files at once (results in connection timeout)
The order in which the files are loaded matters
The Java Streams API can create the pipeline of commands to do, but to my knowledge, it cannot satisfy both requirements above. It can do either single execution:
files.stream().map(this::convertToInputStream).iterator()
which exacerbates the first requirement. Or it can do wholesale execution:
files.stream().map(this::convertToInputStream).collect(toList())
which fails the second.
I have implemented the requirements using a Deque<InputStream> and Thread logic to keep it populated up to a certain element count, but it is not pretty. I am asking if anyone knows of a way to create pipelines like so (perhaps using libraries) in a more elegant fashion. It seemed like a reasonable use case for the Streams API.
Related
I'm new to the whole multithreading in Java thing and I'm unable to get my head around something.
I'm trying to initialise my app properly using multithreading.
For example, I'm using a database (mongodb to be exact) and need to initialise a connection to it, then connect and check a collection exists and then read from it.
Once I have that, I will eventually have a list view (JavaFX) that will display the information taken from the database.
Ideally, whilst this is going on, I'd like other things to be done (in true mutlithreading style).
Would I need to put each submitted task into a queue of sorts and then iterate through, wait if they're not ready and then remove them once they're finished?
I've always done this singlethreaded and it's always been slow
Cheers
Use an asynchronous MongoDB client like the high level ones mentioned here
MongoDB RxJava Driver (An RxJava implementation of the MongoDB Driver)
MongoDB Reactive Streams Java Driver (A Reactive Streams implementation for the JVM)
For such purpose you could add connection pool for your application.
Everything depends on configuration for your project.
The best is to make it configurable. When load is low just have ~4 connections (min) at pool. If load is increased it could go up till 20 (max).
You need to coordinate partial tasks. If you represent tasks with threads, coordination can be done using semaphores and/or blocking queues.
More effective way is represent tasks as dataflow actors - they consume less memory, and you can generate all the tasks at the very start.
Is it possible to collect the objects within a PCollection in Apache Beam into the driver's memory? Something like:
PCollection<String> distributedWords = ...
List<String> localWords = distributedWords.collect();
I borrowed the method here from Apache Spark, but I was wondering if Apache Beam has a similar functionality as well or not!?
Not directly. The pipeline can write the output into a sink (e.g. GCS bucket or BigQuery table), and signal the progress to the driver program, if needed, via something like PubSub. Then driver program reads from the saved data from the common source. This approach will work for all Beam runners.
There may be other workarounds for specific cases. For example, DirectRunner is a local in-memory execution engine that runs your pipeline locally in-process in a sequential manner. It is used mostly for testing, and if it fits your use case you can leverage it, e.g. by storing the processed data in a shared in-memory storage that can be accessed by both the driver program and the pipeline execution logic, e.g. see TestTable. This won't work in other runners.
In general, Pipeline execution can happen in parallel, and specifics of how it happens is controlled by the runner (e.g. Flink, Dataflow or Spark). Beam pipeline is just a definition of the transformations you're applying to your data plus data sources and sinks. Your driver program doesn't read or collect data itself, and doesn't communicate to the execution nodes directly, it basically only sends the pipeline definition to the runner that then decides how to execute it, potentially spreading it across the fleet of machines (or uses other execution primitives to run it). And then each execution node can independently process the data by extracting it from the input source, transforming and then writing it to the output. The node in general doesn't know about the driver program, it only knows how to execute the pipeline definition. Execution environments / runners can be very different and there's no requirement at the moment for runners to implement such collection mechanism. See https://beam.apache.org/documentation/execution-model/
One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.
What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0
I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. I haven't been able to find any confirmation on that, does it support it?
In addition, is there any example of the use of nested iterations?
Thanks!
Just about anything can be done, but the question is what fits the execution model well enough to bother. Spark's operations are inherently parallel, not iterative. That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).
However a Spark (driver) program is just a program and can do whatever you want, locally. Of course, nested loops or whatever you like are entirely fine just as in any scala program.
I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver.
So the process is:
Broadcast a bucketing scheme
Bucket according to that scheme in a distributed operation
Pull small summary stats to the driver
Update bucketing scheme and send again
repeat...
I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?
In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here
My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.