AWS Lambda/ Aws Batch work flow

AWS Lambda/ Aws Batch work flow - java

I have written a lambda that is triggered off s3 bucket to unzip a zip file and process a text document inside. Due to the limitation of memory of lambda i need to move my process over to something like AWS batch. Correct me if I am wrong but my work flow should look something like this.
work flow
I beleive I need to write a lambda to put the location of the s3 bucket on amazons SQS were a AWS batch can read the location and do all the unzipping/data processing their were their is more memory.
Here is my current lambda, it takes in the event triggered by the s3 bucket, checks to see if it is a zip file then pushes the name of that s3 Key to SQS.
Should I tell AWS batch to start reading the queue here in my lambda?
I am totally new to AWS in general and not sure were to go from here.
public class dockerEventHandler implements RequestHandler<S3Event, String> {
private static BigData app = new BigData();
private static DomainOfConstants CONST = new DomainOfConstants();
private static Logger log = Logger.getLogger(S3EventProcessorUnzip.class);
private static AmazonSQS SQS;
private static CreateQueueRequest createQueueRequest;
private static Matcher matcher;
private static String srcBucket, srcKey, extension, myQueueUrl;
#Override
public String handleRequest(S3Event s3Event, Context context)
{
try {
for (S3EventNotificationRecord record : s3Event.getRecords())
{
srcBucket = record.getS3().getBucket().getName();
srcKey = record.getS3().getObject().getKey().replace('+', ' ');
srcKey = URLDecoder.decode(srcKey, "UTF-8");
matcher = Pattern.compile(".*\\.([^\\.]*)").matcher(srcKey);
if (!matcher.matches())
{
log.info(CONST.getNoConnectionMessage() + srcKey);
return "";
}
extension = matcher.group(1).toLowerCase();
if (!"zip".equals(extension))
{
log.info("Skipping non-zip file " + srcKey + " with extension " + extension);
return "";
}
log.info("Sending object location to key" + srcBucket + "//" + srcKey);
//pass in only the reference of where the object is located
createQue(CONST.getQueueName(), srcKey);
}
}
catch (IOException e)
{
log.error(e);
}
return "Ok";
}
/*
*
* Setup connection to amazon SQS
* TODO - Find updated api for sqs connection to eliminate depreciation
*
* */
#SuppressWarnings("deprecation")
public static void sQSConnection() {
app.setAwsCredentials(CONST.getAccessKey(), CONST.getSecretKey());
try{
SQS = new AmazonSQSClient(app.getAwsCredentials());
Region usEast1 = Region.getRegion(Regions.US_EAST_1);
SQS.setRegion(usEast1);
}
catch(Exception e){
log.error(e);
}
}
//Create new Queue
public static void createQue(String queName, String message){
createQueueRequest = new CreateQueueRequest(queName);
myQueueUrl = SQS.createQueue(createQueueRequest).getQueueUrl();
sendMessage(myQueueUrl,message);
}
//Send reference to the s3 objects location to the queue
public static void sendMessage(String SIMPLE_QUE_URL, String S3KeyName){
SQS.sendMessage(new SendMessageRequest(SIMPLE_QUE_URL, S3KeyName));
}
//Fire AWS batch to pull from que
private static void initializeBatch(){
//TODO
}
I have setup docker and understand docker images. I believe my docker image should contain all the code to read the queue, unzip, process and kit the file to RDS all in one docker image/container.
I am looking for someone who has something similar done they could share to help. Something along the lines of :
Mr. S3: Hey lambda I have a file
Mr. Lambda :Okay S3 I see you, hey aws batch could you unzip and do stuff to this
Mr. Batch: Gotchya mr lambda, ill take care of that and put it in RDS or some data base after.
I have not written the class/docker image yet but i have all the code done to process/unzip and kick off to rds done. Lambda just is limited to memory due to some of the files being 1gb or bigger.

Okay so after looking through the AWS docs on Batch, you don't need an SQS queue. Batch has a concept called Job Queue which is similar to an SQS FIFO queue, but different in that these job queues have priorities, and jobs within them can have dependencies on other jobs. The basic process is:
First the weird part is setting up IAM roles so that container agents can talk to the container service, and AWS batch is able to launch various instances when it needs to (there's also a separate role needed for if you do spot instances). The details on permissions required can be found in this doc (PDF) at around page 54.
Now when that's done you setup a compute environment. These are EC2 on-demand or spot instances which hold your containers. Jobs operate on a container level. The idea is that your compute environment is the max resource allocation that your job containers can utilize. Once that limit is hit, your jobs have to wait for resources to be freed up.
Now you create a job queue. This associates jobs with the compute environment you created.
Now you create a job definition. Well, technically you don't have to and can do it through lambda but this makes things a bit easier. Your job definition will indicate what container resources will be needed for your job ( you can of course override this in lambda as well )
Now that this is all done you'll want to create a lambda function. This will be triggered by your S3 bucket event. The function will need necessary IAM permissions to run submit job against the batch service (as well as any other permissions). Basically all the lambda needs to do is call submit job to AWS batch. The basic parameters you'll want are the job queue and the job definition. You'll also set the S3 key for the zip needed as a parameter to the job.
Now when the appropriate S3 event is triggered, it calls lambda, which then submits the job to the AWS batch job queue. Then assuming the setup is all good it will happily pull up resources to process your job. Note that depending on EC2 instance size and container resources allocated this may take a bit (much longer than prepping a Lambda function).

Related

How to read and process a directory multithread using thread pool in SpringBoot

I have a java application that fetches all files in a directory and reads these files one by one.
public Set<String> processDirectory(String filePath) {
Set<String> setResult = new Set<>();
try (LineIterator iterator = FileUtils.lineIterator(new File(filePath))) {
while (iterator.hasNext()) {
String fileLine = iterator.nextLine();
// do something with fileLine and store the result in setResult
}
}
return setResult;
}
I want that given a thread pool of size 3, every thread will process the method processDirectory and store the results into a set.
I'm using Spring boot and I was wondering what is the best practice to implement this.
I'm a bit lost with all the tutorials I found on the Net.

I suggest use Spring batch reader/process/writer huge files efficiently with spring boot
Click [here] (https://docs.spring.io/spring-batch/docs/current/reference/html/readersAndWriters.html)

Move already process file from one folder to another folder in flink

I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.

Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
#Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* #param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);

Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.

Apache Flink : How to Call One Stream from Another Stream

My scenario is, I want to call one stream based on another stream input. Both Stream type is different. The following is my sample code. I want to trigger one stream when some message is received from Kafka stream.
While Application start up, i can read data from DB. Then again i want to get data from DB based on some kafka message. When i receive kafka message in stream , i want to get data from DB again.This is my actual use case.
How to achieve this? Is it possible ?
public class DataStreamCassandraExample implements Serializable{
private static final long serialVersionUID = 1L;
static Logger LOG = LoggerFactory.getLogger(DataStreamCassandraExample.class);
private transient static StreamExecutionEnvironment env;
static DataStream<Tuple4<UUID,String,String,String>> inputRecords;
public static void main(String[] args) throws Exception {
env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool argParameters = ParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(argParameters);
Properties kafkaProps = new Properties();
kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "group1");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("testtopic", new SimpleStringSchema(), kafkaProps);
ClusterBuilder cb = new ClusterBuilder() {
private static final long serialVersionUID = 1L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("127.0.0.1")
.withPort(9042)
.withoutJMXReporting()
.build();
}
};
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat =
new CassandraInputFormat<> ("select * from employee_details", cb);
//While Application is start up , Read data from table and send as stream
inputRecords = getDBData(env,cassandraInputFormat);
// If any data comes from kafka means, again i want to get data from table.
//How to i trigger getDBData() method from inside this stream.
//The below code is not working
DataStream<String> inputRecords1= env.addSource(kafkaConsumer)
.map(new MapFunction<String,String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
inputRecords = getDBData(env,cassandraInputFormat);
return "OK";
}
});
//This is not printed , when i call getDBData() stream from inside the kafka stream.
inputRecords1.print();
DataStream<Employee> empDataStream = inputRecords.map(new MapFunction<Tuple4<UUID,String,String,String>, Tuple2<String,Employee>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Employee> map(Tuple4<UUID,String,String,String> value) throws Exception {
Employee emp = new Employee();
try{
emp.setEmpid(value.f0);
emp.setFirstname(value.f1);
emp.setLastname(value.f2);
emp.setAddress(value.f3);
}
catch(Exception e){
}
return new Tuple2<>(emp.getEmpid().toString(), emp);
}
}).keyBy(0).map(new MapFunction<Tuple2<String,Employee>,Employee>() {
private static final long serialVersionUID = 1L;
#Override
public Employee map(Tuple2<String, Employee> value)
throws Exception {
return value.f1;
}
});
empDataStream.print();
env.execute();
}
private static DataStream<Tuple4<UUID,String,String,String>> getDBData(StreamExecutionEnvironment env,
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat){
DataStream<Tuple4<UUID,String,String,String>> inputRecords = env
.createInput
(cassandraInputFormat
,TupleTypeInfo.of(new TypeHint<Tuple4<UUID,String,String,String>>() {}));
return inputRecords;
}
}

this is going to be a very verbose answer.
To correctly use Flink as a developper, you need to have an understading of its basic concepts. I suggest you start by the architecture overview (https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html), it contains all you need to know in order to get into the world of Flink when you come from programming.
Now, looking at your code, it should not do what you expect because of how Flink will read it. You need to understand that Flink has at least two big steps when it executes your code: first it builds an execution graph which only describes what it needs to do. This happens at the job manager level. The second big step is to ask one or many workers to execute the graph. These two steps are sequential and anything you do regarding the graph description has to be done at the job manager level not inside your operations.
In your case, the graph has:
A Kafak source.
A map that will call getDBData() at a worker level (not good because getDBData() alters the graph by adding a new Input each time it is called).
The line inputRecords = getDBData(env,cassandraInputFormat); will create an orphan branch of the graph. And the line DataStream<Employee> empDataStream = inputRecords.map... will append a branch of a map->keyBy->map to that orphan branch. This will build a part of the graph that will read all the employee records from Cassandra and apply the map->keyBy->map transformations. This will not be linked with the Kafka source in any way.
Now let's get back to your need. I understand you need to fetch data for an employee when his/her id comes from Kafka and do some operations.
The most clean way to handle this is called Side Inputs. This is a data input that you declare when you build your graph and the job manager handles the reading of data and its transmission to the workers. The bad news is that Side Inputs are not yet working for streaming jobs in Flink (https://issues.apache.org/jira/browse/FLINK-2491 - this bug causes streamning jobs to not create checskpoints because side inputs finish quickly and this puts the job in a bizzare state).
With this being said you still have three more options. The right option depends on the size of your employee cassandra table.
The second option is to load all employees to a static final variable employees and use it inside your map functions. The backside of this approach is that the job manager will send a serialized copy of this variable to all workers and may congest your network and may also overload the RAM. If the size of the table is small and should not grow big in the future, then this may be an acceptable work-arround until the Side Inputs are working for streaming jobs. If the size of the table is big or should evolve in the future then consider the third option.
The third option is an improvement of the second one. It uses Flink's broadcast variables (see https://flink.apache.org/2019/06/26/broadcast-state.html and https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html). Short story: it is the same as before with better transfer management. Flink will find the best way to store and send the variable to the workers. This approach is though a litle bit more complicated to implement correctly.
The last option is not advisable in normal circumstances. It simply consists in making a call to Cassandra inside your map operation. This is not a good practice because it adds repeated latency to all your map executions (there will be as many calls as items passing through Kafka). A call means a connection creation, the actual request with the query and waiting for Cassandra to reply and freeing the connection. This can be a lot of work for a step in your graph. It is a solution to consider when you really can not find any alternatives.
For your case, I would advise the third option. I guess the employee table should not be very big and using Broadcast variables is a good choice.

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.
About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.
After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.
Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code:
The error is provided below.
Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:409)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:311)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(WindmillStateReader.java:700)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:47)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:701)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.
In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:
The partition function is provided as the following:
#SuppressWarnings("serial")
public static class datePartition implements SerializableFunction<String, String> {
private String filename;
public datePartition(String filename) {
this.filename = filename;
}
#Override
public String apply(String input) {
String folder_name = "NaN";
String date_dtf = "NaN";
String date_literal = "NaN";
try {
Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
if(foldernames.find()) {
folder_name = foldernames.group(1);
}
else {
Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
if(folderid.find()) {
folder_name = folderid.group(1);
}
}
Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
if(date_long.find()) {
date_literal = date_long.group(1);
if(Utilities.isNumeric(date_literal)) {
LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
date_dtf = date.format(dtf);
}
else {
date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
}
}
return folder_name + "/" + date_dtf + "h/" + filename;
}
catch(Exception e) {
LOG.error("ERROR with either foldername or date");
LOG.error("Line : " + input);
LOG.error("folder : " + folder_name);
LOG.error("Date : " + date_dtf);
return folder_name + "/" + date_dtf + "h/" + filename;
}
}
}
And the actual place where the pipeline is deployed and run can be found below:
public void streamData() {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(new GenericFunctions.extractMsg())
.apply(FileIO.<String, String>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(TextIO.sink())
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))
.withDestinationCoder(StringUtf8Coder.of()));
pipeline.run();
}

The error 'Processing stuck ...' indicates that some particular operation took longer than 5m, not that the job is permanently stuck. However, since the step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles is the one that is stuck and the job gets cancelled/killed, I would think on an issue while the job is writing temp files.
I found out the BEAM-7689 issue which is related to a second-granularity timestamp (yyyy-MM-dd_HH-mm-ss) that is used to write temporary files. This happens because several concurrent jobs can share the same temporary directory and this can cause that one of the jobs deletes it before the other(s) job finish(es).
According to the previous link, to mitigate the issue please upgrade to SDK 2.14. And let us know if the error is gone.

Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:
public void streamData() {
try {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(FileIO.<String,PubsubMessage>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(Contextful.fn(
(SerializableFunction<PubsubMessage, String>) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
TextIO.sink())
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getOutputDirectory())
.withNaming(type -> new CrowdStrikeFileNaming(type))
.withNumShards(options.getNumShards())
.withTempDirectory(options.getTempLocation()));
pipeline.run();
}
catch(Exception e) {
LOG.error("Unable to deploy pipeline");
LOG.error(e.toString(), e);
}
}
The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.
Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).
Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});

Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.