Apache Beam Global Counting - java

I am trying to understand the best way of solving the following:
As simple example scenario, I have a file which describes a test name and if its execution passed (true/false).
test-scenario,passed
--------------------
testA,true
testB,false
Using apache beam I can read, parse the file into PCollection<TestDetails> and then using subsequent transforms write all test details which have passed to one set of files and likewise for those tests which failed.
After writing the above files I would finally like to generate some counts regarding: the total number of file records processed, number tests that passed, number test that failed and write these details to a single file.
Should I use a combine global for this ?

For this purpose, you can use Beam Metrics (please, see the documentation). It provides counters, that can be used for the needs you described above, and then metrics can be fetched once your pipeline is finished. Please, take a look on this example. Also, Beam allows to export metrics into external sink, if it's more convenient.

Related

Multiple readers in single job in single step using spring batch

I am newbie in spring batch. I have a use case in which, I've to read files from a specific folder and write those files in to DB.
For example, I've a files In folder like this
-company_group
|
-my_company_group.json
-my_company_group_alternate_id.json
-sg_company_group.json
-sg_company_group_alternate_id.json
Note: sg = Singapore, my=Malaysia
Now, I want to read these files In the following order.
SG files should be read first than my files.
for each country alternate file should come first.
For example,
sg_company_group_alternate_id.json
sg_company_group.json
And same for the MY files
Currently, I'm reading all files by writing custom MultiResourcePartitioner and sorting my files order in the way which I mentioned above.
There will be 1 writer and reader for 1 file.
There will be 1 job.
Now, the problem Is I've a step in which I've a custom partitioner which I mentioned above it gets all file sort it but it goes in to only 1 reader. It will go through one reader for all files. I want multiple readers for all files.
I mean to say, in 1 job I've a step which loads all files. Now in this step, 1 file get read, write in db repeat for other file in same step.
As per my understanding spring batch do not allow multiple readers in 1 step.
Is there any workaround?
Thanks.
I would recommend to create a job instance per file, meaning pass the file as an identifying job parameter. This has at least two major benefits:
Scaling: you can run multiple jobs in parallel, each job processing a different file
Fault-tolerance: if a job fails, you only restart the failed job, without affecting other files

Spring Batch: Create new steps at job runtime

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.
Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?
Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.
Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.
The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.
You could use partitioned steps
Pass variable containing list of csv as resources to
JobExecutionContext during your inspect step
In partition method retrieve the list of csv and create a partition for each one.
The step will be executed for each of the partition created

Create output file for each of input file from directory of files in Spring Batch

I have a directory of CSV files which contains transaction information.
I need to read each file , process some business logic (validate with DB to check valid account and transaction or not) and write the valid transactions into new output file.
Input:
Tranx_100.csv, Tranx_101.csv, Tranx_102.csv....
Ouput:
Tranx_100_output.csv, Tranx_101_output.csv, Tranx_102_output.csv....
Want to use spring batch for this. Any suggestions on how to implement this?
For each file as input, process, output are same - can i run them as part of 'Step' and repeat the step for each input file in a JOB?
Instead of looping on the same step or using multiple steps in a single job, I would use a job instance for each file (ie the file is an identifying job parameter) and launch jobs in parallel. This approach has multiple advantages:
Fault-tolerance: In case of failure, only the failed file is re-processed
Scalability: You can run multiple jobs in parallel easily
Logging: logs will be separated by design
And all the good reasons of making one thing do one thing and do it well

Aggregators in Apache beam with dataflow runner

I am trying to create aggregators to count values that satisfy a condition across all input data . I looked into documentation and found the below for creation .
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/Aggregator ..
I am using : google-cloud-dataflow-java-sdk-all - 2.4.0 (apache beam based)
However I am not able to find the corresponding class in the new beam api..
I looked into org.apache.beam.sdk.transforms package .
Can you please let me know how can I use aggregators with dataflow runner in new api . ?
The link you have is for the old SDK (1.x).
In SDK 2.x, you should refer to apache-beam SDK. For the Aggregators you mentioned, if I understand correctly, it's for adding counters during processing. I guess the corresponding package should be org.apache.beam.sdk.metrics.
Package org.apache.beam.sdk.metrics
Metrics allow exporting information about the execution of a pipeline.
and org.apache.beam.sdk.metrics.Counter interface:
A metric that reports a single long value and can be incremented or decremented.
As of now, there seem to be no replacement for the Aggregator class in Apache Beam SDK 2.X. An alternate solution to count values respecting a condition would be Transforms. By using the GroupBy transform to collect data meeting a condition and then the Combine transform, you can get a count of the input data respecting the condition.

How to "ignore" a picked up exchange from Apache Camel File Consumer

I have a batch file consumer that is polling a public directory that many different processes drop files to. These files are "batched" together via a guid that is on the filename. Once a particular batch is completed, the applications drop a .done file to trigger the camel file consumer.
My question is that i'm trying to find a way to potentially "ignore" messages/exchange that could have files that I don't want to process (ie.. aren't a part of my current batch).
Additionally, I'd like the "ignored" exchange to not be processed by camel (ie.. not moved to .processed directory).
I'm currently looking at the message filter as a potential way to do this, although i'm not sure if it will full fill my requirement to not process.
Any suggestions?
You can use the 'include' or 'antinclude' (or 'exclude, and 'antexclude') parameters on the File component to only process specific messages, based on regex or ant pattern. Files that aren't processed won't be moved or touched at all.
If you need a more complicated set of rules than can be achieved by regex or ant pattern, you may need to write your own custom pluggable filter, which you can then specify using the 'filter' parameter.
See here for more details on the above:
http://camel.apache.org/file2.html

Categories

Resources