Aggregators in Apache beam with dataflow runner - java

I am trying to create aggregators to count values that satisfy a condition across all input data . I looked into documentation and found the below for creation .
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/Aggregator ..
I am using : google-cloud-dataflow-java-sdk-all - 2.4.0 (apache beam based)
However I am not able to find the corresponding class in the new beam api..
I looked into org.apache.beam.sdk.transforms package .
Can you please let me know how can I use aggregators with dataflow runner in new api . ?

The link you have is for the old SDK (1.x).
In SDK 2.x, you should refer to apache-beam SDK. For the Aggregators you mentioned, if I understand correctly, it's for adding counters during processing. I guess the corresponding package should be org.apache.beam.sdk.metrics.
Package org.apache.beam.sdk.metrics
Metrics allow exporting information about the execution of a pipeline.
and org.apache.beam.sdk.metrics.Counter interface:
A metric that reports a single long value and can be incremented or decremented.

As of now, there seem to be no replacement for the Aggregator class in Apache Beam SDK 2.X. An alternate solution to count values respecting a condition would be Transforms. By using the GroupBy transform to collect data meeting a condition and then the Combine transform, you can get a count of the input data respecting the condition.

Related

Disable logging from a specific class in Apache Beam running on Google Cloud Dataflow

I saw on this website https://cloud.google.com/dataflow/docs/guides/logging that i can programmatically set logging behavior of the pipeline. In my case, there is a class from a third party dependency that is producing too much warning messages, so i want to set its logging level to ERROR only.
So what i did: (a bit different than what the website provided (out of date))
val sdkHarnessLogLevelOverrides = SdkHarnessLogLevelOverrides()
sdkHarnessLogLevelOverrides
.addOverrideForClass(
ThirdPartyClass::class.java,
SdkHarnessOptions.LogLevel.ERROR
)
val optionsWithLoggingConfiguration = options.`as`(SdkHarnessOptions::class.java)
optionsWithLoggingConfiguration.sdkHarnessLogLevelOverrides = loggingConfiguration
//I also want to set the default logging behavior for the rest to WARN level
optionsWithLoggingConfiguration.defaultSdkHarnessLogLevel = SdkHarnessOptions.LogLevel.WARN
val pipeline = Pipeline.create(optionsWithLoggingConfiguration)
But I still can see the WARN message from the ThirdPartyClass in the Stackdrive, which coming from dataflow_step in a dataflow worker.
I don’t know what I did wrong here , the documentation is pretty out of dated and limited examples. Does anyone encounter this problem before? I would love to hear from you. Thank you very much.
Try using DataflowWorkerLoggingOptions.setWorkerLogLevelOverrides
Explanation: Dataflow currently can run in two different modes, which we call "v1" and "v2":
In Dataflow v1 Java pipelines run their DoFn code directly in the same process as the dataflow worker, so you use DataflowWorkerLoggingOptions.setWorkerLogLevelOverrides.
In Dataflow v2, which you can activate by --experiments=use_runner_v2 and will eventually become the default, the SDK harness runs your DoFn code in a separate process from the highly optimized C++-based Dataflow v2 worker, so you use SdkHarnessOptions.setSdkHarnessLogLevelOverrides
A bonus about the v2 approach is that this works on all modernized Beam runners, not just Dataflow, as the SDK harness that runs your DoFn is a portable process that is the same regardless of runner that you choose.

Unable to use Elasticsearch new multi_terms aggregation via Java High Level Client

I`ve updated my ES cluster to the version 7.12 to use brand new multi-terms aggregations. I want to do something like this:
AggregationBuilders.multiTerms(field)
in my Java Rest Client (from
org.elasticsearch.client:elasticsearch-rest-high-level-client)
But as I see, there is no builder for multi terms. I`ve found it in sources on GitHub. As I understood, this class is externalized into some x-pack plugin and I can't get access to it. Should I add this plugin to my classpath? And where can I find it?
Thank you

How to get the execution graph of a created topology in Apache Storm?

Say, I have a very complex Apache Storm topology that is created randomly for some experiments and results in an StormTopology object.
I am now interested in the execution graph of this topology (how does it look like?)
But to best of my knowledge, there is no simple way to easily obtain the graph structure back from storm. For sure, there are some approaches that have their downsides:
I can scan the StormTopology-Object and find bolts and spouts objects, that are containing some Stream and Input-Objects. But using their toString()- in my own logging, for the Spouts, I get something like this. I can`t see, how to rebuild the execution graph from it:
"mybolt1": {
"inputs": {
"GlobalStreamId(componentId:spout1, streamId:default)": "<Grouping shuffle:NullStruct()>"
},
"streams": {
"s1": "StreamInfo(output_fields:[value], direct:false)",
"s1__punctuation": "StreamInfo(output_fields:[__punctuation], direct:false)"
}
By the way, what is meant with Punctuation here?
At some old blog posts and in the documentation, the Storm Visualization is mentioned. But since I cannot find any current information to that, I assume that this feature is not supported anymore.
So my question is: How can I rebuild the execution graph (the format does not matter) from a given StormTopology? I am mainly interested in the arrangement of included bolts and spouts.
The topology visualization should be available in Storm UI as of v0.9.2, see the release notes. I don't think the feature has been removed, the 2.2.0 release notes still mention the Storm UI visualizer in a few bug fixes.

Apache Beam Global Counting

I am trying to understand the best way of solving the following:
As simple example scenario, I have a file which describes a test name and if its execution passed (true/false).
test-scenario,passed
--------------------
testA,true
testB,false
Using apache beam I can read, parse the file into PCollection<TestDetails> and then using subsequent transforms write all test details which have passed to one set of files and likewise for those tests which failed.
After writing the above files I would finally like to generate some counts regarding: the total number of file records processed, number tests that passed, number test that failed and write these details to a single file.
Should I use a combine global for this ?
For this purpose, you can use Beam Metrics (please, see the documentation). It provides counters, that can be used for the needs you described above, and then metrics can be fetched once your pipeline is finished. Please, take a look on this example. Also, Beam allows to export metrics into external sink, if it's more convenient.

SonarQube - Is there an API to grab a part of analysis for all projects you have?

I want to be able to extract, for example, just the Technical Debt numbers out of my sonar instance for all the projects I have, and display them on a page.
Does Sonar provide an api that I can utilize and achieve this?
SonarQube lets you get exhaustive data using its Web API. Taking your example of project's measures:
Since SonarQube 5.4
Use api/measures Web API (see parameters in documentation). Example for project postgresql:
Get the component ID:
https://nemo.sonarqube.org/api/components/show?key=postgresql
Get the desired metrics:
https://nemo.sonarqube.org/api/measures/component?componentId=6d75286c-42bb-4377-a0a1-bfe88169cffb&metricKeys=sqale_debt_ratio&additionalFields=metrics,periods
Before SonarQube 5.4
Use api/resources Web API:
http://sonarqube_url/api/resources?resource=your_resource&metrics=metric_key
Listing metric keys
Use api/metrics/search (documented here), see also Metric Definitions.

Categories

Resources