How to use aggregateField() over multiple columns in Apache Beam Java SDK?

How to use aggregateField() over multiple columns in Apache Beam Java SDK? - java

In Apache Beam Python SDK, it is possible to perform the following:
input
| GroupBy(account=lambda s: s["account"])
.aggregate_field(lambda x: x["wordsAddup"] - x["wordsSubtract"], sum, 'wordsRead')
How do we perform a similar action in the Java SDK? Strangely, the programming guide has only examples in Python for this transform.
Here is my attempt at producing the equivalent in Java:
input.apply(
Group.byFieldNames("account")
.aggregateField(<INSERT EQUIVALENT HERE>, Sum.ofIntegers(), "wordsRead"));

There are some Java examples at https://beam.apache.org/documentation/programming-guide/#using-schemas . (Note you may have to select the java tab on a selector that has both Java and Python to see them.)
In Java I don't think the first argument of aggregateField can take an arbitrary expression; it must be a field name. You can proceed the grouping operation with a projection that adds a new field for the desired expression. For example
input
.apply(SqlTransform.query(
"SELECT *, wordsAddup - wordsSubtract AS wordsDiff from PCOLLECTION")
.apply(Group.byFieldNames("account")
.aggregateField("wordsDiff", Sum.ofIntegers(), "wordsRead"));

Related

C# .Take() in Java

I'm a C# Developer and recently starting get into Java development and here I have a question. Did Java have any build in method that doing the same thing with C# .Take()?
C# example:
int diffNo = 1;
someNumber.OrderBy(x => x.someNumber).Take(diffNo).ToList();
Java example:
someNumber.stream().sorted(Comparator.comparing(Object::getSomeNumber)).collect(Collectors.toList());
So now for Java part I only able to do sorting but don't know is there method can use to replace .Take()

Streams have a limit method, used to truncate a stream to up to the number of elements you provide as an argument.
So, assuming diffNo is a number, you can call it like this
someNumber.stream()
.sorted(Comparator.comparing(SomeClass::getSomeNumber))
.limit(diffNo)
.collect(Collectors.toList());

Dataflow/ApacheBeam Limit input to the first X amount?

I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?

As explained by #Andrew in his comments, maybe you can use the Top transform in Apache Beam (for Java or for Python). Specifically, the Top.of() function returns a PTransform with a PCollection, ordered by a comparator transform.
Here you can find a simple example of use:
PCollection<Student> students = ...;
PCollection<List<Student>> top10Students = students.apply(Top.of(10, new CompareStudentsByAvgGrade()));
And here another example using the Apache Beam Python SDK, which works around the fact that a single element is returned in the PCollection.

For a random sample of X elements, you can use the built-in Sample transform (for Python or Java).
Here is an example that shows how to sample 10 elements from an example input of 100 elements:
import apache_beam as beam
from apache_beam.transforms.combiners import Sample
with beam.Pipeline(runner='DirectRunner') as p:
input = p | beam.Create(range(100))
output = input | Sample.FixedSizeGlobally(10)
output | beam.io.WriteToText('output')

Is there any way to get the AST (Abstract Syntax Tree) of a block of code in Java rather than of the entire class ?

I tried using Javalang module available in python to get the AST of Java source code , but it requires an entire class to generate the AST . Passing a block of code like an 'if' statement throws an error . Is there any other way of doing it ?
PS : I am preferably looking for a python module to do the task.
Thanks

Javalang can parse snippets of Java code:
>>> tokens = javalang.tokenizer.tokenize('System.out.println("Hello " + "world");')
>>> parser = javalang.parser.Parser(tokens)
>>> parser.parse_expression()
MethodInvocation

OP is interested in a non-Python answer.
Our DMS Software Reengineering Toolkit with its Java Front End can accomplish this.
DMS is a general purpose tools for parsing/analyzing/transforming code, parameterized by langauge definitions (including grammars). Given a langauge definition, DMS can easily be invoked on a source file/stream representing the goal symbol for a grammar by calling the Parse method offered by the langauge parameter, and DMS will build a tree for the parsed string. Special support is provided for parsing source file/streams for arbitrary nonterminals as defined by the langauge grammar; DMS will build an AST whose root is that nonterminal, parsing the source according to the subgrammar define by that nonterminal.
Once you have the AST, DMS provides lots of support for visiting the AST, inspecting/modifying nodes, carry out source-to-source transformations on the AST using surface syntax rewrite rules. Finally you can prettyprint the modified AST and get back valid source code. (If you have only parsed a code fragment for a nonterminal, what you get back is valid code for that nonterminal).
If OP is willing to compare complete files instead of snippets, our Smart Differencer might be useful out of the box. SmartDifferencer builds ASTs of its two input files, finds the smallest set of conceptual edits (insert, delete, move, copy, rename) over structured code elemnts that explains the differences, and reports that difference.

Simple java recursive descent parsing library with placeholders

For an application I want to parse a String with arithmetic expressions and variables. Just imagine this string:
((A + B) * C) / (D - (E * F))
So I have placeholders here and no actual integer/double values. I am searching for a library which allows me to get the first placeholder, put (via a database query for example) a value into the placeholder and proceed with the next placeholder.
So what I essentially want to do is to allow users to write a string in their domain language without knowing the actual values of the variables. So the application would provide numeric values depending on some "contextual logic" and would output the result of the calculation.
I googled and did not find any suitable library. I found ANTLR, but I think it would be very "heavyweight" for my usecase. Any suggestions?

You are right that ANTLR is a bit of an overkill. However parsing arithmetic expressions in infix notation isn't that hard, see:
Operator-precedence parser
Shunting-yard algorithm
Algorithms for Parsing Arithmetic Expressions
Also you should consider using some scripting languages like Groovy or JRuby. Also JDK 6 onwards provides built-in JavaScript support. See my answer here: Creating meta language with Java.

If all you want to do is simple expressions, and you know the grammar for those expressions in advance, you don't even need a library; you can code this trivially in pure Java.
See this answer for a detailed version of how:
Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
If the users are defining thier own expression language, if it is always in the form of a few monadic or binary operators, and they can specify the precedence, you can bend the above answer by parameterizing the parser with a list of operators at several levels of precedence.
If the language can be more sophisticated, you might want to investigate metacompilers.

Is there a Java equivalent of Python's printf hash replacement?

Specifically I am converting a python script into a java helper method. Here is a snippet (slightly modified for simplicity).
# hash of values
vals = {}
vals['a'] = 'a'
vals['b'] = 'b'
vals['1'] = 1
output = sys.stdout
file = open(filename).read()
print >>output, file % vals,
So in the file there are %(a), %(b), %(1) etc that I want substituted with the hash keys. I perused the API but couldn't find anything. Did I miss it or does something like this not exist in the Java API?

You can't do this directly without some additional templating library. I recommend StringTemplate. Very lightweight, easy to use, and very optimized and robust.

I doubt you'll find a pure Java solution that'll do exactly what you want out of the box.
With this in mind, the best answer depends on the complexity and variety of Python formatting strings that appear in your file:
If they're simple and not varied, the easiest way might be to code something up yourself.
If the opposite is true, one way to get the result you want with little work is by embedding Jython into your Java program. This will enable you to use Python's string formatting operator (%) directly. What's more, you'll be able to give it a Java Map as if it were a Python dictionary (vals in your code).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to use aggregateField() over multiple columns in Apache Beam Java SDK? - java

Related

C# .Take() in Java

Dataflow/ApacheBeam Limit input to the first X amount?

Is there any way to get the AST (Abstract Syntax Tree) of a block of code in Java rather than of the entire class ?

Simple java recursive descent parsing library with placeholders

Is there a Java equivalent of Python's printf hash replacement?

Categories

Resources