Compute metrics on different window in Apache FLink - java

I am using Apache Flink 1.2 and here's my question:
I have a stream of data and I would like to compute a metric over a window of 1 day. Therefore I will write something like:
DataStream<Tuple6<Timestamp, String, Double, Double, Double, Integer>> myStream0 =
env.readTextFile("Myfile.csv")
.map(new MyMapper()) // Parse the input
.assignTimestampsAndWatermarks(new MyExtractor()) //Assign the timestamp of the event
.timeWindowAll(Time.days(1))
.apply(new average()); // compute average, max, sum
Now I would like to compute the same metrics over a window of 1 hour.
I can write same as before and specify Time.hours(1), but my concerns is that in this way apache flink reads two times the input file and does twice the work. I wonder if there is a way of doing all togheter (i.e. using the same stream).

You can compute hourly aggregates and from those the daily aggregates. This would look for a simple DataStream<Double> as follows:
DataStream<Double> vals = ... // source + timestamp extractor
DataStream<Tuple2<Double, Long>> valCnt = vals // (sum, cnt)
.map(new CntAppender()) // Double -> Tuple2<Double, Long(1)>
DataStream<Tuple3<Double, Long, Long>> hourlySumCnt = valCnt // (sum, cnt, endTime)
.timeWindowAll(Time.hours(1))
// SumCounter ReduceFunction sums the Double and Long field (Long is Count)
// WindowEndAppender WindowFunction adds the window end timestamp (3rd field)
.reduce(new SumCounter(), new WindowEndAppender())
DataStream<Tuple2<Double, Long>> hourlyAvg = hourlySumCnt // (avg, endTime)
.map(new SumDivCnt()) // MapFunction divides Sum by Cnt for average
DataStream<Tuple3<Double, Long, Long>> dailySumCnt = hourlySumCnt // (sum, cnt, endTime)
.map(new StripeOffTime()) // removes unnecessary time field -> Tuple2<Double, Long>
.timeWindowAll(Time.days(1))
.reduce(new SumCounter(), new WindowEndAppender()) // same as above
DataStream<Tuple2<Double, Long>> dailyAvg = dailySumCnt // (avg, endTime)
.map(new SumDivCnt()) // same as above
So, you basically compute sum and count for each hour, and based on that result you
compute the hourly average
compute daily sum and count and the daily average
Note, that I am using a ReduceFunction instead of a WindowFunction for the sum and count computation, because a ReduceFunction is eagerly applied, i.e., all records of the window are not collected but immediately aggregated. Hence the state that needs to be maintained is a single record.

Related

Using Java Stream API, finding highest value of variable, with the stream of the changes made to the variable

Context/Scenario
Let's say we have an immutable object called Transaction, where transaction.getAction() would return a TransactionAction enum which can be DEPOSIT or WITHDRAW, and transaction.getAmount() would return an Integer which specify the amount of money being deposit or withdrawn.
enum TransactionAction {
WITHDRAW,
DEPOSIT
}
public class Transaction {
private final TransactionAction action;
private final int amount;
public Transaction(TransactionAction action, int amount) {
this.action = action;
this.amount = amount;
}
public TransactionAction getAction() {
return action;
}
public int getAmount() {
return amount;
}
}
Question
We now have a Stream<Transaction> which is a stream filled with Transaction that can either be DEPOSIT or WITHDRAW. We can imagine this Stream<Transaction> as a history of transactions of one particular bank account.
What I am trying to achieve is to get the highest balance the account has ever achieved in most efficient manner (thus using Stream API).
Example
Bob transaction history is:
// balance start at 0
[DEPOSIT] 1200 // balance: 1200
[DEPOSIT] 500 // balance: 1700
[WITHDRAW] 700 // balance: 1000
[DEPOSIT] 300 // balance: 1300
[WITHDRAW] 800 // balance: 500
[WITHDRAW] 500 // balance: 0
Bob's highest balance is 1700.
What you need is to find the maximum value of a cumulative sum. In pseudo-code, this would be something like:
transactions = [1200, 500, -700, 300, -800, -500]
csum = cumulativeSum(transactions) // should be [1200,1700,1000,1300,500,0]
max(csum) // should be 1700
The imperative way:
The traditional for-loop is well suited for such cases. It should be fairly easy to write and is probably the most efficient alternative both in time and space. It does not require multiple iterations and it does not require extra lists.
int max = 0;
int csum = 0;
for (Transaction t: transactions) {
int amount = (t.getAction() == TransactionAction.WITHDRAW ? -1 : 1) * t.getAmount();
csum += amount;
if (csum > max) max = csum;
}
Diving into functional:
Streams are a functional programming concept and, as such, they are free of side-effects and well suited for stateless operations. Keeping the cumulative state is considered a side-effect, and then we would have to talk about Monads to bring those side-effects under control and... we don't want to go that way.
Java, not being a functional language (although allowing for functional style), cares less about purity. You could simply have a control variable outside the stream to keep track of that external state within the current map or reduce operations. But that would also be giving up everything Streams are meant for.
So let's see how Java's experienced fellows do in this matter. In pure Haskell, the cumulative sum can be achieved with a Scan Left operation:
λ> scanl1 (+) [1200, 500, -700, 300, -800, -500]
[1200,1700,1000,1300,500,0]
Finding the maximum of this would be as simple as:
λ> maximum ( scanl1 (+) [1200, 500, -700, 300, -800, -500] )
1700
A Java Streams solution:
Java does not have such an idiomatic way of expressing a scan left, but you may achieve a similar result with collect.
transactions.stream()
.map(t -> (t.getAction() == TransactionAction.WITHDRAW ? -1 : 1) * t.getAmount())
.collect(ArrayList<Integer>::new, (csum, amount) ->
csum.add(csum.size() > 0 ? csum.get(csum.size() - 1) + amount : amount),
ArrayList::addAll)
.stream()
.max(Integer::compareTo);
// returns Optional[1700]
EDIT: As correctly pointed out in the comments, this accumulator function is not associative and problems would appear if trying to use parallelStream instead of stream.
This can be further simplified. For example, if you enrich your TransactionAction enum with a multiplier (-1 for WITHDRAW and 1 for DEPOSIT), then map could be replaced with:
.map(t -> t.getAction().getMultiplier() * t.getAmount())
EDIT: Yet another approach: Parallel Prefix Sum
Since Java 8, arrays offer a parallelPrefix operation that could be used like:
Integer[] amounts = transactions.stream()
.map(t -> (t.getAction() == TransactionAction.WITHDRAW ? -1 : 1) * t.getAmount())
.toArray(Integer[]::new);
Arrays.parallelPrefix(amounts, Integer::sum);
Arrays.stream(amounts).max(Integer::compareTo);
// returns Optional[1700]
As Streams collect, it also requires an associative function, Integer::sum satisfies that property. The downside is that it requires an array and can't be used with lists. Although the parallelPrefix is very efficient, setting up the array to work with it could not pay off.
Wrapping up:
Again, it's possible to achieve this with Java Streams although it won't be as efficient as a traditional loop both in time and space. But you benefit from the compositionality of streams. As always, it's a trade-off.
A stream would not help here. Use a list and a for-loop:
List<Transaction> transactions = ...;
int balance = 0;
int max = 0;
for (Transaction transaction : transactions) {
balance += (transaction.getAction() == TransactionAction.DEPOSIT ? 1 : -1)
* transaction.getAmount();
max = Math.max(max, balance);
}
The problem is that you need to keep track of some state while processing transactions, and you wouldn't be able to do this with streams without introducing complicated or mutable data structures that would make this code bug-prone.
Here is another Stream solution:
AtomicInteger balance = new AtomicInteger(0);
int highestBalance = transactions
.stream()
.mapToInt(transaction -> {
int amount = transaction.getAmount();
if (transaction.getAction() == TransactionAction.WITHDRAW) {
amount = -amount;
}
return balance.accumulateAndGet(amount, Integer::sum);
})
.max()
.orElse(0);
Cumulative Sum of each position can be computed like this:
List<Integer> integers = Arrays.asList(1200, 500, -700, 300, -800, -500);
Stream<Integer[]> cumulativeSum = Stream.iterate(
new Integer[]{0, integers.get(0)},
p -> new Integer[]{p[0] + 1, p[1] + integers.get(p[0] + 1)}
)
.limit(integers.size());
With this you can get the max balance in this way:
Integer[] max = cumulativeSum
.max(Comparator.comparing(p -> p[1]))
.get();
System.out.println("Position: " + max[0]);
System.out.println("Value: " + max[1]);
Or with iterator but here is problem that last sum wouldn't be computed:
Stream<Integer> integerStream = Arrays.stream(new Integer[]{
1200, 500, -700, 300, -800, -500});
Iterator<Integer> iterator = integerStream.iterator();
Integer maxCumulativeSum = Stream.iterate(iterator.next(), p -> p + iterator.next())
.takeWhile(p -> iterator.hasNext())
.max(Integer::compareTo).get();
System.out.println(maxCumulativeSum);
Problem is with takeWhile and it may be solved with takeWhileInclusive (from external library).
A wrong solution
// Deposit is positive, withdrawal is negative.
final Stream<Integer> theOriginalDepositWithdrawals = Stream.of(1200, 500, -700, 300, -800, -500);
final Stream<Integer> sequentialDepositWithdrawals = theOriginalDepositWithdrawals.sequential();
final CurrentBalanceMaximumBalance currentMaximumBalance = sequentialDepositWithdrawals.<CurrentBalanceMaximumBalance>reduce(
// Identity.
new CurrentBalanceMaximumBalance(0, Integer.MIN_VALUE),
// Accumulator.
(currentAccumulation, elementDepositWithdrawal) -> {
final int newCurrentBalance =
currentAccumulation.currentBalance +
elementDepositWithdrawal;
final int newMaximumBalance = Math.max(
currentAccumulation.maximumBalance,
newCurrentBalance
);
return new CurrentBalanceMaximumBalance(
newCurrentBalance,
newMaximumBalance
);
},
// Combiner.
(res1, res2) -> {
final int newCurrentBalance =
res1.currentBalance +
res2.currentBalance;
final int newMaximumBalance = Math.max(
res1.maximumBalance,
res2.maximumBalance
);
return new CurrentBalanceMaximumBalance(
newCurrentBalance, newMaximumBalance
);
}
);
System.out.println("Maximum is: " + currentMaximumBalance.maximumBalance);
Helper class:
class CurrentBalanceMaximumBalance {
public final int currentBalance;
public final int maximumBalance;
public CurrentBalanceMaximumBalance(
int currentBalance,
int maximumBalance
) {
this.currentBalance = currentBalance;
this.maximumBalance = maximumBalance;
}
}
This is a wrong solution. It might arbitrarily work, but there is no guarantee that it will.
It breaks the interface of reduce. The properties that are broken are associativity for both the accumulator function and the combiner function. It also doesn't require that the stream respects the order of the original transactions.
This makes it possibly dangerous to use, and might well give wrong results depending on what the implementation of reduce happens to be as well as whether the stream respects the original order of the deposits and withdrawals or not.
Using sequential() here is not sufficient, since sequential() is about sequential/parallel execution. An example of a stream that executes sequentially but does not have ordering is a stream created from a HashSet and then have sequential() called on it.
A correct solution
The problem uses the concept of a "current balance", and that is only meaningful when computed from the first transaction and then in order to the end. For instance, if you have the list [-1000, 10, 10, -1000], you cannot start in the middle and then say that the "current balance" was 20 at some point. You must apply the operations reg. "current balance" in the order of the original transactions.
So, one straight-forward solution is to:
Require that the stream respects the original order of transactions, with a defined "encounter order".
Apply forEachOrdered​().

Apache Flink: The execution environment and multiple sink

My question might cause some confusion so please see Description first. It might be helpful to identify my problem. I will add my Code later at the end of the question (Any suggestions regarding my code structure/implementation is also welcomed).
Thank you for any help in advance!
My question:
How to define multiple sinks in Flink Batch processing without having it get data from one source repeatedly?
What is the difference between createCollectionEnvironment() and getExecutionEnvironment() ? Which one should I use in local environment?
What is the use of env.execute()? My code will output the result without this sentence. if I add this sentence it will pop an Exception:
-
Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution. The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:940)
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:922)
at org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:34)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
at MainClass.main(MainClass.java:114)
Description:
New to programming. Recently I need to process some data (grouping data, calculating standard deviation, etc.) using Flink Batch processing.
However I came to a point where I need to output two DataSet.
The structure was something like this
From Source(Database) -> DataSet 1 (add index using zipWithIndex())-> DataSet 2 (do some calculation while keeping index) -> DataSet 3
First I output DataSet 2, the index is e.g. from 1 to 10000;
And then I output DataSet 3 the index becomes from 10001 to 20000 although I did not change the value in any function.
My guessing is when outputting DataSet 3 instead of using the result of
previously calculated DataSet 2 it started from getting data from database again and then perform the calculation.
With the use of ZipWithIndex() function it does not only give the wrong index number but also increase the connection to db.
I guess that this is relevant to the execution environment, as when I use
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
will give the "wrong" index number (10001-20000)
and
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
will give the correct index number (1-10000)
The time taken and number of database connections is different and the order of print will be reversed.
OS, DB, other environment details and versions:
IntelliJ IDEA 2017.3.5 (Community Edition)
Build #IC-173.4674.33, built on March 6, 2018
JRE: 1.8.0_152-release-1024-b15 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Windows 10 10.0
My Test code(Java):
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//Table is used to calculate the standard deviation as I figured that there is no such calculation in DataSet.
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
//Get Data from a mySql database
DataSet<Row> dbData =
env.createInput(
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl($database_url)
.setQuery("select value from $table_name where id =33")
.setUsername("username")
.setPassword("password")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.DOUBLE_TYPE_INFO))
.finish()
);
// Add index for assigning group (group capacity is 5)
DataSet<Tuple2<Long, Row>> indexedData = DataSetUtils.zipWithIndex(dbData);
// Replace index(long) with group number(int), and convert Row to double at the same time
DataSet<Tuple2<Integer, Double>> rawData = indexedData.flatMap(new GroupAssigner());
//Using groupBy() to combine individual data of each group into a list, while calculating the mean and range in each group
//put them into a POJO named GroupDataClass
DataSet<GroupDataClass> groupDS = rawData.groupBy("f0").combineGroup(new GroupCombineFunction<Tuple2<Integer, Double>, GroupDataClass>() {
#Override
public void combine(Iterable<Tuple2<Integer, Double>> iterable, Collector<GroupDataClass> collector) {
Iterator<Tuple2<Integer, Double>> it = iterable.iterator();
Tuple2<Integer, Double> var1 = it.next();
int groupNum = var1.f0;
// Using max and min to calculate range, using i and sum to calculate mean
double max = var1.f1;
double min = max;
double sum = 0;
int i = 1;
// The list is to store individual value
List<Double> list = new ArrayList<>();
list.add(max);
while (it.hasNext())
{
double next = it.next().f1;
sum += next;
i++;
max = next > max ? next : max;
min = next < min ? next : min;
list.add(next);
}
//Store group number, mean, range, and 5 individual values within the group
collector.collect(new GroupDataClass(groupNum, sum / i, max - min, list));
}
});
//print because if no sink is created, Flink will not even perform the calculation.
groupDS.print();
// Get the max group number and range in each group to calculate average range
// if group number start with 1 then the maximum of group number equals to the number of group
// However, because this is the second sink, data will flow from source again, which will double the group number
DataSet<Tuple2<Integer, Double>> rangeDS = groupDS.map(new MapFunction<GroupDataClass, Tuple2<Integer, Double>>() {
#Override
public Tuple2<Integer, Double> map(GroupDataClass in) {
return new Tuple2<>(in.groupNum, in.range);
}
}).max(0).andSum(1);
// collect and print as if no sink is created, Flink will not even perform the calculation.
Tuple2<Integer, Double> rangeTuple = rangeDS.collect().get(0);
double range = rangeTuple.f1/ rangeTuple.f0;
System.out.println("range = " + range);
}
public static class GroupAssigner implements FlatMapFunction<Tuple2<Long, Row>, Tuple2<Integer, Double>> {
#Override
public void flatMap(Tuple2<Long, Row> input, Collector<Tuple2<Integer, Double>> out) {
// index 1-5 will be assigned to group 1, index 6-10 will be assigned to group 2, etc.
int n = new Long(input.f0 / 5).intValue() + 1;
out.collect(new Tuple2<>(n, (Double) input.f1.getField(0)));
}
}
It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks. See this question Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?
getExecutionEnvironment is the right way to get the environment when you want to run your job. createCollectionEnvironment is a good way to play around and test. See the documentation
The exception error message is very clear: if you call print or collect your data flow gets executed. So you have two choices:
Either you call print/collect at the end of your data flow and it gets executed and printed. That's good for testing stuff. Bear in mind you can only call collect/print once per data flow, otherwise it gets executed many time while it's not completely defined
Either you add a sink at the end of your data flow and call env.execute(). That's what you want to do once your flow is in a more mature shape.

Invert list of periods

I have a list of periods - each period contains startTime and endTime (as a timestamp).
I want to create a list which will contain missing gaps in given range.
Example:
from 100 to 500 for given list:
Range[150, 200]
Range[230, 400]
It will produce a list:
Range[100, 150]
Range[200, 230]
Range[400, 500]
I created a simple algorithm which is iterating my input list and creates valid result table, but I wonder if I can do the same using java8 time API or is there an external library for that.
Using a list of individual range boundaries, you can construct that using a complete set that includes min and max:
I'm using array[int, int], which should be easy to translate into your Range object.
The logic is simple: using only the range boundary numbers, make a complete set and then make pairs of all consecutive boundaries. For that, a sorted list of all (distinct) numbers, including the missing ranges is first created...
List<Integer> flat = Arrays.<int[]>asList(new int[] { 150, 200 },
new int[] { 230, 400 }).stream()
.flatMap(e -> Arrays.asList(e[0],
e[1]).stream()).collect(Collectors.toList());
List<Integer> fullRange = new ArrayList<>();
fullRange.add(100);
fullRange.add(500);
fullRange.addAll(flat);
List<Integer> all = fullRange.stream()
.distinct()
.sorted()
.collect(Collectors.toList());
System.out.println(
IntStream.range(0, all.size())
.filter(i -> i < -1 + all.size()) #Excluding the last element
.mapToObj(index -> Arrays.asList( //You can create Range objects here
all.get(index),
all.get(index + 1))
)
.collect(Collectors.toList()));
This outputs:
[[100, 150], [150, 200], [200, 230], [230, 400], [400, 500]]
Here a solution using my lib Time4J. I have assumed that your timestamps are to be modelled as "milliseconds since Unix epoch", but you are free to use any other type. Time4J knows many different types of date- or time-related intervals and offers various methods to calculate interferences of intervals, here the complement of an interval collection.
// define/create your intervals
MomentInterval i1 =
MomentInterval.between(Instant.ofEpochMilli(150), Instant.ofEpochMilli(200));
MomentInterval i2 =
MomentInterval.between(Instant.ofEpochMilli(230), Instant.ofEpochMilli(400));
// collect the intervals into an interval-collection
IntervalCollection<Moment> ic =
IntervalCollection.onMomentAxis().plus(Arrays.asList(i1, i2));
// define/create the outer time window
MomentInterval window =
MomentInterval.between(Instant.ofEpochMilli(100), Instant.ofEpochMilli(500));
// create/calculate the complement of the interval collection
ic.withComplement(window)
.getIntervals()
.forEach(
i ->
System.out.println(
"Range["
+ i.getStart().getTemporal().toTemporalAccessor().toEpochMilli()
+ ", "
+ i.getEnd().getTemporal().toTemporalAccessor().toEpochMilli()
+ "]"
)
);
Range[100, 150]
Range[200, 230]
Range[400, 500]
By the way, Time4J uses the half-open-approach for moment/instant-intervals meaning that the end boundary of such intervals is excluded. Therefore, I would rather choose the open bracket ")" instead of "]" but have here closely followed your question.
another solution could be like below. for this i defined some classes and functions.
Class Range :
class Range{
private int start;
private int end;
}
class RangeInfo:
class RangeInfo {
private List<Range> ranges;
private int newStart;
private int newEnd;
}
function1:
This function create new Range object from two Range object.
BiFunction<Range,Range,Range> function1 = (r1,r2)->new Range(r1.getEnd(),r2.getStart());
function
This function has two parameter that they are List(Range[150, 200] and Range[230, 400]) and Range (100, 500) and return List of new Range.
BiFunction<List<Range>,Range,List<Range>> function = (r1, pair)->{ // you can use new Pair(100,500) instead of Range object
result.add(new Range(pair.getStart(),r1.get(0).getStart()));
IntStream.range(0, r1.size() - 1).mapToObj(i -> function1.apply(r1.get(i), r1.get(i + 1))).forEachOrdered(result::add);
result.add(new Range(r1.get(r1.size()-1).getEnd(),pair.getEnd()));
result.addAll(r1);
return result;
};
and finally create Range and sort them like below.
function.apply(ranges,new Range(100,500))
.stream()
.sorted(Comparator.comparingInt(Range::getStart))
.collect(Collectors.toList());
Demo
This would be the code in native Java
Code
public final class RangesCalculator {
private static final String START_RANGE = "100";
private static final String END_RANGE = "500";
public static List<Range> invertRange(List<Range> ranges) {
List<Range> toReturn = new ArrayList<>();
for (Range range : ranges) {
if (toReturn.isEmpty()) {
if (!START_RANGE.equals(range.getStart())) {
toReturn.add(Range.newBuilder().setStart(START_RANGE).setEnd(range.getStart()).build());
}
} else {
Range lastRange = toReturn.get(toReturn.size() - 1);
lastRange.setEnd(range.getStart());
}
if (!END_RANGE.equals(range.getEnd())) {
toReturn.add(Range.newBuilder().setStart(range.getEnd()).setEnd(END_RANGE).build());
}
}
return toReturn.stream().filter(range -> !range.getStart().equals(range.getEnd())).collect(Collectors.toList());
}
}

Apache Flink Streaming window WordCount

I have following code to count words from socketTextStream. Both cumulate word counts and time windowed word counts are needed. The program has an issue that cumulateCounts is always the same as windowed counts. Why this issue occurs? What is the correct way to calculate cumulate counts base on windowed counts?
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final HashMap<String, Integer> cumulateCounts = new HashMap<String, Integer>();
final DataStream<Tuple2<String, Integer>> counts = env
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.window(Time.of(5, TimeUnit.SECONDS))
.groupBy(0).sum(1)
.flatten();
counts.print();
counts.addSink(new SinkFunction<Tuple2<String, Integer>>() {
#Override
public void invoke(Tuple2<String, Integer> value) throws Exception {
String word = value.f0;
Integer delta_count = value.f1;
Integer count = cumulateCounts.get(word);
if (count == null)
count = 0;
count = count + delta_count;
cumulateCounts.put(word, count);
System.out.println("(" + word + "," + count.toString() + ")");
}
});
You should first group-by, and apply the window on the keyed data stream (your code works on Flink 0.9.1 but the new API in Flink 0.10.0 is strict about this):
final DataStream<Tuple2<String, Integer>> counts = env
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(5, TimeUnit.SECONDS)).sum(1)
.flatten();
If you apply a window on a non-keyed data stream, there will be only a single threaded window operator on a single machine (ie, no parallelism) to build the window on the whole stream (in Flink 0.9.1, this global window can be split into sub-windows by groupBy() -- however, in Flink 0.10.0 this will not work any more). To counts words, you want to build a window for each distinct key value, ie, you first get a sub-stream per key value (via groupBy()) and apply a window operator on each sub stream (thus, you could have an own window operator instance for each sub-stream, allowing for parallel execution).
For a global (cumulated) count, you can simple apply a groupBy().sum() construct. First, the stream is split into sub-stream (one for each key value). Second, you compute the sum over the stream. Because the stream is not windowed, the sum in computed (cumulative) and updated for each incoming tuple (in more details, the sum has an initial result value of zero and the result is updated for each tuple as result += tuple.value). After each invocation of sum, the new current result is emitted.
In your code, you should not use your special sink function but do as follows:
counts.groupBy(0).sum(1).print();

Efficient way to implement 'events since x' in Java

I want to be-able to ask an object 'how many events have occurred in the last x seconds' where the x is an argument.
e.g. how many events have occurred in the last 120 seconds..
How I approached is linear based on the number of events occurring but was wanting to see what the most efficient way (space & time) to achieve this requirement?;
public class TimeSinceStat {
private List<DateTime> eventTimes = new ArrayList<>();
public void apply() {
eventTimes.add(DateTime.now());
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
for (int i = 0; i < orderTimes.size(); i++) {
DateTime dateTime = eventTimes.get(i);
if (dateTime.compareTo(startTime) > 0)
return eventTimes.subList(i, eventTimes.size()).size();
}
return 0;
}
(PS - i'm using JodaTime for the date/time representation)
Edit:
The key of this algorithm to find all events that have happened in the last x seconds; the exact start time (e.g. now - 30 seconds) is may or maynot be in the collection
Store the DateTime in a TreeSet and then use tailSet to get the most recent events. This saves you from having to find the starting point by iteration (which is O(n)) and instead by searching (which is O (log n)).
TreeSet<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailSet(DateTime.now().minus(Seconds.seconds(seconds)), true).size();
}
Of course, you could also binary search on your sorted list, but this does the work for you.
Edit
If it's a concern that multiple events could occur at the same DateTime, you can take the exact same approach with a SortedMultiset from Guava:
TreeMultiset<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailMultiset(
DateTime.now().minus(Seconds.seconds(seconds)),
BoundType.CLOSED
).size();
}
Edit x2
Here's a much more efficient approach that leverages the fact that you only log events that happened after all other events. With each event, store the number of events up to that date:
SortedMap<DateTime, Integer> eventCounts = initEventMap();
public SortedMap<DateTime, Integer> initEventMap() {
TreeMap<DateTime, Integer> map = new TreeMap<>();
//prime the map to make subsequent operations much cleaner
map.put(DateTime.now().minus(Seconds.seconds(1)), 0);
return map;
}
private long totalCount() {
//you can handle the edge condition here
return eventCounts.getLastEntry().getValue();
}
public void logEvent() {
eventCounts.put(DateTime.now(), totalCount() + 1);
}
Then getting the count since a date is super efficient, just take the total and subtract the count of events that occurred before that date.
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
return totalCount() - eventCounts.lowerEntry(startTime).getValue();
}
This eliminates the inefficient iteration. It's a constant time lookup and an O(log n) lookup.
If you were implementing a data structure from scratch, and the data are not in sorted order, you'd want to construct a balanced order statistic tree (also see code here). This is just a regular balanced tree with the size of the tree rooted at each node maintained in the node itself.
The size fields enable efficient calcualtion of the "rank" of any key in the tree. You can do the desired range query by making two O(log n) probes into the tree for the rank of the min and max range value, finally taking their difference.
The proposed tree and set tail operations are great except the tail views will need time to construct, even though all you need is their size. The asymptotic complexity is the same as the OST, but the OST avoids this overhead completely. The difference could be meaningful if performance is very criticial.
Of course I'd definitely use the standard library solution first and consider the OST only if the speed turned out to be inadequate.
Since DateTime already implements Comparable interface, I would recommend storing the data in a TreeMap instead, and you could use TreeMap#tailMap to get a subtree of the DateTime's that occurs in the desired time.
Based on your code:
public class TimeSinceStat {
//just in case two or more events start at the "same time"
private NavigableMap<DateTime, Integer> eventTimes = new TreeMap<>();
//if this class needs to be used in multiple threads, use ConcurrentSkipListMap instead of TreeMap
public void apply() {
DateTime dateTime = DateTime.now();
Integer times = eventTimes.contains(dateTime) != null ? 0 : (eventTimes.get(dateTime) + 1);
eventTimes.put(dateTime, times);
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
NavigableMap<DateTime, Integer> eventsInRange = eventTimes.tailMap(startTime, true);
int counter = 0;
for (Integer time : eventsInRange.values()) {
counter += time;
}
return counter;
}
}
Assuming the list is sorted, you could do a binary search. Java Collections already provides Collections.binarySearch, and DateTime implements Comparable (according to the JodaTime JavaDoc). binarySearch will return the index of the value you want, if it exists in the list, otherwise it returns the index of the greatest value less than the one you want (with the sign flipped). So, all you need to do is (in your eventsSince method):
// find the time you want.
int index=Collections.binarySearch(eventTimes, startTime);
if(index < 0) index = -(index+1)-1; // make sure we get the right index if startTime isn't found
// check for dupes
while(index != eventTimes.size() - 1 && eventTimes.get(index).equals(eventTimes.get(index+1))){
index++;
}
// return the number of events after the index
return eventTimes.size() - index; // this works because indices start at 0
This should be a faster way to do what you want.

Categories

Resources