I'm currently working on my masters thesis which involves using Drools Fusion to process events coming from multiple streams of XML files (So I am 'replaying' each file as a stream). These files are of a football match taking place with GPS sensors attached to the players that monitors their acceleration and velocity and other stuff like player load etc.
Each XML file contains instances of events stating an ID, start time, end time and code as follows:
<file>
<SESSION_INFO>
<start_time>2015-09-17 19:02:31.31 +100</start_time>
</SESSION_INFO>
<SORT_INFO>
<sort_type>sort order</sort_type>
</SORT_INFO>
<ALL_INSTANCES>
<instance>
<ID>1</ID>
<start>0</start>
<end>1.51</end>
<code>Accel : 0.00 - 2.00</code>
</instance>
<instance>
<ID>2</ID>
<start>1.52</start>
<end>3.01</end>
<code>Accel : -2.00 - 0.00</code>
</instance>
<instance>
<ID>3</ID>
<start>3.02</start>
<end>4.01</end>
<code>Accel : 0.00 - 2.00</code>
</instance>
<instance>
<ID>4</ID>
<start>4.02</start>
<end>4.21</end>
<code>Accel : 2.00 - 4.00</code>
</instance>
</ALL_INSTANCES>
I have 9 of these files which all need to be processed concurrently and feed in these events simultaneously into the engine. My current implementation is of a JAXB unmarshaller to feed these events into the stream but no idea how to do it concurrently (ie: feed the first event in per stream, then the second event in per stream etc). I was looking into using threads for that part of the implementation, unless their is another tool I've missed in Drools to help do this. But searched fairly thoroughly and no comprehensive examples exist in processing multiple streams concurrently.
Another question I have is regarding the Pseudoclock. Because I have these 9 different streams with events happening at different times, I cannot advance the time after every insert because each event in each stream happens at a different time, therefore, these events won't line up. The time at which all these streams start is the same. For example, if I have instance 1 in the XML happening in the duration of 1.51, and another event from another stream with a duration of. Say. 4 seconds, and say I was to advance both of these events, then they would be out of sync from each other.
However, all my time related data exists in each stream. The Kick Off time is 19:02:31, and each event has a timestamp in that stream in seconds after kick off through the 'end' timestamp with the duration of each event of (end timestamp - start timestamp). The processing I need to do with these streams involves taking these acceleration events and correlating them with other streams whenever 2 or more players accelerate at the same rate at roughly the same duration/time interval.
Can anyone give me any pointers or assistance? To summarize, I need to know a better way of concurrently inserting streams into the engine and need to know if I need the pseudoclock for my implementation/processing. I am pretty much a beginner in programming so all I want is to get the system to run.
Thanks a lot!
Stu.
You don't need to process the nine XML files concurrently, i.e., distributed on threads. <instance> elements appear to be sorted according to start or end time (this may depend on what needs to be computed during an instance event), and you can process them all in their natural sequence - just determine what is next in the nine streams.
This way, also your issue relating to the pseudo clock ceases to be a problem. You can easily advance the clock to the next instance event once you have determined it.
Without knowing all the details, I think that each <instance> defines two events: the player starts moving and the player stops moving. And you may have to reasess the situation at each of these two events.
Related
I want to process multiple events in order of their timestamps coming into the system via multiple source systems like MQ, S3 ,KAFKA .
What approach should be taken to solve the problem?
As soon as an event comes in, the program can't know if another source will send events that should be processed before this one but have not arrived yet. Therefore, you need some waiting period, e.g. 5 minutes, in which events won't be processed so that late events have a chance to cut in front.
There is a trade-off here, making the waiting window larger will give late events a higher chance to be processed in the right order, but will also delay event processing.
For implementation, one way is to use a priority-queue that sorts by min-timestamp. All event sources write to this queue and events are consumed only from the top and only if they are at least x seconds old.
One possible optimisation for the processing lag: As long as all data sources provide at least one event that is ready for consumption, you can safely process events until one source is empty again. This only works if sources provide their own events in-order. You could implement this by having a counter for each data source of how many events exist in the priority-queue.
Another aspect is what happens to the priority-queue when a node crashes. Using acknowledgements should help here, so that on crash the queue can be rebuilt from unacknowledged events.
I have a set of rules in my BroadcastStream in Apache Flink.
I am able to apply new rules as they come to my stream of events.
But I am not able to figure out how can I implement if my rules are like
rule 1> alert when count of event a is greater than 5 in a window of 5 mins
rule 2> alert when count of event a is greater than 4 in a window of 15 mins
I am a newbie to flink. I am not able to figure this out.
An application based on flink-sql or flink-cep won't be able to do this, because those libraries can only handle rules that are defined at the time the job is compiled. You would need to start a new job for each new rule, which may not meet your requirements.
If you want to have a single job that can handle a dynamic set of rules that are supplied while the job is running, you'll have to build this yourself. You can use a KeyedBroadcastProcessFunction to do this (which it sounds like you have already begun to experiment with).
Here's a sketch of a possible implementation:
You can use keyed state in the KeyedBroadcastProcessFunction to keep track of the current count in each window. If the rules can be characterized by a time interval and a counting threshold, then you could use MapState, where the keys are the rule IDs, and the values in the map are the current count for that rule. You can have a timer for each rule that fires when each window ends.
As events arrive, you iterate through the rule-based map, incrementing the counter for every relevant rule. And when the timers fire, you find the relevant rules, compare the counters to the thresholds, take appropriate action, and clear those counters.
Some potential complications to keep in mind:
This implementation requires that you partition your stream with a keyBy, so that you can use MapState and timers.
The broadcast stream can't have timers associated with it, so the timers will have to be managed by the processElement method that's handling the keyed stream.
Flink only allows one timer for a given key and given timestamp. So take care if you must handle the case where two rules would need to be triggered at the same time.
If events can arrive out of order, then you will need to either first sort the stream by timestamp, or allow for having multiple windows open concurrently.
I am trying to figure out how to get time-based streaming but on an infinite stream. The reason is pretty simple: Web Service call latency results per unit time.
But, that would mean I would have to terminate the stream (as I currently understand it) and that's not what I want.
In words: If 10 WS calls came in during a 1 minute interval, I want a list/stream of their latency results (in order) passed to stream processing. But obviously, I hope to get more WS calls at which time I would want to invoke the processors again.
I could totally be misunderstanding this. I had thought of using Collectors.groupBy(x -> someTimeGrouping) (so all calls are grouped by whatever measurement interval I chose. But then no code will be aware of this until I call a closing function as which point the monitoring process is done.
Just trying to learn java 8 through application to previous code
By definition and construction a stream can only be consumed once, so if you send your results to an inifinite streams, you will not be able to access them more than once. Based on your description, it looks like it would make more sense to store the latency results in a collection, say an ArrayList, and when you need to analyse the data use the stream functionality to group them.
My application takes a lot of measurements of it's internal processes. For example I time certain methods, I time external webservice calls and I also have variables which have a changing value, and processes which have a 'state' (e.g. PAUSED, WAITING etc).
The application uses 100 to 200 threads, and each bit of data would be associated with a particular thread.
I am looking for some software that I can channel all this information into that would produce useful metrics and graphs of the data (ideally in real time or close to real time), let me set thresholds to trigger warnings, would allow me to filter the data by thread or thread group, etc etc.
The application is performing time critical tasks so the software/api would need to be very fast and never block.
The application is written in java, and ideally the software/api would be in java as well. I think what I'm looking for is called Event Stream Processing, but I'm really not sure what language to use to describe it.
All I've found so far are Esper and ERMA. Can anyone give me a recommendation? I'm the only one working on this project so I'm hoping for something that is pretty easy to set up and use, and has a workable front end.
In the end I found Graphite which was pretty close to being exactly what I wanted. Not the simplest to set up and configure however, but I got it working in the end.
http://graphite.wikidot.com/
In my case I send data directly from my application to Statsd (via UDP), which collects the data and does some pre processing before it ends up in the whisper back end, there is a simple example of a java interface here https://github.com/etsy/statsd/commit/2253223f3c19d2149d65ec5bc802198ff93da4cb
Alternatively you could send your data directly to graphite, example here http://neopatel.blogspot.co.uk/2011/04/logging-to-graphite-monitoring-tool.html
Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.
Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!
So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.