Stock statistics calculation with O(1) time and space complexity

Stock statistics calculation with O(1) time and space complexity - java

I have to design a rest API in Java which :-
Accepts a POST request with the below json :-
{
"instrument": "ABC",
"price": "200.90",
"timestamp" : "2018-09-25T12:00:00"
}
these records would be saved in an in memory collection and not any kind of database.
There would be a GET API which returns the statistics of the specific instrument records received in the last 60 seconds. The GET request would be :- /statistics/{instrumentName} Ex :- /statistics/ABC . The response looks as mentioned below :-
{
"count" : "3"
"min" : "100.00"
"max" : "200.00"
"sum" : "450.00"
"avg" : "150.00"
}
There would be another GET request /statistics which returns the statistics of all the instruments that was received in the last 60 seconds ( Not specific to particular instrument like #2 )
What makes this algorithm complex to implement is that the GET call should be executed - O(1) time and space complexity.
The approach which I have thought for 3# is to have a collection which will have 60 buckets ( since we have to calculate for past 60 secs so sampling per 1 sec). Every time the transaction comes in it will go to specific bucket depending on the key i.e. hour-min-sec ( it would be a map with this key and the statistics for that sec ) .
But what I am not able to understand is how to address the problem 2# where we have to get the statistics of specific instrument /statistics/ABC for last 60 sec in O(1) time and space complexity.
What could be the best strategy to clean up records which are older than 60 secs?
Any help with the algorithm will be appreciated.

Store the data in a Map<String, Instrument>, and have the class look like this:
class Instrument {
private String name;
private SortedMap<LocalDateTime, BigDecimal> prices;
private BigDecimal minPrice;
private BigDecimal maxPrice;
private BigDecimal sumPrice;
// Internal helper method
private void cleanup() {
LocalDateTime expireTime = LocalDateTime.now().minusSeconds(60);
Map<LocalDateTime, BigDecimal> expiredPrices = this.prices.headMap(expireTime);
for (BigDecimal expiredPrice : expiredPrices.values()) {
if (this.minPrice.compareTo(expiredPrice) == 0)
this.minPrice = null;
if (this.maxPrice.compareTo(expiredPrice) == 0)
this.maxPrice = null;
this.sumPrice = this.sumPrice.subtract(expiredPrice);
}
expiredPrices.clear(); // Removes expired prices from this.prices
if (this.minPrice == null && ! this.prices.isEmpty())
this.minPrice = this.prices.values().stream().min(Comparator.naturalOrder()).get();
if (this.maxPrice == null && ! this.prices.isEmpty())
this.maxPrice = this.prices.values().stream().max(Comparator.naturalOrder()).get();
}
// other code
}
All the public methods of Instrument must be synchronized and must start with a call to cleanup(), since time has elapsed since any previous call. The addPrice(LocalDateTime, BigDecimal) method must of course update the 3 statistics fields.
To ensure statistics are in sync, it would be appropriate to have a Statistics class that can be used as return value, so all 4 main statistics values (incl. count obtained from this.prices.size()) represent the same set of prices.

Related

How to increment arrival date of request by one ms if same

I need a java utility method in java (for my application which get thousands of request coming in a second), which has following feature.
The request has arrivaltime in format of (DD-MM-YYYY-HH:MM:SS) and bucketNumber (1-100).
I want that if for same bucketNumber if same arrivaltime comes from request it should increment the value of arrivaltime of request by 1 miliisecond.
For example :
If for bucketNumber=1 arrival time for 1st, 2nd, 3rd request = 01-01-2016-10:00:00 (actually time till milli 01-01-2016-10:00:00:000) and a 4th request with 01-01-2016-10:00:01.
So for 2nd request the utility method will return 01-01-2016-10:00:00 (BUT this actually 01-01-2016-10:00:00:001)
and for 3rd request the utility method will return 01-01-2016-10:00:00 (BUT this actually 01-01-2016-10:00:00:002)
and for 4rd request the utility method will return 01-01-2016-10:00:01 only without performing any operation.
I don't want to keep a huge cache to perform this action (if I use set then I want to keep removing redundant the data as well).
//signature should be like below
Date getIncrementedArrivalTimeIfSame(Date arrivaltime, int bucketNumber ) {
//return incremented if equal else return original arrivaltime
}
Should I use a global map which has bucketNumber as key and a set which has arrival time? Please help me to implement this. This method will be invoked within synchronized block in a threadSafe way.

Below is my solution.
I finaly used a map:
static Map<Integer, Date> arrivalTimeMap = new HashMap<>();
static Date getIncrementedArrivalTimeIfEqual(Date arrivaltime,
int bucketNumber) {
Date lastArrivalTime = arrivalTimeMap.put(bucketNumber, arrivaltime);
if(lastArrivalTime != null && !lastArrivalTime.before(arrivaltime)){
Date incrementedArrivalTime = incrementDateByMilliSeconds(
lastArrivalTime, 1);
arrivaltime = incrementedArrivalTime;
}
arrivalTimeMap.put(bucketNumber, arrivaltime);
return arrivaltime;
}

Displaying more than 10000 rows using Core Reporting Google API v4 ( Java)

I'm fetching Google Analytics data using Core Reporting API v4. I'm able to capture at most 10,000 records for a given combination of Dimensions & Metrics.
My question is that if my query can produce more than 10,000 search results then how can I fetch all those records? I have gone through the documentation and found that in a single request we can't access more than 10,000 records by setting the properties of ReportRequest object.
ReportRequest request = new ReportRequest()
.setDateRanges(Arrays.asList(dateRange))
.setViewId(VIEW_ID)
.setDimensions(Arrays.asList(dimension))
.setMetrics(Arrays.asList(metric))
.setPageSize(10000);
How can we enable multiple requests in a single run depending upon the number of search-results that can be obtained.
For example : If my query can return 35,000 records then there should be 4 requests (10,000,10,000, 10,000 & 3,500) managed internally.
Please look into this and facilitate me some guidance. Thanks in Advance.

The Analytics Core Reporting API returns a maximum of 10,000 rows per
request, no matter how many you ask for.
If the request you are making will generate more then 10000 rows then there will be additional rows you can request. The response returned from the first request will contain a parameter called nextPageToken which you can use to request the next set of data.
You will have to dig around the Java library the only documentation on how to do it I have found is HTTP.
POST https://analyticsreporting.googleapis.com/v4/reports:batchGet
{
"reportRequests":[
{
...
# Taken from `nextPageToken` of a previous response.
"pageToken": "XDkjaf98234xklj234",
"pageSize": "10000",
}]
}

Here's a stable and extensively tested solution in Java. It is a recursive solution that stores every 10000 results batch (if any) and recalls itself until finds a null nextToken. In this specific solution every 10000 results batch is saved into a csv and then a recursive call is performed! Note that the first time this function called from somewhere outside, the nextPageToken is also null!! Focus on the recursive rationale and the null value check!
private static int getComplexReport(AnalyticsReporting service,int
reportIndex,java.lang.String startDate,String endDate,ArrayList<String>
metricNames,ArrayList<String> dimensionNames,String pageToken)
throws IOException
ReportRequest req = createComplexRequest(startDate,endDate,metricNames,dimensionNames,pageToken);
ArrayList<ReportRequest> requests = new ArrayList<>();
requests.add(req);
// Create the GetReportsRequest object.
GetReportsRequest getReport = new GetReportsRequest()
.setReportRequests(requests);
// Call the batchGet method.
GetReportsResponse response = service.reports().batchGet(getReport).execute();
//printResponse(response);
saveBatchToCsvFile("dummy_"+startDate+"_"+endDate+"_"+Integer.toString(reportIndex)+".csv",startDate+"_"+endDate,response,metricNames,dimensionNames);
String nextToken = response.getReports().get(0).getNextPageToken();
//System.out.println(nextToken);
if(nextToken!=null)
return getComplexReport(service,reportIndex+1,"2016-06-21","2016-06-21",metricNames,dimensionNames,nextToken);
return reportIndex;
}

var reportRequest = new ReportRequest
{
DateRanges = new List<DateRange> { dateRange },
Dimensions = new List<Dimension> { date, UserId, DeviceCategory},
Metrics = new List<Metric> { sessions },
ViewId = view,
PageSize = 400000
};

Spark streaming mapWithState timeout delayed?

I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?

Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov

Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.

Java ExecutorService get feedback for all tasks

I want to send email to multiple(500,1000,2000) users.
I have done that using ExecutorService.
But now I want to collect the number of successful emails sent and the number of failed emails out of total records.
I have implemented this like:
int startValue=0;
int endValue=0;
List userEmailList = getListFromDB();
ExecutorService e = Executors.newFixedThreadPool(10);
Collection c = new ArrayList();
while (someflag)
{
// in MyTask class I am sending email to users.
c.add(new MyTask(startValue, endValue,userEmailList));
}
e.invokeAll(c); //Here I am calling invokeall .
pool.shutdown();
public class MyTask implements Callable<String> {
MyTask(startValue, endValue,userEmailList){
}
public String call(){
//e.g. batch 1 will have - startValue => endValue = 0 -100
//e.g. batch 2 will have - startValue => endValue = 101 -199
//e.g. batch 3 will have - startValue => endValue = 200 -299
//e.g. batch 4 will have - startValue => endValue = 300 -399
//e.g. batch 5 will have - startValue => endValue = 400 - 499
for(int i=startValue;i<endValue;i++){
sendEmailToUser(userEmailList.get(i)){
}
}
}
But future.get() returning me number of task completed. so from above code it will return me 5 task.
But I wanted output as no of failed emails and number of successful email sent.
for e.g if there are 500 email users and if 20 falied , then output should be 480 success and 20 failed.
But with above code I am getting only no of task . ie 5 task
Can anybody tell me how I can get feedback from all concurrent tasks (Not the number of tasks completed).

Your MyTask returns a String (implements Callable<String>), which doesn't make much sense in your case. You are free to return any other type you want. Unfortunately you'll need some simple POJO to contain the results, e.g.:
public class Result {
private final int successCount;
private final int failureCount;
public Result(int successCount, int failureCount) {
this.successCount = successCount;
this.failureCount = failureCount;
}
}
And return it after given batch is done (implement Callable<Result>). Of course your MyTask will then have to keep track of how many e-mails failed and return correct values wrapped around Result.
However I see several ways your code can be improved. First of all instead of passing startValue, endValue range to MyTask just use userEmailList.subList(startValue, endValue) - which will simplify your code a lot:
new MyTask(userEmailList.subList(startValue, endValue));
//...
public class MyTask implements Callable<Result> {
MyTask(userEmailList){
}
public Result call(){
for(email: userEmailList) {
sendEmailToUser(email);
//collect results here
}
return new Result(...);
}
}
On the other hand there is nothing wrong in creating MyTask to send just one e-mail. That instead of aggregating counts in given batch you simply check the result of one task (one e-mail) - either nothing or exception (or single Boolean). It's much easier and shouldn't be slower.

I could see that your call method is declared to return a String but your code doesn't return anything (probably incomplete snippet). And from your statement, I understand that you are returning whether the task is completed or not and not whether the mail has been sent. You could make the sendEmailToUser return the success of failure depending on whether the mail has been sent successfully and get the result using Future.get

Find messages from certain key till certain key while being able to remove stale keys

My problem
Let's say I want to hold my messages in some sort of datastructure for longpolling application:
1. "dude"
2. "where"
3. "is"
4. "my"
5. "car"
Asking for messages from index[4,5] should return:
"my","car".
Next let's assume that after a while I would like to purge old messages because they aren't useful anymore and I want to save memory. Let's say after time x messages[1-3] became stale. I assume that it would be most efficient to just do the deletion once every x seconds. Next my datastructure should contain:
4. "my"
5. "car"
My solution?
I was thinking of using a concurrentskiplistset or concurrentskiplist map. Also I was thinking of deleting the old messages from inside a newSingleThreadScheduledExecutor. I would like to know how you would implement(efficiently/thread-safe) this or maybe use a library?

The big concern, as I gather it, is how to let certain elements expire after a period. I had a similar requirement and I created a message class that implemented the Delayed Interface. This class held everything I needed for a message and (through the Delayed interface) told me when it has expired.
I used instances of this object within a concurrent collection, you could use a ConcurrentMap because it will allow you to key those objects with an integer key.
I reaped the collection once every so often, removing items whose delay has passed. We test for expiration by using the getDelay method of the Delayed interface:
message.getDelay(TimeUnit.MILLISECONDS);
I used a normal thread that would sleep for a period then reap the expired items. In my requirements it wasn't important that the items be removed as soon as their delay had expired. It seems that you have a similar flexibility.
If you needed to remove items as soon as their delay expired, then instead of sleeping a set period in your reaping thread, you would sleep for the delay of the message that will expire first.
Here's my delayed message class:
class DelayedMessage implements Delayed {
long endOfDelay;
Date requestTime;
String message;
public DelayedMessage(String m, int delay) {
requestTime = new Date();
endOfDelay = System.currentTimeMillis()
+ delay;
this.message = m;
}
public long getDelay(TimeUnit unit) {
long delay = unit.convert(
endOfDelay - System.currentTimeMillis(),
TimeUnit.MILLISECONDS);
return delay;
}
public int compareTo(Delayed o) {
DelayedMessage that = (DelayedMessage) o;
if (this.endOfDelay < that.endOfDelay) {
return -1;
}
if (this.endOfDelay > that.endOfDelay) {
return 1;
}
return this.requestTime.compareTo(that.requestTime);
}
#Override
public String toString() {
return message;
}
}

I'm not sure if this is what you want, but it looks like you need a NavigableMap<K,V> to me.
import java.util.*;
public class NaviMap {
public static void main(String[] args) {
NavigableMap<Integer,String> nmap = new TreeMap<Integer,String>();
nmap.put(1, "dude");
nmap.put(2, "where");
nmap.put(3, "is");
nmap.put(4, "my");
nmap.put(5, "car");
System.out.println(nmap);
// prints "{1=dude, 2=where, 3=is, 4=my, 5=car}"
System.out.println(nmap.subMap(4, true, 5, true).values());
// prints "[my, car]" ^inclusive^
nmap.subMap(1, true, 3, true).clear();
System.out.println(nmap);
// prints "{4=my, 5=car}"
// wrap into synchronized SortedMap
SortedMap<Integer,String> ssmap =Collections.synchronizedSortedMap(nmap);
System.out.println(ssmap.subMap(4, 5));
// prints "{4=my}" ^exclusive upper bound!
System.out.println(ssmap.subMap(4, 5+1));
// prints "{4=my, 5=car}" ^ugly but "works"
}
}
Now, unfortunately there's no easy way to get a synchronized version of a NavigableMap<K,V>, but a SortedMap does have a subMap, but only one overload where the upper bound is strictly exclusive.
API links
SortedMap.subMap
NavigableMap.subMap
Collections.synchronizedSortedMap

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.