Alright so this problem has been breaking my brain all day today.
The Problem: I am currently receiving stock tick data at an extremely high rate through multicasts. I have already parsed this data and am receiving it in the following form.
-StockID: Int-64
-TimeStamp: Microseconds from Epoch
-Price: Int
-Quantity: Int
Hundreds of these packets of data are parsed every second. I am trying to reduce the computation on my storage end by packaging up this data into dictionaries/hashtables hashed by the stockID (key == stockID)(value == array of [timestamp, price, quantity] elements).
I also want each dictionary to represent timestamps within a 5min interval. When the incoming data's timestamps get past the 5min time interval, I want this new data to go into a new dictionary that represents the next time interval. Also, a special key will be hashed at key -1 telling what 5 particular minute interval per day does this dictionary belong to (so if you receive something at 12:32am, it should hash into the dictionary that has value 7 at key -1, since this represents the time interval of 12:30am to 12:35am for that particular day). Once the time passes, the dict that has its time expired can be sent off to the dataWrapper.
Now, you might be coming up with some ideas right about now. But here's a big constraint. The timestamps that are coming in Are not necessarily strictly increasing; however, if one waits about 10 seconds after an interval has ended then it can be safe to assume that every data coming in belongs to the current interval.
The reason I am doing all this complicated things is to reduce computation on the storage side of my application. With the setup above, my storage side thread can simply iterate over all of the key, value pairs within the dictionary and store them in the same location on the storage system without having to reopen files, reassign groups or change directories.
Good Luck! I will greatly appreciate ANY answers btw. :)
Preferred if you can send me something in python (that's what I'm doing the project in), but I can perfectly understand Java, C++, Ruby or PHP.
Summary
I am trying to put stock data into dictionaries that represent 5min intervals for each dictionary. The timestamp that comes with the data determines what particular dictionary it should be put in. This could be relatively easy except that timestamps are not strictly increasing as they come in, so dictionaries cannot be sent off to the datawrapper immediately once 5 mins has passed by the timestamps, since it isn't guaranteed to not receive any more data within 10 seconds, after this its okay to send it to the wrapper.
I just want any kind of ideas, algorithms, or partial implementations that could help me with the scheduling of this. How can we switch the current use of dictionaries within both timestamps (for the data) and actual time (the 10seconds buffer).
Clarification Edit
The 5 min window should be data driven (based upon timestamps), however the 10 second timeout appears to be clock time.
Perhaps I am missing something ....
Its appears you want to keep the data in 5 min buckets, but you can't be sure you have all the data for a bucket for up to 10 sec after it has rolled over.
This means for each instrument you need to keep the current bucket and the previous bucket. When its 10 seconds past the 5 min boundary you can publish/write out the old bucket.
Related
I'd like to find a functional data structure that can perform "Flow control".
Example: For any IP visiting my website, if the IP has visited >= N times since M minutes ago, the IP is restricted to visit for Z minutes.
Is there any solution that does not require timer (to remove visit records periodically) or large data storage (to remember all the visits from all IPs)?
Can use JAVA or Scala to construct the data structure.
The simple answers are Yes, No and Yes.
Yes, you can do it without a timer, you only need a single clock. When a request arrives you look at the clock and decide based on the historic data whether to reject the request or not according to your algorithm.
No, you can't do this without recording up to N visit records for each IP. You need to know the time of each request to know how many occurred in the last M minutes. There are various ways of compressing this but you can't implement your algorithm without recording every visit.
Yes, you can use Java or Scala to create the appropriate data structures based on your algorithm.
However you can reduce the data storage if you modify your test. For example you could divide time into windows of length M and count the requests in each window. If the number of requests in the current and previous windows exceeds N then you reject the request. This doesn't give exactly the same results but it achieves the general goal of rate-limiting requests from over-active clients while storing only two values for each IP address.
I have several large double and long arrays of 100k values each that needs to be accessed for computation at a given time, even with largeHeap requested the Android OS doesnt give me enough memory and i keep getting outofmemory exceptions in most of tested devices. So i went researching for ways to overcome this, and according to an answer i got from Waldheinz in my previous question i implemented an array that is based on a file, using RandomAccessMemory to get a channel to it, then map it using MappedByteBuffer as suggested, and use the MappedByteBuffer asLongBuffer or asDoubleBuffer. this works perfect, i 100% eliminated the outofmemory exceptions. but the performance is very poor. i get lot of calls to get(some index) that takes about 5-15 miliseconds each and therefore user exprience is ruined
some usefull information :
i am using binary search on the arrays to find a start and end indices and then i have a linear loop from start to end
I added a print command for any get() calls that takes more then 5 mili seconds to finish (printing out time it took,requested index and the last requested index), seems like all of the binary search get requests were printed, and few of the linear requests were too.
Any suggestions on how to make it go faster?
Approach 1
Index your data - add pointers for quick searching
Split your sorted data into 1000 buckets 100 values each
Maintain an index referencing each bucket's start and end
The algorithm is to first find your bucket in this memory index (even a loop is fine for this) and then to jump to this bucket in a memory mapped file
This will result into a single jump over a file (a single bucket to find) and an iteration on 100 elements max.
Approach 2
Utilize a lightweight embedded database. I.e. MapDB supports Android.
I've read this post which was very near my question and I still didn't found what I was looking for.
I'm developing an application that relies on two plain-text files: let's say weekdays.txt and year.txt. One file has most likely (yet to define) seven lines, so it's very small (few bytes), but the other will contain 365 lines (one per each day of the year), which is not very big in bytes (20 Kb tops, my guess), but requires more processing power.
The app is not yet done, so I'll try to be explicit:
So my application will get the current date and the time and will look on weekdays.txt for the line that corresponds to the current day of the week, parse that line's information and store it in memory.
After that the program should read year.txt and look for the line that corresponds to the current date and parse (and store in memory) that line's info.
Then it should do print out all the stored info.
When I say 'parse the info' I mean parsing Strings, something as simple as:
the string "7*1234-568" should be read as:
String ID=7;
int postCode=1234;
int areaCode=568;
The goal here is to create a light (and offline, this is crucial) application for quick use.
As you can see, this is a Developing 101 level application, and my question is: do you think this is too heavy work for any mobile phone? The reason I'm asking this is because I want my app to be functional in the biggest number of today's cellphones possible.
By the way, do you think for this kind of work I should instead be working with a database? I heard people around the forum talking of RMS and some said that it's kind of limited, so I just stayed the same. Anyway the idea of the txt files was to be easiest for the user to update just in case it's necessary...
Thanks in advance!
If your config files are read-only and are not going to change with time, then you could include them inside the jar. You should be able to read them using Class.getResourceAsStream that returns an InputStream. An ASCII file with 366 lines (remember leap years) and 80 cols is around 29KB, so even 10 years old phones will read it without major problems (remember to perform IO in a separate thread though).
If the configuration could change, then you'll probably want to create a WS and have the phones fetch the config over the internet. To provide offline capabilities you could sync with the remote DB periodically and store the info in the device. RMS is record-based, and has a max size (device-dependent), but I think it is ok for your case. The drawback of this approach is that at least a first synchronization should be made, thus phones without a data plan will be left out.
Since one of your requirement is to do it offline, I'd recommend using the RMS. I am not that confident in using files in j2me for such important data (not sure if it's better now) since it can be prone to errors and file corruptions.
If the amount of data you're going to save is as you say, 7 lines for weeks and 365 for years, then no problems with RMS.
Good luck!
i am currently creating a bus time table app. as of now it gets the current date and time.
I currently have an array of times the bus will be coming next. I want to compare the current time to that of those in the array and return the one closest... representing the next bus. However, there are different timetables for every stop, one for each direction. furthermore they change almost daily.
Is there a nicer data structure anyone can recommend for storing and recalling this data? My code is very messy so would like a better method of comparing,storing and returning data than arrays.
Thanks.
I would recommend using heap data structure (min-heap).
Each time the top most element will represent the nearest event, so you can remove it operate on it (computing the next event's time) and then insert it again.
it is easy and fast.
My application has 24 buttons to count different vehicle types and directions (the app will be used to count traffic). Currently, I'm saving a line to a .csv file each time a button is pressed. The csv file contains a timestamp, direction, etc.
I have to measure how many times every button was pressed in 15-minute intervals.
How should I store the counters for the buttons?
I just need to output how often every button (every button has a different tag for identification) was pressed in 15 minutes.
I was thinking about using a HashMap, which could just take the button's tag as key and the number of occurrences as value, like this:
HashMap<String, Integer> hm = new HashMap<String, Integer>();
Integer value = (Integer) hm.get(buttonTag);
if (value == null) {
hm.put(buttonTag, 1);
} else {
int nr_occ = value.intValue() + 1;
hm.put(buttonTag, nr_occ);
}
However, I don't need the total sums of button presses, but in 15 minute chunks. And as I'm already writing timestamps to the raw csv-file, I'm wondering how I should store these values.
Edit: I'm currently testing the HashMap, and it's working really well for counting, but I see two issues: first of all, grouping it into 15-minute intervals and secondly, the hashmap isn't sorted in any way. I'd need to output the values sorted by vehicle types and directions.
If your data is simple (i.e. you only need a key/value pair) then consider using SharedPreferences to store the button id and the time it was pressed.
This is a good way because it is extremely fast. You already put the info into a .csv file but to extract the data and traverse through it so that you can compare timestamps is too much overhead IMO.
When a button is pressed store your data in the .csv file then also store the key/value (id/timestamp) then you can iterate through that and do your compare and output whatever you need to output.
The other way (and probably even better) is to simply create and write to your .csv file, then dump it and use something else more robust to process that data as you will probably be doing this anyway.
EDIT: I see a lot of answers which are saying to use SQLite, statics, etc...
These are all inferior methods for what you are asking. Here is why...
SQLite is WAY too much for just a simple button id and timestamp. However if you think you might need to store more data in the future this may in fact be an option.
A static variable might be destroyed if the system happens to destroy the Activity.
If you want to be cool, just serialise the HashMap every time your activity gets destroyed (onDestroy) and load it back up when your activity gets created (onCreate). This is going to be the quickest and simplest way.
Where are you serializing your data to? To a file in Environment.getExternalStorage().
Oh and you might want to keep track of a timestamp to clear the HashMap every 15 minutes - you could put this in SharedPreferences.
I assume you already know how to do this but just in case: Java Serialization
Might be me but this sounds as an ideal situation for using a SQLite database. So you can easily get this interval thing from that by selecting on the timestamp.