I'd like to find a functional data structure that can perform "Flow control".
Example: For any IP visiting my website, if the IP has visited >= N times since M minutes ago, the IP is restricted to visit for Z minutes.
Is there any solution that does not require timer (to remove visit records periodically) or large data storage (to remember all the visits from all IPs)?
Can use JAVA or Scala to construct the data structure.
The simple answers are Yes, No and Yes.
Yes, you can do it without a timer, you only need a single clock. When a request arrives you look at the clock and decide based on the historic data whether to reject the request or not according to your algorithm.
No, you can't do this without recording up to N visit records for each IP. You need to know the time of each request to know how many occurred in the last M minutes. There are various ways of compressing this but you can't implement your algorithm without recording every visit.
Yes, you can use Java or Scala to create the appropriate data structures based on your algorithm.
However you can reduce the data storage if you modify your test. For example you could divide time into windows of length M and count the requests in each window. If the number of requests in the current and previous windows exceeds N then you reject the request. This doesn't give exactly the same results but it achieves the general goal of rate-limiting requests from over-active clients while storing only two values for each IP address.
Related
I have a big string array which has between 24-32 random characters (which include 0123456789abcdefghijklmnopqrstuvwxyz!##$%^&*()_+=-[]';/.,<>?}{). Some times the array is empty, but other times the array has more than 1000 elements inside it.
I send them to my client, which is a browser, via AJAX every time he requests them and I want to reload a part of my application only if that array is different. That means if there was a modification, adding/removing in said array. So I want to send the entire array, and some kind of hash of all the elements inside it. I can't use md5 or anything like that because the elements inside the array might move around.
What do you suggest I do? The server uses Java to serve pages.
Are you sure transmitting 1000 characters is actually a problem in your use case? For instance, this stackoverflow page is currently 17000 bytes large, and stackoverflow makes no effort to only transmit it if it has changed. Put differently, transmitting 1000 characters will take about 1000 bytes, or 1 ms on a 1 MBit connection (which is slow by modern standards ;-).
That said, transmitting data only if it has changed is such a basic optimization strategy that it has been incorporated into the HTTP standard itself. The HTTP standard describes both time based and etag based invalidation, and is implemented by virtually any software or hardware interacting using HTTP, including browsers and CDNs. To learn more, read an tutorial by Google or the normative specification.
You could be using time based invalidation, either by specifying a fixed lifetime or interpreting the If-Modified-Since header. You could also use an ETag that is not sensitive to ordering, by putting your elements into a particular order (e.g. through sorting) before hashing.
I would suggest a system that allows you to skip sending the strings altogether if the client has the latest version. The client keeps the version number (or hash code) of the latest version it received. If it hasn't received any strings yet, it can default to 0.
So, when the client needs to get the strings, it can say, "Give me the strings if the current version isn't X," where X is the version that the client currently has.
The server maintains a version number or hash code which it updates whenever the strings change. If it receives a request, and the client's version is the same as the current version, then the server returns a result that says, "You already have the current version."
The point here is twofold: prevent transmitting information that you don't need to transmit, and prevent the client from having to compute a hash code.
If the server needs to compute a hash at every request rather than just keeping a current hash code value, have the server sort the array of strings first, and then do an MD5 or CRC or whatever.
This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.
Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?
I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.
So I am working on a project that requires a collection of clients to be iterated through for updating, with each client requiring an update packet for every other client within proximity. I want to be able to do this in a fast way since updates will happen for a large amount of clients, at an often-occurring interval.
My original plan of attack was to create regions based on client locations, updating each client only with the other clients in their region. This would entail a LinkedList<Region>, with the Region having its own list of clients which would update among each other. One problem with this method was that some regions could have 1 client, while others could have 1000. Another level of difficulty arose from the fact that clients will constantly be moving (thus changing location and Region). These problems could be avoided if there was a way to modify the list while iterating through it, possibly splitting elements when a region gets too large.
Next I thought of creating one large List<Client> that held all players, which was constantly sorted based on location. Then to update client at index n of the list with the closest 20 clients, I would only iterate n-10 and n+10 from their current index. I don't really like this method as much since if there was a 21st client in a closeby area, they could be ignored even though they had equal distance to the client at n as the one at n+10. It also seemed slow to have to resort all the clients every tick.
In terms of speed, which of these methods provides better performance? Additionally, are there any other Java collections I should consider? Thanks!
I strongly prefer the first method. Sorting the entire list every tick is going to end up being a very bad idea time-wise, which rules out the second method.
To solve the concurrency issues, you should make a copy of the LinkedList<Reigon> before updating it in a thread. That way you will allow Clients to change their Reigon at the same time as updates are being pushed out to each Reigon.
Another note is that if you plan on retrieving an arbitrary Reigon from the LinkedList<Reigon> (for example, when you move a Client from one Reigon to another) you should look into some kind of a hash set. It will increase performance greatly when retrieving an arbitrary element from the middle of the list, especially if the list is large.
I am writing a multiplayer game where I allow players to attack each other.
Only the attacker must be logged in.
I need to know how many attacks player did in last 6 hours, and I need to know if the defender was attacked during last 1 hour. I don't care about attacks done more than 6 hours ago. Is there any way to implement it better than storing these data in database and deleting "expired" data (older than 6 hours)?
Server is written in java, clients will be Android.
Any ideas / tutorial links or even keywords are appreciated. Also, if you think there is no better solution, please say so :)
For GAE, there is no real alternative, no. Besides using the datastore, GAE offers a memcache system, which, in fact, is designated for storing temporary data. However, as stated in the documentation, "Values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value." Therefore, it would be the best to use the datastore.
Alright so this problem has been breaking my brain all day today.
The Problem: I am currently receiving stock tick data at an extremely high rate through multicasts. I have already parsed this data and am receiving it in the following form.
-StockID: Int-64
-TimeStamp: Microseconds from Epoch
-Price: Int
-Quantity: Int
Hundreds of these packets of data are parsed every second. I am trying to reduce the computation on my storage end by packaging up this data into dictionaries/hashtables hashed by the stockID (key == stockID)(value == array of [timestamp, price, quantity] elements).
I also want each dictionary to represent timestamps within a 5min interval. When the incoming data's timestamps get past the 5min time interval, I want this new data to go into a new dictionary that represents the next time interval. Also, a special key will be hashed at key -1 telling what 5 particular minute interval per day does this dictionary belong to (so if you receive something at 12:32am, it should hash into the dictionary that has value 7 at key -1, since this represents the time interval of 12:30am to 12:35am for that particular day). Once the time passes, the dict that has its time expired can be sent off to the dataWrapper.
Now, you might be coming up with some ideas right about now. But here's a big constraint. The timestamps that are coming in Are not necessarily strictly increasing; however, if one waits about 10 seconds after an interval has ended then it can be safe to assume that every data coming in belongs to the current interval.
The reason I am doing all this complicated things is to reduce computation on the storage side of my application. With the setup above, my storage side thread can simply iterate over all of the key, value pairs within the dictionary and store them in the same location on the storage system without having to reopen files, reassign groups or change directories.
Good Luck! I will greatly appreciate ANY answers btw. :)
Preferred if you can send me something in python (that's what I'm doing the project in), but I can perfectly understand Java, C++, Ruby or PHP.
Summary
I am trying to put stock data into dictionaries that represent 5min intervals for each dictionary. The timestamp that comes with the data determines what particular dictionary it should be put in. This could be relatively easy except that timestamps are not strictly increasing as they come in, so dictionaries cannot be sent off to the datawrapper immediately once 5 mins has passed by the timestamps, since it isn't guaranteed to not receive any more data within 10 seconds, after this its okay to send it to the wrapper.
I just want any kind of ideas, algorithms, or partial implementations that could help me with the scheduling of this. How can we switch the current use of dictionaries within both timestamps (for the data) and actual time (the 10seconds buffer).
Clarification Edit
The 5 min window should be data driven (based upon timestamps), however the 10 second timeout appears to be clock time.
Perhaps I am missing something ....
Its appears you want to keep the data in 5 min buckets, but you can't be sure you have all the data for a bucket for up to 10 sec after it has rolled over.
This means for each instrument you need to keep the current bucket and the previous bucket. When its 10 seconds past the 5 min boundary you can publish/write out the old bucket.