Compare 2 large string arrays between client/server - java

I have a big string array which has between 24-32 random characters (which include 0123456789abcdefghijklmnopqrstuvwxyz!##$%^&*()_+=-[]';/.,<>?}{). Some times the array is empty, but other times the array has more than 1000 elements inside it.
I send them to my client, which is a browser, via AJAX every time he requests them and I want to reload a part of my application only if that array is different. That means if there was a modification, adding/removing in said array. So I want to send the entire array, and some kind of hash of all the elements inside it. I can't use md5 or anything like that because the elements inside the array might move around.
What do you suggest I do? The server uses Java to serve pages.

Are you sure transmitting 1000 characters is actually a problem in your use case? For instance, this stackoverflow page is currently 17000 bytes large, and stackoverflow makes no effort to only transmit it if it has changed. Put differently, transmitting 1000 characters will take about 1000 bytes, or 1 ms on a 1 MBit connection (which is slow by modern standards ;-).
That said, transmitting data only if it has changed is such a basic optimization strategy that it has been incorporated into the HTTP standard itself. The HTTP standard describes both time based and etag based invalidation, and is implemented by virtually any software or hardware interacting using HTTP, including browsers and CDNs. To learn more, read an tutorial by Google or the normative specification.
You could be using time based invalidation, either by specifying a fixed lifetime or interpreting the If-Modified-Since header. You could also use an ETag that is not sensitive to ordering, by putting your elements into a particular order (e.g. through sorting) before hashing.

I would suggest a system that allows you to skip sending the strings altogether if the client has the latest version. The client keeps the version number (or hash code) of the latest version it received. If it hasn't received any strings yet, it can default to 0.
So, when the client needs to get the strings, it can say, "Give me the strings if the current version isn't X," where X is the version that the client currently has.
The server maintains a version number or hash code which it updates whenever the strings change. If it receives a request, and the client's version is the same as the current version, then the server returns a result that says, "You already have the current version."
The point here is twofold: prevent transmitting information that you don't need to transmit, and prevent the client from having to compute a hash code.
If the server needs to compute a hash at every request rather than just keeping a current hash code value, have the server sort the array of strings first, and then do an MD5 or CRC or whatever.

Related

Functional Data structures to support flow control

I'd like to find a functional data structure that can perform "Flow control".
Example: For any IP visiting my website, if the IP has visited >= N times since M minutes ago, the IP is restricted to visit for Z minutes.
Is there any solution that does not require timer (to remove visit records periodically) or large data storage (to remember all the visits from all IPs)?
Can use JAVA or Scala to construct the data structure.
The simple answers are Yes, No and Yes.
Yes, you can do it without a timer, you only need a single clock. When a request arrives you look at the clock and decide based on the historic data whether to reject the request or not according to your algorithm.
No, you can't do this without recording up to N visit records for each IP. You need to know the time of each request to know how many occurred in the last M minutes. There are various ways of compressing this but you can't implement your algorithm without recording every visit.
Yes, you can use Java or Scala to create the appropriate data structures based on your algorithm.
However you can reduce the data storage if you modify your test. For example you could divide time into windows of length M and count the requests in each window. If the number of requests in the current and previous windows exceeds N then you reject the request. This doesn't give exactly the same results but it achieves the general goal of rate-limiting requests from over-active clients while storing only two values for each IP address.

Searching strategy to efficiently load data from service or server?

This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.
Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?
I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.

Understanding maximum length of an AppEngine key-name in the Java API

I am trying to figure out what the maximum length for an AppEngine key name is in the Java API.
This question has been asked in much less depth before:
How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?
and received two conflicting answers (with the one that seems less credible to me being the accepted one...)
#ryan was able to provide links to the relevant Python API source in his answer and I've been trying to find something similar in the Java API.
But neither Key.java, nor KeyFactory.java, nor KeyTranslator.java seem to enforce any limits on the name property of a key. So, if there is a limit, it is implemented elsewhere. KeyTranslator calls com.google.storage.onestore.v3.OnestoreEntity.Path.Element.setName(), which could be the place where the limit is implemented, but unfortunately I can't find the source for this class anywhere...
Specifically, I would like to know:
Is the 500 character limit a hard limit specifically imposed for key names somewhere in the backend or is it simply a recommendation that should be sufficient to make sure the full key-string never exceeds the 1500-byte limit of a short text property (long text properties with more bytes cannot be indexed, if I understand correctly).
If it is a hard limit:
Is it 500 characters or 500 bytes (i.e. the length after some encoding)?
Are the full 500 bytes/characters available for the name of the key or do the other key-components (kind, parent, app-id, ...) deduct from this number?
If it is a recommendation:
Is it sufficient in all cases?
What is the maximum I can use if all keys are located in the root of my application and the kind is only one letter long? In other words: Is there a formula I can use to calculate the real limit given the other key components?
Lastly, if I simply try to measure this limit by attempting to store keys of increasing length until I get some exception, will I be able to rely on the limit that I find if I only create keys with identical ancestor paths and same-length kinds in the same application? Or are there other variable-length components to a key that might get added and reduce the available key-name-length in some cases? Should it be the same for the Development and the Production Servers?
The Datastore implements all of its validation in the backend (because it prevents successful operations in one client to fail in another). Datastore keys have the following restrictions:
A key can have at most 100 path elements (these are kind, name/id pairs)
Each kind can be at most 1500 bytes.
Each name can be at most 1500 bytes.
The 500 character limit has been converted into a 1500 byte limit. So places you've seen a 500 character limit before (like #ryan's answer in the linked question) are now 1500 bytes. Strings are encoded using UTF-8.
Importantly to answer some specifics from your question:
Are the full 500 bytes/characters available for the name of the key or do the other key-components (kind, parent, app-id, ...) deduct from this number?
No, the 1500 byte limit is per field.

Using GWT-RPC vs RequestFactory for passing large arrays

I am building an application which retrieves data and parses it into a two-dimensional array object before sending it back to the client. The application then uses the data to create an image on HTML5 canvas. The array contains several thousand entries, and when I built the application using GWT-RPC, it worked properly, but it took too long to transfer the array to the client (several minutes).
I found this issue when searching for solutions: http://code.google.com/p/google-web-toolkit/issues/detail?id=860
The last response was from a couple months ago, but there doesn't seem to be a conclusive answer to the best way to go about passing large arrays from server to client. Since deRPC is being deprecated (I have yet to actually try using it), is using requestfactory the only choice? It seems like requestFactory is supposed to be for accessing a database, and not for performing calculations and returning large results, and I haven't found an example where a request is made for a calculation and a result is passed back. Should I create a JSON object instead of an array in my current implementation and keep RPC or am I missing something when it comes to requestFactory?
The issue you linked to is about slow speed of de-serialization on the client and not the data transfer speed. You should first measure the transfer speed using Firebug or similar tool and then subtract this time from total time of RPC call to find out how much time is spent during de-serialization. Roughly speaking, the breakup goes like this:
Total RPC time = time-spent-on-server + network-time-in-out +
deserialization-time
You should first find out which part is the real bottleneck, if it turns out to be the data transfer speed, you'll probably need to re-think your design. See my answer to a related question.
EDIT:
IMO, until you have calculated the above time breakup, you should put aside the question that whether JSON or another approach is right for you

a very simple implementation of an onion router

I want to write a very simple implementation of an onion router in Java (but including chaum mixes) - a lot of the public / private key encryption seems pretty straightforward, but struggling to understand how the last router would know that the final onionskin has been 'peeled'.
I was thinking of having some sort of checksum also encoded, so that each router tries a decryption with their private key, and if the checksum works - forwards the newly peeled onion to the next router.
Only this way, (assuming that some bit of the checksum are stripped every time a successful decryption occurs) there will be a way (looking at the checksum) to estimate how close it is to decryption -- this this a major vulnerability ? is the checksum methodology an appropriate simplification?
Irrespective of the problem you mention, it's generally good practice to include some integrity check whenever you encrypt/decrypt data. However, checksums aren't really suitable for this. Have a look at Secure Hash algorithms such as SHA-256 (there are implementations built into the standard Java cryptography framework).
Now, coming back to your original question... To each node of the onion, you're going to pass an encrypted "packet", but that packet won't just include the actual data to pass on-- it'll include details of the next node, your hash code, and whatever else... including whatever flag/indication to say whether the next "node" is an onion router or the actual end host. Indeed the data for the last node will have to have some special information, namely the details of the actual end host to communicate with. In other words, the last node knows the onion has been peeled because you encode this fact in the data it ends up receiving.
Or at least, I think that's how I'd do it... ;-)
N.B. The encryption per se isn't that complicated I don't think, but there may be one or two subtleties to be careful of. For example, in a normal single client-server conversation, one subtlety you have to be careful of is to never encrypt the same block of data twice with the same key (or at least, that's what it boils down to-- research "block modes" and "initialisation vectors" if you're not familiar with this concept). In a single client-server conversation the client and server can dictate parts of the initialisation vector. In an onion router, some other solution will have to be found (at worst, using strongly-generated random numbers generated by the client alone, I suppose).
You could hide the number of checksums by storing them in a cyclic array, whose initial offset is chosen at random when the onion in constructed. Equivalently, you could cyclically shift that array after every decryption.

Categories

Resources