Searching strategy to efficiently load data from service or server?

Searching strategy to efficiently load data from service or server? - java

This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.

Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?

I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.

Related

I want to preserve my data during service restart, but my data is not in simple variable name-value or table format. How should I go about this?

I want to preserve data during service restart, which uses a arraylist of {arraylist of integers} and some other variables.
Since it is about 40-60 MB, I don't want it be generated each time the service restarts(it takes a lot of time); I want to generate data once, and maybe copy it for next service restart.
How can it be done?
Please consider how will I go about putting a data structure similar to multidimensional array(3d or above) into file, before suggesting writing the data in a file; which when done, will likely take significant time to read too.

You can try writing your data after generation to a file. Then on next service restart, you can simply read that from the file.

If you need persistent data, then put it into database
https://developer.android.com/guide/topics/data/data-storage
or try some object database like http://objectbox.io/

So you're afraid reading from the file would take along time due to its size, the number and size of the rows (the inner arrays).
I think it might be worthy to stop for a minute and ask yourself whether you need all this data at once. Maybe you only need a portion of it at any given time and there are scenarios in which you don't use some (or maybe most) of the data? If this is likely, I would suggest that you'll compute the data on demand, when required, and only keep a memory based cache for future demand in the current session.
Otherwise, if you do need all the data at a given time, you have a trade-off here. Trade-off between size on disk and processing time. You can shrink the data using some algorithm, but it would be at the expense of the processing time. On the hand, you can just serialize your object of data and save it to disk as is. Less time, more disk space.
Another solution for your scenario, could be, to just use a DB and a cursor (room on top sqlite). I don't exactly know what it is that you're trying to do, but your arrays can easily be modeled into a DB. Model a single row as you'd like and add to that model the outer index of the array. Then save the models into the DB, potentially making the outer index field the primary key if the DB.
Regardless of the things I wrote, try to think if you really need this data persistent on your client, maybe you can store it at the server side? If so, there are other storage and access solutions which are not included at the Android client side.

Thank you all for answering this question.
This is what I have finally settled for:
Instead of using the structure as part of the app, I made this into a
tool, which will prepare data to be used with the main app. In doing
so, it also stopped the concern regarding service restart.
This tool will first read all the strings from input file(s).
Then put all of them into the structure one at a time.(This will be
the part which I was having doubts, and asked the question about.
Since all the data is into the structure here, as soon as program
terminates, this structured data is unusable.)
Now, I prepared another structure for putting this data into file,
and put all this data into file so that I do not need to read to all
input file again and again, but only few lines.
Then I thought, why spend time "read"ing files while I can hard code
it into my app. So, as final step of this preprocessing tool, I made
it into a class which has switch(input){case X: return Y}.
Now I will just have to put this class into the app I wanted to make.
I know this all sounds very abstract, even stretching the concept of abstract, if you want to know details, please let me know. I am also including link of my "tool". Please visit and let me know if there would have been some better way.
P.S. There could be errors in this tool yet, which if you find, let me know to fix them.
P.P.S.
link: Kompressor Tool

What action should a client application take after executing a command?

Background
This question is best illustrated using an example. Say I have a client application (e.g. desktop application, mobile app, etc.) that consumes information from a web service. One of the screens has a list of products that are queried from the web service when the client application starts up and are bound to the UI element. Now, the user creates a new product. This causes the client application to send a command to the web service to add that product to a database.
Question
In the client application, what should happen after the command is issued and is successful? Do you:
Query the full product list from the service and refresh the entire product list in the client application?
Query just the two newly added products and add them to the product list?
Don't query, and instead just use the information available in the client application to create the new products in the GUI, and then add them to the list?
The same questions apply to update too. If you update a product, do you get confirmation of a successful update on the service, and then just let the GUI update the product without further requests to the service?
Edit - Additional details added
From initial feedback, the takeaway appears to be go with the simplest approach unless this:
Leads to performance concerns
Negatively impacts user experience
There is a major/significant portion of my application where the main way to interact with the application is to drag grid records between a number of different grids. For example, dragging a product onto another grid would create a new order, which would need to be sent to the service. Some of these grids are more complex than your standard grid. Records can be grouped, and each group can be collapsed/expanded (see here). In this case, while the grid can be refreshed from the service very quickly, this would probably lead to usability concerns. When a grid is refreshed with all new data, if the user had any groups expanded/collapsed, this would be lost.
So, while most grids in my application could probably just all be refreshed at once, the more complex ones will need to be updated more carefully. I would think this would lend to option 1 or 2 (at least for creating new records). One thought I had was that the client application could create GUIDs for new records to be sent with the application. That way, no follow-up query would need to be made to the service, as the client application would already have the unique ID. Then, the client application would just wait for a successful response from the service prior to showing the user the new record.

Get the whole list
I guess it depends how costly the request/response are. If possible and efficient, I would always choose your first option (get the whole list) until there is a performance concern.
As the saying goes:
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization – For experts only: Don't do it yet.
There is simply less scenarios to cover, less code to write, less code to maintain since you'll need the "get the whole list" service no matter what.
It also returns the "most up to date list of products" in case another client added products simultaneously.
Only pros, until there is a performance concern, in my opinion. These last 3 words would imply that this question will only lead to opinions and should be closed...

I don't think there's any definitive right answer; these kinds of questions need to be thought of on a case by case basis. #3 by itself is often not an option - for example, if you need the client to have a database-generated field like an ID, it's gotta get from point A to point B somehow. You also need to think about how you're exposing any errors to your user, because it's a terrible experience if you make it appear that everything succeeded, but you actually had an error and the product didn't really save.
Beyond that, I'd look at usability as my next criteria. What's the experience like for your users if you refresh the list versus adding just a couple of products? Is there a significant difference? A lot comes down to your specific application, and also the workflow being done. If adding products is something that is the main part of someone's job, where they may spend hours a day doing this, shaving even a second off the time is a real win for your users, while if it's an uncommon workflow that people do from time to time, the performance expectations are somewhat lower.
And last I'd look at code maintenance and complexity. If two paths are giving relatively similar experiences, pick the one that's easier to build and maintain.
There are other options, too. You can go with a hybrid approach - for example, maybe on the client you add the data to the product list immediately (perhaps showing some kind of "saving" indicator), while also asynchronously querying the database so you can refresh the product listing and report any errors. Such approaches tend to be the most complex, but you might go down that route if usability demands it.

Efficient recall of a delta-based data log in Java

My application has a number of objects in an internal list, and I need to be able to log them (e.g. once a second) and later recreate the state of the list at any time by querying the log file.
The current implementation logs the entire list every second, which is great for retrieval because I can simply load the log file, scan through it until I reach the desired time, and load the stored list.
However, the majority of my objects (~90%) rarely change, so it is wasteful in terms of disk space to continually log them at a set interval.
I am considering switching to a "delta" based log where only the changed objects are logged every second. Unfortunately this means it becomes hard to find the true state of the list at any one recorded time, without "playing back" the entire file to catch those objects that had not changed for a while before the desired recall time.
An alternative could be to store (every second) both the changed objects and the last-changed time for each unchanged object, so that a log reader would know where to look for them. I'm worried I'm reinventing the wheel here though — this must be a problem that has been encountered before.
Existing comparable techniques, I suppose, are those used in version control systems, but I'd like a native object-aware Java solution if possible — running git commit on a binary file once a second seems like it's abusing the intention of a VCS!
So, is there a standard way of solving this problem that I should be aware of? If not, any pitfalls that I might encounter when developing my own solution?

Cache update with db changes

We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.

It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.

I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.

I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.

which NOSQL database tool is better to choose for my application?

I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.

Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."

I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?

If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)

Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.