How to sync large lists between client and server

How to sync large lists between client and server - java

I'd like to sync a large list of items between the client and the server. Since the list is pretty big I can't sync it in a single request so, how can I ensure the list to be synched with a reasonable amount of calls to the synchronization service?
For example:
And I want to sync a list with 100.000 items so I make a web service with the following signature
getItems(int offset,int quantity): Item[]
The problem comes when, between call and call, the list is modified. For example:
getItems(0,100) : Return items (in the original list) [0,100)
getItems(100,100): Return items (in the original list) [100,200)
##### before the next call the items 0-100 are removed ####
getItems(200,100): Return items (in the original list) [300,400)
So the items [200,300) are never retrieved. (Duplicated items can also be retrieved if items are added instead of removed.
How can I ensure a correct sync of this list?

From time to time, the service should save immutable snapshots. The interface should be getItems(long snapshotNumber, int offset,int quantity)
to save time, space, and traffic, not every modification of the list should form a snapshot, but every modification should form a log message (e.g. add items, remove range of items), and that log messages should be send to the client instead of full snapshots. Interface can be getModification(long snapshotNumber, int modificationNumber):Modification.

Can you make the list ordered on some parameter on the server side? For e.g. a real world use-case for this scenario is showing records in a table on UI. The number of records on the server side can be huge so you wouldn't want to get the whole list at once and instead you get them on each scroll that the user makes.
In this case, if the list is ordered, you get a lot of things for free. And your API becomes getItems(long lastRecordId,int quantity). Here lastRecordId would be a unique key identifying that particular record. You use this key to calculate the offset (on the server side) and retrieve the next batch from this offset location and return the recordId of the last record to the client which it uses in its next API call.
You don't have to maintain snapshots and there wouldn't be any duplicate records retrieved. The scenarios that you mention in case of removals/insertions don't occur in this case. But at some point in time, you would have to discard the copy that the client has and start syncing all over again if you want to track additions and removals on the client side for the data that the client has already seen.

Related

Server/Client live updates

I have some problems understanding the best concept for my problem.
My architecure is pretty basic. I have a backend with data that can be updated and clients which will load data with some filtes.
I have a backend that has the data in a EHCache.
The data model is pretty basic for example
{
id: string,
startDate: date,
endDate: date,
username: string,
group: string
}
The data can only be modified by another backend service.
When data is modified, added or deleted we have an data update event generated.
The clients are all web clients and have a Spring boot REST Service to fetch the data from the cache.
For the data request the clients sends his own request settings. There are different settings like date and text filter. For example
{
contentFilter: Filter,
startDateFilter: date,
endDateFilter: date
}
The backend use this settings to filter the data from the cache and then sends the response with the filtered data.
When the cache generates an update event every client gets notified by a websocket connection.
And then request the full data with the same request settings as before.
My problem is now that there are many cache updates happening and that the clients can have a lots of data to load if the full dataset is loaded everytime.
For example I have this scenario.
Full dataset in cache: 100 000 rows
Update of rows in cache: 5-10 random rows every 1-5 seconds
Client1 dataset with request filter: 5000 rows
Client2 dataset with request filter: 50 rows
Now everytime the client receives a update notification the client will load the complete dataset (5000 rows) and that every 1-5 seconds. If the update only happens on the same row everytime and the row isn´t loaded by the client because of his filter settings then the client would be loading the data unnecessarily.
I am not sure what would be the best solution to reduce the client updates and increase the performance.
My first thought was to just send the updated line directly with the websocket connection to the clients.
But for that I would have to know if the client "needs" the updated line. If the updates are happening on rows that the clients doesn´t need to load because of the filter settings then I would spam the client with unnecessary updates.
I could add a check on the client side if the id of the updated row is in the loaded dataset but then I would need a separate check if a row is added to the cache instead of an update.
But I am not sure if that is the best practice. And unfortunately I can not find many resources about this topic.

The most efficient things are always the most work, sadly.
I won't claim to be an expert at this kind of thing - on either the implementation(s) available or even the best practices - but I can give some food for thought at least, which may or may not be of help.
My first choice: your first thought.
You have the problem of knowing if the updated item is relevant to the client, due to the filters.
Save the filters for the client whenever they request the full data set!
Row gets updated, check through all the client filters to see if it is relevant to any of them, push out to those it is.
The effort for maintaining that filter cache is minimal (update whenever they change their filters), and you'll also be sending down minimal data to the clients. You also won't be iterating over a large dataset multiple times, just the smaller client set and only for the few rows that have been updated.
Another option:
If you don't go ahead with option 1, option 2 might be to group updates - assuming you have the luxury of not needing immediate, real-time updates.
Instead of telling the clients about every data update, only tell them every x seconds that there might be data waiting for them (might be, you little tease).
I was going to add other options but, to be honest, I don't see why you'd worry about much beyond option 1, maybe with an option 2 addition to reduce traffic if that's an issue.
'Best practice'-wise, sending down multiple FULL datasets to multiple clients multiple times a second is certainly not it.
Sending only the data relevant to each client is a much better solution, and if you can further reduce how much the client even needs to send (i.e. only their filter updates and not have them re-send something you could already have saved) is an added bonus.
Edit:
Ah, stateless server - though it's not really stateless. You're using web sockets, so the server has some kind of state for those connections. It's already stateful so option 1 doesn't really break anything.
If it's to be completely stateless, then you also can't store the updated rows of data, so you can't return those individually. You're back to what you're doing which is a full round-trip and data read + serve.
Option 3, though, if you're semi stateless (don't want to add any metadata to those socket connections) but do hold updated rows: timestamp them and have the clients send the time of their last update along with their filters - you can then return only the updated rows since their last update using their provided filters (timestamp becomes just another filter) (or maybe it is stateless, but the timestamp becomes another filter).
Either way, limiting the updated data back down to the client is the main goal if for nothing else than saving data transfer.
Edit 2:
Sounds like you may need to send two bits of data down (or three if you want to split things even further - makes life easier client-side, I guess):
{
newItems: [{...}, ...],
updatedItems: [{...}, ...],
deletedIds: [1,2...]
}
Yes, when their request for an update comes, you'll have to check through your updated items to see if any are deleted and of relevance to the client's filters, but you can send down a minimal list of ids rather than whole rows that your client can then remove.

Publish pre-trigger values from an OPC UA server to a OPC UA client

Can anyone help me out with the following use case for OPC UA?: reading triggered measurements from an OPC UA server with additional measurement values that happened in the period just before the trigger condition occurred. This pre-trigger period would be configurable, let's say half a second. This allows seeing what happened just before the trigger of interest occurred.
How would I go about?: As soon as the trigger happens make the results available for the OPC UA clients and then the client should then act on the same trigger to retrieve the preceding historical measurement values of the period before the trigger happened. I hope there is a smarter way so that the client can remain stateless. And there is no requirement that the data arrives a bit later so that the pre-trigger values are sent first before the post-trigger values are sent to the client.
Given that the data must be buffered anyhow to make this possible, could this work?:
my back end measurement data provider within the OPC UA server could
just start returning data values back to the client starting with the
values from half a second earlier (the configured pre-trigger
period). I.e. not returning the current measurement values but starting with the pre-trigger ones.
I have seen in the Milo server example that within the ExampleNameSpace the AttributeValueDelegate construct is used for dynamic nodes. This seems to allow to return data values one at the time including a timestamp. I don't have the proper test tools to see if it works if I start returning relative old values.
The other thing is how this relates to monitored items and sampling intervals. If the client, for example, would request a sampling rate of 10 samples per second, would it then check if the returned monitored items are actually within that range? I.e. will older values arriving late be discarded or just pulled in by the client?

No matter what approach you take it's going to require the client to be aware of what you're doing here, so here's an approach that might work:
Create two nodes in the server, a scalar that holds the current measurement value, and an array that holds the last N measurements values.
In the client, create a monitored item for the scalar value with MonitoringMode.Reporting, and create a monitored item for the array value with MonitoringMode.Sampling. Then use the SetTriggering service to create a triggering link between the scalar item and the array item.
What this will result in is you can freely update the array value in the server without the value being reported as changed, but when you update the scalar value and the change is reported, the current value of the array will get reported too.
As a side note, I'd avoid relying on the AttributeDelegate mechanism for new development. It's going to be deprecated and replaced with something else once development on 0.3 starts.

Checking if a Set of items exist in database quickly

I have an external service which I'm grabbing a list of items from, and persisting locally a relationship between those items and a user. I feed that external service a name, and get back the associated items with that name. I am choosing to persist them locally because I'd like to keep my own attributes about those external items once they've been discovered by my application. The items themselves are pretty static objects, but the total number of them are unknown to me, and the only time I learn about new ones is if a new user has an association with them on the external service.
When I get a list of them back from the external service, I want to check if they exist in my database first, and use that object instead but if it doesn't I need to add them so I can set my own attributes and keep the association to my user.
Right now I have the following (pseudocode, since it's broken into service layers etc):
Set<ExternalItem> items = externalService.getItemsForUser(user.name);
for (ExternalItem externalItem : items){
Item dbItem = sessionFactory.getCurrentSession().get(Item.class,item.id);
if (dbitem == null){
//Not in database, create it.
dbItem = mapToItem(externalItem);
}
user.addItem(dbItem);
}
sessionFactory.getCurrentSession().save(user);//Saves the associated Items also.
The time this operation is taking is around 16 seconds for approximately 500 external items. The remote operation is around 1 second of that, and the save is negligible also. The drain that I'm noticing comes from the numerous session.get(Item.class,item.id) calls I'm doing.
Is there a better way to check for an existing Item in my database than this, given that I get a Set back from my external service?
Note: The external item's id is reliable to be the same as mine, and a single id will always represent the same External Item

I would definitely recommend a native query, as recommended in the comments.
I would not bother to chunk them, though, given the numbers you are talking about. Postgres should be able to handle an IN clause with 500 elements with no problems. I have had programmatically generated queries with many more items than that which performed fine.
This way you also have only one round trip, which, assuming the proper indexes are in place, really should complete in sub-second time.

Deleting multiple items from a Redis hash, based on a certain value

What is the most efficient way to delete a bunch of items from a hash, based on whether an item's value contains a specific substring or not? As far as I know, there is not really a way to do this in one simple block. I have to literally grab all the values of that hash in a Java list, then iterate over this list till i find what I need, then delete its key from the hash, and repeat the same procedure ove and over again.
Another approach I tried was to put an id references to the the hash items in a separate list, so that later on, with a single call, i could grab a list of id for items which should be deleted. That was a bit better, but still, the redis implementation I use (Jedis) does not support the deletion of multiple hash keys, so again I am left with my hands tied.
Redis does not support referential integrity, right? This means, OK, the keys stored in the Redis list are references to the items in the hash, so if I delete the list, the corresponding items from the hash would eb deleted. There is nothing like that in Redis, right?
I will have to go through this loop and delete every single item separately. I wish at least there was something like a block, where I could collect all 1000 commands, and send them in one entire call, rather than 1000 separate ones.

I wish at least there was something like a block,
where I could collect all 1000 commands, and send them in one entire call,
rather than 1000 separate ones.
That's what transactions are for: http://redis.io/topics/transactions
Using pipeline would let possible commands from other connected clients to be issued between the pipelined commands, since it only guarantees that your client issues commands without waiting for replies, with no guarantee of atomicity.
Commands in a transaction (i.e. between MULTI/EXEC) are issued atomically, which I presume is what you want.

Deleting the ids in a Redis List will not affect the Redis Hash Fields. To speed things up consider pipelining. Jedis supports that...

Client Side sorting + Hibernate Paging?

I use GWT for UI and Hibernate/Spring for buisness-layer.Following GWT widget is used to display the records.(http://collectionofdemos.appspot.com/demo/com.google.gwt.gen2.demo.scrolltable.PagingScrollTableDemo/PagingScrollTableDemo.html).I assume the sorting is done in client side.
I do not retrieve the entire result set since its huge.
I use
principals = getHibernateTemplate().findByCriteria(criteria,
fromIndex, numOfRecords);
to retrive data.Theres no criteria for sorting in Hibernate layer.
This approach does not give the correct behaviour since it only Sorts the current dataset in the client.
What is the best solution for this problem?
NOTE : I can get the primary-Sort-column and other sort Columns using the UI framework.
May be I can sort the result using primary-sort-column in the hibernate layer?

You need to sort on the server.
Then you can either:
send the complete resultset to the client and handle pagination on the client side. The problem is that the resultset may be big to retrieve from db and sent to the client.
handle the pagination on the server side. The client and the server request only one page at a time from the db. The problem then is that you will order the same data again and again to extract page 1, page 2, etc. each time you ask the db for a specific page. This can be a problem with large database.
have a trade-off between both (for large database):
Set a limit, say 300 items
The server asks the db for the first 301 items according to the order by
The server keept the resultset (up to 301 items) in a cache
The client request the server page by page
The server handles the pagination using the cache
If there are 301 items, the client displays "The hit list contains more than 300 items. It has been truncated".
Note 1: Usually, the client doesn't care if he can't go to the last page. You can improve the solution to count for the total number of rows first (no need of order by then) so that you can display message that is better to the user, e.g. "Result contained 2023 elements, only first 300 can be viewed".
Note 2: if you request the data page by page in the database without using any order criterion, most db (at least Oracle) don't guarantee any ordering. So you may have the same item in page 1 and 2 if you make two requests to the database. The same problem happens if multiple items have the same value that is use to order by (e.g. same date). The db doesn't guarantee any ordering between element with the same value. If this is the case, I would then suggest to use the PK as the last order criterion to order by (e.g. ORDER BY date, PK) so that the paging is done in a consistent way.
Note 3: I speak about client and server, but you can adapt the idea to your particular situation.

Always have a sort column. By default it could by "name" or "id"
Use server side paging. I.e. pass the current page index and fetch the appropriate data subset.
In the fetch criteria / query use the sort column. If none is selected by the client, use the default.
Thus you will have your desired behaviour without trade-offs.

It will be confusing to the user if you sort on a partial result in the GUI, and page on the server.
Since the data set is huge, sending the entire data set to the user and do both paging and sorting there is a no-go.
That only leaves both sorting and paging on the server. You can use Criteria.addOrder() to do sorting in hibernate. See this tutorial.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.