I am designing a web service that wraps a very large data source and I would be very grateful for any suggestions whether my design is appropriate or I am missing something substantially better.
So here is the problem:
We have several data sources that all provide the same interface with the "most important" method being RowIterator select(Table table, String where). Now, functionally everything is going fine for all our implementations but the problem is that the web service that we need to wrap around one of the sources would (in a naive implementation) upon receiving a query
wait for the wrapped data source to return the whole result set
marshal the whole result set before sending it to the client
at the client side unmarshal the whole result set before returning it to the caller
Only after this sequence would the caller be able to see the first row. This is a quite disappointing behavior as the caller has to wait unnecessessarily for the whole result set twice. I want to have some pipelining, instead. The caller must be able to see the first results while the service is still sending rows. Now I am planning to overcome this by implementing some kind of paging that is encapsulated in my client-side row iterator. The service would maintain a session id (with a timeout) that is created upon receiving a query and can be used to fetch chunks of data. The session id could already be returned before sending the actual query to the wrapped data source. The client would then fetch chunks (pages) until a chunk is empty or smaller than the expected (= requested) chunk size.
So, in this design the caller would be able to see the first results while the service is still sending rows. However, I am wondering whether there is a way to efficiently pipeline results on a per-row basis using a SOAP web service?
Also, would it be possible to return the results to the caller without repeatedly asking for more results?
In the end I used MTOM to transmit the data in binary and used blocking queues at the client and the server to achieve the desired parallelism. I sketched this here: Streaming large SOAP attachments
Related
I have some problems understanding the best concept for my problem.
My architecure is pretty basic. I have a backend with data that can be updated and clients which will load data with some filtes.
I have a backend that has the data in a EHCache.
The data model is pretty basic for example
{
id: string,
startDate: date,
endDate: date,
username: string,
group: string
}
The data can only be modified by another backend service.
When data is modified, added or deleted we have an data update event generated.
The clients are all web clients and have a Spring boot REST Service to fetch the data from the cache.
For the data request the clients sends his own request settings. There are different settings like date and text filter. For example
{
contentFilter: Filter,
startDateFilter: date,
endDateFilter: date
}
The backend use this settings to filter the data from the cache and then sends the response with the filtered data.
When the cache generates an update event every client gets notified by a websocket connection.
And then request the full data with the same request settings as before.
My problem is now that there are many cache updates happening and that the clients can have a lots of data to load if the full dataset is loaded everytime.
For example I have this scenario.
Full dataset in cache: 100 000 rows
Update of rows in cache: 5-10 random rows every 1-5 seconds
Client1 dataset with request filter: 5000 rows
Client2 dataset with request filter: 50 rows
Now everytime the client receives a update notification the client will load the complete dataset (5000 rows) and that every 1-5 seconds. If the update only happens on the same row everytime and the row isn´t loaded by the client because of his filter settings then the client would be loading the data unnecessarily.
I am not sure what would be the best solution to reduce the client updates and increase the performance.
My first thought was to just send the updated line directly with the websocket connection to the clients.
But for that I would have to know if the client "needs" the updated line. If the updates are happening on rows that the clients doesn´t need to load because of the filter settings then I would spam the client with unnecessary updates.
I could add a check on the client side if the id of the updated row is in the loaded dataset but then I would need a separate check if a row is added to the cache instead of an update.
But I am not sure if that is the best practice. And unfortunately I can not find many resources about this topic.
The most efficient things are always the most work, sadly.
I won't claim to be an expert at this kind of thing - on either the implementation(s) available or even the best practices - but I can give some food for thought at least, which may or may not be of help.
My first choice: your first thought.
You have the problem of knowing if the updated item is relevant to the client, due to the filters.
Save the filters for the client whenever they request the full data set!
Row gets updated, check through all the client filters to see if it is relevant to any of them, push out to those it is.
The effort for maintaining that filter cache is minimal (update whenever they change their filters), and you'll also be sending down minimal data to the clients. You also won't be iterating over a large dataset multiple times, just the smaller client set and only for the few rows that have been updated.
Another option:
If you don't go ahead with option 1, option 2 might be to group updates - assuming you have the luxury of not needing immediate, real-time updates.
Instead of telling the clients about every data update, only tell them every x seconds that there might be data waiting for them (might be, you little tease).
I was going to add other options but, to be honest, I don't see why you'd worry about much beyond option 1, maybe with an option 2 addition to reduce traffic if that's an issue.
'Best practice'-wise, sending down multiple FULL datasets to multiple clients multiple times a second is certainly not it.
Sending only the data relevant to each client is a much better solution, and if you can further reduce how much the client even needs to send (i.e. only their filter updates and not have them re-send something you could already have saved) is an added bonus.
Edit:
Ah, stateless server - though it's not really stateless. You're using web sockets, so the server has some kind of state for those connections. It's already stateful so option 1 doesn't really break anything.
If it's to be completely stateless, then you also can't store the updated rows of data, so you can't return those individually. You're back to what you're doing which is a full round-trip and data read + serve.
Option 3, though, if you're semi stateless (don't want to add any metadata to those socket connections) but do hold updated rows: timestamp them and have the clients send the time of their last update along with their filters - you can then return only the updated rows since their last update using their provided filters (timestamp becomes just another filter) (or maybe it is stateless, but the timestamp becomes another filter).
Either way, limiting the updated data back down to the client is the main goal if for nothing else than saving data transfer.
Edit 2:
Sounds like you may need to send two bits of data down (or three if you want to split things even further - makes life easier client-side, I guess):
{
newItems: [{...}, ...],
updatedItems: [{...}, ...],
deletedIds: [1,2...]
}
Yes, when their request for an update comes, you'll have to check through your updated items to see if any are deleted and of relevance to the client's filters, but you can send down a minimal list of ids rather than whole rows that your client can then remove.
I've used GigaSpaces in the past and I'd like to know if I can use Ignite in a similar fashion. Specifically, I need to implement a master-worker pattern where one set of process writes objects to the in-memory data grid and another set reads those objects, does some processing, and possibly writes results back to the grid. One important GigaSpaces/JavaSpaces feature I need is leasing. If I write an object to the space and it isn't picked up within a certain time period, it should automatically expire and I should get some kind of notification.
Is Apache Ignite a good match for this use case?
I've worked with GigaSpaces before. What you are looking for is perhaps "continuous queries" in Ignite. That would allow create a filter for a specific predicate I.e. Checking a field of a new object being written to the grid. Once the filter is evaluated it will trigger a listener that can execute the logic you require and write results or changes back to the grid. You can create as many of these queries as desired and create chains. Similar to the "notification container" in gigaspaces. And as you would expect you can control the thread pools for this separately.
As for master worker pattern, you can configure client Ignite nodes to be written the the data and server nodes to store and process the data. You can even use other client nodes as remote listeners for data changes as you mentioned.
Check these links:
https://apacheignite.readme.io/docs/continuous-queries
https://apacheignite.readme.io/docs/clients-vs-servers
I've worked with GigaSpaces before. What you are looking for is perhaps "continuous queries" in Ignite. That would allow create a filter for a specific predicate I.e. Checking a field of a new object being written to the grid. Once the filter is evaluated it will trigger a listener that can execute the logic you require and write results or changes back to the grid. You can create as many of these queries as desired and create chains. Similar to the "notification container" in gigaspaces. And as you would expect you can control the thread pools for this separately.
As for master worker pattern, you can configure client Ignite nodes to be written the the data and server nodes to store and process the data. You can even use other client nodes as remote listeners for data changes as you mentioned.
I am thinking of setting up a page in an application that each of the queries can return a resultset that cannot fit in memory or the query is very expensive to fetch all of them. The user will be hitting "get more" to get more of those results. I wonder if I could use a yielder for Java something like that (http://benjiweber.co.uk/blog/2015/03/21/yield-return-in-java/) and if I will need Web Sockets e.g from Spring (http://docs.spring.io/spring/docs/current/spring-framework-reference/html/websocket.html) so that the client can tell to Server to push more results. Also could you please give an example of the handshake .. Will the endpoint uri be based on some session id as well? Also when databases like OrientDB/Neo4j return Iterables does it mean that we can keep the connection open and get the next rows after minutes without problems? Thanks!
You are talking about two different concepts.
Pagination
If you have a large result set and you need to return it piece by piece to avoid long query times or high memory requirements, you're paginating the over the result set.
To do this, you require another piece of the set hitting "Get More" button from the client. Each time you require more, the server will receive a request from the server and will hit the DB with some paginated query.
Example in SQL (page 10, 10 results/page , for instance):
SELECT * FROM Table OFFSET 100 LIMIT 109
Websockets / Yielder
You'll need a websocket / yielder when is the server the one who sends data, in other words, the client doesn't require an update, it only keeps the socket open and receives updates from the Server when they come.
That's the case of a Message service, for example, avoiding constant polling from the client side.
In your case is absolutely unnecessary a websocket. You can also see an example of what I'm saying here -> What's the behavioral difference between HTTP Stay-Alive and Websockets?
However you can setup a keep-alive connection between your back-end and database in order to avoid closing/opening constantly the connection each time the user requires more results.
Finally, your question about Iterable results in Neo4j. Neo4j's result type is an Iterable list of Map<String,Object> which represents a List of key-value pairs. That doesn't keep the connection alive (by default), it only iterates through the returned results of that particular query.
I have been given a API link of the form of a URL and query string. And following is my approach,
Query string format means that a GET request is to be fired.
I also assume that this can be done with the HttpURLConnection in Java
I have some data list that I'm retrieving from db
How would I fire for each data in list? Is a simple for loop not going to be enough for such a sophisticated task?
The API link is a trivial link with query string with data from db to be appended to one at a time.
Would like to hear how you would approach this task and see if my approach lacks somewhere.
You are right in doubting the simple for loop approach. It would be slow. The request is blocking, so you'll be waiting for the result of request 1 before firing request 2. Look into doing this asynchronously, firing multiple requests at once.
It's hard to say more without details on the API. Is it an online web service? Something internal created by another department? If it does not exist, consider asking for a version of that function that can receive multiple parameters at once, instead of having to do tons of tiny calls.
Here is my situation: I have Java EE single page application. All client-server communication is AJAX based with JSON is used as format to exchange data. One of my request takes around 1 min to calculate data required by client. Also this data is huge(Could be > 20 MB). So it is not possible to pass entire data to javascript in one go. So for this reason I am only passing few records to client and using grid to display data with paging option.
Now when user clicks on next page button, I need to get more data. My question is how do I cache data on server side ? I need this data only for one user as a time. Would you recommend caching all data one first request using session id as key ?
Any other suggestions ?
I am assuming you are using DB backend for that. I'd use limits to return small chunks of data, most DB vendors have solution for this. That would make your queries faster, and also most of JS fameworks with grid type of components will support paginating results(ExtJS for example).
If you are fetching data from 3rd party and passing it on (with some modifications or not) I'd still stick to the database and use such workflow: pool data from 3rd party, save in db, call from your widget small chunks required by customers.
Hope this helps.
The cheapest (and not so ineffective way of caching data) in a Java EE web application is to use the Session object like you intend to do. It's ineffective since it requires the developer to ensure that the cache does not leak memory; so it is upto to the developer to nullify the reference to the object once the object is no longer needed.
However, even if you wish to implement the poor man's cache, caching 20MB of data is not advisable, as it does not scale well. The scalability question rises when multiple users utilize the same functionality of the application, in which case 20MB is a lot of data.
You're better off returning paginated "datasets" in the form of JSON, based on the ValueList design pattern. Each request for the query of data will result in partial retrieval of data, which is then sent down the wire to the client. That way, you never have to cache the complete results of the query execution, and also you can return partial datasets. It is entirely upto to you, as to whether you want to cache; usually caching is done for large datasets that are utilized time and again.