There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.
I am passing close to 1000 parameters to my server. And i got the infamous "More than the maximum number of POST parameters detected" error. I increased the MAX_COUNT value in my local server and it was working fine.
But in my production environment the server admin is not ready to increase the count above 512, he is sighting DOS attacks as the reason.
Is there a best way to pass these 1000+ parameters to the server. I tried compartmentalizing the page into sections and made save buttons available for each section. But if it wasn't possible to compartmentalize, what would be the best practice to send huge number of parameters to a server?
You could get around this by posting an XML or JSON document with all of the parameters. It will require quite a bit changes, but will get you around the restriction without having to deal with the sysadmin.
I am currently using Spring as a RESTful Web Service for a mobile application I am writing. The server-side of things is pretty well completed (in terms of functionality) but now I am trying to improve its performance.
What I want to do is find an efficient way to return 304 (Not Modified) response codes from my server to the client with an implementation that is not shallow. Meaning, that I want to save both bandwidth and processing cycles.
What I figured I need to do is determine the last time an object has been modified and compare it to the if-modified-since HTTP header. The question here is, how should I get the last updated time of an object quickly (i.e. with zero-to-minimal access to the persistence layer)? Or is there a better approach to this altogether?
Note: I have been referencing this and this.
I'm not sure mongoDB support getting the last update time. I would do it manually. Check the $currentDate operator, here: http://docs.mongodb.org/manual/reference/operator/update/currentDate/
I'm working on a project which uses Facebook Graph Api to download information about users of our application and their Facebook friends. There is a need to update this data, but as I understand Real-Time Update is not an option. For example I would like to have update of profile feed of friends of our app user, and I don't see a way to do this with Real-Time Update.
Could someone give me some advice on this update mechanism? I need to update app users, their friend connections and profile feeds of users and their friends. I understand I'll have to poll Facebook servers to retrieve this data. What I'm trying to find out is some good practices when doing these things. Update frequency? Way to recognize that data has changed? If anyone has experience with this kind of things every advice would mean a lot.
Thanks.
You can use the since= query string parameter of the Graph API call. Here's some pseudocode to help you along
var usersLastPostDate = GetLastPostDateFromDataStore(userId);
if(usersLastPostDate not populated) {
streamItems = GraphApiGet(userId, "me/feed")
lastStreamItemDate = GetNewestStreamItemDate(streamItems)
StoreLastPostDateIntoDataStore(userId, lastStreamItemDate )
}
else {
streamItems = GraphApiGet(userId, "me/feed?since=" + usersLastPostDate )
}
Not massively useful for your use case (as you're wanting to get data which changes frequently), but worth pointing out that the Graph API now supports ETags - https://developers.facebook.com/blog/post/627/.
ETags will tell you if the data has changed since the last time you requested it. This won't stop you from hitting Facebooks API throttling limits, but is a quick and easy way to tell if the data has changed since you last asked for it.
There is no one answer to your question, as it depends on what your application is doing. How often do you need to get the updated information? If your data is stale for 5 minutes, is that really a problem? Can you grab the data from Facebook lazily, when some user action requires that you have it?
If you do need to do a lot of polling try and use non-blocking IO, especially if you're expecting to have a lot of open HTTP requests to Facebook whilst you're polling. Build a reliable queueing mechanism and HTTP poker to ensure requests are being made as expected. Without any idea of what technology stack you're using it's hard to be more specific than that.
HTH
What about this: Open Graph Subscription system ?
I have a small application in java which searches images using bing image search. The problem I am facing is that, its getting only first 20 images. May be because when we search on bing.com it populates first 20 images first and then its an infinite scrolling feature.
Is there any way to search more than 20 images using bing?
Cheers :)
I'm guessing this is because this site uses ajax to populate the "infinite" scrolling list as you call it.
You probably send an http request and get the initial page (btw on my browser I got 6 images accross x 4 down, i.e. 24 not 20; thinking about it maybe my client also got 20 only at first and got the last 4 w/ ajax...), and you'd need to do the paging trough by way of ajax requests.
At a glance, the xhtml and associated javascript of the page is very dense and somewhat obfuscated, It would take a while to get oriented... An alternative to analyzing this page is to instead use a packet sniffer (such as wireshark) and to capture the requests which take place when you scroll down.
Essentially this will likely expose some form of ajax request, which you can then easily emulate with java. Typically the ajax response is easy to parse whatever its nature (xml, jason, gzip...).
A possible snags to this well laid out plan is if the returned data in the ajax response is encrypted, for example where the extra images are bundled in some sort of envelope for which you'll then need to discover the format.
Depending on the actual task at hand, you may try alternatives such as automations within GreaseMonkey (on Firefox) or similar tools.
What of Bing API ?
Note that all the above approaches are akin to screen-scraping and hence quite sensitive to even minute changes in the Bing application, and, depending on effective usage and context, this could put the project in a legal grey area... A better approach may be to register and obtain a proper application ID with MS/Bing and to use the Bing API.
You are simulating a browser? Doesn't the Bing engine have an entry point for programs instead - a web service or so - which would make your task much easier.
EDIT: SDK appears to be here: http://msdn.microsoft.com/en-us/library/cc980922.aspx
Just wanted to post a direct answer to the question:
Bing uses Ajax (of course) for the infinite scroll. Each "tick" is based on a simple ajax get request, which accuires new images.
For instance, this url returns 30 results (121-151) in a "htmlraw" format based on the query "max payne".
http://www.bing.com/images/async?q=max+payne&format=htmlraw&first=121
Edit:
It works with the original url too, just add &first=NUMBER to the querystring. Example:
www.bing.com/images/search?q=payne&go=&form=QBLH&scope=images&filt=all&first=10
I am building my own bulk image collector (for a "learning project" for myself) and I found out that it is paginated like this.
FYI, Google and Bing are easy, Yahoo and Altavista (redundant, since their results are from Yahoo) are far more problematic - they don't post the directlink to the original image.
Have fun! :)
This can be done by using count parameter. For example, I tried GET "https://api.cognitive.microsoft.com/bing/v7.0/images/search?q=shoes&mkt=en-us&count=30" call and it returns 30 images.