Specify InputStream for ServletResponce instead of copying InputStream in OutputStream - java

In short I have a Servlet, which retrieves pictures/videos e t.c. from underlying data store.
In order to archive this I need to copy files InputStream to ServletResponce *OutputStream*
From my point of view this is not effective, since I'll need to copy the file in memory before sending it, it would be more convinient to specify InputStream, from which OutputStream would read data and send it straight away, after reading some data in the buffer.
I looked at ServletResponce documentation and it have some buffer for the message data, so I have a few questions regarding it.
Is this the right mechanism?
What If I decide not to send the file at the end of Servlet processing?
For example:
If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.

To address your first concern, you can easily copy InputStream to OutputStream using IOUtils from Apache Commons Lang:
IOUtils.copy(fileInputStream, servletOutputStream);
It uses 4K buffer, so memory consumption should not be a concern. In fact you cannot just send straight away data from InputStream. At the lowest level the operating system still has to read file contents to some memory location and in order to send it to socket, you need to provide a memory location where the data to be sent resides. Streams are just a useful abstraction.
About your second question: this is how HTTP works: if you start streaming data to the client, servlet container sends all response headers first. If you abort in the middle, from the client perspective it looks like interrupted download.

Is this the right mechanism?
Basically, it is the only mechanism provided by the Servlet APIs. You need to design your servlet with this in mind.
(It is hard to see how it could be done any other way. A read syscall reads data into memory from a device (the disk). A write syscall writes data from memory to a device (the network interface). There is no syscall to transfer data directly from one device to another. The best you can do is to reduce the amount of copying of data within the application. If you use something like IOUtils.copy, it should minimize that as far as possible. The only way you could avoid going through application memort would be to use some special purpose hardware / operating system combination optimized for content delivery.)
However, this is probably moot anyway. In most cases, the performance bottleneck is likely to be movement of data over the network. Data can probably be read from disk to memory, copied, and written to the network interface orders of magnitude faster than it can move through the network to the user's web browser (or whatever).
If it is NOT moot, then a practical way to do content delivery would be to use a separate web server implemented in native code that us optimized for delivering static content; e.g. something like nginx.)
What If I decide not to send the file at the end of Servlet processing? For example: If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.
You should write your servlet to do the access checks BEFORE reading the content into memory. And ideally, before you "commit" the response by sending the response header.

Related

How to improve file IO performance in a rest service

The scenario I have at hand is from my spring boot rest service, read a word doc from resources folder and pass the byte array to the client
I read the word doc in memory using FileInputStream, convert input stream to a byte array using Apache Common IO IOUtils and place it in the response body of the rest service.
The problem here is that I always read the file in memeirh oer service request which is detrimental for there local memory of the process where service is running on.
I can’t read the file line by line and return it to the service caller in that fashion as I need to return the byte array back to the caller all together
Another problem I foresee is with how the file is read. I want to be a non blocking IO instead of a blocking IO.
Wondering what would be an efficient way to solve this
Do you actually need to read the file every time a request comes in.
Otherwise you could just read the file on server startup and then keep the file in memory stored in a Spring Bean. Then fetch it from there on every call?
If you don't want to upload file every time, it's better to create the #Bean, doing that in init/postconstruct phase. You also can add some functionality to your retrieve() method, which checks and stores file modification time with invokation of File.lastModified() to decide whether you have to reload the content or not.

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

What is the best way to write a generic httpservletresponsewrapper that handles all special cases of response's contenttype in a large web app?

I am customizing a large COTS content management system known as Confluence.
Confluence returns many different types of httpservletresponses (text/ascii, image/png, image/jpg, microsoft powerpoint files, PDF files, etc...).
I have written a servletfilter that attempts to modify all responses sent back to the client by writing out a small set of bytes. This works well for the most part. However, I have to continuously check for special cases like powerpoint files, or PDFs, PNGs, etc.. If the user happens to be downloading such content I do not modify the response at all. Modifying the response breaks stream of powerpoint bytes or PDF bytes that are in the process of being served to the client. By simply checking for these special cases and not writing out any of my bytes my problem is solved. But I feel the bigger problem is there could be many many more cases I am not thinking of (perhaps audio and video) or who knows what. I will have to continue playing the game of checking for these special cases as I learn of them.
I was wondering if there is a smarter way to handle this.
I did a google and I ran into this example.
I'm looking for something along the lines of this example, but I was hoping someone could explain to me what's going on behind the scenes and if I can solve this problem in a smarter way.
The filter example is sort of incomplete, but the gist of what it seems to be doing is buffering the entire response in to a byte array, with which you can do whatever you want later. I think the implication is that you might extend this filter, then call getData() after the filter chain fires, and then perform processing.
You don't speak to what you're doing, or why the content type matter, or why "special" content types that you don't care about (that you just pass through) matter.
What you can do, is you could create a registry of content type handlers to classes. Then, as you detect the content type of the outbound request, you can dispatch to the appropriate handler. These handlers can be simply represented as a map of content type -> class name of the handler, with a default pass through "do nothing" handler for any content type that is not registered. You can load that map from a properties file, filter configuration, or a table in the database.
While it may seem attractive to just buffer the entire output stream and then act upon it, I would recommend against it. Imagine the memory pressure if the user is downloading a large (10's to 100's of MB) PDF or video or something else. Perhaps most of your content is appropriate to be buffered, but there may well be some that are not.
Of course your handler can implement many of the portions of the filter chain, and act as a proxy filter, so your handlers can do anything a filter can do.
Also, your filter may interfere with higher order HTTP processing (notably chunk delivery, range support, Etag and caching support, etc.). That stuff can be a pain to have to redo.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?
Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.
You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Special OutputStream to work into memory and file depending on the amount of input data

Currently I'm working with an SSH client api providing me stdout and stderr as InputStreams. I have to read all the data from these streams at client side and provide an api for implementors to be able to work with these data the way they want (just drop it, write it to DB, process it etc). First I tried to keep the whole data read in byte arrays, but with huge amount of data (could happen sometimes) this can cause serious memory problems. But I don't want to write all the data of every call into files if that isn't really necessary.
Anyone knows about a solution which reads data into memory until it reaches a limit (like 1mb), after it writes data from memory to a file and appends all the remaining data of the inputstream to the same file?
commons io has a workable solution: DeferredFileOutputStream.
Can you avoid reading the stream until you know what you are going to do with it?
If you use this approach you can dump them, read portions of data and write them to a database as you read it, or read and process the data as you read it.
This way you would not need to read more than 1 MB (or less) at any one time.

Categories

Resources