Rest calls-Large amount of data between calls

Rest calls-Large amount of data between calls - java

We are using Rest using Jersey. There are few scenarios where server(WAS 8.5) sends large amount of data to client, which is RCP application. In some cases data is so huge(150MB) in xml format that client gets an OutOfMemoryError exception.
I have below questions
How much size is increased when java object is converted in xml?
How we can send large java object to client and still use rest calls?

1) Tough question to answer without seeing the XML schema, I've seen well designed schemas that result in tight, lean XML, and others that are a mess and very bloated. To test it write some test code that serializes your Java objects to a byte[] and compare it's size to the XML payload you currently produce.
2) Might be worth looking into a chunking process, 150MB is pretty large for a single payload. Also are you using GZIP compression for this already? Also may be worth looking at Fast Infoset. Basically it's a binary encoding for XML that generally helps reduce the size of an XML Document.

Related

Is additional base64 encoding necessary when sending files as byte[] from Java service to Java Service via RestTemplate?

I am sending data via json body in a post request from a client (Java) to a server (Java) using a Spring RestTemplate and RestController.
The data is present as a POJO on the client and will be parsed into a POJO with the same structure on the server.
On the client I am converting a file with Files.readAllBytes to byte[] and store it in the content field.
On the server side the whole object including the byte[] will be marshalled to XML using JAXB annotations.
class BinaryObject {
String fileName;
String mimeCode;
byte[] content;
}
Everything is working fine and running as intended.
I heard it could be beneficial to encode the content field before transmitting the date to the server and decode it there before it is marshaled into XML.
My Question
Is it necessary or recommended to additionally encode / decode the content field with base64?

TL;DR
To the best of my knowledge, you are not going against any good practice with your current implementation. One might question the design (exchanging files in JSON ? Storing binary inside XML ?), but this is a separate question.
Still, there is room for possible optmization, but the toolset you use (e.g. Spring rest template + Spring Controler + JSON serialization (jackson) + XML using JAXB) kind of hide the possible optimizations from you.
You have to carrefully weight the pros and cons of working around your comfortable "automat(g)ical" serializations that work well as of today to see if it is worth the trouble to tweak it.
We can nonetheless discuss the theory of what could be done.
A discussion about Base64
Base64 encoding in an efficient way to encode binary data in pure text formats (e.g. MIME strucutres such as email or some HTTP bodies, JSON, XML, ...) but it has two costs : the first is a non negligible size increase (~ 33% size), the second is CPU time.
Sometimes, (but you'd have to profile, check if that is your case), this cost is not negligible, esp. for large files (due to some buffering and char/byte conversions in the frameworks, you could easilly end up using e.g. 4x the size of the encoded file in the Java Heap).
When handling 10kb files at 10 requests/sec, this is usually NOT an issue.
But 10MB files at 100 req/second, well that is another ball park.
So you'd have to check (I doubt your typical server will reach 100 req/s with 10MB files, because that is a 1GB/s incoming network bandwidth).
What is optimizable in your current process
In your current process, you have multiple encodings taking place : the client needs to Base64 encode the bytes read from the file.
When the request hits the server, the server decodes the base64 to a byte[], then your XML serialization (JAXB) reconverts the byte[] to base64.
So in effect, "you" (more exactly, the REST controler side of things) decoded base64 content, all for nothing because the XML side of things could have used it directly.
What could be done
A few things.
Do you need base64 at the calling site ?
First, you do not have to encode at the client side. When using JSON, there is no choice, but the world did not wait for JSON to exchange files (e.g. arbitrary binary content) over HTTP.
If your content is a file name, a MIME type, and a file body, then standard, direct HTTP calls with no JSON at all is perfectly fine.
The MIME type could be mapped to the Content-Type HTTP Header, the file name inside the Content-Disposition HTTP header, and the contents as the raw HTTP body. No base64 needed (but you need your server-side to accept raw HTTP content as is). This is standard as can be.
This change would allow you to remove the encoding (client side), lower the network size of the call (~33% less), and remove one decoding at the server side. The server would just have to base64 encode (once) a raw stream to produce the XML, and you would not even need to buffer the whole file contents for that (you'd have to tweak you JAXB model a bit, but you can JAXB serialize directly bytes from an InputStream, which means, almost no buffer, and since your CPU probably encodes faster than your network serves content, no real latency incurred).
If this, for some reason, is not an option, let's say your client has to send JSON (and therefore base64 content)
Can you avoid decoding at the server side
Sort of. You can use a server-side bean where the content is actually a String and NOT a byte[]. This is hacky, but your REST controler will no longer deserialize base64, it will keep it "as is", which is a JSON string (which happens to be base64 encoded content, but the controler does not care).
So your server will have saved the CPU cost of one base64 decoding, but in exchange, you'll have a base64 String in java heap (compared to the raw byte[], +33% size on Java >=9 with compact strings, +166% size on Java < 9).
If you are to profit from this, you also have to tweak your JAXB to see the base64 encoded String as a byte[], which is not trivial as far as I can tell, unless you modify the JAXB object in such a way that it accepts a String instead of the byte[] which is kind of hacky (if your JAXB objects are generated from a XML schema, this might really become a pain to implement)
All in all this is much harder - probably too much if you are not really hitting the wall, performance wise, on this particular issue.
A few other stuff
Are your files pure binary, or are they actually text ? If there are text, you may benefit from using CDATA encoding on the XML side instead of base64 ?
Is your XML actually a SOAP call ? If so, and if the service supports MTOM, you could avoid base64 completely, but that is an altogether different subject.

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

parsing giant json response

First the background info:
I'm using commons httpclient to make a get request to the server.
Server response a json string.
I parse the string using org.json.
The problem:
Actually everything works, that is for small responses (smaller then 2^31 bytes = max value of an integer which limits the getResponseBody and the stringbuilder). I on the other hand have a giant response (over several GB) and I'm getting stuck. I tried using the "getResponseBodyAsStream" of httpclient but the response is so big that my system is getting stuck. I tried using a String, a Stringbuilder, even saving it to a file.
The question:
First, is this the right approach, if so, what is the best way to handle such a response? If not, how should I proceed?

If you ever shall have a response which can be factor of GB you shall parse the json as stream character by character (almost) and avoid creating any String objects... (its very important because java stopTheWorld garbage collection will cause your system freese for seconds if you constantly create lot of garbage)
You can use SAXophone to create parsing logic.
You'll have to implement all the methods like onObjectStart, onObjectClose, onObjectKey etc... its hard at first but once you take a look implementation of PrettyPrinter in test packages you'll have the idea...
Once properly implemented you can handle an infinite stream of data ;)
P.S. this is designed for HFT so its all about performance and no garbage...

How to write more than 30 MB of data in xml?

First of all sorry if I'm repeating this question but I don't find any relevant solutions for my problem.
I'm facing difficulty in finding the way to solve the below issues.
1) I'm facing a scenario where I have to write more than 30 MB - 400 MB of data in a xml. When I'm using 'String' object to append the data to xml I'm getting 'OutOfMemory' exception.
After spending more time in doing R&D, I came to know that using 'Stream' will resolve this issue. But I'm not sure about this.
2) Once I constructed the xml, I have to send this data to the DMZ server using Android devices. As I know sending large amount of data using Http is difficult in this situation. In this case,
a) Using FTP will be helpful in this scenario?
b) Splitting the data into chunks of data and sending will be helpful?
Kindly let me know your suggestions. Thanks in advance.

i would consider zipping up the data before ftping it across.You could use a ZipOutputStream .
For the Out of Memory Exception, you could consider increasing the Heap Size.
Check this : Increase heap size in Java
Can you post some values of heap size you tried, your code and some exception traces?

Use StAX or SAX. These can create XML of any size because they write XML parts they generate to OutputStream on the fly.

What you should do is
First, use a XML parser to read and write data in XML format. it could be SAX or DOM. If data size is huge try CSV format it will take less space as you do not have to store XML tag.
Second, When creating output file make sure those are small small files.
third when sending over network, make sure you zipped everything.
And for god sake, don't eat up user mobile data cap for this design. Warn user about this file size and suggest him to use WiFi network.

Which format to create logs in profilers?

In the profiler I am writing, which is in fact a JVMTI agent for Java programs, I need a format to log the events collected. Further these logs have to be send to a socket and read by a GUI somewhere else. So I need a working serialization between two languages.
I already implemented my own protocol in XML and it worked very well. However I was told to consider another format. As XML building might be very slow and every additional code executed in the profiler influences heavily the profiled program. This is true, but does XML DOM Building take that long?
I used TinyXML so far. I hope no one points to RapidXML, as I hope there are not that different on a not-embedded machine.
What do you think? Currently I am trying to reimplement it with protobuf, which claims to be n times faster then XML.

I have a design I am working on for all log file in my remit. I record data in JSON but the JSON data is nested in a very simple xml format.
eg
<entry ts="2011-02-23T17:18:19.202" level="trc_1" typ="trace">New Message Received</entry>
<entry ts="2011-02-23T17:18:19.202" level="trace" typ="msg"><data>{"Name":"AgtConf","AgtId":1111,...}</data></entry>
That way I can easily separate out data and logging, but keep the logging directory from being complicated. Also saves having to write my own parser for a custom format. However given your situation I recommend using JSON only given that you are basically using to serialise. JSON is very much human-readable when it is formatted correctly, it can be very concise, and there are stable parsers for it.

my first choice is always the traditional txt file.
you can append new entries at the end of file (bottom)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.