Java- Getting Json by parts

Java- Getting Json by parts - java

Generally is there a way to get a big JSON string by a single request by parts?
For example, if I have a JSON string consisting of three big objects and having each size of 1mb, can I somehow in a single request get the first 1mb then parse it while other 3 objects are still being downloaded, instead of waiting for the full 3mb string to download?

If you know how big the parts are, it would be possible to split your request in three using HTTP1.1's range requests. Assuming your ranges are defined correctly, you should be able to get the JSON objects directly from the server (if the server supports range requests).
Note that this hinges on a) the server's capability to handle range requests, b) the idempotency of your REST operation (it could very well run the call three times, a cache or reverse proxy may help with this) and c) your ability to know the ranges before you call.

Related

Get the size of objects on S3 with a single API call

I have a Java application that extracts compress quite a few objects on S3 through streaming. So to make it more efficient, the application does not download objects on the local disk and upload them again, but it streams the files in 5MB chunks and compress them on the fly. The challenge I am facing is in order to provide progress on this operation, I need to rely on the total size of all the objects and use a counter to see how much from the total size is handled as the source of calculating the progress.
The challenge I have been facing is in order to get the size of objects, I need to iterate through all the objects first and get the size one by one and calculate the total before starting the process. However, this is going to be too slow as there might be millions of objects which means millions of API calls. If I try to calculate the size before starting the compression, this calculation process will take more than the actual compression and it defeats the whole purpose. Therefore, I was wondering if there is any way I can pass the list of objects via a single API call and receive the total size. I know there is a way to add a prefix and get the prefix match for all the objects that match a prefix, but since objects may get stored with different prefixes, this approach will not work.
The following code snippet is how I can get the object size one by one:
public Long getObjectSize(AmazonS3Client amazonS3Client, String bucket, String key)
throws IOException {
return amazonS3Client.getObjectMetadata(bucket, key).getContentLength();
}
NOTE: If I relied on the number of objects to calculate the progress, that wouldn't be accurate at all. Some objects are 2-3KB and some of them are quite big (1-2GB).

You could use Stream API of java 8 to turn iterate and made the sum of the values or
maybe with using AmazonCloudWatch api to help you getting the BucketSizeBytes metric.
So you need to listMetrics and use BucketSizeBytes to GetMetricData.
Here the link of documentation:
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/cloudwatch/AmazonCloudWatch.html#listMetrics-com.amazonaws.services.cloudwatch.model.ListMetricsRequest-
https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudwatch-monitoring.html
Here some examples of AmazonCloudWatch:
https://www.javatips.net/api/com.amazonaws.services.cloudwatch.model.metric
https://www.programcreek.com/java-api-examples/?api=com.amazonaws.services.cloudwatch.AmazonCloudWatchClient
UPDATE:
Like I told you in one of this comments, you also could use command line interface.
In this case, you also use cloudwatch, but through aws cli and you receive a JSON response format.
In one of the links that I put has an example, follows here:
aws cloudwatch get-metric-statistics --metric-name BucketSizeBytes
--namespace AWS/S3 --start-time 2016-10-19T00:00:00Z --end-time 2016-10-20T00:00:00Z --statistics Average --unit Bytes --region us-west-2 --dimensions Name=BucketName,Value=ExampleBucket Name=StorageType,Value=StandardStorage --period 86400 --output json
This other link have more explanations:
http://cloudsqale.com/2018/10/08/s3-monitoring-step-1-bucket-size-and-number-of-objects/
In summary, it seems that using cloudwatch is the easiest way to avoid many calls with iterations.

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

parsing giant json response

First the background info:
I'm using commons httpclient to make a get request to the server.
Server response a json string.
I parse the string using org.json.
The problem:
Actually everything works, that is for small responses (smaller then 2^31 bytes = max value of an integer which limits the getResponseBody and the stringbuilder). I on the other hand have a giant response (over several GB) and I'm getting stuck. I tried using the "getResponseBodyAsStream" of httpclient but the response is so big that my system is getting stuck. I tried using a String, a Stringbuilder, even saving it to a file.
The question:
First, is this the right approach, if so, what is the best way to handle such a response? If not, how should I proceed?

If you ever shall have a response which can be factor of GB you shall parse the json as stream character by character (almost) and avoid creating any String objects... (its very important because java stopTheWorld garbage collection will cause your system freese for seconds if you constantly create lot of garbage)
You can use SAXophone to create parsing logic.
You'll have to implement all the methods like onObjectStart, onObjectClose, onObjectKey etc... its hard at first but once you take a look implementation of PrettyPrinter in test packages you'll have the idea...
Once properly implemented you can handle an infinite stream of data ;)
P.S. this is designed for HFT so its all about performance and no garbage...

Web services response conversion issue

I am using 'n' number of web services in my systems. I am very well taking care. But, in recent days I am just seeing a strange behaviour while handling response of one my external systems.
Here is my problem,
When I request one of my downstream system for getting data, i am getting response with one very big xml. During parsing the response in system, the complete JAVA thread itself got struck more than configured time. So for temporary fix, we request downstream system to limit the response.
But, how this is happening? Irrespective of how big the data, the unmarshlling process should complete right.
So may i know what was the root cause of this issue ?

If you are unmarshalling then the whole XML will be converted to one object graph containing all the objects specified in the XML. So the bigger the XML the bigger the resulting object graph. Of course this takes more memory, perhaps more than your application has to its disposition, which could lead to an OutOfMemoryException.
If the XML received contains some kind of a list of items you can consider handling it item by item. You will read in one item at a time and then process it and dispose of it. You will then need only the amount of memory to fit one item's object graph in memory. But to do this you would have to rewrite your processing code to use a library like SAX.

Shorten a String

Is there a better way to shorten (Use fewer characters) a String in java besides converting the chars to int's and running them through base36?
For example, say if I wanted to shorten a URL.

Short URL services (like 'tinyurl') work by storing a big database table that maps from short URLs to their full form.
When you request a tinyurl, the service allocates a random-looking short url (that is not currently in use) and creates an entry in its table that maps from the short url to your supplied longer one.
When you try to load the short url in a browser, the request first goes to the tinyURL service, which looks up the full URL and then sends an HTTP redirect response to the browser telling it to go to the real URL.
You can implement your own URL shortening service by doing the same thing, though if you are shortening your own URLs you can maybe do the redirection internally to your web server; e.g. using a servlet request filter.
I described the above in the context of shortening URLs in a way that still allows the URLs to be resolved1. But, this approach can also be used more generally; i.e. by creating a pair of Map<String,String> objects and populate it with bidirectional mappings between sequentially generated short strings and the original (probably longer) strings. It is possible to prove that will give a smaller average size of short string than any algorithmic compression or encoding scheme over the same set of long strings.
The downside is the space needed to store the mappings, and the fact that you need the mappings any place (e.g. on any computer) where you need to do the short-to-long or long-to-short conversions.
1 - When you think about it, that is essential. If you shorten a URL string and the result is no longer resolvable, it not a useful URL for most purposes.

Since URL's are UTF-8, and since the characters are therefore base 256, encoding the same characters as integer code-points in base 32 can only make them longer. Or are you not asking what it sounds like you are asking?
Further, in Java Strings are base 65536 UTF-16, so encoding their code points as base 32 will make Java strings even longer.
Just as encoding binary data in base 64 makes it longer by 4/3's - every 3 bytes requires 4 base 64 bytes to encode.

Put the full Urls in a database and give the id as the redirect URL

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.