How can I more efficently download large files over http? - java

I'm trying to download large files (<1GB) in Kotlin since I already knew I'm using okhttp and pretty much followed just used the answer from this question. Except that I'm using Kotlin instead of java, so the syntax is slightly diffrent.
val client = OkHttpClient()
val request = Request.Builder().url(urlString).build()
val response = client.newCall(request).execute()
val is = response.body().byteStream()
val input = BufferedInputStream(is)
val output = FileOutputStream(file)
val data = ByteArray(1024)
val total = 0L
val count : Int
do {
count = input.read(data)
total += count
output.write(data, 0, count)
} while (count != -1)
output.flush()
output.close()
input.close()
That works in that it downloads the file without using too much memory but it seems needlessly ineffective in that it constantly tries to write more data without knowing if any new data arrived.
That also seems confirmed with my own tests while running this on a very resource limited VM as it seems to use more CPU while getting a lower download speed then a comparable script in python, and of cause using wget.
What I'm wondering if there is a way where I can give something a callback that gets called if x bytes are available or if it's the end of the file so I don't have to constantly try and get more data without knowing if there is any.
Edit:
If it's not possible with okhttp I don't have a problem using something else, it's just that it was the http library I'm used to.

As of version 11, Java has a built-in HttpClient which implements
asynchronous streams of data with non-blocking back pressure
and that's what you need if you want your code to run only when there's data to process.
If you can afford to upgrade to Java 11, you'll be able to solve your problem out of the box, using the HttpResponse.BodyHandlers.ofFile body handler. You won't have to implement any data transfer logic on your own.
Kotlin example:
fun main(args: Array<String>) {
val client = HttpClient.newHttpClient()
val request = HttpRequest.newBuilder()
.uri(URI.create("https://www.google.com"))
.GET()
.build()
println("Starting download...")
client.send(request, HttpResponse.BodyHandlers.ofFile(Paths.get("google.html")))
println("Done with download.")
}

One could do away with the BufferedInputStream. Or as its default buffer size in Oracle's java is 8192, use a larger ByteArray, say 4096.
However best would be to either use java.nio or try Files.copy:
Files.copy(is, file.toPath());
This removes about 12 lines of code.
An other way is to send the request with a header to deflate gzip compression Accept-Encoding: gzip, so the transmission takes less time. In the response here then possibly wrap is in a new GZipInputStream(is) - when the response header Content-Encoding: gzip is given. Or if feasible store the file compressed with an addition ending .gz; mybiography.md as mybiography.md.gz.

Related

Get Zip file size overr FTP Connection in Java

I'm sending zip file over FTP connection so to fetch file size , I have used :
URLConnection conn = imageURL.openConnection();
long l = conn.getContentLengthLong();
But it returns -1
Similarly for files sent over Http request , I get correct file size.
How to get correct file size in ftp connection in this case ?
for files sent over Http request , I get correct file size.
MAYBE. URLConnection.getContentLength[Long] returns specifically the content-length header. HTTP (and HTTPS) supports several different ways of delimiting bodies, and depending on the HTTP options and versions the server implements, it might use a content-length header or it might not.
Somewhat similarly, an FTP server may provide the size of a 'retrieved' file at the beginning of the operation, or it may not. But it never uses a content-length header to do so, so getContentLength[Long] doesn't get it. However, the implementation code does store it internally if the server provides it, and it can be extracted by the following quite ugly hack:
URL url = new URL ("ftp://anonymous:dummy#192.168.56.2/pub/test");
URLConnection conn = url.openConnection();
try( InputStream is = conn.getInputStream() ){
if( ! conn.getClass().getName().equals("sun.net.www.protocol.ftp.FtpURLConnection") ) throw new Exception("conn wrong");
Field fld1 = conn.getClass().getDeclaredField("ftp");
fld1.setAccessible(true); Object ftp = fld1.get(conn);
if( ! ftp.getClass().getName().equals("sun.net.ftp.impl.FtpClient") ) throw new Exception ("ftp wrong");
Field fld2 = ftp.getClass().getDeclaredField("lastTransSize");
fld2.setAccessible(true); long size = fld2.getLong(ftp);
System.out.println (size);
}
Hacking undocumented internals may fail at any time, and versions of Java above 8 progressively discourage it: 9 through 15 give a warning message about illegal access unless you use --add-opens to permit it and 16 makes it an error (i.e. fatal). Unless you really need the size before reading the data (which implicitly gives you the size) I would not recommend this.

SQS Message Size Exceeding 256 kB

I'm trying to send a json object (serialized as a string) into an SQS queue that triggers a lambda. The SQS message is exceeding the maximum 256 kB limit that SQS has. I was trying to gzip compress my message before sending it. Here is how I'm trying to do it:
public static String compress(String str) throws Exception {
System.out.println("Original String Length : " + str.length());
ByteArrayOutputStream obj=new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(obj);
gzip.write(str.getBytes("UTF-8"));
gzip.close();
String base64Encoded = Base64.getEncoder().encodeToString(obj.toByteArray());
System.out.println("Compressed String length : " + base64Encoded.length());
return base64Encoded;
}
The lambda that this SQS queue triggers is a nodejs based lambda where I need to unzip and decode this message. Im trying to use the zlib library in nodejs to unzip and decode my message like this:
exports.handler = async (event, context) => {
let msg = null
event.Records.forEach(record => {
let { body } = record;
var buffer = zlib.inflateSync(new Buffer(body, 'base64')).toString();
msg = JSON.parse(JSON.parse(JSON.stringify(buffer.toString(), undefined, 4)))
});
}
I'm getting the following error on execution:
{
"errorType": "Error",
"errorMessage": "incorrect header check",
"code": "Z_DATA_ERROR",
"errno": -3,
"stack": [
"Error: incorrect header check",
" at Zlib.zlibOnError [as onerror] (zlib.js:180:17)",
" at processChunkSync (zlib.js:429:12)",
" at zlibBufferSync (zlib.js:166:12)",
" at Object.syncBufferWrapper [as unzipSync] (zlib.js:764:14)",
" at /var/task/index.js:12:19",
" at Array.forEach (<anonymous>)",
" at Runtime.exports.handler (/var/task/index.js:10:17)",
" at Runtime.handleOnce (/var/runtime/Runtime.js:66:25)"
]
}
Can someone tell me how I can approach this problem in a better way? Is there aa better way to compress the string in java? Is there a better way to decompress, decode and parse the json in nodejs?
256Kb for the message is huge, if you send millions messages like this, it will be extremely hard to process them all, think about replication that SQS has to do internally.
SQS is not a database and its not meant to store a lot of text.
I assume that you message contains a lot of business information in addition to some technical message identification parameters.
Usually this points on a wrong design of the system. So you can try the following:
Think about the storage for the content of the business information. It should not be SQS, it can be anything, Mongo, Postgres/MySQL whatever, Maybe ElasticSearch or even Redis in some cases. Since the application is on cloud, aws has many additional storage engines (S3, DynamoDB, aurora, etc). So find the one that suits your use case the best. Probably S3 is the way to go if you only need a document by some key (path), but the decision is beyond the scope of this question.
The "sender" of the message will store the business related information in this storage, and will send a short message to SQS that will contain a pointer (url, foreign key, or application specific document id, whatever) on the document so that the receiver will be able to get that document from the storage once it gets the SQS message.
With this approach you don't need to zip anything, the messages will be short.
The problem is that you are sending a gzip stream, and then trying to read a zlib stream. They are two different things. Either send gzip and receive gzip, or send zlib and receive zlib. E.g. zlib.gunzipSync on the receive side.

Streaming Chunked HTTP Entities with vert.x

I'm beginning an initial review of vert.x and comparing it to akka-http. One area where akka appears to shine is streaming of response bodies.
In akka-http it is possible to create a streaming entity that utilizes back-pressure which allows the client to decide when it is ready to consume data.
As an example, it is possible to create a response with an entity consisting of 1 billion instances of "42" values:
//Iterator is "lazy", therefore this function returns immediately
val bodyData : () => Iterator[ChunkStreamPart] = () =>
Iterator
.continually("42")
.take(1000000000)
.map(ChunkStreamPart.apply)
val route =
get {
val entity : HttpEntity =
Chunked(ContentTypes.`text/plain(UTF-8)`, Source fromIterator bodyData)
complete(HttpResponse(entity=entity))
}
The above code will not "blow up" the server's memory and will return the response to the client before the billion values have been generated.
The "42" values will get created on-the-fly as the client tries to consume the response body.
Question: is this streaming capability also present in vert.x?
A cursory review of the HttpServerResponse class would indicate that it is not since the write member function can only take in a String or a vert.x Buffer. From my limited understanding it seems that Buffer is not lazy and holds the data in memory which means the 1 billion "42" example would crash a server with just a few concurrent requests.
Thank you in advance for your consideration and response.

HtttpURLConnection never uses gzip

I'm currently developing an app, that should measure (fairly precisely) the size of webpages.
The thing I'm struggling with now is that I need to know the sizes of particular files that are on the website. I have an array of URLs and I try to fetch their headers to get Content-Length, however some files return -1 since they are chunked. If they return -1 I try to download them to get their size.
And here lies the problem - I found out that I always get uncompressed version of the file.
Example file -
http://www.google-analytics.com/analytics.js
When I open it in Chrome, the headers says this:
However, when I download it using HttpURLConnection, it has a size of 25421 bytes, and when I check the Content-Encoding header, its always null.
connection = (HttpURLConnection)(new URL(url)).openConnection();
connection.setRequestProperty("Accept-Encoding", "gzip");
connection.connect();
int contentLength = connection.getContentLength();
if (contentLength == -1 && connection != null) {
InputStream input = connection.getInputStream();
byte[] buffer = new byte[4096];
int count = 0, len;
while ((len = input.read(buffer)) > 0) {
count += len;
}
contentLength = count;
}
So the problem is, that I download a webpage with my application, and it says it has (let's say) 400kB. But when I download it using some kind of tool, like http://tools.pingdom.com/fpt/ , the size is much smaller, like 100kB, since most of the scripts are gzipped, that means the transfer is lower.
I know 300kB is not that much, but when you are using a mobile transfer, every kB counts, and I want my app to be precise.
Could you point me where I make mistake, or how could I solve this?
Thank you
Your HttpURLConnection setup code looks correct to me. You could try setting the User-Agent to a standard browser one, perhaps the server is trying to be more intelligent than it ought to be. Failing that, run your traffic through a debugging proxy like Fiddler or Burp to see what's going on at the network level.
If you are using iJetty, you have to enable gzip compression first
You have to enable the GzipFilter to make Jetty return compressed content. Have a look here on how to do that: http://blog.max.berger.name/2010/01/jetty-7-gzip-filter.html
You can also use the gzip init parameter to make Jetty search for compressed content. That means if the file file.txt is requested, Jetty will watch for a file named file.txt.gz and returns that.

How to POST binary bytes using java async http client (ning) library

I need to use Async Http Client (https://github.com/sonatype/async-http-client ) to post an byte array to URL.Content type is octet-stream.
How Do I do it using async http client.
Should I use ByteArrayBodyGenerator ? Is there any example code to see how it is done?
If the byte array is already in memory, is it better to use ByteArrayInputStream
and use RequestBuilder.setBody(InputStream)
It is suggested in the docs to not to use InputStream in setBody, because in order to get the content length, the library will need to load everything in memory.
And it seems that ByteArrayBodyGenerator has the same issue. To get the content length it uses a call to bytes.length() and bytes is your byte array (private final byte[] bytes;). So, to get the length of a byte array, the array needs to be loaded in memory.
Here is the source from github:
https://github.com/sonatype/async-http-client/blob/master/src/main/java/com/ning/http/client/generators/ByteArrayBodyGenerator.java
You may write your own BodyGenerator implementation to avoid the issue.
Also you asked for an example of using BodyGenerator:
final SimpleAsyncHttpClient client = new SimpleAsyncHttpClient.Builder()
.setRequestTimeoutInMs(Integer.MAX_VALUE)
.setUrl(url)
.build();
client.post(new ByteArrayBodyGenerator(YOUR_BYTE_ARRAY)).get();
And if you want to use legacy API:
final AsyncHttpClientConfig config
= new AsyncHttpClientConfig.Builder().setRequestTimeoutInMs(Integer.MAX_VALUE).build();
final AsyncHttpClient client = new AsyncHttpClient(config);
client.preparePost(url)
.setBody(new ByteArrayBodyGenerator(YOUR_BYTE_ARRAY))
.execute()
.get();

Categories

Resources