Stop HtmlUnit download after specified file size is reached

Stop HtmlUnit download after specified file size is reached - java

I'm stuck trying to stop a download initiated with HtmlUnit after a certain size was reached. The InputStream
InputStream input = button.click().getWebResponse().getContentAsStream();
downloads the complete file correctly. However, seems like using
OutputStream output = new FileOutputStream(fileName);
int bytesRead;
int total = 0;
while ((bytesRead = input.read(buffer)) != -1 && total < MAX_SIZE) {
output.write(buffer, 0, bytesRead);
total += bytesRead;
System.out.print(total + "\n");
}
output.flush();
output.close();
input.close();
somehow downloads the file to a different location (unknown to me) and once finished copies the max size into the file "fileName". No System.out is printed during this process. Interestingly, while running the debugger in Netbeans and going slowly step-by-step, the total is printed and I get the MAX_SIZE file.
Varying the buffer size in a range between 1024 to 102400 didn't make any difference.
I also tried Commons'
BoundedInputStream b = new BoundedInputStream(button.click().getWebResponse().getContentAsStream(), MAX_SIZE);
without success.
There's this 2,5 years old post, but I couldn't figure out how to implement the proposed solution.
Is there something I'm missing in order to stop the download at MAX_SIZE?
(Exceptions handling and other etcetera omitted for brevity)

There is no need to use HTMLUnit for this. Actually, using it to such a simple task is a very overkill solution and will make things slow. The best approach I can think of is the following:
final String url = "http://yoururl.com";
final String file = "/path/to/your/outputfile.zip";
final int MAX_BYTES = 1024 * 1024 * 5; // 5 MB
URLConnection connection = new URL(url).openConnection();
InputStream input = connection.getInputStream();
byte[] buffer = new byte[4096];
int pendingRead = MAX_BYTES;
int n;
OutputStream output = new FileOutputStream(new File(file));
while ((n = input.read(buffer)) >= 0 && (pendingRead > 0)) {
output.write(buffer, 0, Math.min(pendingRead, n));
pendingRead -= n;
}
input.close();
output.close();
In this case I've set a maximum download size of 5 MB and a buffer of 4 KB. The file will be written to disk in every iteration of the while loop, which seems to be what you're looking for.
Of course, make sure you handle all the needed exceptions (eg: FileNotFoundException).

Related

Fastest way to write multiple files in java

I have requirement where I need to write multiple input streams to a temp file in java. I have the below code snippet for the logic. Is there a better way to do this in an efficient manner?
final String tempZipFileName = "log" + "_" + System.currentTimeMillis();
File tempFile = File.createTempFile(tempZipFileName, "zip");
final FileOutputStream oswriter = new FileOutputStream(tempFile);
for (final InputStream inputStream : readerSuppliers) {
byte[] buffer = new byte[102400];
int bytesRead = 0;
while ((bytesRead = inputStream.read(buffer)) > 0) {
oswriter.write(buffer, 0, bytesRead);
}
buffer = null;
oswriter.write(System.getProperty("line.separator").getBytes());
inputStream.close();
}
I have multiple files of size ranging from 45 to 400 mb, for a typical 45mb and 360 mb files this method is taking around 3 mins on average. Can this be further improved?

You could try a BufferedInputStream
As #StephenC replied is it unrelevant in this case to use a BufferedInputStream because the buffer is big enough.
I reproduced the behaviour on my computer (with an SSD drive). I took a 100MB file.
It took 110ms to create the new file with this example.
With an InputStreamBuffer and an OutputStream = 120 ms.
With an InputStream and an OutputStreamBuffer = 120 ms.
With an InputStreamBuffer and an
OutputStreamBuffer = 110 ms.
I don't have a so long execution time as your's.
Maybe the problem comes from your readerSuppliers ?

When streaming a file from a server, the file size on disk is bigger than the total bytes read, what's going on?

I'm reading a binary file from artifactory. The file size according to artifactory is 34,952,058 bytes, the totalBytes counter that's logged after reading is finished is also 34,952,058 bytes. But the file size on disk is 39,426,048 bytes. What's going on??
I've tried BufferedOutputStream, FileOutputStream and OutputStream.
It's the same result every time. What am I missing?
This is what my latest code looks like at the moment:
try {
URL url = new URL(fw.getArtifactoryUrl());
URLConnection connection = url.openConnection();
in = connection.getInputStream();
File folder = utils.getFirmwareFolder(null, FirmwareUtils.FIRMWARE_LATEST, true);
StringBuilder builder = new StringBuilder(folder.toString());
builder.append("/").append(fw.getFileName());
Path filePath = Paths.get(builder.toString());
OutputStream out = Files.newOutputStream(filePath);
int read = 0;
int totalBytes = 0;
while ((read = in.read(bytes)) > 0) {
totalBytes += read;
out.write(bytes);
out.flush();
}
logger.info("Total bytes read: " + totalBytes);
in.close();
out.close();
<<< more code >>>

Your code reads correctly, but writes incorrectly
while ((read = in.read(bytes)) > 0) { // Read amount of bytes
totalBytes += read; // Add the correct amount of bytes read to total
out.write(bytes); // Write the whole array, no matter how much we read
out.flush(); // Completely unnecessary, can harm performance
}
You need out.write(bytes, 0, read) to write only the bytes you've read instead of the whole buffer.

difference between input.read and input.read(array, offset, length)

I'm trying to understand how inputstreams work. The following block of code is one of the many ways to read data from a text file:-
File file = new File("./src/test.txt");
InputStream input = new BufferedInputStream (new FileInputStream(file));
int data = 0;
while (data != -1) (-1 means we reached the end of the file)
{
data = input.read(); //if a character was read, it'll be turned to a bite and we get the integer representation of it so a is 97 b is 98
System.out.println(data + (char)data); //this will print the numbers followed by space then the character
}
input.close();
Now to use input.read(byte, offset, length) i have this code. I got it from here
File file = new File("./src/test.txt");
InputStream input = new BufferedInputStream (new FileInputStream(file));
int totalBytesRead = 0, bytesRemaining, bytesRead;
byte[] result = new byte[ ( int ) file.length()];
while ( totalBytesRead < result.length )
{
bytesRemaining = result.length - totalBytesRead;
bytesRead = input.read ( result, totalBytesRead, bytesRemaining );
if ( bytesRead > 0 )
totalBytesRead = totalBytesRead + bytesRead;
//printing integer version of bytes read
for (int i = 0; i < bytesRead; i++)
System.out.print(result[i] + " ");
System.out.println();
//printing character version of bytes read
for (int i = 0; i < bytesRead; i++)
System.out.print((char)result[i]);
}
input.close();
I'm assuming that based on the name BYTESREAD, this read method is returning the number of bytes read. In the documentation, it says that the function will try to read as many as possible. So there might be a reason why it wouldn't.
My first question is: What are these reasons?
I could replace that entire while loop with one line of code: input.read(result, 0, result.length)
I'm sure the creator of the article thought about this. It's not about the output because I get the same output in both cases. So there has to be a reason. At least one. What is it?

The documentation of read(byte[],int,int says that it:
Reads up to len bytes of data.
An attempt is made to read as many as len bytes
A smaller number may be read.
Since we are working with files that are right there in our hard disk, it seems reasonable to expect that the attempt will read the whole file, but input.read(result, 0, result.length) is not guaranteed to read the whole file (it's not said anywhere in the documentation). Relying in undocumented behaviors is a source for bugs when the undocumented behavior change.
For instance, the file stream may be implemented differently in other JVMs, some OS may impose a limit on the number of bytes that you may read at once, the file may be located in the network, or you may later use that piece of code with another implementation of stream, which doesn't behave in that way.
Alternatively, if you are reading the whole file in an array, perhaps you could use DataInputStream.readFully
About the loop with read(), it reads a single byte each time. That reduces performance if you are reading a big chunk of data, since each call to read() will perform several tests (has the stream ended? etc) and may ask the OS for one byte. Since you already know that you want file.length() bytes, there is no reason for not using the other more efficient forms.

Imagine you are reading from a network socket, not from a file. In this case you don't have any information about the total amount of bytes in the stream. You would allocate a buffer of fixed size and read from the stream in a loop. During one iteration of the loop you can't expect there are BUFFERSIZE bytes available in the stream. So you would fill the buffer as much as possible and iterate again, until the buffer is full. This can be useful, if you have data blocks of fixed size, for example serialized object.
ArrayList<MyObject> list = new ArrayList<MyObject>();
try {
InputStream input = socket.getInputStream();
byte[] buffer = new byte[1024];
int bytesRead;
int off = 0;
int len = 1024;
while(true) {
bytesRead = input.read(buffer, off, len);
if(bytesRead == len) {
list.add(createMyObject(buffer));
// reset variables
off = 0;
len = 1024;
continue;
}
if(bytesRead == -1) break;
// buffer is not full, adjust size
off += bytesRead;
len -= bytesRead;
}
} catch(IOException io) {
// stream was closed
}
ps. Code is not tested and should only point out, how this function can be useful.

You specify the amount of bytes to read because you might not want to read the entire file at once or maybe you couldn't or might not want to create a buffer as large as the file.

Implement pause/resume in file downloading

I'm trying to implement pause/resume in my download manager, I search the web and read several articles and change my code according them but resume seems not working correctly, Any ideas?
if (!downloadPath.exists())
downloadPath.mkdirs();
if (outputFileCache.exists())
{
downloadedSize = outputFileCache.length();
connection.setAllowUserInteraction(true);
connection.setRequestProperty("Range", "bytes=" + downloadedSize + "-");
connection.setConnectTimeout(14000);
connection.connect();
input = new BufferedInputStream(connection.getInputStream());
output = new FileOutputStream(outputFileCache, true);
input.skip(downloadedSize); //Skip downloaded size
}
else
{
connection.setConnectTimeout(14000);
connection.connect();
input = new BufferedInputStream(url.openStream());
output = new FileOutputStream(outputFileCache);
}
fileLength = connection.getContentLength();
byte data[] = new byte[1024];
int count = 0;
int __progress = 0;
long total = downloadedSize;
while ((count = input.read(data)) != -1 && !this.isInterrupted())
{
total += count;
output.write(data, 0, count);
__progress = (int) (total * 100 / fileLength);
}
output.flush();
output.close();
input.close();

Okay problem fixed, here is my code for other users who wants to implement pause/resume:
if (outputFileCache.exists())
{
connection.setAllowUserInteraction(true);
connection.setRequestProperty("Range", "bytes=" + outputFileCache.length() + "-");
}
connection.setConnectTimeout(14000);
connection.setReadTimeout(20000);
connection.connect();
if (connection.getResponseCode() / 100 != 2)
throw new Exception("Invalid response code!");
else
{
String connectionField = connection.getHeaderField("content-range");
if (connectionField != null)
{
String[] connectionRanges = connectionField.substring("bytes=".length()).split("-");
downloadedSize = Long.valueOf(connectionRanges[0]);
}
if (connectionField == null && outputFileCache.exists())
outputFileCache.delete();
fileLength = connection.getContentLength() + downloadedSize;
input = new BufferedInputStream(connection.getInputStream());
output = new RandomAccessFile(outputFileCache, "rw");
output.seek(downloadedSize);
byte data[] = new byte[1024];
int count = 0;
int __progress = 0;
while ((count = input.read(data, 0, 1024)) != -1
&& __progress != 100)
{
downloadedSize += count;
output.write(data, 0, count);
__progress = (int) ((downloadedSize * 100) / fileLength);
}
output.close();
input.close();
}

It is impossible to tell what is wrong without some more information, however things to note:
You must make a HTTP/1.1 request (it's hard to tell from your sample code)
The server must support HTTP/1.1
The server will tell you what it supports with an Accept-Ranges header in the response
If-Range should be the etag the server gave you for the resource, not the last modified time
You should check your range request with something simple to test the origin actually supports the Range request first (like curl or wget )

It nay be that your server is taking to long to respond (more then the timeout limit) or this is also a fact that not all servers support pause - resume.
It is also a point to ponder that weather the file is downloaded through Http, https, ftp or udp.
Pausing" could just mean reading some of the stream and writing it to disk. When resuming you would have to use the headers to specify what is left to download.
you may try something like :
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
if(ISSUE_DOWNLOAD_STATUS.intValue()==ECMConstant.ECM_DOWNLOADING){
File file=new File(DESTINATION_PATH);
if(file.exists()){
downloaded = (int) file.length();
connection.setRequestProperty("Range", "bytes="+(file.length())+"-");
}
}else{
connection.setRequestProperty("Range", "bytes=" + downloaded + "-");
}
connection.setDoInput(true);
connection.setDoOutput(true);
progressBar.setMax(connection.getContentLength());
in = new BufferedInputStream(connection.getInputStream());
fos=(downloaded==0)? new FileOutputStream(DESTINATION_PATH): new FileOutputStream(DESTINATION_PATH,true);
bout = new BufferedOutputStream(fos, 1024);
byte[] data = new byte[1024];
int x = 0;
while ((x = in.read(data, 0, 1024)) >= 0) {
bout.write(data, 0, x);
downloaded += x;
progressBar.setProgress(downloaded);
}
and please try to sync things.

I would start debugging from this line:
connection.setRequestProperty("Range", "bytes=" + downloadedSize + "-");
As from the source code it is not possible to determine what downloadedSize is, it's hard to elaborate further, but the format should be bytes=from-to.
Anyway, I would suggest you to use Apache HttpClient to avoid common pitfalls. Here is a question from someone who uses Apache HttpClient on a similar topic and some sample code is provided.

I think you just need to delete the input.skip(downloadedSize) line. Setting the HTTP header for byte range means the server will skip sending those bytes.
Say you have a file that's 20 bytes long consisting of "aaaaabbbbbcccccddddd", and suppose the transfer is paused after downloading 5 bytes. Then the Range header will cause the server to send "bbbbbcccccddddd", you should read all of this content and append it to the file -- no skip(). But the skip() call in your code will skip "bbbbb" leaving "cccccddddd" to be downloaded. If you've already downloaded at least 50% of the file, then skip() will exhaust all of the input and nothing will happen.
Also, all of the things in stringy05's post apply. You should make sure the server supports HTTP/1.1, make sure the Range header is supported for the resource (dynamically generated content may not support it), and make sure the resource isn't modified using etag and modification date.

Calculating Download Speed

I am downloading a file but trying to also determine the download speed in KBps. I came up with an equation, but it is giving strange results.
try (BufferedInputStream in = new BufferedInputStream(url.openStream());
FileOutputStream out = new FileOutputStream(file)) {
byte[] buffer = new byte[4096];
int read = 0;
while (true) {
long start = System.nanoTime();
if ((read = in.read(buffer)) != -1) {
out.write(buffer, 0, read);
} else {
break;
}
int speed = (int) ((read * 1000000000.0) / ((System.nanoTime() - start) * 1024.0));
}
}
It's giving me anywhere between 100 and 300,000. How can I make this give the correct download speed? Thanks

You are not checking your currentAmmount and previousAmount of file downloading.
example
int currentAmount = 0;//set this during each loop of the download
/***/
int previousAmount = 0;
int firingTime = 1000;//in milliseconds, here fire every second
public synchronyzed void run(){
int bytesPerSecond = (currentAmount-previousAmount)/(firingTime/1000);
//update GUI using bytesPerSecond
previousAmount = currentAmount;
}

First, you are calculating read() time + write() time in very short intervals and the result will vary depending on the (disk cache) flushing of the writes().
Put the calculation right after the read()
Second, your buffer size (4096) probably does not match the tcp buffer size (yours is eventually smaller), and because of that some reads will be very fast (because it is read from the local TCP buffer). Use Socket.getReceiveBufferSize()
and set the size of your buffer accordingly (let say 2* the size of TCP recv buf size) and fill it in a nested loop until full before calculating.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Stop HtmlUnit download after specified file size is reached - java

Related

Fastest way to write multiple files in java

When streaming a file from a server, the file size on disk is bigger than the total bytes read, what's going on?

difference between input.read and input.read(array, offset, length)

Implement pause/resume in file downloading

Calculating Download Speed

Categories

Resources