Java resumable hash computation

Java resumable hash computation - java

I would like to achieve resumable on-the-fly hash generation of some file being uploaded on the server. The files are big so I am using the update(byte[]) method of MessageDigest class (as described here, for instance: How can I generate an MD5 hash? ) on the fly, as new bytes arrive from the HttpServletRequest's InputStream.
Everything is going well, however, it's becoming interesting at the moment when I want to add resumable upload support. If upload is prematurely terminated, the incomplete file is stored on the disk. However, the controller (and underlying service) exits, so the MessageDigest object is lost. Before that happens, can I serialize the MessageDigest object to the disk (or DB, it doesn't matter) in the way that when I deserialize the object again, it will remember its temporary state, so when I resume uploading (from the exact place where it has been terminated before, so no bytes are redundant, nor are some bytes missing) and continue update()ing that deserialized MessageDigest, ultimately I get the same result (hash) as if the file was uploaded whole at once?

Grab one of the custom MD5 implementations like this one or this one. Make it serializable or just make its internal state public. Preserve the state when the upload is aborted, and restore it when the upload is resumed.

Hashes are cheap to compute (MD5 doubly so; are you sure you don't want SHA1?). I would recommend rehashing everything from the beginning as soon as you detect that an upload has been resumed. Runtime should be low unless the uploads are truly huge - hopefully large, interrupted uploads will be scarce.

Related

How to Implement Huawei's Chunked File Upload Using Java

I need to implement a deployment pipeline, and at the end of the pipeline, we are uploading a file, in this case, to Huawei's app store. But for a file with more than 5 megabytes in size, we have to use a chunked API. I'm not familiar of how chunked uploads work. Can someone give me an implementation guideline, preferably in java of how to implement such mechanism? The API parameters are as follow :
Edit :
In response in the comment below, let me clarify my question. Looking up for some references of how to do a chunked request, libraries such as httpclient and okhttp simply set the chunk flag to true, and seemed to hide the details from the library's client :
https://www.java-tips.org/other-api-tips-100035/147-httpclient/1359-how-to-use-unbuffered-chunk-encoded-post-request.html
Yet, the Input Parameters of the API seems to expect that I manage the chunk manually, since it expect ChunkSize and a sequence number. I'm thinking that I might need to use the plain java http interface to work with the API, yet I failed to find any good source to get me starting. If there is anyone who could give me a reference or an implementation guidance, that will definitely help.
More updates :
I tried to manually chunk my file into several parts, each of 1 megabyte in size. Then I thought I could try calling the API for every chunk, using a multipart/form-data. But the server side always close the connection before writing even begin, causing : Connection reset by peer: socket write error.
It shouldn't be a proxy issue, since I have set it up, and I could get the token, url and auth code without problem.

File segmentation: a file with more than a few gigabytes is uploaded to the server. If you can only use the simplest upload, receive, process and succeed, I can only say that your server is very good. Even if the server is good enough, this operation is not allowed. So we have to find a way to solve this problem.
First of all, we have to solve the problem of large files. There is no way to cut them into several m bytes and send them to the server many times and save them. Then name these files with MD5 + index of the source file. Of course, some friends use UUID + index to name them. The differences between the two will be described in detail below. When you upload these small files to the server separately, it is better to save these records to the database.
(1) When the first block upload is completed, write the name, type, MD5, upload date, address and unfinished status of the source file to a table, and change the splicing completion status to finished. Temporarily named file table
(2) After each block upload, the record is saved in the database. The MD5 + index name of the source file, the MD5 of the block (this is a key point), the upload time and the file address. Save into database and name it file__ TEM table
Second transmission function: many online disks realize this function. At the beginning of upload, send Ajax request to query the existence of the file to be uploaded. Here, H5 provides a method to obtain the MD5 file, and then use ajax to request whether the MD5 exists in the file and whether the status is completed. If it exists, also verify whether the local file still exists. In the case of simultaneous existence. You can return the presence status to the front desk, and then you can proudly tell the customer, seconds passed.
here is the link:
https://blog.csdn.net/weixin_42584752/article/details/80873376

Java - processing file in memory without the disk R/W

I am receiving files through a socket
and saving them to database.
So, i'm receiving the byte stream, and passing it
to a back-end process, say Process1
for the DB save.
I'm looking to do this without saving
the stream on disk. So, rather than storing the incoming stream
as a file on disk and then passing that file to Process1,
i'm looking to pass it while it's still in the memory.
This is to eliminate the time-costly disk read & write.
One way i can do is to pass the byte[] to Process1.
I'm wondering whether there's a better way of doing this.
TIA.

You can use a ByteArrayOutputStream. It is, essentially, a growable byte[] which you can write into at will, that is in the limit of your available heap space.
After having written to it/flushed it/closed it (although those two last operations are essentially a no-op, that's no reason for ditching sane practices), you can obtain the underlying byte array using this class's .toByteArray().

Socket sounds like what you are looking for.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Where do we store key/passphrase/salt for encryption?

My app needs to encrypt some data (a user session token). Most examples I see around have a method that generates a Key using a passphrase and a salt, like:
public static Key generateKey(char[] passphrase, byte[] salt) {
...
}
My understanding is that we have three options for generating the passphrase:
Have the user enter it every time the app starts (annoying to the user).
Hard-code the passphrase into the app itself. More convenient for the user, but someone can find out what your passphrase is given your app binary.
Randomly generate a passphrase, but then we have to store the generated Key on disk. Now we've just shifted the problem to having to store the key securely on disk, which also seems impossible. If the attacker finds the generated key, big problem.
Option #1 won't work for me. Options #2 and #3 seem inherently flawed, unless I'm grossly misunderstanding how to go about this (hoping that I am). What's the recommended way to do this if we can't go with #1? Do we put in a bunch of obfuscated hoops for an attacker to jump through and hope for the best?
Thanks

"Do we put in a bunch of obfuscated hoops for an attacker to jump through and hope for the best?" Basically yes. The size and number of the hoops being how hard you want to make it.
If you are not using a server, then whatever you do to obsfucate and encrypt your data is reversible. However, you can make it REALLY hard. For example, a technique I used to protect some video assets.
Replaced the first 1024 bytes of the header (it's MP4) with 1024 bytes taken from the middle of one of the apps image assets. I tried several repairers, all of which failed to automagically recover the file - although it can be done manually. Then...
Encrypted the file using a private key which is 256 bytes taken from another image asset.
When the key is extracted, it's hashed through an algorithm which does all kinds of otherwise non-sensical maths to mangle the key.
Used a pre-compile obsfucator.
I've tried myself to reverse engineer this, even knowing how it's done, and it's so hard as to make the effort not worth the result.
There are numerous discussions on SO which summarise as; If you simply want to stop copying, make it difficult (cost vs reward) but otherwise sleep easy because there is ultimately nothing you can do. If the data is commercially sensitive, then a server coupled with system level security (e.g whole device encryption and no root) is required.

You store the salt along with the encrypted data, it is not secret information. You can derive the key on either something the user enters, or some sort of a device property: (hashed) IMEI, MAC address, etc.
Basically, think who are you protecting your data from and why. Since the user needs this, there is not much point trying to protect it from them. If you store this in a private file, other apps cannot read it on a non-rooted phone. If you want to protect it on rooted phones, encryption might help, but as long as the key resides in the app, or is derived based on something on the device, it is only making it harder, not impossible to recover.
Android does have a system-wide keystore service, but it has no public API and is subject to change. You could use that to protect your key(s), if you are willing to take the risk of your app breaking on future versions. Some details here: http://nelenkov.blogspot.com/2012/05/storing-application-secrets-in-androids.html

Specify InputStream for ServletResponce instead of copying InputStream in OutputStream

In short I have a Servlet, which retrieves pictures/videos e t.c. from underlying data store.
In order to archive this I need to copy files InputStream to ServletResponce *OutputStream*
From my point of view this is not effective, since I'll need to copy the file in memory before sending it, it would be more convinient to specify InputStream, from which OutputStream would read data and send it straight away, after reading some data in the buffer.
I looked at ServletResponce documentation and it have some buffer for the message data, so I have a few questions regarding it.
Is this the right mechanism?
What If I decide not to send the file at the end of Servlet processing?
For example:
If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.

To address your first concern, you can easily copy InputStream to OutputStream using IOUtils from Apache Commons Lang:
IOUtils.copy(fileInputStream, servletOutputStream);
It uses 4K buffer, so memory consumption should not be a concern. In fact you cannot just send straight away data from InputStream. At the lowest level the operating system still has to read file contents to some memory location and in order to send it to socket, you need to provide a memory location where the data to be sent resides. Streams are just a useful abstraction.
About your second question: this is how HTTP works: if you start streaming data to the client, servlet container sends all response headers first. If you abort in the middle, from the client perspective it looks like interrupted download.

Is this the right mechanism?
Basically, it is the only mechanism provided by the Servlet APIs. You need to design your servlet with this in mind.
(It is hard to see how it could be done any other way. A read syscall reads data into memory from a device (the disk). A write syscall writes data from memory to a device (the network interface). There is no syscall to transfer data directly from one device to another. The best you can do is to reduce the amount of copying of data within the application. If you use something like IOUtils.copy, it should minimize that as far as possible. The only way you could avoid going through application memort would be to use some special purpose hardware / operating system combination optimized for content delivery.)
However, this is probably moot anyway. In most cases, the performance bottleneck is likely to be movement of data over the network. Data can probably be read from disk to memory, copied, and written to the network interface orders of magnitude faster than it can move through the network to the user's web browser (or whatever).
If it is NOT moot, then a practical way to do content delivery would be to use a separate web server implemented in native code that us optimized for delivering static content; e.g. something like nginx.)
What If I decide not to send the file at the end of Servlet processing? For example: If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.
You should write your servlet to do the access checks BEFORE reading the content into memory. And ideally, before you "commit" the response by sending the response header.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.