Suggestions for uploading very large (> 1GB) files

Suggestions for uploading very large (> 1GB) files - java

I know that such type of questions exist in SF but they are very specific, I need a generic suggestion. I need a feature for uploading user files which could be of size more that 1 GB. This feature will be an add-on to the existing file-upload feature present in the application which caters to smaller files. Now, here are some of the options
Use HTTP and Java applet. Send the files in chunks and join them at the server. But how to throttle the n/w.
Use HTTP and Flex application. Is it better than an applet wrt browser compatibility & any other environment issues?
Use FTP or rather SFTP rather than HTTP as a protocol for faster upload process
Please suggest.
Moreover, I've to make sure that this upload process don't hamper the task of other users or in other words don't eat up other user's b/w. Any mechanisms which can be done at n/w level to throttle such processes?
Ultimately customer wanted to have FTP as an option. But I think the answer with handling files programmatically is also cool.

Use whatever client side language you want (a Java App, Flex, etc.), and push to the server with HTTP PUT (no Flex) or POST. In the server side Java code, regulate the flow of bytes in your input stream loop. A crude, simple, sample snippet that limits bandwidth to no faster than an average <= 10KB/second:
InputStream is = request.getInputStream();
OutputStream os = new FileOutputStream(new File("myfile.bin"));
int bytesRead = 0;
byte[] payload = new byte[10240];
while (bytesRead >= 0) {
bytesRead = is.read(payload);
if (bytesRead > 0)
os.write(payload, 0, bytesRead);
Thread.currentThread().sleep(1000);
}
(With more complexity one could more accurately regulate the single stream bandwidth, but it gets complex when considering socket buffers and such. "Good enough" is usually good enough.)
My application does something similar to the above--we regulate both up (POST and PUT) and (GET) down stream bandwidth. We accept files in the 100s of MB every day and have tested up to 2GB. (Beyond 2GB there is the pesky Java int primitive issues to deal with.) Our clients are both Flex and curl. It works for me, it can work for you.
While FTP is great and all, you can avoid many (but not all) firewall issues by using HTTP.

If you want to reduce bandwidth you may want to send the data compressed (unless its compressed already) This may save 2-3 times the data volume depending on what you are sending.

For an example of good practice for uploading large files, and the various ways of tackling it, have a look at flickr.com (you may have to sign up to get to the uploader page)
They provide various options, including HTTP form upload, a java desktop client, or some kind of javascript-driven gadget that I can't quite figure out. They don't seem to use flash anywhere.

For sending files to a server, unless you have to use HTTP, FTP is the way to go. Throttling, I am not completely sure of, at least not programmatically.
Personally, it seems like limitations of the upload speed would be better accomplished on the server side though.

Related

Java : How to count lines from TB size file in faster way

Our file is going to be of 10 tb on an avg in size. Was wondering if there is a better way of doing than this to make it faster?
BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();

I don't think anyone can really answer your question as asked. Here are the missing details that I would need to really give you a good answer.
What file system are you using to store 10TB files?
If they are really 10TB then I am assuming you are running a cluster of some sort. What distributed file system are you using?
What OS are you running on?
Linux/Win/etc.
Do you have to use Java or can you dive down into C/C++?
Handling files this large really really fast will require hooking into system calls that are not portable
Can you write out the number of lines when you are creating the file?
This problem goes away if you just write the number of lines when you are creating it.
If this is on a cluster are you coping the file locally and then processing?
Are you mapping/mounting a drive over the network and processing? If so then you are limited by the network bandwith to move a 10TB file from your cluster to your workstation
Without those 6 items anyone is just guessing.
Update with OP response:
Here is what I would do given the info.
Before you do anything at all you need to see if you are saturating your network connection. Given you are dealing with HUGE amounts of data over the network there may be nothing you can do beyond upgrading your switches and tuning your network stack on your servers. If, and only if, you have confirmed that your network connection(s) are not pegged at 100% below are other things I would try.
Start simple and increase the buffer size on your BufferedReader I think that java defaults to 8192 for the buffer size. Depending on how you have the HDFS file system setup and your network you may be able to get substantial speedups by just increasing the buffer size.
If you're still slow I would try to use a File channel
Still slow? Run two threads one from the start of the file and one from the end. Play with the buffer sizes like you did in step 2.
If you're still too slow can you hook right into HDFS? If you are reading a file over an NFS mount hooking directly into HDFS may give a performance boost.
Still slow?? Install another network card and channel bond it to double your throughput and then start back at step 1 :)
Good luck!!

Why java FileOutputStream's write() or flush() doesn't make NFS client really send data to NFS server?

My Java web application use NFS file system, I use FileOutputStream to open, write multiple chunks and then close the file.
From the profiler stats I found that stream.write(byte[] payload,int begin, int length) and even stream.flush() takes zero milliseconds. Only the method call stream.close() takes non-zero milliseconds.
It seems that java FileOutputStream's write() or flush() doesn't really cause NFS client to send data to NFS server. Is there any other Java class will make NFS client flush data in real time? or there is some NFS client tuning need to be done?

You are probably running into Unix client-side caching. There are lots of details here in the O'Reilly NFS book.
But in short:
Using the buffer cache and allowing async threads to cluster multiple buffers introduces some problems when several machines are reading from and writing to the same file. To prevent file inconsistency with multiple readers and writers of the same file, NFS institutes a flush-on-close policy:
All partially filled NFS data buffers for a file are written to the NFS server when the file is closed.
For NFS Version 3 clients, any writes that were done with the stable flag set to off are forced onto the server's stable storage via the commit operation.
NFS cache consistency uses an approach called close-to-open cache consistency - that is, you have to close the file before your server (and other clients) get a consistent up-to-date view of the file. You are seeing the downsides of this approach, which aims to minimize server hits.
Avoiding the cache is hard from Java. You'd need to set the file open() O_DIRECT flag if you're using Linux; see this answer for more https://stackoverflow.com/a/16259319/5851520, but basically it disables the client's OS cache for that file, though not the server's.
Unfortunately, the standard JDK doesn't expose O_DIRECT. as discussed here: Force JVM to do all IO without page cache (e.g. O_DIRECT) - essentially, use JNI youself or use a nice 3rd party lib. I've heard good things about JNA: https://github.com/java-native-access/jna ...
Alternatively, if you have control over the client mount point, you can use the sync mount option, as per NFS manual. It says:
If the sync option is specified on a mount point, any system call
that writes data to files on that mount point causes that data to be
flushed to the server before the system call returns control to user
space. This provides greater data cache coherence among clients, but
at a significant performance cost.
This could be what you're looking for.

Generally, Java's streams make no guarantee about the effects of flush apart from maybe the flushing of the buffers in the Java classes involved.
To overcome that limitation, Java NIO's Channels can be used, see e.g https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#force(boolean). However, if "the file does not reside on a local device then no such guarantee is made." And Java cannot make such a guarantee, because the underlying remote file system or protocol may not be able to provide that function at all. However, you should be able to achieve (almost) the same level of synchronization with force() that you'd get from a native O_DIRECT access that #SusanW mentioned.

Should I use memory mapped files for my simple flow?

I am interested in implemented the following simple flow:
Client sends to a server process a simple message which the server stores. Since the message does not have any hierarchical structure IMO the best approach is to save it in a file instead of an rdb.
But I want to figure out how to optimize this since as I see it there are 2 choices:
Server sends a 200 OK to the client and then stores the message so
the client does not notice any delay
Server saves the message and then sends the 200OK but then the
client notices the overhead of the file I/O.
I prefer the performance of (1) but this could lead to a client thinking all went ok when actually the msg was never saved (for various error cases).
So I was thinking if I could use nio and memory mapped files.
But I was wondering is this a good candidate for using mem-mapped files? Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?

Server saves the message and then sends the 200OK but then the client notices the overhead of the file I/O.
I suggest you test this. I doubt a human will notice a 10 milli-second delay and I expect you should get better than this for smaller messages.
So I was thinking if I could use nio and memory mapped files.
I use memory mapping as it can reduce the overhead per write by up to 5 micro-second. Is this important to you? If not, I would stick with the simplest approach.
Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
As long as the OS doesn't crash, yes.
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?
Opening and closing files is likely to be fast more expensive than writing the data. (By an order of magnitude) I would suggest keeping such operations to a minimum.
You might find this library of mine interesting. https://github.com/peter-lawrey/Java-Chronicle It allows you to persist messages in the order of single digit micro-seconds for text and sub-micro-second for a small binary message.

How do I implement client-side bandwidth throttling for FTP/HTTP?

I am tasked with writing a client-side data download system (on Linux) that uses FTP or HTTP to download terabyte-sized data from external partners to our local site. Our company's network admin tells me that I cannot exceed a certain bandwidth. What is the best way for me to implement such a system? Do existing libraries exist?
I am open to writing my own FTP and HTTP clients (in either C or Java on Linux) but would prefer to stay out of the kernel. I know that I can limit the rate at which my FTP/HTTP client calls a socket read(), but what happens if the server-side calls write() faster than my limit?

You could build another layer on top of an InputStream: In the read method, you can count the bytes so far. If the number of bytes/second exceed a certain limit, let the download thread sleep for a while. TCP's flow control does the rest.

I know Apache JMeter simulates slow connections. You could maybe take look at the code.

If you know the network path delay you could just set your TCP receive buffer size to the desired bandwidth-delay product. That will throttle the sender all right. But the resulting value may be too small for your platform, so it may adjust it upwards. Check the value after you set it.
Does your netadmin know that TCP automatically shares bandwidth fairly?

Are you open to off the shelf GUI or command line products? Filezillia provides this.
There also is a linux command line client called lftp. A settable parameter is net:limit-total-rate which will limit the rate of transfer. Since this client supports multiple transfers at one time, it also has a parameter net:limit-rate.

To keep it simple, if you are on linux you just could use wget instead of re-inventing the wheel? Take a look at the --limit-rate switch.
But back on topic :) This answer could get you started: How can I implement a download rate limited in Java?

Is it faster to port data from a C++ program to a Java program's input via sockets than via a raw json or xml file on the server?

What's the best way to go about things in terms of speed/performance?
Where do things like "Apache Thrift" come in and what are the benefits?
Please add some good resources I can use to learn about any recommendations!
Thanks all

Presuming you mean both processes are already running, then it's going to be via sockets.
Writing a file to the disk from one process then reading it from the other is going to incur the performance hit of the disk write and read (and of course whatever method you employ to keep the reader from accessing the file until it's done being written; either locks or an atomic rename on the disk).
Even ignoring that, your localhost interface is going to have a faster transfer rate than your disk controller, with the possible exception of a 10Gb fiber channel RAID array with 15k RPM drives in it.

Try it out. There's just no other way to find out.
Using sockets or the file system should be comparably fast, since both methods rely on some system calls that are very similar.
Always be aware that this communication involves these steps:
Encoding your data to a stream of bytes (JSON, XML, YAML, X.509 DER, Java Serialization)
Transferring this stream of bytes (TCP socket, UNIX socket, filesystem, ramdisk, pipes)
Decoding the stream of bytes into data (same as step 1)
Step 1 and 2 are completely independent, so take that into account when you benchmark.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.