I have a question concerning upload speed to Google Cloud Storage using resumable uploads.
I've wrote a desktop java client to upload large files to GCS(it has some specialized features thats why gsutil was not an answer for my company). During tests runs about 2 months ago it was utilizing available connection bandwidth very well with about 20Mbps out of 25Mbps connection. The project was frozen for almost 2 months and now when it is reopened the same client is uploading with very poor speed at about 1.4Mbps out of 25Mbps availible.
I've wrote simple Python script to check if it will have same problems and it is a little faster but still at about 2Mbps. Gsutil tool is performing almost the same like my Python script.
I've also run the test on different network infrastructure with over 50Mbps upload speed.
The results are also quite poor:
Java client 2.4Mbps
Python script 3.2Mbps
gsutil 3.2Mbps
The only thing that have changed is the Google Cloud Storage API version. I'm using JSON API and the first tests were run on v1beta API version.
At the moment there's no difference if I'm still using depreciated API or the new one.
Has anyone encountered same upload speed degradation?
What are your average upload speeds?
What could be a possible reason of such a dramatic upload performance decrease?
Will parallel uploads of composite objects help me to fully utilize available bandwidth?
To ascertain what the highest bandwidth you can expect is, we suggest running the gsutil perfdiag command.
For example, to see how well it uploads a 100 MB file:
gsutil perfdiag -t wthru -s 100M gs://bucketname
This will upload a 100MB file five times and report the results. An example output from my run:
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 100 MB file 5 times for a total transfer size of 500 MB.
Write throughput: 71.61 Mbit/s.
It will also output lots of other information that might help diagnose the problem. If the perfdiag output shows much higher throughput than your application, then something might be wrong with your code. If the perfdiag output is also low bandwidth, then something might be wrong with your network path to Google's servers, which the perfdiag output can help identify the problem. If that doesn't help solve your problem, please email the result file (perfdiag -o output.json) to gs-team#google.com.
Related
I am trying to upload 6GB file and my application restricts up to 6GB to upload. There is a timeout error that I get. The logic includes normal chunking the files and uploading sequentially. My question might be bit foolish but I want to know does parallel processing of these chunks will really solve the problem for file size of 6GB and is it really required. The timeout error sometimes cannot be reproduced but do happen for some users and could depend on network bandwidth. Please suggest.
Thanks
I'm looking for a solution how to read multiple files from FTP to Google CloudStorage in an efficient way. Each file size is 3-5 GB, the amount of files is 100-200.
I found the next solutions: read files using GAE instance.
Any ideas what else I can try?
The best way will be to use Google Cloud parallel uploads to Cloud Storage using gsutil compose. You can try this with:
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp bigfile gs://your-bucket
Basically:
gsutil is dividing the file in multiple smaller chuncks.
It then uploads all the files to Cloud Storage.
They get composed as a single file
Then it deletes all the smaller chuncks
Keep in mind this has a trade off described in the docs:
Using parallel composite uploads presents a tradeoff between upload
performance and download configuration: If you enable parallel
composite uploads your uploads will run faster, but someone will need
to install a compiled crcmod on every machine
where objects are downloaded by gsutil or other Python applications.
Note that for such uploads, crcmod is required for downloading
regardless of whether the parallel composite upload option is on or
not. For some distributions this is easy (e.g., it comes pre-installed
on macOS), but in other cases some users have found it difficult.
In case you are not able to use gsutil and you can't install the Cloud Storage SDK in your FTP server, you could download the files in a VM and run the Cloud Storage SDK or gsutil in this VM.
App Engine Standard does not allow to write to disk. Thus, any file you upload is going to be stored on memory until you upload them to Cloud Storage. So I don't think this is convenient in this case.
App Engine Flexible does allow writing to disk. This is an ephemeral disk, once it get's restarted the contents of the disk get's deleted and each week is restarted. But you wouldn't be taking advantaged of the Load Balancer and the Automatic Scaling the instance has.
In this case, I think the best way would be a Google Cloud preemptible VM. Now, even though this VM only lives for one day at max, they run at a lower price than a normal VM. Once they are going to get terminated, you could check which files had been uploaded to Storage and resume your workload in a new preemptible VM. You could also use a large number of this VMs working in parallel to speed up the download and upload process.
Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks
If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.
It's pretty easy to minimize scriptS with Yuicompressor. Unfortunately, this process is totally slow when executing the JAR with exec in php.
Example (PHP):
// start with basic command
$cmd = 'java -Xmx32m -jar /bin/yuicompressor-2.4.8pre.jar -o \'/var/www/myscript.min.js\' \'/var/www/myscript.min.temp.js\'';
// execute the command
exec($cmd . ' 2>&1', $ok);
The execution time for ~20 files takes up to 30 seconds ! on a Quad Core Server with 8GB Ram.
Does anybody know a faster solution, to minimize a bunch of scripts ?
The execution time mainly depends of the file size(s).
Let's take a try with Google Closure Compiler.
It is also a good idea to caching the result in a file or use some extensions (APC, Memcached) with the combination of client-side caching headers. If you checking the last modification time with filemtime() you will know to need minify or not.
I often use separate caching by files, to prevent minification of a large content, then creating an MD5 checksum by the whole and if it is modified since the last request, then save the new checksum and print out the content, else just using:
header('Not Modified', true, 302);
By this way, it is a very few calculations by each requests also in dev state. I'm using ExtJS 4 for my current project wich is 1.2 MB large at raw and a lot of project-codes without any problem and under 1s response time.
I need to write a MultiThreaded Java Application that will be used to load test the MMS Server. Transactions starts when the MMS server indicates to my MultiThreaded Java Application that a MMS has arrived on the server and then i need to download the attachment that is part of the of the MMS from the MMS server using the protocol supported by the MMS Server. Once is successfully download the attachment, then it marks the completion of the Transaction, Since its a load testing application for the MMS Server, the expected TPS is above 1400 TPS, hence i need to provide the hardware requirements for this application, I feel that i need a horizontal scaling along with a load balancer and a network connectivity in GBPS to download attachments. If i have 2 boxes, then each box has to handle 700 TPS , is it feasible for a multi threaded java application deployed on a Solaris box to acheive this performance of 700 TPS. Please let me know your thoughts from a architecture, hardware and it will be helpful if i can get suggestion on which Solaris hardware needs to be considered. I have Solaris T5220 in my mind.
Thanks a lot in advance for all your help.
I doubt that you'll need such a big machine. This depends on a lot of different factors though, of which quality of code probably is the most important one.
Regarding network usage, you should really come up with a number of KB an average attachment will have. For 10 KB attachments, 1400 TPS would mean 14,000 KB or 14 MB per second. For 1 MB it would be 1.4 GB per second - quite a difference, isn't it?
For 1.4 GB per second, you could also get some serious problems to store it somewhere - if this is a requirement at all.
The processing itself shouldn't be too much of a problem (but again, depends on a multitude of different factors).
The best thing you could do is to use any free hardware (or virtual machine) you can grab and run some tests. Just see what numbers you get and decide where to go from there.