Issue with file upload in java - java

I am trying to upload 6GB file and my application restricts up to 6GB to upload. There is a timeout error that I get. The logic includes normal chunking the files and uploading sequentially. My question might be bit foolish but I want to know does parallel processing of these chunks will really solve the problem for file size of 6GB and is it really required. The timeout error sometimes cannot be reproduced but do happen for some users and could depend on network bandwidth. Please suggest.
Thanks

Related

Processing 100 MB+ files from FTP server in less than 30 sec?

Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks
If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.

GCS resumable uploads speed

I have a question concerning upload speed to Google Cloud Storage using resumable uploads.
I've wrote a desktop java client to upload large files to GCS(it has some specialized features thats why gsutil was not an answer for my company). During tests runs about 2 months ago it was utilizing available connection bandwidth very well with about 20Mbps out of 25Mbps connection. The project was frozen for almost 2 months and now when it is reopened the same client is uploading with very poor speed at about 1.4Mbps out of 25Mbps availible.
I've wrote simple Python script to check if it will have same problems and it is a little faster but still at about 2Mbps. Gsutil tool is performing almost the same like my Python script.
I've also run the test on different network infrastructure with over 50Mbps upload speed.
The results are also quite poor:
Java client 2.4Mbps
Python script 3.2Mbps
gsutil 3.2Mbps
The only thing that have changed is the Google Cloud Storage API version. I'm using JSON API and the first tests were run on v1beta API version.
At the moment there's no difference if I'm still using depreciated API or the new one.
Has anyone encountered same upload speed degradation?
What are your average upload speeds?
What could be a possible reason of such a dramatic upload performance decrease?
Will parallel uploads of composite objects help me to fully utilize available bandwidth?
To ascertain what the highest bandwidth you can expect is, we suggest running the gsutil perfdiag command.
For example, to see how well it uploads a 100 MB file:
gsutil perfdiag -t wthru -s 100M gs://bucketname
This will upload a 100MB file five times and report the results. An example output from my run:
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 100 MB file 5 times for a total transfer size of 500 MB.
Write throughput: 71.61 Mbit/s.
It will also output lots of other information that might help diagnose the problem. If the perfdiag output shows much higher throughput than your application, then something might be wrong with your code. If the perfdiag output is also low bandwidth, then something might be wrong with your network path to Google's servers, which the perfdiag output can help identify the problem. If that doesn't help solve your problem, please email the result file (perfdiag -o output.json) to gs-team#google.com.

Workaround for handling expectable java.lang.OutOfMemoryError: Java heap space

I am working on a Java Web Application based on Java 6/Tomcat 6.0. It's a web based document management system. The customers may upload any kind of file to that web application. After uploading a file a new Thread is spawned, in which the uploaded file is analyzed. The analysis is done using a third party library.
This third-party-libraries works fine in about 90% of the analyze-jobs, but sometimes (depending on the uploaded file) the logic starts to use all remaining memory, leading to an OutOfMemoryError.
As the whole application is running in a single JVM, the OoM-Error is not only affecting the analyze-jobs, but has also impact on other features. In the worst case scenario, the application crashes completely or remains in an inconsistent state.
I am now looking for a rather quick (but safe) way to handle those OoM-Errors. Replacing the library currently is no option (that's why I have neither mentioned the name of the library, nor what kind of analysis is done). Does anybody have an idea of what could be done to work around this error?
I've been thinking about launching a new process (java.lang.ProcessBuilder) to have a new JVM. If the third-party-lib causes an OoM-Error there, it would not have effects on the web application. On the other hand, this would cause additional effort to synchronize the new Process with the Analysis-Part of the web application. Does anybody have any experience with such a system (especially with regards to the stability of the system)?
Some more information:
1) The analysis part can be summarized as a kind of text extraction. The module receives a file reference as input and writes the analysis result into a text file. The resulting text-file is further processed within the web applications business logic. Currently the workflow is synchronous. The business logics waits for the third-party-lib to complete its job. There is no queuing or other asynchronous approach.
2) I am quite sure that the third-party-library causes the OoM-Error. I've tested the analysis part in isolation with different files of different sizes. The file that causes the OoM-Error is quite small (about 4MB). I have done further tests with that particular file. While having a JVM with 256MB of heap, the analysis crashes due to the OoM-Error. The same test in a JVM with 512MB heap passes. However, increasing the heap size will only help for a short period of time, as a larger test file again causes the test to fail due to OoM-Error.
3) A Limit for the size of files being uploaded is in place; but of course you cannot have a limit of 4MB per file. Same is for the OS and architecture. The system has to work on both 32- and 64-bit systems (Windows and Linux)
It depends on both the client and the server as well as the design of the web app. You need to answer a few questions:
what is supposed to happen as a result of the analysis and when is it supposed to happen?
Does the client wait for the result of the analysis?
What is returned to the client?
You also need to determine the nature of the OOM.
It is possible that you might want to handle the file upload and the file analysis separately. For instance, your webapp can upload the file to somewhere in the file system and you can defer the analysis part to a web service, which would be passed a reference to the file location. The webservice may or may not be called asynchronously, depending on how and when the client that uploaded the file needs notification in the case of a problem in the analysis.
All of these factors go into your determination.
Other considerations, what JVM are you using, what is the OS and how is it configured in terms of system memory? Is it the JVM 32 or 64 bit, what is the max file size allowed on upload, what kind of garbage collectors have you tried.
It is possible that you can solve this problem from an infrastructure perspective as opposed to changing the code. Limiting the max size of the file upload, moving from 32 to 64 bit, changing the garbage collector, upgrading libraries after determining whether or not there is a bug or memory leak in one of them, etc.
One other red flag that is glaring, you say "a thread is spawned". While this sort of thing is possible it is often frowned upon in the JEE world. Spawning threads yourself can cause problems in how the container manages resources. Make sure you are not causing the issue yourself, try a file load independently in a test environment on a file that is known to cause problems (if that can be ascertained). This will help you determine ff the problem is the third party library or a design one.
Why not have a (possibly clustered) application per 3rd-party lib that handles file analysation. Those applications are called remotely (possibly asynchronously) from your main application. They are passed a URL which points to the file they should analyze and return their analysation results.
When a file upload completed the analyzation job is put into the queue. When an analyzation application is up again after it crashed it will resume consuming messages from the queue.

GC selection for Java/Netty file upload REST service

I have a Java REST service which runs on Netty and is used for uploading(streaming) large number of files of 5MB to 500MB. When I'm increasing the number of concurrent uploads at some point the application goes out of memory, which is expected, but I'm looking for recommendations on which Java GC and VM settings should I use in this scenario to improve the performance and to reduce the memory footprint.
I would really appreciate if somebody could share similar experiences.
UPDATE: To add more context to the question, the REST service is getting file as a stream and passing the same stream to Amazon S3.
Likely you should use a flow control when sending large data amount. You can check if the channel is writable (Channel.isWritable()) before sending data and wait untill it will be ready. You can use notifications via ChannelInboundHandler.channelWritabilityChanged(ChannelHandlerContext ctx) to track this. Without flow control all your large files will consume memory waiting for sending in Netty's output buffer.

Issue with FileWriter

I'm running my java application on a windows 2008 server (64-bit) in the hotspot vm.
A few months ago I created a tool to assist in the detection of deadlocking in my application. For the past month or so, the only thing that has been giving me any problems is the writing to text files.
The main thread always seems to get stuck on the following line for what I would assume to be almost 5 seconds at a time. After a few seconds the application continues to run normally and without problems:
PrintWriter writer = new PrintWriter(new FileWriter(PATH + name + ".txt"));
Not sure what causes this, but any insight into the problem would be most appreciated. The files I'm writing are small and that is unlikely the issue (unless anyone has any objections).
If you need any more information, please let me know.
Is PATH on a network drive? You could see almost any delay writing to a network file system. It's generally a very bad idea to do that with applications. They should generally write all their files locally and then post transactions to a server somehow.
When your file system gets overloaded, you can see delays with even the simplest of tasks. e.g. If I create a large file (multiple GB) and try to do a a simple disk access which is not cached it can wait seconds.
I would check your disk write cache is turned on and your disks are idle most of the time. ;)

Categories

Resources