Read big files to GCP with Java

Read big files to GCP with Java - java

I'm looking for a solution how to read multiple files from FTP to Google CloudStorage in an efficient way. Each file size is 3-5 GB, the amount of files is 100-200.
I found the next solutions: read files using GAE instance.
Any ideas what else I can try?

The best way will be to use Google Cloud parallel uploads to Cloud Storage using gsutil compose. You can try this with:
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp bigfile gs://your-bucket
Basically:
gsutil is dividing the file in multiple smaller chuncks.
It then uploads all the files to Cloud Storage.
They get composed as a single file
Then it deletes all the smaller chuncks
Keep in mind this has a trade off described in the docs:
Using parallel composite uploads presents a tradeoff between upload
performance and download configuration: If you enable parallel
composite uploads your uploads will run faster, but someone will need
to install a compiled crcmod on every machine
where objects are downloaded by gsutil or other Python applications.
Note that for such uploads, crcmod is required for downloading
regardless of whether the parallel composite upload option is on or
not. For some distributions this is easy (e.g., it comes pre-installed
on macOS), but in other cases some users have found it difficult.
In case you are not able to use gsutil and you can't install the Cloud Storage SDK in your FTP server, you could download the files in a VM and run the Cloud Storage SDK or gsutil in this VM.
App Engine Standard does not allow to write to disk. Thus, any file you upload is going to be stored on memory until you upload them to Cloud Storage. So I don't think this is convenient in this case.
App Engine Flexible does allow writing to disk. This is an ephemeral disk, once it get's restarted the contents of the disk get's deleted and each week is restarted. But you wouldn't be taking advantaged of the Load Balancer and the Automatic Scaling the instance has.
In this case, I think the best way would be a Google Cloud preemptible VM. Now, even though this VM only lives for one day at max, they run at a lower price than a normal VM. Once they are going to get terminated, you could check which files had been uploaded to Storage and resume your workload in a new preemptible VM. You could also use a large number of this VMs working in parallel to speed up the download and upload process.

Related

Eclipse - simulate full disk when running server

I have a small application which doesn't take up lots of space (the application does a lot of disk writing, so data accumulates) and I need to know what happens when the disk is full and the application is still trying to write to the disk; whether parts fail or the entire thing.
I'm currently running Tomcat within Eclipse and I would like to know if there is a way to limit the disk space allowed to a server created in Eclipse. Any Ideas?

You can create a small RAM drive, which can be used like a physical drive but exists entirely in RAM. This has the added benefit that is really fast and that you don't have to delete your test files afterwards, as the contents will be gone after your RAM drive is closed.
As for how exactly you create your RAM drive, this depends on your operating system.
In Linux, you can use tmpfs (taken from https://unix.stackexchange.com/questions/66329/creating-a-ram-disk-on-linux):
mount -o size=16G -t tmpfs none /mnt/tmpfs
Edit: In Windows, this isn't shipped with your system so you need to install additional software. A list of it can be found at http://en.wikipedia.org/wiki/List_of_RAM_drive_software.

GCS resumable uploads speed

I have a question concerning upload speed to Google Cloud Storage using resumable uploads.
I've wrote a desktop java client to upload large files to GCS(it has some specialized features thats why gsutil was not an answer for my company). During tests runs about 2 months ago it was utilizing available connection bandwidth very well with about 20Mbps out of 25Mbps connection. The project was frozen for almost 2 months and now when it is reopened the same client is uploading with very poor speed at about 1.4Mbps out of 25Mbps availible.
I've wrote simple Python script to check if it will have same problems and it is a little faster but still at about 2Mbps. Gsutil tool is performing almost the same like my Python script.
I've also run the test on different network infrastructure with over 50Mbps upload speed.
The results are also quite poor:
Java client 2.4Mbps
Python script 3.2Mbps
gsutil 3.2Mbps
The only thing that have changed is the Google Cloud Storage API version. I'm using JSON API and the first tests were run on v1beta API version.
At the moment there's no difference if I'm still using depreciated API or the new one.
Has anyone encountered same upload speed degradation?
What are your average upload speeds?
What could be a possible reason of such a dramatic upload performance decrease?
Will parallel uploads of composite objects help me to fully utilize available bandwidth?

To ascertain what the highest bandwidth you can expect is, we suggest running the gsutil perfdiag command.
For example, to see how well it uploads a 100 MB file:
gsutil perfdiag -t wthru -s 100M gs://bucketname
This will upload a 100MB file five times and report the results. An example output from my run:
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 100 MB file 5 times for a total transfer size of 500 MB.
Write throughput: 71.61 Mbit/s.
It will also output lots of other information that might help diagnose the problem. If the perfdiag output shows much higher throughput than your application, then something might be wrong with your code. If the perfdiag output is also low bandwidth, then something might be wrong with your network path to Google's servers, which the perfdiag output can help identify the problem. If that doesn't help solve your problem, please email the result file (perfdiag -o output.json) to gs-team#google.com.

Cassandra terminates if no space in disk

I am using Cassandra DB in my java application. Am using Thrift client to connect Cassandra from my java application. If the Cassandra disk get full means it automatically terminates. So from my java program i could not find the correct error why the Cassandra is down.
So how to avoid the auto termination of Cassandra or is their any way to identify the disk full error ?
Also i dont have physical access to cassandra drive. Its running in some other remote machine.

Disk errors and, in general, generic hardware/system errors are not usually properly handled in any application. The database should only provide as much durability as possible in such scenarios and it is the correct behavior - shut down and break as little as possible.
As for your application - if you can not connect to the database, there is no difference as to what caused an error. You app will not work anyway.
There are special tools that can monitor your machine, i.e. Nagios. If you are the administrator of that server, use such applications. When the disk is getting filled up you will receive an email or text. Use such tools and don't break an open door by implementing several hundred of lines of code to handle random and very rare situations.

Setup ssh access to Casandra machine and use some ssh client like JSch to run df /casandra/drive (if Linux) or fsutil volume diskfree c:\casandra\drive (if Windows) from your Java client. Capture output that is simple and parse to obtain the free disk space. That way your application will monitor that is happening there and probably should alert the user and refuse to add data if there is an out of disk space threat.
You can also use standard monitoring tools or setup server side script to send the message if the disk space low. However this will not stop your application from crashing, you need to take actions after you see that the disk space is low.

Solaris: Mounting a file system on an application's handlers

When mounting an NFS filesystem, all data handling goes through the nfs client. How can I write my own handlers to use something other than NFS?
An alternative would be a localhost NFS server but that seams awfully inefficient
Edit
Example of what should happen
Normally with a filesystem you get: app reads/writes filesystem, Solaris sees where it is mounted and if it is disk then it reads/writes the disk. If it is software mirror it reads and writes to the mirror software. If it it is NFS it reads and writes to a remote NFS server. I want it to read and write to a custom storage software instead of any of the above mentioned options.
Our storage software is for storing files that applications use, it is geared towards large or frequently replaced chunks of data that are not stored in a database. It also includes certain flexibility specific to our company.
Old/existing applications don't know about our new software. All they know to do is read/write a directory. We could tell Solaris that the directory was hosted on NFS and then the NFS server translates and connects to the storage software. We would prefer to tell Solaris about our new program which Solaris has never heard of and then teach Solaris how to talk to our program.

To me this sounds like you'd have to create a pseudo file system. Solaris uses VFS (Virtual File System), under which you can use different filesystems presented as one uniform structure to userspace. Wheither you mount a UFS or NFS or WHATEVER filesystem, users and applications can use filesystem-agnostic tools to interact with VFS.
That means that what you need to create a pseudo file system; a filesystem that manages to handle the vnode and vfs operations (VFS syscall interface), such as read(), write() etc and tie them, (decide what to do when someone opens a particular file etc), to a database-backend of your choice.
Read more:
http://developers.sun.com/solaris/articles/solaris_internals_ch14_file_system_framework.pdf
Sounds like a big task...
Regards,
jgr

You might want to look at some CIFS servers. Alfresco has JCIFS, which is a CIFS server library in Java. It lets you present resources as files, as if they're on a Windows system. So, that means that programs can "mount" these CIFS servers, and you can publish data from your Database via that mechanism.
I have not used it, but that sounds like what you want to do and perhaps something you may want to look in to.
There's also FUSE which lets you create custom file systems in "user mode" rather than having to hack the kernel. It works on Unix and Mac OS, there may be a Windows version as well. This can, in theory, do anything.
For example, there are instances that let you mount a remote system over SSH using a FUSE system. These tend to be written in C/C++.

NFS isn't about mounting a directory on software but mounting a remote share on a directory. Whether the storage device is remote or not doesn't matter that much, it is still through layers of kernel software. Solaris use VFS to provide the first layer. You should implement the underlying one. That would be quite a difficult task for someone already familiar with VFS. As you obviously are not familiar with writing kernel code, I would be very pessimistic about your project ...
What I would suggest you to do instead would be a simpler and less risky approach. Implement an interposition library that would intercept the application I/O code (open, read, write, close, and the likes or perhaps libc fopen, fwrite, you have to figure out what is the best location to interpose) and call your storage software instead.
Here is a simple example of the process:
http://developers.sun.com/solaris/articles/lib_interposers.html

Memory usage in Google App Engine

I am a bit confused. I wrote a Java stand alone app and now I want to use GAE to
deploy it on the web and on the way also to learn about GAE.
In my application, I read data from file, store it in memory, process it, and then store the results in memory or file.
I understand that now I need to store the results in the GAE's data store, which is fine. So I can run my program independently on my computer, then write the results to file, and then use GAE to upload all the results to the data store, and then users can query it. However, is there a way that I can transfer the entire process into the GAE application? so the application reads data from file, do the processing (use the memory on the application server and not my computer - needs at least 4GB of RAM), and then when it's done (might take 1-2 hours), writes everything to the GAE data store? (so it's an internal "offline" process that no users are involved).
I'm a bit confused since Google don't mention anything about memory quota.
Thanks!

You will not be able to do your offline processing the way you are envisioning. There is a limit to how much memory your app can use, but that is not the main problem. All processing in app engine is done in request handlers. In other words, any action you want your app to do will be written as if it is handling a web request. Each of these handlers is limited to 30 seconds of running time. If your process tries to run longer, it will get shut down. App engine is optimized for serving web requests, not doing heavy computations.
All that being said, you may be able to break up your computational tasks into 30 second chunks and store intermediate results in the datastore or memcache. In that case you could use a cron job or task queue (both described in the app engine docs) to keep calling your processing handlers until the data crunching was done.
In summary, yes, it may be possible to do what you want, but it might not be worth the trouble. Look into other cloud solutions like Amazon's EC2 or Hadoop if you want to do computationally intensive things.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.