Move file to a different s3Location in Java - java

My usecase is as follows. I want to move a file from a certain s3Location to a different s3Location, using the Java S3 SDK.
For instance, if the file is in bucket/current, I want to move it to bucket/old.
I currently can download a file as an S3Object, turn that object into a File (java.io, since the S3 Java client for reasons I don't understand does not allow you to upload an S3Object, the very same object you download!) and upload that file. I'm curious if there is a better approach to this.
Thanks!

There is no direct implementation of a rename or move operation in S3. Instead, the typical solution is to copy the object to the new location and then delete the original. You can accomplish this with the AmazonS3#copyObject and AmazonS3#deleteObject methods of the AWS SDK for Java.
This is more efficient than the technique you described in your question of downloading the file locally and then re-uploading it under the new key. copyObject internally makes use of S3 server-side copy provided in the S3 REST API PUT Object - Copy operation. The copy is performed on the S3 server side, so you won't have to pay the I/O costs (and real money costs if transiting out of AWS servers) compared to a local file download/upload.
Please be aware that this is much different from the rename operation as provided in a typical local file system, for multiple reasons:
It is not atomic. Most local file systems provide an atomic rename operation, which is useful for building safe "commit" or "checkpoint" constructs to publish the fact that a file is done being written and ready for consumption by some other process.
It is not as fast as a local file system rename. For typical local file systems, rename is a metadata operation that involves manipulating a small amount of information in inodes. With the copy/delete technique I described, all of the data must be copied, even if that copy is performed on the server side by S3.
Your application may be subject to unique edge cases caused by the Amazon S3 Data Consistency Model.

You can use moveObject of the StorageService class.

Related

How to manually copy executable to workers with Apache Beam Dataflow on GCP

Somewhat new to Beam and GCP. Following this document and using the Beam 'subprocess' examples I've been working on a simple Java Pipeline that runs a C binary. It runs fine when using the directRunner and I'm now trying to get it to run in the cloud. With the file staged in a gs buckets, I get the error: 'Cannot run program "gs://mybucketname/tmp/grid_working_files/Echo": error=2, No such file or directory' which makes sense since I guess you can't execute directly out of cloud storage? Where I'm stuck now is how to move the executable to the worker. The document states:
When you use a native Apache Beam language (Java or Python), the Beam SDK automatically moves all required code to the workers. However, when you make a call to external code, you need to move the code manually.

To move the code, you do the following:
Store the compiled external code, along with versioning information, in Cloud Storage.
In the #Setup method, create a synchronized block to check whether the code file is available on the local resource. Rather than implementing a physical check, you can confirm availability using a static variable when the first thread finishes.
If the file isn't available, use the Cloud Storage client library to pull the file from the Cloud Storage bucket to the local worker. A recommended approach is to use the Beam FileSystems class for this task.
After the file is moved, confirm that the execute bit is set on the code file.
In a production system, check the hash of the binaries to ensure that the file has been copied correctly.
I've looked at the FileSystems class, and I think I understand it, but what I don't know is where I need to copy the files to. Is there a known directory or filepath that the workers use? I'm using the Dataflow runner.
You can copy the file to wherever you want in your workers local filesystem, e.g. you could use the tempfile module to create a new, empty temporary directory in which to copy your executable before running.
Using custom containers might be a good solution to this as well.

Use the Checkstyle API without providing a java.io.File

Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.
There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.

How to store files in memory with java?

I am trying to implement a minimal FTP Server using java. On this server, I want all files to exist in memory only. Nothing should be written on disk.
Having said this, I have to create a virtual file system, comprised of a root directory, some sub-directories and files. A few of those will initially be loaded from the hard disk and then will only be handled in memory.
My question is: is there an efficient way to implement this in Java? Is there something that is preimplemented? A Class I should use? (I don't have access to all libraries: java.lang, java.io)
Assuming there is not, I have created my own simple FileSystem, Directory and File classes. I have no idea, however, how I should store the actual data in memory. Knowing that a file can be an image, a text file or anything else that could plausibly be exchanged with an FTP server, how should I store it? Also, there are two transfer modes that I should be able to use: binary and ASCII. So in whatever format I store the data, I should be able to convert them to some kind of binary or ASCII format.
I know the question is a bit abstract, any sort of hints as to where I should look will be appreciated.
The data will be in memory unless written to the disk.
Assuming you have the relevant data stored in some variables then simply do not call any file writer and the variable will remain in java's stack/heap without being stored onto the filesystem.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.
Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.
Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

How to safely update a file that has many readers and one writer?

I have a set of files. The set of files is read-only off a NTFS share, thus can have many readers. Each file is updated occasionally by one writer that has write access.
How do I ensure that:
If the write fails, that the previous file is still readable
Readers cannot hold up the single writer
I am using Java and my current solution is for the writer to write to a temporary file, then swap it out with the existing file using File.renameTo(). The problem is on NTFS, renameTo fails if target file already exists, so you have to delete it yourself. But if the writer deletes the target file and then fails (computer crash), I don't have a readable file.
nio's FileLock only work with the same JVM, so it useless to me.
How do I safely update a file with many readers using Java?
According to the JavaDoc:
This file-locking API is intended to
map directly to the native locking
facility of the underlying operating
system. Thus the locks held on a file
should be visible to all programs that
have access to the file, regardless of
the language in which those programs
are written.
I don't know if this is applicable, but if you are running in a pure Vista/Windows Server 2008 solution, I would use TxF (transactional NTFS) and then make sure you open the file handle and perform the file operations by calling the appropriate file APIs through JNI.
If that is not an option, then I think you need to have some sort of service that all clients access which is responsible to coordinate the reading/writing of the file.
On a Unix system, I'd remove the file and then open it for writing. Anybody who had it open for reading would still see the old one, and once they'd all closed it it would vanish from the file system. I don't know if NTFS has similar semantics, although I've heard that it's losely based on BSD's file system so maybe it does.
Something that should always work, no matter what OS etc, is changing your client software.
If this is an option, then you could have a file "settings1.ini" and if you want to change it, you create a file "settings2.ini.wait", then write your stuff to it and then rename it to "settings2.ini" and then delete "settings1.ini".
Your changed client software would simply always check for settings2.ini if it has read settings1.ini last, and vice versa.
This way you have always a working copy.
There might be no need for locking. I am not too familiar with the FS API on Windows, but as NTFS supports both hard links and soft links, AFAIK, you can try this if your setup allows it:
Use a hard or soft link to point to the actual file, and name the file diferently. Let everyone access the file using the link's name.
Write the new file under a different name, in the same folder.
Once it is finished, have the file point to the new file. Optimally, Windows would allow you to create the new link with replacing the existing link in one atomic operation. Then you'd effectively have the link always identify a valid file, either the old or the new one. At worst, you'd have to delete the old one first, then create the link to the new file. In that case, there'd be a short time span in which a program would not be able to locate the file. (Also, Mac OS X offers a "ExchangeObjects" function that allows you to swap two items atomically - maybe Windows offers something similar).
This way, any program that has the old file already opened will continue to access the old one, and you won't get into its way creating the new one. Only if an app then notices the existence of the new version, it could then close the current and open it again, this way getting access to the new version.
I don't know, however, how to create links in Java. Maybe you have to use some native API for that.
I hope this helps anyways.
I have been dealing with something similar recently. If you are running Java 5, perhaps you could consider using NIO file locks in conjunction with a ReentrantReadWriteLock? Make sure all code referencing the FileChannel object ALSO references the ReentrantReadWriteLock. This way the NIO locks it at a per-VM level while the reentrant lock locks it at a per-thread level.
FileLock fileLock = filechannel.lock(position, size, shared);
reentrantReadWriteLock.lock();
// do stuff
fileLock.release();
reentrantReadWriteLock.unlock();
Of course, some exception handling would be required.

Categories

Resources