I am writing a media transcoding server in which I would need to move files in the filesystem and till now I am in the dilemma of whether using java renameTo can be replaced by something else that would give me better performance. I was considering using exec("mv file1 file2") but that would be my last bet.
Anyone has had similar experiences or can help me find a solution?
First of all, renameTo is likely just wrapping a system call.
Secondly, moving a file does not involve copying any data from the file itself (at least, in unix). All that happens is that the link from the old directory is removed, and a link from the new directory is added. I don't think you're going to find any performance improvements here.
I don't think that using the default methods for file has a (mentionable) performance penalty as most of this JVMtoOS functions are wrapping native calls already.
The only case where an exec would be needed is if you wanted to do something with different rights than the program or use a special tool to copy/move the file. (e.g. smart-move when ntfs-junctions are involved)
If rename is a significant performance bottleneck, then you need to improve your hardware as this is your main contraint. The software is a trivial portion of the time spent and optimising it will make little difference.
What is your disk confiugration? How is it optimised for writes?
Related
I'm successfully using Desktop.getDesktop().moveToTrash(File) under MacOS to delete files and then retrieve them from the Trash folder. I'd like to do the same under Windows. But I don't see a native Java way to access to Recycle Bin so I can undelete them.
Under MacOS I simply rename files from the Trash folder back to where they were. Is there a way I can do that with the Windows Recycle Bin? Or do something similar?
There's nothing in the core API. You have a bunch of options.
But first.. there's trash, and delete
"move to trash" means the file is literally undestroyable - as long as it is remains in the trashcan it remains on your disk. Said differently, if you have a completely filled up harddisk, trash 100GB worth of files, that disk is... still completely filled up. Possibly certain OSes have a 'bot' that runs when needed or on a schedule that truly deletes files in the trashcan that fit certain rules (such as 'deleted more than 30 days ago').
"Actually delete" means the disk space is now available - if you have a full disk, actually delete 100GB worth of files, then there's now 100GB available, but those files are STILL on disk!! - the only thing that 'delete' does, is flag the space. It doesn't overwrite the actual bits on disk with zeroes or random data or whatnot. Further use of this disk will eventually mean some other file is written 'over' the deleted file at which point it truly becomes entirely unrecoverable, but if you have some extremely sensitive data, you delete the files, then you toss the computer in the garbage bin, anybody who gets their hands on that machine can trivially recover your data. Because what 'delete' does is set a flag "I am deleted", nothing more. All you need to do to undo this, is to unset the flag.
The reason I mention this, is because you used the term 'undelete' in your question. Which usually means something else.
Verb
UnVerb
Action
Trash
Recover, Untrash, Restore, Put Back
Disk space remains unavailable. File visible in OS trash can tool
Delete
Undelete
Disk space is now available; data is still on disk but could be overwritten at any time.
Wipe
N/A
Data is overwritten with zeroes. Some wiping tools overwrite 7 times with random data to be really sure1
Trim
N/A
Pulse all cells on an SSD2 - intended to make data unrecoverable, applies only to SSDs
[1] This fights largely hypothetical trickery where you recover data by scanning for minute magnetic differences in potential. If it's doable at all it requires equipment that costs multiple millions. Still, if you're paranoid, you write random data, and repeat that procedure 7 to 30 times.
[2] SSDs are 'weird' in that they are overprovisioned: They can store more data than the label says. It's because SSDs work in 'cells' and there's a limit to how often a cell can be rewritten. Eventually a cell is near death (it's clear it's only got a few rewrites left in it), at which point the data is copied to an unused cell, and the near-death cell is marked off as no longer usable. The SSD is a 'fake harddrive', exposing itself as a simple contiguous block of writable and addressable space. It's like a mini computer, and will map write ranges to available cells. As a consequence, using basic OS/kernel driver calls to tell the SSD to write 7x random data over a given range of bits does not actually work, in that it is possible that there's a cell with the file data that's been marked as not to be used, and it won't be wiped. While somewhat hard to do, you can send special commands, so-called TRIM commands, to most SSDs to explicitly tell them to pulse-clear all cells on the entire drive, even the ones that have been marked as near-death. This low-level call to the SSD firmware is the only way to securely delete anything off of an SSD. Naturally, the whole point of this exercise is that you can't undo it.
So, to be clear, the one and only thing on this list that is meaningfully doable without writing extremely complex software that scans the raw disk out, byte for byte (which is a tool you should not be writing in java, as you'll be programming a lot towards the OS/arch which java is not good at), is the Untrash part: Undoing the 'trash' action.
Not available in basic java
... unfortunately even that is not available normally. There's an API to tell the OS to 'trash' a file, there is no API call to untrash it. That means you'll have to code an implementation for untrashing for each and every OS you want to support. On mac, you could handroll it by moving files out of ~/.Trash. On windows its a little trickier.
One "simple" (heh) way is to use JNI to write C code (targeting the windows API, to be compiled on windows with windows oriented C tools) that does the job, and then use JNI to call this C function (compiled to a .dll) on windows specifically. You can ship the DLL and simply not use it on non-windows OSes. You will have to compile this DLL once for every arch you want to target (presumably, x64, possibly x86 and aarch64 (ARM)). This is all highly complicated and requires some knowledge about how to write fairly low-level windows code.
Use command line tools
You can invoke command line tools. For example, windows has fsutil which can be used to make hard links. I think you can do it - C:\$Recycle.bin is the path, more or less. Where C is itself already a little tricky to attempt to find from java (you can have multiple disks in a system, so do you just scan for C:, D:, etc? But if the machine still has a CD-ROM drive that'll make it spin up which surely you didn't want. You can ask windows about what kind of disk a letter is, but this again requires JNI, it's not baked into java).
You could write most of the untrash functionality in a powershell script and then simply use java's ProcessBuilder to run it, and have it do the bulk of the work.
Use C:\$Recycle.bin
You can try accessing Paths.get("C:\$Recycle.bin") and see what happens. Presumably, you can just move files out of there. But note that each file has associated with it, knowledge of where it used to be. The files still have their extension but their names are mangled, containing only the drive letter they came from + a number. There's a separate mapping file that tells you where the file was deleted from and what its name was. You will have to open this mapping file and parse through it. You'll have to search the web to figure out what the format of this mapping file is. You'll have to take care not to corrupt it, and probably to remove the file you just recovered from it (without corrupting it).
Note that files in the recycle bin have all sorts of flags set, such as the system flag. You may have to write it all in powershell or batch scripts, and start those from java instead, to e.g. run ATTRIB.EXE to change properties first. Hopefully you can do it with the java.nio.file API which exposes some abilities to interact with windows-specific file flag stuff.
Build your own trashcan
In general it's a bad idea to use java to write highly-OS-specific tooling. You can do it, but it'll hurt the entire time. The designers of java didn't make it for this (Project Panama is trying to fix this, but it's not in JDK18 and won't be in 19 either, it's a few years off – and it wasn't really designed for this kind of thing either), and your average java coder wouldn't use it for this, so that means: Few to no libraries, and hard to find support.
Hence, it's a better idea to consider desktop java apps to do things more in its own way than your average desktop tool. Which can include 'having its own trashcan'. Let's say you have a code editor written in java, and it has a 'delete' feature. You're free to implement 'delete' by moving files to a trashcan dir you made, where you track (Via a DB or shadow files) when the delete occurred, who did it, and where the file came from. Then you build code that can move it back, and code that 'empties the trash', possibly on a schedule.
You can do all that simply with Files.move.
I'm testing data structure performance with very large data.
As a temporary workaround (see here) I want to write memory to disk.
I want to test with very big datasets - how can I make it so that when the java VM runs out of memory it writes some of it to disk?
Since we're talking about temporary fixes here you could always increase your page file if you need a little extra space (swap file in most linux distros)
Here's a link from Microsoft:
http://windows.microsoft.com/en-us/windows-vista/change-the-size-of-virtual-memory
Linux:
http://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/
Now let me say that this isn't a good long term fix, but I understand that sometimes developers just need to make it work. If this is something that will ever see a production environment you may want to look at a tool like Hadoop. It allows you to distribute your data processing over multiple JVM's--a tool built for a "big data" application like the one you're describing
Maybe you can use stream, or some buffered one. I think that will be the best choice for testing such structure. If you can read from disk using stream and that will be not make any additional objects(only that which are necessary) so you can have all jvm memory for your structure. But maybe you can describe your problem more?
I am currently trying to determine the cause of high memory usage in a Java application running on an exotic platform where I know of no instrumented JVM.
I have the source to the application, and can make changes to the source for the purposes of testing.
How can I debug memory usage under these conditions?
If more info is needed, I'll be happy to provide. I'm just a little lost trying to use such an old jvm without much tooling to speak of.
If I were in your shoes I would approach it with:
Find the functional areas you know
need attention.
Make backup copy of code
Start inserting print statements
with start and end times
See what takes a lot of time and
narrow it down.
For Java 5 and later this can be done using Java agents. For earlier versions - including 1.1.8 - you must load native agents to do this. If you cannot instrument your code, you must do the work needed yourself.
One approach to get most of the way is to use a Java 1.1 compatible version of log4j which allows you to essentially write out strings prepended with a timestamp. This can then be massaged afterwards into extracting answers to whatever you want to know.
If you need memory profiling - and I'd recommend against this - you could start serializing objects out to disk, then measuring disk size as a rough estimate of memory size.
If you really want to dig into where you're usually not supposed to be, try the sun.misc package, although I don't know how much of that was around in 1.1.x.
I am writing a servlet which will examine a directory on the server (external to the web container), and recursively search for certain files (by certain files, I mean files that are of a certain extension as well as a certain naming convention). Once these files are found, the servlet responds with a long list of all of the found files (including the full path to the files). My problem is that there are so many files and directories that my servlet goes extremely slow. I was wondering if there was a best practice or existing servlet for this type of problem? Would it be more efficient to simply compile the entire list of files and do the filtering via js/jquery on the client side?
Disk access is slow and as the number of files and directories increases, you'll rapidly reach a point where your servlet will be useless when using the conventional recursive search through the directory tree. You'll especially reach this limit quickly if you have a large number of concurrent users performing the same search at the same time.
It's instead, much better to use an external batch job to generate the list of files which can then be read into the servlet through a database call or even by just parsing a file containing all the file names separated by a newline character. Using "find" on Linux is a simple way to do this: e.g.
find <path_to_directory> -name '*.bin' > list_files.txt
This would list every file name that ends with .bin in a particular directory and output it into a file named list_files.txt. Your servlet could then read in that file and create the list of files from there.
If you really have loads of files, you might think about spawning an external process to do the searching. If you're running on a unix-like server (like linux), you might get speed gains by having the "find" command do the searching for you, and parse its output.
You can google for many examples of how to use "find".
I see two possible reasons why this process might be going slowly:
1) Disk I/O is taking too long. This'll be a real constraint that you can't do much about. Usually the operating system is pretty good at keeping structures in memory that allow it to find files in your folders much quicker. If it is too slow regardless, you might have to build an index yourself in memory. This all depends on how you're doing it.
In any case, if this is the issue (you can try measuring), then there's no way doing the filtering client side will help, as that shouldn't really take very long, no matter where you do it. Instead you're going to make the client slower by sending it more data to sort through.
2) There's something wrong with your directory traversal. You say it's "recursive". If you mean it's actually recursive, i.e. a method that calls itself whenever it encounters a new directory, then that might well be slowing you down (the overhead really adds up). There's some stuff about tree traversal on wikipedia, but basically just use a queue or stack to keep track of where you are in the traversal, instead of using your method state to do so.
Note that a file system isn't actually a tree, but I'm assuming that it is in this case. It gets a bit hairier otherwise.
I don't agree with the other posters that you can't implement it in-process. It should work pretty well up to a certain point, no need for batch jobs just yet.
i think your servlet works slow because of hard drive speed. if list of files a permanent you should load it to memory
I believe it's the File class but I heard that is very expensive in memory.
Is there a better way to work with file paths?
It's hard to say without knowing what you want to do, but please do not prematurely optimize. I doubt the memory use of a File object will be at all noticeable in your application.
The File class doesn't hold much data in and of itself. It has all of two instance fields. If all you're worried about is memory, it doesn't look like it's much of a problem. Nothing is loaded from the file system till you open a stream or a channel.
The File class might be expensive enough that you don't want to use it in order to store every file on your hard drive in memory. I know I've had issues with that, particularly when I tried to use a tree of File objects. If you do encounter a situation where using the file class is too expensive, consider just using Strings, and converting to Files at need. But having that be the optimization that makes your program become practical is probably a sign that you have bigger issues. It is far more likely to have the overhead associated with the structure holding objects to be an issue.
The only time I know where File uses a lot of memory is when you use File.list()...
See these for some solutions:
Is there a workaround for Java’s poor performance on walking huge directories?