Is there any way to access the number of blocks allocated to a file with the standard Java File API? Or even do it with some unsupported & undocumented API underneat. Anything to avoid native code plugins.
I'm talking about the st_blocks field of struct stat that the fstat/stat syscalls work on in Unix.
What I want to do is to create a sparse copy of a file that now has lots of redundant data, i.e. make a new copy of it, only containing the active data but sparsely written to it. Then swap the two files with an atomic rename/link operation. But I need a way to find out how many blocks are allocated to the file beforehand, it might already have been sparsely copied. The old file is then removed.
This will be used to free up disk space in a database application that is 100% Java. The benefit on relying on sparse file support in the filesystem is that I would not have to change the index that point out the location where the data is, that increases the complexity of the task at hand.
I think I can do somewhat well by relying on the file timestamp to see if files have already been cleaned up. But this intrigued me. I can not even find anything in the java 7 NIO.2 API for file attribute access at this level.
The only way I can think of is to use ls -s filename to get the actual size of the file on disk. http://www.lrdev.com/lr/unix/sparsefile.html
Related
I'm successfully using Desktop.getDesktop().moveToTrash(File) under MacOS to delete files and then retrieve them from the Trash folder. I'd like to do the same under Windows. But I don't see a native Java way to access to Recycle Bin so I can undelete them.
Under MacOS I simply rename files from the Trash folder back to where they were. Is there a way I can do that with the Windows Recycle Bin? Or do something similar?
There's nothing in the core API. You have a bunch of options.
But first.. there's trash, and delete
"move to trash" means the file is literally undestroyable - as long as it is remains in the trashcan it remains on your disk. Said differently, if you have a completely filled up harddisk, trash 100GB worth of files, that disk is... still completely filled up. Possibly certain OSes have a 'bot' that runs when needed or on a schedule that truly deletes files in the trashcan that fit certain rules (such as 'deleted more than 30 days ago').
"Actually delete" means the disk space is now available - if you have a full disk, actually delete 100GB worth of files, then there's now 100GB available, but those files are STILL on disk!! - the only thing that 'delete' does, is flag the space. It doesn't overwrite the actual bits on disk with zeroes or random data or whatnot. Further use of this disk will eventually mean some other file is written 'over' the deleted file at which point it truly becomes entirely unrecoverable, but if you have some extremely sensitive data, you delete the files, then you toss the computer in the garbage bin, anybody who gets their hands on that machine can trivially recover your data. Because what 'delete' does is set a flag "I am deleted", nothing more. All you need to do to undo this, is to unset the flag.
The reason I mention this, is because you used the term 'undelete' in your question. Which usually means something else.
Verb
UnVerb
Action
Trash
Recover, Untrash, Restore, Put Back
Disk space remains unavailable. File visible in OS trash can tool
Delete
Undelete
Disk space is now available; data is still on disk but could be overwritten at any time.
Wipe
N/A
Data is overwritten with zeroes. Some wiping tools overwrite 7 times with random data to be really sure1
Trim
N/A
Pulse all cells on an SSD2 - intended to make data unrecoverable, applies only to SSDs
[1] This fights largely hypothetical trickery where you recover data by scanning for minute magnetic differences in potential. If it's doable at all it requires equipment that costs multiple millions. Still, if you're paranoid, you write random data, and repeat that procedure 7 to 30 times.
[2] SSDs are 'weird' in that they are overprovisioned: They can store more data than the label says. It's because SSDs work in 'cells' and there's a limit to how often a cell can be rewritten. Eventually a cell is near death (it's clear it's only got a few rewrites left in it), at which point the data is copied to an unused cell, and the near-death cell is marked off as no longer usable. The SSD is a 'fake harddrive', exposing itself as a simple contiguous block of writable and addressable space. It's like a mini computer, and will map write ranges to available cells. As a consequence, using basic OS/kernel driver calls to tell the SSD to write 7x random data over a given range of bits does not actually work, in that it is possible that there's a cell with the file data that's been marked as not to be used, and it won't be wiped. While somewhat hard to do, you can send special commands, so-called TRIM commands, to most SSDs to explicitly tell them to pulse-clear all cells on the entire drive, even the ones that have been marked as near-death. This low-level call to the SSD firmware is the only way to securely delete anything off of an SSD. Naturally, the whole point of this exercise is that you can't undo it.
So, to be clear, the one and only thing on this list that is meaningfully doable without writing extremely complex software that scans the raw disk out, byte for byte (which is a tool you should not be writing in java, as you'll be programming a lot towards the OS/arch which java is not good at), is the Untrash part: Undoing the 'trash' action.
Not available in basic java
... unfortunately even that is not available normally. There's an API to tell the OS to 'trash' a file, there is no API call to untrash it. That means you'll have to code an implementation for untrashing for each and every OS you want to support. On mac, you could handroll it by moving files out of ~/.Trash. On windows its a little trickier.
One "simple" (heh) way is to use JNI to write C code (targeting the windows API, to be compiled on windows with windows oriented C tools) that does the job, and then use JNI to call this C function (compiled to a .dll) on windows specifically. You can ship the DLL and simply not use it on non-windows OSes. You will have to compile this DLL once for every arch you want to target (presumably, x64, possibly x86 and aarch64 (ARM)). This is all highly complicated and requires some knowledge about how to write fairly low-level windows code.
Use command line tools
You can invoke command line tools. For example, windows has fsutil which can be used to make hard links. I think you can do it - C:\$Recycle.bin is the path, more or less. Where C is itself already a little tricky to attempt to find from java (you can have multiple disks in a system, so do you just scan for C:, D:, etc? But if the machine still has a CD-ROM drive that'll make it spin up which surely you didn't want. You can ask windows about what kind of disk a letter is, but this again requires JNI, it's not baked into java).
You could write most of the untrash functionality in a powershell script and then simply use java's ProcessBuilder to run it, and have it do the bulk of the work.
Use C:\$Recycle.bin
You can try accessing Paths.get("C:\$Recycle.bin") and see what happens. Presumably, you can just move files out of there. But note that each file has associated with it, knowledge of where it used to be. The files still have their extension but their names are mangled, containing only the drive letter they came from + a number. There's a separate mapping file that tells you where the file was deleted from and what its name was. You will have to open this mapping file and parse through it. You'll have to search the web to figure out what the format of this mapping file is. You'll have to take care not to corrupt it, and probably to remove the file you just recovered from it (without corrupting it).
Note that files in the recycle bin have all sorts of flags set, such as the system flag. You may have to write it all in powershell or batch scripts, and start those from java instead, to e.g. run ATTRIB.EXE to change properties first. Hopefully you can do it with the java.nio.file API which exposes some abilities to interact with windows-specific file flag stuff.
Build your own trashcan
In general it's a bad idea to use java to write highly-OS-specific tooling. You can do it, but it'll hurt the entire time. The designers of java didn't make it for this (Project Panama is trying to fix this, but it's not in JDK18 and won't be in 19 either, it's a few years off – and it wasn't really designed for this kind of thing either), and your average java coder wouldn't use it for this, so that means: Few to no libraries, and hard to find support.
Hence, it's a better idea to consider desktop java apps to do things more in its own way than your average desktop tool. Which can include 'having its own trashcan'. Let's say you have a code editor written in java, and it has a 'delete' feature. You're free to implement 'delete' by moving files to a trashcan dir you made, where you track (Via a DB or shadow files) when the delete occurred, who did it, and where the file came from. Then you build code that can move it back, and code that 'empties the trash', possibly on a schedule.
You can do all that simply with Files.move.
I'm testing data structure performance with very large data.
As a temporary workaround (see here) I want to write memory to disk.
I want to test with very big datasets - how can I make it so that when the java VM runs out of memory it writes some of it to disk?
Since we're talking about temporary fixes here you could always increase your page file if you need a little extra space (swap file in most linux distros)
Here's a link from Microsoft:
http://windows.microsoft.com/en-us/windows-vista/change-the-size-of-virtual-memory
Linux:
http://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/
Now let me say that this isn't a good long term fix, but I understand that sometimes developers just need to make it work. If this is something that will ever see a production environment you may want to look at a tool like Hadoop. It allows you to distribute your data processing over multiple JVM's--a tool built for a "big data" application like the one you're describing
Maybe you can use stream, or some buffered one. I think that will be the best choice for testing such structure. If you can read from disk using stream and that will be not make any additional objects(only that which are necessary) so you can have all jvm memory for your structure. But maybe you can describe your problem more?
A hard disk is broken in to blocks and every file stored on the hard disk is broken into that size of block and stored on the hard disk.
e.g.
Consider a 1MB file and the block size is 512 bytes, then the file's first block is stored at 0x121454 and second block at 0x846132.
I need to obtain 0x121454 and 0x846132. I want to use Java.
If not in Java then can C be used? If so with the help of jni I can implement that.
In linux inode block has the details of all the memory addresses but not aware of window.
You can't do this in pure Java.
You probably can't do this in C / C++ either ... unless you are running a privileged application that can access the "raw" device file in Linux (or the Windows equivalent).
And even then, you'd need to implement a whole stack of code that understands the file system structure at the disc block level, and can do all of the calculations.
And those calculations are far, far more complicated than your question envisages. There are multiple file system formats to deal with, and then there are the mappings from virtual disk block numbers to the level of physical disk / platter / track / sector addressing.
And even once you've achieved this, you can't use the information for anything much:
It would be dangerous to try to write files using physical disk addresses. One mistake and you've trashed the file system. (In fact, it is impossible to do safely unless you unmount the file system first ... because there's no way your application can coordinate with what the OS is doing.)
Even reading would be difficult to do reliably, because the OS could be writing to the file while you are reading, and that could be changing the disc address of the file's contents.
Is there any way of reliably "allocating" (reserving) hard disk space via "standard" Java (J2SE 5 or later)?
Take for example the case of a multithreaded application, executing in a thread pool, where every thread downloads files. How can the application make sure that its download won't be interrupted as a result of disk space exhaustion?
At least, if it knows beforehand the size of the file it is downloading, can it do some sort of "reservation", which would guarantee file download, irrespective of what the other threads are doing?
(There is a similar question in StackOverflow, but it does not discuss multithreading and also uses NIO.)
EDIT: After some further testing, the solution proposed on the similar question does not seem to work, as one can set ANY allowed length via the suggested RandomAccessFile approach, irrespective of the underlying hard disk space. For example, on a partition with only a few gigabytes available, I was able to create TB (terrabyte!) files at will.
The solution would have been sufficient if the getFreeSpace() method of the File class reported a decreased amount of available space every time one created a new file, but it actually doesn't, thus confirming the zero length which, in practice, these files seem to have.
These are at least the results I am seeing on a CentOS 5.6 virtual machine, running in VMWare Player 4.0.0.
Write zeros to the file. That will ensure you have allocated disk space (unless drive compression or some other variable-size encoding of the file is in use).
You might get away with writing a single zero for every block, but determining the blocksize may not be trivial.
This is to avoid the creation of a sparse file which does not have all your space allocated.
I am writing a servlet which will examine a directory on the server (external to the web container), and recursively search for certain files (by certain files, I mean files that are of a certain extension as well as a certain naming convention). Once these files are found, the servlet responds with a long list of all of the found files (including the full path to the files). My problem is that there are so many files and directories that my servlet goes extremely slow. I was wondering if there was a best practice or existing servlet for this type of problem? Would it be more efficient to simply compile the entire list of files and do the filtering via js/jquery on the client side?
Disk access is slow and as the number of files and directories increases, you'll rapidly reach a point where your servlet will be useless when using the conventional recursive search through the directory tree. You'll especially reach this limit quickly if you have a large number of concurrent users performing the same search at the same time.
It's instead, much better to use an external batch job to generate the list of files which can then be read into the servlet through a database call or even by just parsing a file containing all the file names separated by a newline character. Using "find" on Linux is a simple way to do this: e.g.
find <path_to_directory> -name '*.bin' > list_files.txt
This would list every file name that ends with .bin in a particular directory and output it into a file named list_files.txt. Your servlet could then read in that file and create the list of files from there.
If you really have loads of files, you might think about spawning an external process to do the searching. If you're running on a unix-like server (like linux), you might get speed gains by having the "find" command do the searching for you, and parse its output.
You can google for many examples of how to use "find".
I see two possible reasons why this process might be going slowly:
1) Disk I/O is taking too long. This'll be a real constraint that you can't do much about. Usually the operating system is pretty good at keeping structures in memory that allow it to find files in your folders much quicker. If it is too slow regardless, you might have to build an index yourself in memory. This all depends on how you're doing it.
In any case, if this is the issue (you can try measuring), then there's no way doing the filtering client side will help, as that shouldn't really take very long, no matter where you do it. Instead you're going to make the client slower by sending it more data to sort through.
2) There's something wrong with your directory traversal. You say it's "recursive". If you mean it's actually recursive, i.e. a method that calls itself whenever it encounters a new directory, then that might well be slowing you down (the overhead really adds up). There's some stuff about tree traversal on wikipedia, but basically just use a queue or stack to keep track of where you are in the traversal, instead of using your method state to do so.
Note that a file system isn't actually a tree, but I'm assuming that it is in this case. It gets a bit hairier otherwise.
I don't agree with the other posters that you can't implement it in-process. It should work pretty well up to a certain point, no need for batch jobs just yet.
i think your servlet works slow because of hard drive speed. if list of files a permanent you should load it to memory