Retrieving the starting memory addresses of the blocks of a file

Retrieving the starting memory addresses of the blocks of a file - java

A hard disk is broken in to blocks and every file stored on the hard disk is broken into that size of block and stored on the hard disk.
e.g.
Consider a 1MB file and the block size is 512 bytes, then the file's first block is stored at 0x121454 and second block at 0x846132.
I need to obtain 0x121454 and 0x846132. I want to use Java.
If not in Java then can C be used? If so with the help of jni I can implement that.
In linux inode block has the details of all the memory addresses but not aware of window.

You can't do this in pure Java.
You probably can't do this in C / C++ either ... unless you are running a privileged application that can access the "raw" device file in Linux (or the Windows equivalent).
And even then, you'd need to implement a whole stack of code that understands the file system structure at the disc block level, and can do all of the calculations.
And those calculations are far, far more complicated than your question envisages. There are multiple file system formats to deal with, and then there are the mappings from virtual disk block numbers to the level of physical disk / platter / track / sector addressing.
And even once you've achieved this, you can't use the information for anything much:
It would be dangerous to try to write files using physical disk addresses. One mistake and you've trashed the file system. (In fact, it is impossible to do safely unless you unmount the file system first ... because there's no way your application can coordinate with what the OS is doing.)
Even reading would be difficult to do reliably, because the OS could be writing to the file while you are reading, and that could be changing the disc address of the file's contents.

Related

Using Java can I programmatically undelete a file under Windows?

I'm successfully using Desktop.getDesktop().moveToTrash(File) under MacOS to delete files and then retrieve them from the Trash folder. I'd like to do the same under Windows. But I don't see a native Java way to access to Recycle Bin so I can undelete them.
Under MacOS I simply rename files from the Trash folder back to where they were. Is there a way I can do that with the Windows Recycle Bin? Or do something similar?

There's nothing in the core API. You have a bunch of options.
But first.. there's trash, and delete
"move to trash" means the file is literally undestroyable - as long as it is remains in the trashcan it remains on your disk. Said differently, if you have a completely filled up harddisk, trash 100GB worth of files, that disk is... still completely filled up. Possibly certain OSes have a 'bot' that runs when needed or on a schedule that truly deletes files in the trashcan that fit certain rules (such as 'deleted more than 30 days ago').
"Actually delete" means the disk space is now available - if you have a full disk, actually delete 100GB worth of files, then there's now 100GB available, but those files are STILL on disk!! - the only thing that 'delete' does, is flag the space. It doesn't overwrite the actual bits on disk with zeroes or random data or whatnot. Further use of this disk will eventually mean some other file is written 'over' the deleted file at which point it truly becomes entirely unrecoverable, but if you have some extremely sensitive data, you delete the files, then you toss the computer in the garbage bin, anybody who gets their hands on that machine can trivially recover your data. Because what 'delete' does is set a flag "I am deleted", nothing more. All you need to do to undo this, is to unset the flag.
The reason I mention this, is because you used the term 'undelete' in your question. Which usually means something else.
Verb
UnVerb
Action
Trash
Recover, Untrash, Restore, Put Back
Disk space remains unavailable. File visible in OS trash can tool
Delete
Undelete
Disk space is now available; data is still on disk but could be overwritten at any time.
Wipe
N/A
Data is overwritten with zeroes. Some wiping tools overwrite 7 times with random data to be really sure1
Trim
N/A
Pulse all cells on an SSD2 - intended to make data unrecoverable, applies only to SSDs
[1] This fights largely hypothetical trickery where you recover data by scanning for minute magnetic differences in potential. If it's doable at all it requires equipment that costs multiple millions. Still, if you're paranoid, you write random data, and repeat that procedure 7 to 30 times.
[2] SSDs are 'weird' in that they are overprovisioned: They can store more data than the label says. It's because SSDs work in 'cells' and there's a limit to how often a cell can be rewritten. Eventually a cell is near death (it's clear it's only got a few rewrites left in it), at which point the data is copied to an unused cell, and the near-death cell is marked off as no longer usable. The SSD is a 'fake harddrive', exposing itself as a simple contiguous block of writable and addressable space. It's like a mini computer, and will map write ranges to available cells. As a consequence, using basic OS/kernel driver calls to tell the SSD to write 7x random data over a given range of bits does not actually work, in that it is possible that there's a cell with the file data that's been marked as not to be used, and it won't be wiped. While somewhat hard to do, you can send special commands, so-called TRIM commands, to most SSDs to explicitly tell them to pulse-clear all cells on the entire drive, even the ones that have been marked as near-death. This low-level call to the SSD firmware is the only way to securely delete anything off of an SSD. Naturally, the whole point of this exercise is that you can't undo it.
So, to be clear, the one and only thing on this list that is meaningfully doable without writing extremely complex software that scans the raw disk out, byte for byte (which is a tool you should not be writing in java, as you'll be programming a lot towards the OS/arch which java is not good at), is the Untrash part: Undoing the 'trash' action.
Not available in basic java
... unfortunately even that is not available normally. There's an API to tell the OS to 'trash' a file, there is no API call to untrash it. That means you'll have to code an implementation for untrashing for each and every OS you want to support. On mac, you could handroll it by moving files out of ~/.Trash. On windows its a little trickier.
One "simple" (heh) way is to use JNI to write C code (targeting the windows API, to be compiled on windows with windows oriented C tools) that does the job, and then use JNI to call this C function (compiled to a .dll) on windows specifically. You can ship the DLL and simply not use it on non-windows OSes. You will have to compile this DLL once for every arch you want to target (presumably, x64, possibly x86 and aarch64 (ARM)). This is all highly complicated and requires some knowledge about how to write fairly low-level windows code.
Use command line tools
You can invoke command line tools. For example, windows has fsutil which can be used to make hard links. I think you can do it - C:\$Recycle.bin is the path, more or less. Where C is itself already a little tricky to attempt to find from java (you can have multiple disks in a system, so do you just scan for C:, D:, etc? But if the machine still has a CD-ROM drive that'll make it spin up which surely you didn't want. You can ask windows about what kind of disk a letter is, but this again requires JNI, it's not baked into java).
You could write most of the untrash functionality in a powershell script and then simply use java's ProcessBuilder to run it, and have it do the bulk of the work.
Use C:\$Recycle.bin
You can try accessing Paths.get("C:\$Recycle.bin") and see what happens. Presumably, you can just move files out of there. But note that each file has associated with it, knowledge of where it used to be. The files still have their extension but their names are mangled, containing only the drive letter they came from + a number. There's a separate mapping file that tells you where the file was deleted from and what its name was. You will have to open this mapping file and parse through it. You'll have to search the web to figure out what the format of this mapping file is. You'll have to take care not to corrupt it, and probably to remove the file you just recovered from it (without corrupting it).
Note that files in the recycle bin have all sorts of flags set, such as the system flag. You may have to write it all in powershell or batch scripts, and start those from java instead, to e.g. run ATTRIB.EXE to change properties first. Hopefully you can do it with the java.nio.file API which exposes some abilities to interact with windows-specific file flag stuff.
Build your own trashcan
In general it's a bad idea to use java to write highly-OS-specific tooling. You can do it, but it'll hurt the entire time. The designers of java didn't make it for this (Project Panama is trying to fix this, but it's not in JDK18 and won't be in 19 either, it's a few years off – and it wasn't really designed for this kind of thing either), and your average java coder wouldn't use it for this, so that means: Few to no libraries, and hard to find support.
Hence, it's a better idea to consider desktop java apps to do things more in its own way than your average desktop tool. Which can include 'having its own trashcan'. Let's say you have a code editor written in java, and it has a 'delete' feature. You're free to implement 'delete' by moving files to a trashcan dir you made, where you track (Via a DB or shadow files) when the delete occurred, who did it, and where the file came from. Then you build code that can move it back, and code that 'empties the trash', possibly on a schedule.
You can do all that simply with Files.move.

Writing multiple files of same data amount vs writing a single large file of same data amount

I want to write a big file to the local disk.
I split the big file into many small files and I tried to write it to the disk. But I observed that when I split the files and tried to write, there was a big increase in disk write time.
Also, I copy the files from a disk and write it another computer's disk(reducer). I observed that there was a big increase in read time as well. Can anybody explain me the reason? I am working with hadoop.
Thanks!

That's due to the underlying file system and hardware.
There's overhead for each file in addition to its contents, for example MFT for NTFS(on Windows). So for a single large file the file system could do less bookkeeping.Thus it's faster.
As arranged by your OS, single big file tends to be written on consecutive sectors of the hard drive where possible, but multiple small files may or may not be written as such. So the resulting increased seek time may account for the increased reading time for many small files.
The efficiency of your OS may also play a big part. For example whether it prefetches file contents, how it makes use of buffer, etc. For many small files it's more difficult for the OS to use the buffer(and deal with other issues) efficiently.(Under different scenarios it can behave differently.)
EDIT: As for the copy process you mentioned, generally your OS do it in the following steps:
read data from disk->writing data to buffer->read from buffer->write to (possibly another) disk
This is usually done in multiple threads. When dealing with many small files, the OS may fail to coordinate these threads efficiently(Some threads are very busy, while others must wait). For a single large file the OS doesn't have to deal with these issues.

Every file system has a smallest unit(non sharable) defined to store the data named page. Say for example, in the file system, you have a page size of 4KB. Now if you save a big file of 8 KB, it will consume 2 pages on the disk. But if you break the file in 4 files, each of size 2KB, then it will consume 4 half filled pages on the disk consuming size 16KB disk space.
Similarly, if you break the file in 8 small files, each of size 1KB, then it will consume 8 pages in the disk though partially filled and your 32KB of the disk space is consumed.
Same is true in the reading overhead. If your file as several pages, then might be scattered. This will lead into high overhead in seektime/access time.

Memory-Mapped MappedByteBuffer or Direct ByteBuffer for DB Implementation?

This looks like a long question because of all the context. There are 2 questions inside the novel below. Thank you for taking the time to read this and provide assistance.
Situation
I am working on a scalable datastore implementation that can support working with data files from a few KB to a TB or more in size on a 32-bit or 64-bit system.
The datastore utilizes a Copy-on-Write design; always appending new or modified data to the end of the data file and never doing in-place edits to existing data.
The system can host 1 or more database; each represented by a file on-disk.
The details of the implementation are not important; the only important detail being that I need to constantly append to the file and grow it from KB, to MB, to GB to TB while at the same time randomly skipping around the file for read operations to answer client requests.
First-Thoughts
At first glance I knew I wanted to use memory-mapped files so I could push the burden of efficiently managing the in-memory state of the data onto the host OS and out of my code.
Then all my code needs to worry about is serializing the append-to-file operations on-write, and allowing any number of simultaneous readers to seek in the file to answer requests.
Design
Because the individual data-files can grow beyond the 2GB limit of a MappedByteBuffer, I expect that my design will have to include an abstraction layer that takes a write offset and converts it into an offset inside of a specific 2GB segment.
So far so good...
Problems
This is where I started to get hung up and think that going with a different design (proposed below) might be the better way to do this.
From reading through 20 or so "memory mapped" related questions here on SO, it seems mmap calls are sensitive to wanting contiguous runs of memory when allocated. So, for example, on a 32-bit host OS if I tried to mmap a 2GB file, due to memory fragmentation, my chances are slim that mapping will succeed and instead I should use something like a series of 128MB mappings to pull an entire file in.
When I think of that design, even say using 1024MB mmap sizes, for a DBMS hosting up a few huge databases all represented by say 1TB files, I now have thousands of memory-mapped regions in memory and in my own testing on Windows 7 trying to create a few hundred mmaps across a multi-GB file, I didn't just run into exceptions, I actually got the JVM to segfault every time I tried to allocate too much and in one case got the video in my Windows 7 machine to cut out and re-initialize with a OS-error-popup I've never seen before.
Regardless of the argument of "you'll never likely handle files that large" or "this is a contrived example", the fact that I could code something up like that with those type of side effects put my internal alarm on high-alert and made consider an alternative impl (below).
BESIDES that issue, my understanding of memory-mapped files is that I have to re-create the mapping every time the file is grown, so in the case of this file that is append-only in design, it literally constantly growing.
I can combat this to some extent by growing the file in chunks (say 8MB at a time) and only re-create the mapping every 8MB, but the need to constantly be re-creating these mappings has me nervous especially with no explicit unmap feature supported in Java.
Question #1 of 2
Given all of my findings up to this point, I would dismiss memory-mapped files as a good solution for primarily read-heavy solutions or read-only solutions, but not write-heavy solutions given the need to re-create the mapping constantly.
But then I look around at the landscape around me with solutions like MongoDB embracing memory-mapped files all over the place and I feel like I a missing some core component here (I do know it allocs in something like 2GB extents at a time, so I imagine they are working around the re-map cost with this logic AND helping to maintain sequential runs on-disk).
At this point I don't know if the problem is Java's lack of an unmap operation that makes this so much more dangerous and unsuitable for my uses or if my understanding is incorrect and someone can point me North.
Alternative Design
An alternative design to the memory-mapped one proposed above that I will go with if my understanding of mmap is correct is as follows:
Define a direct ByteBuffer of a reasonable configurable size (2, 4, 8, 16, 32, 64, 128KB roughly) making it easily compatible with any host platform (don't need to worry about the DBMS itself causing thrashing scenarios) and using the original FileChannel, perform specific-offset reads of the file 1 buffer-capacity-chunk at a time, completely forgoing memory-mapped files at all.
The downside being that now my code has to worry about things like "did I read enough from the file to load the complete record?"
Another down-side is that I don't get to make use of the OS's virtual memory logic, letting it keep more "hot" data in-memory for me automatically; instead I just have to hope the file cache logic employed by the OS is big enough to do something helpful for me here.
Question #2 of 2
I was hoping to get a confirmation of my understanding of all of this.
For example, maybe the file cache is fantastic, that in both cases (memory mapped or direct reads), the host OS will keep as much of my hot data available as possible and the performance difference for large files is negligible.
Or maybe my understanding of the sensitive requirements for memory-mapped files (contiguous memory) are incorrect and I can ignore all that.

You might be interested in https://github.com/peter-lawrey/Java-Chronicle
In this I create multiple memory mappings to the same file (the size is a power of 2 up to 1 GB) The file can be any size (up to the size of your hard drive)
It also creates an index so you can find any record at random and each record can be any size.
It can be shared between processes and used for low latency events between processes.
I make the assumption you are using a 64-bit OS if you want to use large amounts of data. In this case a List of MappedByteBuffer will be all you ever need. It makes sense to use the right tools for the job. ;)
I have found it performance well even with data sizes around 10x your main memory size (I was using a fast SSD drive so YMMV)

I think you shouldn't worry about mmap'ping files up to 2GB in size.
Looking at the sources of MongoDB as an example of DB making use of memory mapped files you'll find it always maps full data file in MemoryMappedFile::mapWithOptions() (which calls MemoryMappedFile::map()). DB data spans across multiple files each up to 2GB in size. Also it preallocates data files so there's no need to remap as the data grows and this prevents file fragmentation. Generally you can inspire yourself with the source code of this DB.

The number of blocks allocated to a sparse file

Is there any way to access the number of blocks allocated to a file with the standard Java File API? Or even do it with some unsupported & undocumented API underneat. Anything to avoid native code plugins.
I'm talking about the st_blocks field of struct stat that the fstat/stat syscalls work on in Unix.
What I want to do is to create a sparse copy of a file that now has lots of redundant data, i.e. make a new copy of it, only containing the active data but sparsely written to it. Then swap the two files with an atomic rename/link operation. But I need a way to find out how many blocks are allocated to the file beforehand, it might already have been sparsely copied. The old file is then removed.
This will be used to free up disk space in a database application that is 100% Java. The benefit on relying on sparse file support in the filesystem is that I would not have to change the index that point out the location where the data is, that increases the complexity of the task at hand.
I think I can do somewhat well by relying on the file timestamp to see if files have already been cleaned up. But this intrigued me. I can not even find anything in the java 7 NIO.2 API for file attribute access at this level.

The only way I can think of is to use ls -s filename to get the actual size of the file on disk. http://www.lrdev.com/lr/unix/sparsefile.html

Allocate disk space for multiple file downloads in Java

Is there any way of reliably "allocating" (reserving) hard disk space via "standard" Java (J2SE 5 or later)?
Take for example the case of a multithreaded application, executing in a thread pool, where every thread downloads files. How can the application make sure that its download won't be interrupted as a result of disk space exhaustion?
At least, if it knows beforehand the size of the file it is downloading, can it do some sort of "reservation", which would guarantee file download, irrespective of what the other threads are doing?
(There is a similar question in StackOverflow, but it does not discuss multithreading and also uses NIO.)
EDIT: After some further testing, the solution proposed on the similar question does not seem to work, as one can set ANY allowed length via the suggested RandomAccessFile approach, irrespective of the underlying hard disk space. For example, on a partition with only a few gigabytes available, I was able to create TB (terrabyte!) files at will.
The solution would have been sufficient if the getFreeSpace() method of the File class reported a decreased amount of available space every time one created a new file, but it actually doesn't, thus confirming the zero length which, in practice, these files seem to have.
These are at least the results I am seeing on a CentOS 5.6 virtual machine, running in VMWare Player 4.0.0.

Write zeros to the file. That will ensure you have allocated disk space (unless drive compression or some other variable-size encoding of the file is in use).
You might get away with writing a single zero for every block, but determining the blocksize may not be trivial.
This is to avoid the creation of a sparse file which does not have all your space allocated.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.