How do I allocate contiguous disk space? - java

I am developing a system which works with lots of files and doing some google searches I read about improving the speed of information retrieval by the hard disk. But since I work with Java, I can't find any library to work with this issue. I have a very vague knowledge in C++, and found something about hard disk information retrieval with IOCTL.
Apparently there is no way of getting specific information like how many contiguous free blocks can I get from my hard disk or the maximum of contiguous free blocks it have.
I am currently working with Windows 7 and XP.
I am aware of the use of JNI but I have strong problems with C++. But even searching for C++ solutions I can't find anything. Maybe I am doing some wrong queries on Google.
Could someone please give me a link, suggestions of study or anything? I am willing to study C++ (although I have almost no free time).
Thank you very much!
PS-Edit: I know it would practically make no difference. But I really need to learn about this. But thanks to everyone giving advices.

Have you identified a performance problem? If not, then don't do anything.
Are you sure that the physical distribution of the files on the disk is the cause of this performance problem? If not, then measure where the time is spent in your application, and try to improve the algorithms, introduce caches if necessary.
If you have done all this, and are sure it's the physical distribution of the files on the disk that's causing the performance problem, have you thought about buying a faster disk, or about using several ones? Hardware is often much cheaper than development time.
I very much doubt the physical distribution of the files on the disk has a significant impact on the performance of your app. I would search elsewhere first.

NTFS already tries to allocate your files contiguously, as stated in this blog post of a Windows 7 engineer. Your files will only be fragmented if there is no big enough contiguous chunk of free space.
If you believe that it is important for your files not to be fragmented, then I think the best option is to schedule a nightly defragmentation of your disk. That's more of a system administration problem.
Finally, fragmentation is irrelevant on SSD disks.

AFAIK there's no built in way nor a 100% pure java solution. The problem is that retrieving that kind of information depends on the platform and since Java should be platform independent you only can use a common subset.

Captain Kernel explains here that this won't necessarily increase disk performance, and beyond that, is not possible without extensive work.

Related

Prevent file fragmentation

I have an issue with fragmentation on my drive. I got a programm that generates over 50000 files in different folders, each file grows over time. Each file will be about 500MB in size and I need to read the files fast.
The issue I am facing is that each file will be spread over the drive and defragmenation would take over 4 weeks.
I heard about a filesystem that will spread each file on the drive so that the gap between each file is the same sice. I searched the internet for that filesystem but i couldn't find anything.
My program is written in Java, maybe there is a way to set the beginning of a file on a specific byte position on the drive.
I would be glad if someone could help me facing this issue.
I heard about a filesystem that will spread each file on the drive so that the gap between each file will be the same sice. I searched in the internet for that filesystem but i coudn't find anything.
Most likely you did not because it does not exist...
But we have RAID systems (Rapid Array of Inexpensive Disks) which could ease your pain...
As Timothy said, you can't get to that level by using Java.
I neither heard that filesystem, it hasn't got much logic though.
Perhaps, in the case that you are storing text, you can use a NoSQL database (like MongoDB) that stores data in binary size. Probably you'll get good speeds, and the Java connector is easy to use.
Use a Linux filesystem like ext4 where disk fragmentation is very low but also make sure you have plenty of disk space left else fragmentation will happen anyway.
I also don't know of a file system that does this. However I have some info that may help-
If you used an SSD, then fragmentation would be less of a concern for reading performance reasons. SSDs store data in chunks - NAND flash pages, 16 KB for instance. These are always stored in scattered order due to the wear-levelling algorithm used. That is very unlike how hard disks work in practice. Pages on SSDs are accessed in a very parallel fashion as well. As a result, you would have much less impact of fragmentation on reading performance with an SSD. Fragmentation would still have some penalty for writes/deletions.
RAID would also allow for higher performance on reads as Timothy mentions.

Application slows down over time - Java + Python

This is a difficult one to explain, and not hopeful for a single, simple answer, but thought it's worth a shot. Interested in what might slow down a long Python job that interacts with a Java application.
We have an instance of Tomcat running a fairly complex and robust webapp called Fedora Commons (not to be confused with Fedora the OS), software for storing digital objects. Additionally, we have a python middleware that performs long background jobs with Celery. One particular job is ingesting a 400+ page book, where each page of the book has a large TIFF file, then some smaller PDF, XML, and metadata files. Over the course of 10-15 minutes, derivatives are created from these files and they are added to a single object in Fedora.
Our problem: over the course of ingesting one book, adding files to the digital object in the Java app Fedora Commons slows down very consistently and predictably, but I can't figure out how or why.
I thought a graph of the ingest speeds might help, perhaps it belies a common memory management pattern that those more experienced with Java might recognize:
The top-left graph is timing large TIFFs, being converted to JP2, then ingested into Fedora Commons. The bottom-left is very small XML files, with no derivative being made, ingested as well. As you can see, the slope of their curve slowing down is almost identical. On the right, are those two processes graphed together.
I've been all over the internet trying to learn about garbage collection in Java (GC), trying different configurations, but not having much effect on the slowdown. If it helps, here are some memory configurations we're passing to Tomcat (where the tail-end I believe are mostly diagnostic):
JAVA_OPTS='-server -Xms1g -Xmx1g -XX:+UseG1GC -XX:+DisableExplicitGC -XX:SurvivorRatio=10 -XX:TargetSurvivorRatio=90 -verbose:gc -Xloggc:/var/log/tomcat7/ggc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC'
We're working with 12GB of RAM on this VM.
I realize the number of factors that might result in this behavior are, excuse the pun, off the charts. But we've worked with Fedora Commons and our Python middleware for quite some time, and been mostly successful. This slow down you could set your watch too just feels suspiciously Java / garbage collection related, though I could be very wrong about that too.
Any help or advice for digging in more is appreciated!
You say you suspect GC as the problem, but you show no GC metrics. Put your program through a profiler and see why the GC is overloaded. It is hard to solve a problem without identifying the cause.
Once you have the found where the problem lies, likely you will need to change the code instead of just tweaking GC settings.
Thanks to all for the suggestions around GC and Tomcat analysis. Turns out, the slowdown was entirely due to ways that Fedora Commons builds digital objects. I was able to isolate this by creating an extremely simple digital object, iteratively adding near zero-size datastreams and watching the progress. You can see this in the graph below:
The curve of the slowdown as almost identical, which suggested it was not our particular ingest method or file sizes. Furthermore, prompted me to dig back into old forum posts about Fedora Commons which confirm that single objects are not meant to contain a large number of datastreams.
It is perhaps interesting how this knowledge was obfuscated behind intellectual organization of digital objects, and not specifically the performance hits you take with Fedora, but that's probably fodder for another forum.
Thanks again to all for the suggestions - if nothing else, normal usage of Fedora is finer tuned and humming along better than before.
Well, instead of looking into obscure GC settings, you might want to start managing memory explicitly, so the GC doesn't affect your execution that much.

Is it possible to get harddisc size using php or java?

I want to detect the harddisc size of my computer and if possible the the partitions informations(partition size, free memory, used etc).Is it possible in php/java?
For PHP use disk_total_space() function.
For me, getting the total disk space for java is complicated. I haven't tried it.
For Java, finding free disk space used to be a long-standing feature request. It was finally implemented for Java 6 aka Mustang.
You can now use File.getFreeSpace() and getUsableSpace(). See e.g. http://www.javalobby.org/java/forums/t19527.html for explanations and examples.
For Java versions prior to Java 6, there is no (easy, cross-platform) solution, just some ugly hacks.
Note: This will give you the free space on the partition of the File instance. I don't know of any way to get a list of all partitions, at least not in pure Java. At any rate, this is a highly system-specific information, so probably not feasible in pure Java.
Maybe you could describe your problem in more detail, then we can possibly help.

I'm asked to tune a long starting app into a short time period

I'm asked to shorten the start-up period of a long starting app, however I have also to obligate to my managers to the amount of time i will reduce the startup - something like 10-20 seconds.
As I'm new in my company I said I can obligate with timeframe of months (its a big server and I'm new and I plan to do lazy load + performance tuning).
That answer was not accepted I was required to do some kind of a cache to hold important data in another server and then when my server starts up it would reach all its data from that cache - I find it a kind of a workaround and I don't really like it.
do you like it?
What do you think I should do?
PS when I profiled the app I saw many small issues that make the start-up long (like 2 minutes) it would not be a short process to fix all and to make lazy load.
Any kind of suggestions would help.
Language is Java.
Thanks
Rule one of performance optimisation: measure it. Get hard figures. At each stage of optimisation measure the performance gain/loss/lack of change. You (and your managers) are not in a position to say that a particular optimisation will or will not work before you try it and measure it. You can always ask to test & measure a solution before implementing it.
Rule two of performance optimisation (or anything really): choose your battles. Please bear in mind that your managers may be very experienced with the system in question, and may know the correct solution already; there may be other things (politics) involved as well, so don't put your position at risk by butting heads at this point.
I agree with MatthieuF. The most important thing to do is to measure it. Then you need to analyze the measurements to see which parts are most costly and also which resource (memory, CPU, network, etc) is the bottleneck.
If you know these answers you can propose solutions. You might be able to create small tests (proof of concepts) of your solution so you can report back early to your managers.
There can be all kind of solutions for example simply buying more hardware might be the best way to go. It's also possible that buying more hardware will have no results and you need to make modifications. The modifications can be optimizing the software, the database or other software. It might be to choose better algorithms, to introduce caching (in expense of more memory usage) or introduce multi threading to take advantage of multiple CPU cores. You can also make modifications to the "surroundings" of your application such as the configuration/version of your operating system, Java virtual machine, application server, database server and others. All of these components have settings which can affect the performance.
Again, it's very important to measure, identify the problem, think of a solution, build solution (maybe in proof of concept) and measure if solution is working. Don't fall in to the trap of first choosing a solution without knowing the problem.
It sounds to me as if you've come in at a relatively junior position, and your managers don't (yet) trust your abilities and judgment.
I don't understand why they would want you to commit to a particular speed-up without knowing if it was achievable.
Maybe they really understand the code and its problems, and know that a certain level of speed-up is achievable. In this case, they should have a good idea how to do it ... so try and get them to tell you. Even if their ideas are not great, you will get credit for at least giving them a try.
Maybe they are just trying to apply pressure (or pass on pressure applied to them) in order to get you to work harder. In this case, I'd probably give them a worth-while but conservative estimate. Then spend some time investigating the problems more thoroughly. And if after a few days research you find that your "off the cuff" estimates are significantly off the mark, go back to the managers with a more accurate estimate.
On the technical side, a two minute start-up times sounds rather excessive to me. What is the application doing in all that time? Loading data structures from files or a database? Recalculating things? Profiling may help answer some of these questions, but you also need to understand the system's architecture to make sens of the profile stats.
Without knowing what the real issues are here, I'd suggest trying to get the service to become available early while doing some of the less critical initialization in the background, or lazily. (And your managers' idea of caching some important data may turn out to be a good one, if viewed in this light.) Alternatively, I'd see if it was feasible to implement a "hot standby" for the system, or replicate it in such a way that allowed you to reduce startup times.

Tell Java not to push an object into swap space

Is it possible tell the JVM to hint the OS that a certain object should preferably not get pushed out to swap space?
The short answer is no.
Java doesn't allow you any control over what is swapped in and what is 'pinned' into RAM.
Worrying about this sort of thing is usually a sign that there is something else wrong in your project. The OS will on the whole do a much better job of working out what should be swapped and what shouldn't. Your job is to write your software such that it doesn't try to second guess what underlying VM/OS is going to do, just concentrate on delivering your features and a good design.
This problem has also been very noticeable in Eclipse and the KeepResident dirty hack plugin (http://suif.stanford.edu/pub/keepresident/) avoids it.
It might be a good place to start? I have not seen it in widespread use so perhaps this has been integrated in the standard Eclipse distribution?
Hey! You are programming in a managed language. Why are you thinking about these? If you can't get these stuff out of your mind, you can always choose to program in C.
The short answer is (as given above): Dont' do it :-).
It would however be possible in principle. Most OS do allow to "lock" certain memory areas from being swapped (e.g. mlock(2) under Linux, VirtualLock under Windows).
The VM could expose this functionality to Java applications via a suitable API. However, no VM I know of does that, so to do it, you would first have to modify your VM...
If you access it regularly, that whatever page it happens to be in at the time (the JVM moves stuff around during garbage collection) will not be paged out unless something else is requesting memory even more aggressively. But there is no way of telling the JVM to not move it to another page, and the OS only knows about pages.
Not an answer, but lacking points to comment, I reserve this option :)
There are reasons to not store information in swap. Be it passwords or other confidential information that should not spend eternity on disk. Also, coming back after a weekend to my pc, I'd like some things to be in memory immediately available.
(Non Java) Natively there is probably some way to do this for each/most operating systems. With windows this is definitely possible. But not straight out of java (think JNI).
Depending on how desperate this option is, you could always look at using video memory, or some other hardware device that does not swap out. This allows you to still use a standardish java api, like jogl to store information. But somehow I doubt that is in context with the implementation/results you are looking for.
Basically you want to keep the whole JVM in main memory the whole time.

Categories

Resources