I need to determine the size of a very large character-encoded file. A read of the file takes a significant amount of time.
My understanding is that when a file is first created/modified the size is cached, so the OS can quickly retrieve the value when the size is requested, say, by a file manager. (eg. it seems quick when opening the properties dialog of a large file in win explorer)
Assuming the above is true, can this be retrieved in Java? I had thought that length() read the file to determine the size...or does it in fact get this cached size? Or does the creation of a File object do this read/retrieved the cached size?
My own research hasn't been able to answer these questions as yet.
I'd appreciate some help with my understanding
Thanks
File systems generally store the length as a part of the file description. This way the OS knows where the end of the file is. This information is cached when accessed. And repeated calls for this information will also be cache.
Note: the OS often reads more data from disk than you ask for. This is because access to disk are expensive and memory is relatively cheap. e.g. when you get the length of one file it may read in the detail of many files on the assumption you might want information about those files too. i.e. the first time you get a file's information it is likely to already be cached.
getLength() delegates to the underlying native operating system function to get the length of the file. You should be fine using this.
The length() method doesn't read the file. It calls a native method which delegates to the OS to get the file length. Its response time should not depend on the actual file length.
I think you're over thinking this. Length should query the file system and figure this out very quickly. It's certainly not reading the entire file, and counting bytes.
Related
I am trying to get the byte count for the specific file in a HDFS directory.
I tried to use fs.getFileStatus() ,but i do not see any methods for getting byte count of the file, i can see only getBlockSize() method.
Is there any way can i get the byte count of a specific file in HDFS?
fs.getFileStatus() returns a FileStatus objects which has a method getLen() this will return "length of this file, in bytes." Maybe you should haev a closer look on this: https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileStatus.html.
BUT be aware that the file size is not that important on HDFS. The files will be organized in so called data-blocks each datablock is by default 64 MB. So if you deal with many small files (which is one big anti-pattern on HDFS) you may have less capacity than you expect. See this link for more details:
https://hadoop.apache.org/docs/r2.6.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Blocks
We need to use fs.getLen() method to get the file byte count.
I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.
(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.
I'm trying to do the following: I've a database filled with file names located under a directory. This directory is changing constantly (downloaded files are being added and removed). My application is supposed to scan this directory for the first time and add the files into the database. The second time the application will run, it needs to check if the filenames in the database are still available in the directory.
For the check I use the following pseudo code:
get the filename from the database
check if exists (file f = new File(filename))
if (f.exists()){
mark as existing;
} else {
mark is as deleted
}
if it does, then mark it as existing, else mark it as removed (later will clean the database up)
The question is: How can I check all the files on the database if they exists without producing much garbage? Files can be more than 1000. Running the loop with "new File(...)" more than 1000 times will cause too much garbage.
Any help is appreciated.
The File() object is really tiny. It has only path string in it and reference to the FileSystem object. It just look like a wasting resources, but it's not.
Think about File object as a path String with few helper methods to deal with file paths.
It has nothing to do with file descriptor or other heavy resources.
Never do optimization before profiling. You will end up with non optimal difficult to maintain code.
Files can be more than 1000. Running the loop with "new File(...)"
more than 1000 times will cause too much garbage.
Really? Have you tested this? I can't see this being a significant concern under modern systems. (What are you most worried about? The JVM garbage collection?)
Otherwise, get the current directory, then call .list() or .listFiles(), load into a Set for performance (a HashSet would probably do nicely), then just query against the Set. (You'll still be creating Strings and entries within the Set that could be a similar GC concern.) The potential problem here is that you're now loading a potentially "large" number of elements into memory within the JVM - rather than checking on-demand as you read each row out of the database.
I'd stick with the code that you have outlined. +1 for Michal's answer - please review for additional details as to why doing this should be of no concern.
Do it the other way--you add a set of rows to a database table. You then scan the directory the files are in and just get a list of filenames and compare that list to a 'select names from filesTable' type of query.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...
I have some 1000 key-pair values that I will use in my j2me application, reading it from a resource file. However I will be using only a few of those values at any time, say 10, based on the record number being generated inside the application logic. Loading all the values into memory and then looking up is fairly not an efficient option as I will not be using all the records. Is there a better scheme to store the values in the file, some indexing or something so that I can retrieve those key-pair values by skipping the amount of bytes in the file to reach and read the appropriate record? As this is a resource file in the jar there wont be any modifications to it.
If you know the record length when they are created, you could write the records out in binary format to a file. But at the start of each record, you could first write a number indicating its size in bytes and use a RandomAccessFile to access the records by moving the file pointer.
But in terms of speed, loading into memory will be faster than reading from a file, but if memory is at a premium, then a file wouldn't be a bad way to go.
Jeff
Skipping bytes in a compressed resource file inside a jar is not really going to be optimal either and the implementation of InputStream you get as a result of calling Class.getResourceAsInputStream() may be fragmented if you plan on running your application on several devices.
EDIT after additional info in comment:
It could be that the best way to do this is actually to store the (question, answer) data in 1000 different classes.
It's going to feel very weird as a solution but the class loader should only load the 10 classes you actually use, you can generate the 1000 source files with a simple J2SE program and you can load 10 random classes based on an integer inside their name using java.lang.Class.forName().
If the jar file doesn't become too big to use, you're basically relying on the indexing of its zip file format for the class loader performances...