I wanted to retrieve list of files (Say about 100 files) from a directory.
I used retrieveFileStream method in java FTPClient object to get the files individually.
While retrieving the files, I am getting socket exception several times in between and I have a retry logic to overcome that.
The problem is each socket exception is causing a delay of 10 seconds, which eventually impacts my code performance.
I want to make code changes such that all the files to be retrieved in a
single function. I tried listFiles method in FTPClient object to get all the files in the particular directory. But my directory is having huge number of files (Say about 10000), which again impacts my code performance.
Is there any method to get list of files by providing the required file names list as input parameter? Please help me on this.
I want to make code changes such that all the files to be retrieved in a
single function. I tried listFiles method in FTPClient object to get all the files in the particular directory. But my directory is having huge number of files (Say about 10000), which again impacts my code performance.
Is there any method to get list of files by providing the required file names list as input parameter?
There's no better solution than the one you already have.
Of course, except for spliting the job to multiple threads.
Related
I want to save severals files using Stream to serialize objects, with one serializable object per file.
The reason why I want have one file per object is that I have a list of these objects, I am on Android and at the start of the applicaiton I want load all saved objects, this is the easy part.
But during the execution I want add new elements in this list, and delete elements. And I want update the folder with my files in the same time.
So I supposed I have to create one file per object. But may be there is another solution ?
The main problem is : how name each file ? To avoid overwrite a file, ...
I first thinked that I will have to name my files depending the name of a String given by the user, but I will have to check if the name file is valid,...
So may be the solution is to just name my files with an integer, and at the first load, I just continue the counter from the higher file ?
I suppose my problem is frequent, so what are your solutions to write during the execution a dynamic list of objects ?
Thanks you
I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!
If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.
I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.
I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.
(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.
I'm trying to do the following: I've a database filled with file names located under a directory. This directory is changing constantly (downloaded files are being added and removed). My application is supposed to scan this directory for the first time and add the files into the database. The second time the application will run, it needs to check if the filenames in the database are still available in the directory.
For the check I use the following pseudo code:
get the filename from the database
check if exists (file f = new File(filename))
if (f.exists()){
mark as existing;
} else {
mark is as deleted
}
if it does, then mark it as existing, else mark it as removed (later will clean the database up)
The question is: How can I check all the files on the database if they exists without producing much garbage? Files can be more than 1000. Running the loop with "new File(...)" more than 1000 times will cause too much garbage.
Any help is appreciated.
The File() object is really tiny. It has only path string in it and reference to the FileSystem object. It just look like a wasting resources, but it's not.
Think about File object as a path String with few helper methods to deal with file paths.
It has nothing to do with file descriptor or other heavy resources.
Never do optimization before profiling. You will end up with non optimal difficult to maintain code.
Files can be more than 1000. Running the loop with "new File(...)"
more than 1000 times will cause too much garbage.
Really? Have you tested this? I can't see this being a significant concern under modern systems. (What are you most worried about? The JVM garbage collection?)
Otherwise, get the current directory, then call .list() or .listFiles(), load into a Set for performance (a HashSet would probably do nicely), then just query against the Set. (You'll still be creating Strings and entries within the Set that could be a similar GC concern.) The potential problem here is that you're now loading a potentially "large" number of elements into memory within the JVM - rather than checking on-demand as you read each row out of the database.
I'd stick with the code that you have outlined. +1 for Michal's answer - please review for additional details as to why doing this should be of no concern.
Do it the other way--you add a set of rows to a database table. You then scan the directory the files are in and just get a list of filenames and compare that list to a 'select names from filesTable' type of query.
How can I be sure if a file was processed before? There is a remote storage location which is a file source for my application. My program gets files from this location and processes them in a scheduled way. How can I be sure that the next time I fetch only non-processed files? I'm thinking about using file attributes. The archive and modified date can be a solution. But I learned that two bits of file attributes are not used. How can I use these fields in Java? By the way I don't want to use a database.
A common strategy is to use some form of hash function to create a checksum. Record the checksum of the file, and compare the list of processed files identified by checksum against the file in question. If the checksum of the file in question is in the list, you have already processed it.
Protect your list of processed file checksums. If you lose it, or it becomes corrupted, it might be a long, bad day.
To prevent unnecessary network traffic, you might consider preparing 'check' files on the remote repository that contain a checksum that corresponds to a potential input file.
EDIT:
Upon further comment, it is potentially possible to directly interact with file system attributes. The proposed Java 1.7 spec introduces file-system specific attribute views to directly interact with these attributes. The view you would be interested in is 'DosFileAttributeView'.
Basic use might be something similar to this ('input' is a file based on a java 'Path'; add necessary exception handling):
// import as necessary from java.nio.file and java.io
DosFileAttributeView view = input.getFileAttributeView(DosFileAttributeView.class);
//Check if the system supports this view
if (view != null)
{
DosFileAttributes attributes = view.readAttributes();
// skip any file already marked as an archive
if (!attributes.isArchive())
{
myObject.process(input)
attributes.setArchive(true)
}
}
Can you rename the file (e.g. "filename.archive")? or into an "archive" subdirectory?