Check if files exists without too much garbage

Check if files exists without too much garbage - java

I'm trying to do the following: I've a database filled with file names located under a directory. This directory is changing constantly (downloaded files are being added and removed). My application is supposed to scan this directory for the first time and add the files into the database. The second time the application will run, it needs to check if the filenames in the database are still available in the directory.
For the check I use the following pseudo code:
get the filename from the database
check if exists (file f = new File(filename))
if (f.exists()){
mark as existing;
} else {
mark is as deleted
}
if it does, then mark it as existing, else mark it as removed (later will clean the database up)
The question is: How can I check all the files on the database if they exists without producing much garbage? Files can be more than 1000. Running the loop with "new File(...)" more than 1000 times will cause too much garbage.
Any help is appreciated.

The File() object is really tiny. It has only path string in it and reference to the FileSystem object. It just look like a wasting resources, but it's not.
Think about File object as a path String with few helper methods to deal with file paths.
It has nothing to do with file descriptor or other heavy resources.
Never do optimization before profiling. You will end up with non optimal difficult to maintain code.

Files can be more than 1000. Running the loop with "new File(...)"
more than 1000 times will cause too much garbage.
Really? Have you tested this? I can't see this being a significant concern under modern systems. (What are you most worried about? The JVM garbage collection?)
Otherwise, get the current directory, then call .list() or .listFiles(), load into a Set for performance (a HashSet would probably do nicely), then just query against the Set. (You'll still be creating Strings and entries within the Set that could be a similar GC concern.) The potential problem here is that you're now loading a potentially "large" number of elements into memory within the JVM - rather than checking on-demand as you read each row out of the database.
I'd stick with the code that you have outlined. +1 for Michal's answer - please review for additional details as to why doing this should be of no concern.

Do it the other way--you add a set of rows to a database table. You then scan the directory the files are in and just get a list of filenames and compare that list to a 'select names from filesTable' type of query.

Related

MarkLogic: Move document from one directory to another on some condition

I'm new to MarkLogic and trying to implement following scenario with its Java API:
For each user I'll have two directories, something like:
1.1. user1/xmls/recent/
1.2. user1/xmls/archived/
When user is doing something with his xml - it's put to the "recent" directory;
When user is doing something with his next xml and "recent" directory is full (e.g. has some amount of documents, let's say 20) - the oldest document is moved to the "archived" directory;
User can request all documents from the "recent" directory and should get no more than 20 records;
User can remove something from the "recent" directory manually; In this case, if it had 20 documents, after deleting one it must have 19;
User can do something with his xmls simultaneously and "recent" directory should never become bigger than 20 entries.
Questions are:
In order to properly handle simultaneous adding of xmls to the "recent" directory, should I block whole "recent" directory when adding new entry (to actually add it, check if there are more than 20 records after adding, select the oldest 21st one and move it to the "archived" directory and do all these steps atomically)? How can I do it?
Any suggestions on how to implement this via Java API?
Is it possible to change document's URI (e.g. replace "recent" with "archived" in my case)?
Should I consider using MarkLogic's collections here?
I'm open to any suggestions and comments (as I said I'm new to MarkLogic and maybe my thoughts on how to handle described scenario are completely wrong).

You can achieve atomicity of a sequence of transactions using Multi-Statement Transactions (MST)
It is possible to MST from the Java API: http://docs.marklogic.com/guide/java/transactions#id_79848
It's not possible to change a URI. However, it is possible to use an MST to delete the old document and reinsert a new one using the new URI in one an atomic step. This would have the same effect.
Possibly, and judging from your use case, unless you must have the recent/archived information as part of the URI, it may be simpler to store this information in collections. However, you should read the documentation and evaluate for yourself: http://docs.marklogic.com/guide/search-dev/collections#chapter

Personally I would skip all the hassle with separate directories as well as collections. You would endlessly have to move files around, or changes their properties. It would be much easier to not calculate anything up front, and simply use lastModified property, or something alike, to determine most recent items at run-time.
HTH!

is it possible for delete file without new file instance in Java?

I have a simple function used for file delete,
it will check the file size,
if small than a specific value, delete the file
however, this function will be called thousand times
and every time it will new file instance,
i think it will be expensive on file object creation issue,
is there any other way to fix this issue?
public void checkFile(String filePath) {
File file = new File(filePath); //this is expensive
if (file.length() < 500) {
file.delete();
}
}

The effect on the performance of the new File() compared to checking the file size on the disk is miniscule. Don't worry about it.
If you really really think that it will make a difference, measure it and then optimise it.

IMHO "thinking" isn't good enough; have you really identified that File object creation is a bottle neck in your application? Anyways, I don't think you can delete a file without creating a File object, unless you are planning on writing your own "native" method which unlinks the file by just taking in the file path as a string.

Why would the code be expensive? Creating temporary objects in Java is not expensive anymore, due to generational GC. And a File is just an object encapsulating a path to the file system. It's not expensive to create one.

Standard java API does not allow this. And thousands of times is almost nothing for modern computer. Creation of java.io.File instance takes less time than deletion, so do not worry. If you see any problems with this code you can create cache as a Map<String, File> and get the file instance from there.
But again, do not do this unless you see that this is your problem. No pre-mature optimization!

There is no way to delete a file in pure Java that doesn't entail creating a File object. The impure alternatives are:
using JNI or JNA to call native code that will call unlink or the Window equivalent,
running the rm or del command as an external process.
The first is at best only marginally faster than new File().delete(). The second is significantly slower.
I'd say that 90+% of the cost of new File().delete() is in the system call and the operating system's file system layers.

File size caching & efficient retrieval of file sizes in Java

I need to determine the size of a very large character-encoded file. A read of the file takes a significant amount of time.
My understanding is that when a file is first created/modified the size is cached, so the OS can quickly retrieve the value when the size is requested, say, by a file manager. (eg. it seems quick when opening the properties dialog of a large file in win explorer)
Assuming the above is true, can this be retrieved in Java? I had thought that length() read the file to determine the size...or does it in fact get this cached size? Or does the creation of a File object do this read/retrieved the cached size?
My own research hasn't been able to answer these questions as yet.
I'd appreciate some help with my understanding
Thanks

File systems generally store the length as a part of the file description. This way the OS knows where the end of the file is. This information is cached when accessed. And repeated calls for this information will also be cache.
Note: the OS often reads more data from disk than you ask for. This is because access to disk are expensive and memory is relatively cheap. e.g. when you get the length of one file it may read in the detail of many files on the assumption you might want information about those files too. i.e. the first time you get a file's information it is likely to already be cached.

getLength() delegates to the underlying native operating system function to get the length of the file. You should be fine using this.

The length() method doesn't read the file. It calls a native method which delegates to the OS to get the file length. Its response time should not depend on the actual file length.

I think you're over thinking this. Length should query the file system and figure this out very quickly. It's certainly not reading the entire file, and counting bytes.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks

Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.

Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.

Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.

As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Need advice in Efficiency: Scanning 2 very large files worth of information

I have a relatively strange question.
I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.
I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read every line that begins with the userID that is contained in the file and store (some value X, related to the rows)
Any advice or tips on how I can make this process work more efficiently?

Import file into SQL database
Use SQL
Performance!
Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.

Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().
Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.

For each row R in the file {
Let N be the number that you need to
extract from R.
Check if there is a file called N. If
not, create it.
Append R to the file called N
}

How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.
Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.

You don't mention whether or not this is a regular, ongoing thing or an occasional check.
Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.

Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.
If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.
I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.
For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).
Make sense?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.