We run a webserver that has to serve files from a rather large directory, so finding the file via a simple wildcard search like "abcd*jklp*" has serious performance issues.
Is there a way (a trick or a library) to speed up file search in Java? If not, is there a simple caching solution, such that each search is done only once unless the application explicitly empties the cache?
When your web application starts (and every x minutes after that) cache every file you intend to serve in a static variable. When users search for a specific file search your static cache rather than running a search on the actual file server.
Caching everything works great if all you allow is searching on a file name. You could store every file path in a List/Array. If the list/array is already sorted you can use a binary search for user search queries. If there are wild card(s) generate the proper RegEx.
For full text searching of file contents, storing it all statically would not be feasible. Do something lazy like back your files with a database or buy a search appliance like GSA http://www.google.com/enterprise/search/gsa.html
The "trick" in searching is to provide as much information as possible to the initial query. If my desired file is called BobAndAlice and my input String is B* it will be theoretically slower than the search utilizing Bo* . Caching works by building a lookup table in a fashion similar to a HashMap for the sake of this discussion. In essence each search, upon submit, will be checked against your Query Map and if it has been run and the Cache Emptied flag is set to false you will then hit the Lookup Table which will be all of the pointers that came from the results of the executed query. Thereby allowing for fast lookups of already retrieved data. Where you can run into issue is if you store duplicate file copies instead of the pointer to the file. The same process will be valid for actually serving the file to the user.
This doesn't seem like a Java problem. Its more like a algorithmic problem. What I understood from the problem is that you have large number of files in a given directory and given a wild card pattern string you need to process the file matching that criteria. This essentially the problem of String matching where you have lot of strings and need to find only those which match particular criteria. There are lot of options through which you can do this however I will suggest approach of Suffix tree for this scenario as it will give almost o(n) performance in file search.
Related
I have to access a database with 380,000 entries. I don't have write access to the DB, I can just read it. I've made a search function using a map to search for users by firstname. Here is my process:
1 - Load everything from the DB
2 - Store everything into a Map<Charactere, ArrayList<User>>, using Alpha letters to store users according to the first letter of their firstname.
<A> {Alba, jessica, Alliah jane, etc ...}
<B> {Birsmben bani, etc ...}
When someone searches for a user, I take the firstletter of the firstname typed and use map.get(firstletter), then iterate on the ArrayList to find all the users.
The Map Take a huge space in the memory I guess (380,000 User object). I had to increase the heap size
I want to make it faster. Use firstname as key for the Map, in order to make it faster (there are many people with the same firstname).
I have two solutions in mind:
1 - Still use a map with firstname as key (increasing the heap size again?)
2 - Use files on the disk instead of Map (Alba.dat will contain all Alba for example) and open the right file for each search. No need to incease the heap size, but are there any side effects?
Which one is better? (pros and cons)
Update with more info
It's a database of customers who calls our customer service on the phone. The person who takes the call has to search using the customers names (usually firstname and then lastname). Using the Db is too slow to search. The solution I've implemented is much faster already (1/2 seconds vs 26 seconds using the db), but I want to improve it.
IMHO, I don't think you have to cache all the entries in memory, but a part of them, maybe:
Maybe just use a ring buffer, or
More complicated, and make more sense, to implement a LFU Cache, that keeps the N top most frequently accessed item only. See this question for a hint of how to implement such a cache.
There are several issues with your approach:
It implies that the number in users doesn't change, a good application design would work with any number of users without software change
It implies that the current problem is the only one. What happens if the next requirement that needs implementation is "search by caller id" or "search by zip code"?
It is reinventing the wheel, you are currently starting to write a database, index or information retrieval solution (however you want to name it) from scratch
The right thing to do is to export the user data into a database engine which provides proper search capabilities. The export/extraction hopefully can be speed up, if you have modification time stamps or if you can intercept updates and reapply it to your search index.
What you use for your search does not matter to much, a simple database on a modern system is fast enough. Most also provide indexing capabilities to speed up your search. If you want something which can be embedded in your application and is specialized on search and solves your problems above, I'd recommend using Lucene.
This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.
Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?
I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.
So I haven't really done any serious multithreading before( with the exception of the typical for-loop textbook example) so I thought I might give it a try. The task that I am trying to accomplish is the following:
Read an identification code from a file called ids.txt
Search for that identification code in a separate file called sequence.txt
Once identification is found, extract the string that follows the id.
Create an object of type DataSequence (which encapsulates the identification code and the extracted sequence) and add it to an ArrayList.
Repeat for 3000+ ids.
I have tried this the "regular" way within a single thread but the process is way too slow.How can I approach this issue in a multi-threaded fashion ?
Without seeing profiling data, it's hard to know what to recommend. But as a blind guess, I'd say that repeatedly opening, searching, and closing sequence.txt is taking most of the time. If this is guess is accurate, then the biggest improvement (by far) would be to find a way to process sequence.txt only once. The easiest way to do that would be to read in the relevant information from the file into memory, building a hash map from id to the string that follows it. The entire file is only 53.3 MB, so this is an eminently reasonable approach. Then as you process ids.txt, you only need to look up the relevant string from the map—a very quick operation.
An alternative would be to use the java.nio classes to create a memory-mapped file for sequence.txt.
I'd be hesitant about looking to multithreading to improve what seems to be a disk-bound operation, particularly if the threads will all end up contending for access to the same file (even if it is only read access). This does not strike me as a good problem with which to learn multithreading techniques; the payoff is just not likely to be there.
Multi-threading could be an overkill here. try the following algorithmic approach.
1. Open the file ids.txt in read mode
2. Declare a HashMap for storing key-value pair
2. Loop till end of the file
2A. Read a line as a string
2B. Parse the line as id (key) and rest of the line (value) to store in the HashMap object
3. Now search using the HashMap as desired or do whatever you need with this.
Note: 2A and 2B can be put in two different tasks for two different threads in a producer-consumer framework of design.
I have a multiuser system. Each user creates indexable content, but each user can only search your own content.
What better way?
Create a single directory index, index everything in there, and then filter when searching.
Create a directory index for each client and show all results
If there is no need to share the data among users' content, I would go for the second option. Filtering adds overhead and searches might take longer as the corpus will be larger. Not to mention scalability issues, unnecessary GC overhead, etc.
The downside is that you will likely not be able to benefit from field cache as you will have to open/close the index for each user every time. But if you can identify which users are still active and keep their readers open, this can be alleviated.
Sotirios Delimanolis raised a point that 10M directories might be a pain to manage. This is valid point - many files/directories in a single directory does not scale in most of the file systems. But you can always distribute these directories so they form a nice balanced tree.
I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?
600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?
Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.