I am trying to create a small search engine that uses the Java Scanner class to read a file, matching user queries to the queried keyword in the file.
However i have a problem, i need to rank these keywords, if i search for "computer" and the file being searched contains 4 instances of "computer they will all be displayed within one line, because they are the same.
However if it returns " the computer shop" then this should be ranked lower, than just "computer because i did not search for "the computer shop"
I hope you understand, ho can i do this?
Thanks
As far as I understand you your problem is in search engine logic. In this case class Scanner is irrelevant. It is just a convenient utility to read data from stream.
Concerning to search engine please define better your input and required output. Generally you should search for better matching of your query with target text. What does it mean? It is very complicated. Probably longer char sequence, probably more matching words etc. People wrote hundreds of PhDs about this and created thousands of companies (did you hear about Google? :)).
So, unless it is homework try to use tools like Solr or Lucine. Otherwise think about strategies I mentioned above.
Good luck.
A better approach might be to create an inverted index. Instead of going from a file to the words in the file, you do the opposite.
A simple implementation in Java might just be done using Map<String,List<File>>, where the string is the word and the list of files represents the files that contain that.
Related
I'm working on a program for my debate team, and one of the features of it will be that will search through text files for certain keywords. Since there is always a limited time to prepare speeches in debate, speed is my absolute top priority, but the methods of searching that I've tried so far aren't fast enough. The fastest way I've tried was using grep to search each of the files and it technically works, but there are about 2500 files for it to search through, so even though it takes like 5 milliseconds per file, that time adds up really quickly when searching for multiple keywords, or searching for different things as the user would need them.
What I really need is a way to perhaps ensure that my program wont be searching through every document when it's searching or something that would essentially cut down the number of documents it has to look through. Does anyone know if something like that is possible? Or if not, could anyone point me in the direction of something to research that would cut down the search time in other ways?
I think you are looking for text search engine. I believe Apache Lucene will help you. What you can do is to create an index of all your files, based on the content of these files. Then you can quickly search over that index for interesting words and sentences so the Lucene will tell you in which file is that word/sentence best match.
The index should be stored in a file so you don't have to re-create it every time you start searching, but only extend it when the new document comes.
Lucene will do even more for you because it can search for similar words (like google does).
Describing the Lucene engine usage is I think out of the scope of this short answer, but I believe you will find the nice intro follow this link:
http://www.lucenetutorial.com/sample-apps/textfileindexer-java.html
Either use Lucene or some kind of index as stated by Vicctor.
Or, see other grep like solutions:
ignore some files if possible
Fastest possible grep <- Interesting
https://beyondgrep.com/feature-comparison/
Or if you want to learn how to code, try doing it yourself !
I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?
600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?
Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.
I am currently trying to build a small system that read's in a bunch of file names (at the moment, only a few hundred), and then allows the user to search the file names. The end goal is to find dulicates, ones that will not necassarily have the exact same names, but will share common words. I would eventually like to add a feature that allows it to suggest possible duplicates as well.
Currently I add each file path to an ArrayList, and then pass each word of the file name to a Hashtable which uses chaining. The words are created using String.split(), and all non alphanumeric characters are converted into white spaces. This part works fine, and you can search for single word's no worries.
I know the theory behind searching multiple terms, getting the response and building a basic relevance on how many time it selects each document.
My current issue is with file names that are something akin to this 'mybestfile'. My program can only handle them as a single word. and unless searching for 'mybestfile' you will find nothing.
Can anyone suggest a design path that I should head down from here. I know I could parse in an entire dictionary, then try and pull words out by matching substrings, but to be honest, this is just meant to be a simplistic program and I'd rather avoid that kind of thing.
Any help would be appreciated!!
(Also the point of this is half learning, half proving I can do it, so I would like to know of solutions that already exist, but more for how they did it, rather then using them instead)
You could start by playing with various "sounds like" and distance algorithms, available in the Apache Codec language package. (I think the distance algo is in Commons Lang, not codec.)
SimMetrics is another. Can't actually find the one I'm looking for, but here's a list, too.
I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.
You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.
I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.
This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".
I found the following Java libraries for doing spelling correction:
JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)
Any suggestions are welcome!
What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.
Alternatively have a look at BK-Trees.
An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.
If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.
A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.
With regards to populating a Lucene index as the basis of a spell checker, this is a good way to solve the problem. Lucene has an out the box SpellChecker you can use.
There are plenty of word dictionaries available on the net that you can download and use as the basis for your lucene index. I would suggest supplementing these with a number of domain specific texts as well e.g. if your users are medics then maybe supplement the dictionary with source texts from medical thesis and publications.
Try Peter Norvig's spell checker.
You can hit the Gutenberg project or the Internet Archive for lots and lots of corpus.
Also, I think that the Wiktionary could help you. You can even make a direct download.
http://code.google.com/p/google-api-spelling-java is a good Java spell checking library, but I agree with Thomas Jung, that may not be the answer to your problem.