Efficient searching of words using indexing - java

I'm doing a project in which I have to search for a word in a dictionary efficiently.
Can any one provide me the Java code for implementing this search with indexing?
Can I use a b+ tree for the implementation?

Check out this answer.
The best way I know of (personally) to efficiently map from strings to other values is with a Trie. The answer I provided includes links to several already implemented versions.
An alternative is interning all your strings and indexing based on the yourString.intern().getHashCode().

This sounds like homework. If it is, please tag it as such.
Is "using an index" an external requirement, or one you've invented because you think it's part of the solution?
I would consider using a datastructure called a "Trie" for this kind of requirement (assuming the use of an index isn't actually mandated - although even then, you could argue that the Trie IS the index...)

Related

Using self-built approaches in Lucene search engine

I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.
This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.

If you have a dictionary of strings, what's the fastest way to search a file and increment the number of times the strings appear?

Let's say you have a dictionary with 5 strings in it, and you also have multiple files. I want to iterate through those files and see how many times the strings in my dictionary appears in them. How can I do this so it is most efficient?
I would like this to scale as well..so more than 5 strings and more than a few documents. I'm pretty open about what language I'm using. Preferably Java or C#, but once again, I can work in another language.
Most efficient is always a trade off between time you want to put into it and the results you want (or need).
One easy approach that is efficient is to use a regular expression. This is probably pretty good with five strings and this will be fairly efficient. If that isn't good enough for you, well... You can certainly find a better approach.
This is a Pattern Matching Problem. The best algorithm to solve this kind of problem is Knuth-Morris-Pratt Algorithm. This is a fomous algorithm therefore you will find its description anywhere, but it found on Introduction to Algorithm book.

Lightweight library cappable of suggesting different spellings of words from a bounded set?

I was looking for lightweight library that'd allow me to feed it a bunch of words, and then ask it whether a given word would have any close matches.z
I'm not particularly concerned with the underlying algorithm (I reckon a simple hamming distance algorithm would probably suffice, were I to undertake the task myself).
I'm just in the development of a small language and I found it nifty to make suggestions to the user when an "Undefined class" error is detected (lots of times it's just a misspelled word). I don't want to lose much time on the issue though.
Thanks
Levenshtein distance is a common way of handling it. Just add all the words to a list and then brute-force iterate over it and return the smallest distance. Here's one library with a Levenschtein function: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html
If you have a large number of words and you want it to run fast, then you'd have to use ngrams. Spilt each word into bigrams and then add (bigram, word) to a map. Use the map to look up the bigrams in the target word, and then iterate through the candidates. That's probably more work than you want to do, though.
not necessarily a library but i think this article may be really helpful. it mostly describes the general workings of how a spelling corrector works in python, but also has a link for a java implementation which you may use if that is what you are looking for specifically (note that I haven't specifically used the java one before)

SortedBiTreeMultimap data structure in Java?

Is there any Java library with TreeMap-like data structure which also supports all of these:
lookup by value (like Guava's BiMap)
possibility of non-unique keys as well as non unique values (like Guava's Multimap)
keeps track of sorted values as well as sorted keys
If it exists, it would probaby be called SortedBiTreeMultimap, or similar :)
This can be produced using a few data structures together, but I never took time to unite them in one nice class, so I was wondering if someone else has done it already.
I think you are looking for a "Graph". You might be interested in this slightly similar question asked a while ago, as well as this discussion thread on BiMultimaps / Graphs. Google has a BiMultimap in its internal code base, but they haven't yet decided whether to open source it.

Metalanguage like BNF or XML-Schema to validate a tree-instance against a tree-model

I'm implementing a new machine learning algorithm in Java that extracts a prototype datastructure from a set of structured datasets (tree-structure). As im developing a generic library for that purpose, i kept my design independent from concrete data-representations like XML.
My problem now is that I need a way to define a data model, which is basically a ruleset describing valid trees, against which a set of trees is being matched. I thought of using BNF or a similar dialect.
Basically I need a way to iterate through the space of all valid TreeNodes defined by the ModelTree (Like a search through the search space for algorithms like A*) so that i can compare my set of concrete trees with the model. I know that I'll have to deal with infinite spaces there but first things first.
I know, it's rather tricky (and my sentences are pretty bumpy) but I would appreciate any clues.
Thanks in advance,
Stefan
I believe that you are talking about a Regular Tree Grammar. This Wikipedia page is an entry point for the topic, and the book that it links to might be helpful.

Categories

Resources