How to use nearly related parameters retrieve the same result - java

I am building an sms based application that will retrieve railway schedules.Now the problem that I am facing is that if the user types the wrong name of a particular station(Suppose he writes 'Kolkta' instead of 'Kolkata') then my app would not be able to forward the result of query that has got nearest match to it.How will I do it?Is there an API in java for this?

I guess Apache Lucene provides support you want in java.

Lucence Apache sounds promising, but if you want something more straightforward that you can cook at home very easily, you can try computing the minimal edit distance between the user input and the entire set of railway names. This is a measurement of similarity between strings and can be computed very efficiently (especially in your case, where the strings are very short).
The link above contains a scary mathematical formula but this is the nature of all formal representations. They are scary. Scroll a bit downwards a you will find the extremely short pseudo code for the algorithm (almost copy paste).

Related

Using self-built approaches in Lucene search engine

I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.
This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.

The closer the better approach in Lucene

I'm pretty new to Lucene, so forgive me in advance if some of my terminology is wrong.
Lucene offers different types of fields (keyword, text, unstored, unindexed), but it seems it also supports Numeric field, Int field and Float field.
Now, I'm wondering if "the closer the better" functionality exists/or is easy to implement in Lucene:
I want the creation_date of a document stored as the unix time into a float field.
Then I want to be able to compare the unix time given in a query with the indexed unix time of the documents.
Instead of a range query (which checks if the range is between particular bounds) or a boolean query (which checks if the values are the same) I want to be able to return a sense of similarity based on the time between the unix times. If the timespan is small it should end up with a higher score than if the timespan is large. Preferably this shouldn't happen linear but instead exponentially for example. So as the title of this question says: The closer, the better.
I've noticed that ElasticSearch, which uses Lucene as core offers decay function scores, is this the behaviour that I'm looking for and is this present in Lucene?
Lastly, I'm wondering: can one compare this 'type' of scoring together with the default tf-idf scoring that is used to query the body of the documents, in a way that the final score is a combination of the score of the timespan between the documents and the textual similarity of the bodies.
I dont think you get it out of the box like elastic search. You could always try to add it yourself as a module. These algorithms are available at large on the internet.
You could also use the boosting and negative boosting systems in lucene in combination with the exisiting ranking system to experiment if that gives you the sort of results you would want. I am doing that on apache SOLR and it's working like a charm :)
on your last point, tf-idf module is available in solr, if not already in lucene just copy it from solr and add it as module in lucene and combine your own module with the tf-idf module to achieve a combined result.

AppEngine Approximate Partial String Matching Algorithm

So, I realize that this covers a wide array of topics and pieces of them have been covered before on StackOverflow, such as this question. Similarly, Partial String Matching and Approximate String Matching are popular algorithmic discussions, it seems. However, using these ideas in conjunction to suit a problems where both need to be discussed seems highly inefficient. I'm looking for a way to combine the two problems in to one solution, efficiently.
Right now, I'm using AppEngine with Java and the Persistent DataStore. This is somewhat annoying, since it doesn't seem to have any arithmetic usage in the queries to make things easier, so I'm currently considering doing some precalculation and storing it as an extra field in the database. Essentially, this is the idea that a friend and I were having on how to possibly implement a system for matching and I was more or less hoping for suggestions on how to make it more efficient. If it needs to be scrapped in favor of something better that already exists, I can handle that, as well.
Let's start off with a basic example of what I'd look to do a search for. Consider the following nonsense sentence:
The isolating layer rackets the principal beneath your hypocritical rubbish.
If a user does a search for
isalatig pri
I would think that this would be a fairly good starting match for the string, and the value should be returned. The current method that we are considering using basically assigns a value to test divisibility. Essentially, there is a table with the following data
A: 2 B: 3 C: 5
D: 7 E: 11 F: 13
...
with each character being mapped to a prime number (multiple characters don't make a difference, only one character is needed). And if the query string divides the string in the database, then the value is returned as a possible match.
After this, keywords that aren't listed as stopwords are compared from the search string to see if they are starting substrings of words in the possible match under a given threshold of an edit distance (currently using the Levenshtein distance).
distance("isalatig", "isolating") == 2
distance("pri", "principal") == 0 // since principal has a starting
// substring of pri it passes
The total distance for each query is then ranked in ascending order and the top n values are then returned back to the person doing the querying.
This is the basic idea behind the algorithm, though since this is my first time dealing with such a scenario, I realize that I'm probably missing something very important (or my entire idea may be wrong). What is the best way to handle the current situation that I'm trying to implement. Similarly, if there are any utilities that AppEngine currently offers to combat what I'm trying to do, please let me know.
First off, a clarification: App Engine doesn't allow arithmetic in queries because there's no efficient way to query on the result of an arbitrary arithmetic expression. When you do this in an SQL database, the planner is forced to select an inefficient query plan, which usually involves scanning all the candidate records one by one.
Your scheme will not work for the same reason: There's no way to index an integer such that you can efficiently query for all numbers that are divisible by your target number. Other potential issues include words that translate into numbers that are too large to store in a fixed length integer, and being unable to distinguish between 'rental', 'learnt' and 'antler'.
If we discard for the moment your requirement for matching arbitrary prefixes of strings, what you are searching for is full-text indexing, which is typically implemented using an inverted index and stemming. Support for fulltext search is on the App Engine roadmap but hasn't been released yet; in the meantime your best option appears to be SearchableModel, or using an external search engine such as Google Site Search.

"Quick and Dirty" Facial Recognition and Database Storage/Lookup in Java

For the last week I've been researching and experimenting with facial recognition. The intended application is for a person to be able to look up a person's information in a database (SQL) by simply taking a picture of their face. The initial expectation was to be able to compress a face down to a key or hash and use this as the database lokup. This need not be extremely accurate as the person looking up the information can and most likely will end up doing a final comparison between the original image on file and the person standing in front of them.
OpenCV/JavaCV seems to be the obvious starting point, and the facial detection that it provides works well, however the implementation of Eigenfaces for facial recognition isn't ideal because online training by recompiling hundreds of thousands of user faces every time a new face needs to be added to the training set wouldn't work.
I am experimenting with using SURF descriptors on a face extracted using OpenCV's Haar Cascade features, and this appears to get me closer to the intended result, however I am unable to think of a way to efficiently lookup and compare roughly 30 descriptors (which are either 64 or 128 dimensional vectors) in a database. I've done some reading about LSH and Spectral Hashing algorithms, however there are no implementations to be found for Java and my math isn't strong enough to implement them myself.
Does anyone have any thoughts or ideas on how this might be accomplished, or if it is even possible?
Hashing isn't complicated, nor do you need a degree in maths.
Assuming that any 2 images will result in a fairly similar number of 'descriptors' then it only requires that you get a reasonable match with enough of them to get to a high enough confidence factor.
How specific these descriptors are determines what level of collision you can accept in your hashing algorithm.
As you have several of them, I would suggest that you don't need anything too sophisticated - after all, you probably want a level of 'fuzziness' in your search?
Start with something simple - experiment and refine. You might even find that you'll need different hashing for different descriptors - i.e. some might be more specific than others?
Hopefully some food for thought.

iso 19794-2 fingerprint format

I am using iso 19794-2 fingerprint data format. All the data are in the iso 19794-2 format. I have more than hundred thousand fingerprints. I wish to make efficient search to identify the match. Is it possible to construct a binary tree like structure to perform an efficient(fastest) search for match? or suggest me a better way to find the match. and also suggest me an open source api for java to do fingerprint matching. Help me. Thanks.
Do you have a background in fingerprint matching? It is not a simple problem and you'll need a bit of theory to tackle such a problem. Have a look at this introduction to fingerprint matching by Bologna University's BioLab (a leading research lab in this field).
Let's now answer to your question, that is how to make the search more efficient.
Fingerprints can be classified into 5 main classes, according to the type of macro-singularity that they exhibit.
There are three types of macro-singularities:
whorl (a sort of circle)
loop (a U inversion)
delta (a sort of three-way crossing)
According to the position of those macro-singularities, you can classify the fingerprint in those classes:
arch
tented arch
right loop
left loop
whorl
Once you have narrowed the search to the correct class, you can perform your matches. From your question it looks like you have to do an identification task, so I'm afraid that you'll have to do all the comparisons, or else add some layers of pre-processing (like the classification I wrote about) to further narrow the search field.
You can find lots of information about fingerprint matching in the book Handbook of Fingerprint Recognition, by Maltoni, Maio, Jain and Prabhakar - leading researchers in this field.
In order to read ISO 19794-2 format, you could use some utilities developed by NIST called BiomDI, Software Tools supporting Standard Biometric Data Interchange Formats. You could try to interface it with open source matching algorithms like the one found in this biometrics SDK. It would however need a lot of work, including the conversion from one format to another and the fine-tuning of algorithms.
My opinion (as a Ph.D. student working in biometrics) is that in this field you can easily write code that does the 60% of what you need in no time, but the remaining 40% will be:
hard to write (20%); and
really hard to write without money and time (20%).
Hope that helps!
Edit: added info about NIST BiomDI
Edit 2: since people sometimes email me asking for a copy of the standard, I unfortunately don't have one to share. All I have is a link to the ISO page that sells the standard.
The iso format specifies useful mechanisms for matching and decision parameters. Decide on what mechanism you wish to employ to identify the match, and the relevant decision parameters. When you have determined these mechanisms and decision parameters, examine them to see which are capable of being put into an order - with a fairly high degree of individual values, as you want to avoid multiple collisions on the data. When you have identified a small number of data items (preferably one) that have this property, calculate the property for each fingerprint - preferably as they are added to the database, though a bulk load can be done initially. Then the search for a match is done on the calculated characteristic, and can be done by a binary tree, a black-red tree, or a variety of other search processes. I cannot recommend a particular search strategy without knowing what form and degree of differentiation of values you have in your database. Such a search strategy should, however, be capable of delivering a (small) range of possible matches - which can then be tested individually against your match mechanism and parameters, before deciding on a specific match.

Categories

Resources