I'm currently in the process of designing some desktop software and I've always wanted to implement an intuitive search function. For example, I need to write an algorithm that parses a search query like "next monday between 2 and 3pm" or "anytime after 2 on friday", or even "how do I use ". So the context can be very different but be asking the same thing, which is what gets me.
Should I be tokenizing the query (which I'm doing so far), or should I treat the string as a whole pattern and compare to a library of some sort?
I'm not sure if SO is the right place for this so if necessary point me in the right direction. Basically I would just like some advice as to the approach I should be taking.
Thanks.
Temporal Extraction (i.e. Extract date/time entities from free form text) - How? might give you some pointers.
"Entity extraction" is the process of extracting human recognizable entities (names, places, dates, etc.) from unstructured text. That article deals specifically with temporal entities but reading up on "entity extraction" in general is a good place to start.
Entity extraction has to be done per-language though, so expect difficulty when you're trying to internationalize your product to other locales. For Google Calendar, we spent a lot of time on temporal entity extraction and on expression recurrence relations in human readable form ("every last Friday in November") and each of the 40 locales that we operate in have their own quirks.
If you are planning to use a predefined grammar, you should consider using a state machine. There is for example the Ragel State Machine Compiler, which lets you use simple regular expressions to define a state machine and allows you to generate the actual source code for various target languages.
Here is a simple parser that I wrote to get all table names from an SQL select query. You could do something similar (https://gist.github.com/1524986).
Related
Relatively simple question. I need to translate/localize a legacy Java application.
Our company, with newer applications uses .properties files in Java for localizing their strings, and this concept is very similar to .resx files in C# (which we also have products using that).
The problem is this is a legacy product that was around before we started thinking about localization. It is full of hard coded strings and also various forms of hard-coded string concatenation/formatting.
As far as I am aware I have a very daunting task of pulling all our strings and formatting into .properties files in the product and then referencing those in the code.
Personally I have no huge issue doing this work, but I want to make sure I am not missing something.
So I have a couple general questions.
Is there a faster way of converting my product to use the
.properties files? Off the top of my head I could write a script
that would automate maybe 30-40% of the work...
Are there any "gotchas" I should be worried about specific to converting a legacy
product (I am not looking for general localization "gotchas" which I
can google for, but anything specific to this scenario)?
Finally, are there any completely different strategies I am overlooking for
localization? This is just how we translate our existing products,
but because this is a legacy product (and on the agenda to be
re-written) this is essentially throw-away code and I could do pretty much whatever I want. Including just
finding the cheapest dirtiest fastest way possible, although I am
obviously leaning toward doing the job properly.
Any thoughts, people?
As a guideline I would say try to keep answers focused on the questions being asked, but any informational contributions or questions are always welcome in comments.
No, there is no faster way. You have to go through the code line by line.
There are plenty of gotchas, since internationalization is about more than just string constants.
You may already know that number formats and date formats need to be localized, but you'll need to be on the lookout for numbers and dates being embedded into strings via concatenation or StringBuilder.append calls. You'll also need to be on the lookout for implicit toString() calls, such as when a Number or Date is supplied as a Swing model value (for example, returning a Number from the TableModel.getValueAt method), or when a JSP or JSF EL expression refers to such a value directly instead of formatting it.
Similarly, keep an eye out for enum constants directly displayed to the user, implicitly invoking their toString() method.
Creating sentences through string concatenation is a problem not only because of the formatting of numbers, dates, and enums, but also because other languages may have different ordering of sentence structure. Such string concatenation should be replaced with localized MessageFormats.
Keystrokes need to be localized, including all mnemonics (and accelerators if it's a desktop app).
Layouts are an issue. Places where the application assumes left-to-right orientation are something you'll want to address; even if you're only planning to localize for other left-to-right languages, you probably know that putting off good i18n practices is asking for trouble later down the line.
If your app is a Swing application, you'll want to convert LEFT/WEST and RIGHT/EAST layout constraints to LINE_START and LINE_END. If your app is a web application, you'll need to factor out margin-left, margin-right, padding-left, padding-right, border-left, and border-right (and probably many others I'm forgetting) into lang-specific CSS blocks.
Swing apps also need to call applyComponentOrientation after building each window, usually right before calling pack().
Some programmers like to store parts of a UI in a database. I'm not talking about user content (which you shouldn't localize); I'm talking about label text, window titles, layout constraints, and so on. I have a hearty dislike for that practice, personally, but people do it. If your app is doing that, I guess either the database table needs a locale column, or the practice of storing the UI in the database needs to be removed entirely.
To answer your final question, if there are any better strategies than stepping through the code, I've never heard of them. You could just search for double-quote characters in the code, of course. I suppose the choice depends on how professional and polished your superiors want the application to look.
One thing I've learned is that throw-away code often isn't. Don't be surprised if that rewrite ends up trying to salvage large swaths of code from the legacy version.
I'm not really sure if my question is correct per se to post here, but I thought I'd give it a go.
I'm working on a project where I take text data from a public knowledge base and want to use this text to automatically expand tag based search queries with additional terms that are supposed to be relevant to the original query. The public knowledge base is basically a collection of data from Wikipedia; in my case the abstracts of 3.74 million articles.
In the beginning I simply performed a search based on an original query, fetched the words used in articles describing the matches from my query and did a simple term frequency calculation to get the N most used terms.
It seemed to be a simple idea that worked to begin with, but as I tested more queries I started running into problems. It's clear that I need some kind of semantic analysis on my custom text collection, but I have no idea where to even begin doing something like this. Any tools I find online that are supposed to do semantic analysis' like this only works on a predefined collection of texts. As stated: I need something that can process a custom collection and later use that index to perform searches on.
Any ideas or suggestions?
Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.
I am implementing a small CRM system. and the concept of data mining to predict and find opportunities and trends are essential for such systems. One data mining approach is clustering. This is a very small CRM project and using java to provide the interface for information retrieval from database.
My question is that when I insert a customer into database, I have a text field which allows customers to be tagged on their way into the database i.e. registration point.
Would you regard tagging technique as clustering? If so, is this a data mining technique?
I am sure there is complex API such as Java Data Mining API which allows data mining. But for the sake of my project I just wanted to know if tagging users with keyword like stackoverflow allows tagging of keywords on posting question is a form of data mining since through those tagged words, one can find trends and patterns easily through searching.
To make it short, yes, tags are additional information that will make data mining easier to conduct later on.
They probably won't be enough though. Tags are linked to entities and, depending on how you compute them, they might not show interesting relations between different entities. With your tagging system, the only relation usable I see is 'has same tag' and it might not be enough.
Clustering your data can be done using community detection techniques on graphs built using your data and relations between entities.
This example is in Python and uses the networkx library but it might give you an idea of what I'm talking about: http://perso.crans.org/aynaud/communities/
Yes, tagging is definitely one way of grouping your users. However, it’s different than ‘clustering.’ Here’s why: you’re making a conscious decision on how you want to group them, but there may be better/ different user groups based on ranging behaviors that may not be obvious to you.
Clustering methods are unsupervised learning methods that can help you uncover these patterns. These methods are “unsupervised” because you don’t have a specific target variable; rather, you want to find groups/ patterns that are most prominent in the data. You can feed CRM data to clustering algorithms to uncover ‘hidden’ relationships.
Also, if you’re using ‘tagging,’ it’s more of a descriptive analytics problem - you’ve well-defined groups in the data, and you’re identifying their behavior. Clustering would be a predictive analytics problem - algorithms will try to predict groups based on the user behavior they recognize in the data.
I am using iso 19794-2 fingerprint data format. All the data are in the iso 19794-2 format. I have more than hundred thousand fingerprints. I wish to make efficient search to identify the match. Is it possible to construct a binary tree like structure to perform an efficient(fastest) search for match? or suggest me a better way to find the match. and also suggest me an open source api for java to do fingerprint matching. Help me. Thanks.
Do you have a background in fingerprint matching? It is not a simple problem and you'll need a bit of theory to tackle such a problem. Have a look at this introduction to fingerprint matching by Bologna University's BioLab (a leading research lab in this field).
Let's now answer to your question, that is how to make the search more efficient.
Fingerprints can be classified into 5 main classes, according to the type of macro-singularity that they exhibit.
There are three types of macro-singularities:
whorl (a sort of circle)
loop (a U inversion)
delta (a sort of three-way crossing)
According to the position of those macro-singularities, you can classify the fingerprint in those classes:
arch
tented arch
right loop
left loop
whorl
Once you have narrowed the search to the correct class, you can perform your matches. From your question it looks like you have to do an identification task, so I'm afraid that you'll have to do all the comparisons, or else add some layers of pre-processing (like the classification I wrote about) to further narrow the search field.
You can find lots of information about fingerprint matching in the book Handbook of Fingerprint Recognition, by Maltoni, Maio, Jain and Prabhakar - leading researchers in this field.
In order to read ISO 19794-2 format, you could use some utilities developed by NIST called BiomDI, Software Tools supporting Standard Biometric Data Interchange Formats. You could try to interface it with open source matching algorithms like the one found in this biometrics SDK. It would however need a lot of work, including the conversion from one format to another and the fine-tuning of algorithms.
My opinion (as a Ph.D. student working in biometrics) is that in this field you can easily write code that does the 60% of what you need in no time, but the remaining 40% will be:
hard to write (20%); and
really hard to write without money and time (20%).
Hope that helps!
Edit: added info about NIST BiomDI
Edit 2: since people sometimes email me asking for a copy of the standard, I unfortunately don't have one to share. All I have is a link to the ISO page that sells the standard.
The iso format specifies useful mechanisms for matching and decision parameters. Decide on what mechanism you wish to employ to identify the match, and the relevant decision parameters. When you have determined these mechanisms and decision parameters, examine them to see which are capable of being put into an order - with a fairly high degree of individual values, as you want to avoid multiple collisions on the data. When you have identified a small number of data items (preferably one) that have this property, calculate the property for each fingerprint - preferably as they are added to the database, though a bulk load can be done initially. Then the search for a match is done on the calculated characteristic, and can be done by a binary tree, a black-red tree, or a variety of other search processes. I cannot recommend a particular search strategy without knowing what form and degree of differentiation of values you have in your database. Such a search strategy should, however, be capable of delivering a (small) range of possible matches - which can then be tested individually against your match mechanism and parameters, before deciding on a specific match.