Natural language processing to recognise numerical data

Natural language processing to recognise numerical data - java

My requirement is to recognize and extract numerical data from a natural language sentence (English only) in response to queries. Platform is Java. For example if the user query is "What is the height of mount Everest" and we have a paragraph as:
In 1856, the Great Trigonometric Survey of British India established the first published height of Everest, then known as Peak XV, at 29,002 ft (8,840 m). In 1865, Everest was given its official English name by the Royal Geographical Society upon recommendation of Andrew Waugh, the British Surveyor General of India at the time, who named it after his predecessor in the post, and former chief, Sir George Everest.[4] Chomolungma had been in common use by Tibetans for centuries, but Waugh was unable to propose an established local name because Nepal and Tibet were closed to foreigners. (Pasted from wikipedia)
For a user query "Height of mount Everest" from the paragraph I need to get 29002 ft or 8840 m as the answer. Can anyone please suggest any possible ways of doing it in Java? Are there any open source libraries for the same?

Obviously, doing this well is extremely difficult to do. If it's an assignment though then I'm guessing the expectation is a bit lower. Here are some thoughts to hopefully get you started:
I'd split the problem into 2 parts; parsing the question block and then passing the answer block. From the question block, you need to know 2 pieces of information, the noun of what you're searching for, and also the type of the answer. In this case the noun is Everest and the type is height. "Types" of data you can build a dictionary for fairly quickly to search your input string for (e.g. "height", "weight", "distance", "age"). The nouns are more difficult, so I'd say to just assume that every non-type in the question is a potential noun, perhaps removing a dictionary of known non-nouns (such as "at", "the", "of" etc.).
Once you've identified the noun and type from the question, you can begin scanning your answer block. I'd begin by breaking that up into sentences. Then scan each sentence for each of your nouns. If one is found in that sentence, you need to scan the sentence again for numbers (taking into account possible whitespace or comma delimiting). Finally, you need to look "around" any numbers you find for a measurement type. So in this case, your "type" that we parsed from the question was "height". You would need to create a mapping of types to measurements, so "height" would map "km, ft, in, cm, m" etc. If the number has one of these types around it, then return the number and measurement type as the answer.
Hope that gets you started. As stated above, this is not intended to be a robust, commercial solution. It's homework-level.

Related

Is there a good way to identify whether there is date information contained in a String

I had this problem of trying to identifying whether there is a date information contained in a paragraph. So here are the issues:
We don't know where the date string might appear. A paragraph would be something like "We would like set the appointment at Nov. 15th. Then we would .....". So we cannot directly use DateTime.parse()
The format of the date is arbitrary, it can be more formal forms like "Nov. 15th" or "08/21/1988" or "5th in this month".
It would be unlikely to cover all the cases given that the date information can have various forms, I just want to cover as many cases as possible. The lightweight solution I can come up with would be regular expressions I guess.... And again that would be a huge expression. Does anyone know if there are better solutions or available regular expressions for this?
(P.S. I would prefer more light weighted approaches, methods like machine learning might be more general but is not applicable to my task here)

I'd propably approach it with a regular expression (or multiple) as well.
I'd make the regular expression match regions that look date-like by matching everything around "th", "nd" "st", month/day names and abbreviations, dot/line/slash/colon separated numbers or such things. Experiment with that and see how good it finds dates with a ton of test-cases.
Parsing the possible dates is another story. I guess you'd need something as powerful as PHP's strtotime.
Another approach is to just clearly define a big collection of possible formats. Then, when one is detected, you can easily parse it. Feels too brute-force for me though

As a starting point, there are seven pages of date regexes over at http://regexlib.com. If you don't know which one you're looking for, I would create an array and apply them one at a time. You'll still have a problem with dates like 11/12/2015 vs. 12/11/2015 so some kind of process for clarification is still necessary (e.g., automatically mail back and ask "Do you mean December 11 or November 12?").

How to calculate similarity between Chamber of Commerce numbers?

I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLﬂ0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLﬂ0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)

Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.

Simple physical quantity measurement unit parser for Java

I want to be able to parse expressions representing physical quantities like
g/l
m/s^2
m/s/kg
m/(s*kg)
kg*m*s
°F/(lb*s^2)
and so on. In the simplest way possible. Is it possible to do so using something like Pyparsing (if such a thing exists for Java), or should I use more complex tools like Java CUP?
EDIT: To answere MrD's question the goal is to make conversion between quantities, so for example convert g to kg (this one is simple...), or maybe °F/(kg*s^2) to K/(lb*h^2) supposing h is four hour and lb for pounds

This is harder than it looks. (I have done a fair amount of work here). The main problem is there is no standard (I have worked with NIST on units and although they have finally created a markup language few people use it). So it's really a form of natural language processing and has to deal with :
ambiguity (what does "M" mean - meters or mega)
inconsistent punctuation
abbreviations
symbols (e.g. "mu" for micro)
unclear semantics (e.g. is kg/m/s the same as kg/(m*s)?
If you are just creating a toy system then you should create a BNF for the system and make sure that all examples adhere to it. This will use common punctuation ("/", "", "(", ")", "^"). Character fields can be of variable length ("m", "kg", "lb"). Algebra on these strings ("kg" -> 1000"g" has problems as kg is a fundamental unit.
If you are doing it seriously then ANTLR (#Yaugen) is useful, but be aware that units in the wild will not follow a regular grammar due to the inconsistencies above.
If you are REALLY serious (i.e. prepared to put in a solid month), I'd be interested to know. :-)
My current approach (which is outside the scope of your question) is to collect a large number of examples from the literature automatically and create a number of heuristics.

How to read single question from text file containing several question and options as well as information?

I have a Text file which contains data (4 to 5 paragraph) and then contain 100 question along with each containing four option (a,b,c,d). Later it contain answer for above question (1.a,2.b and so on). My problem is I want to read each time a single question and options to display. Here when the user selects the option I want to check whether he choose the right one or not. Similarly I want to do this for all questions. And at last I want to display how many question are correct and how many are false.
My question is, how can I get each question and option separately? Similarly the answer. How to get information present in text file separately when required?
my file
Section I – General Questions
Instructions: choose the most correct answer.
1. The International Labour Organization’s first Maternity Protection Convention was issued in:
A.) 1952.
B.) 2000.
C.) 1919.
D.) 1997.
2. Beyond the 5th month of pregnancy, noise in excess of _______________________ may cause
hearing loss in the fetus:
A.) 90 dBA TWA per OSHA requirements
B.) 155 dBC peak and 115 dBC TWA per ACGIH® TLV® Booklet notes
C.) 75 dBA TWA per American Academy of Pediatrics guidelines
D.) 65 dBC peak per EPA studies
ANSWERS – Section I General Questions
1. C
2. B

I wouldn't worry about the format of the file at this point too much. Design your objects (presumably things like Question, Answer, Student etc etc). Get that all working with some test data and then worry about parsing the file. Your program objects don't need to know what the file will look like.

I imagine this text file uses some kind of delimiters to indicate the end of onw questions and the beginning of the next one. If so, I suggest:
Open the text file at the beginning of your program.
When you need to display the first question, read until the first delimiter (the one that occurs between the end of the 1st question and the beginning of the 2nd question).
When you have to show the next question, start from your current position (at the end of question #1), and read until the next delimiter (which would occur after question #2).
The answers can also be read similarly. If you have a special delimiter (say, an XML tag) which indicates the start of the answers, skip ahead up to that position. Then start reading the entries one-by-one, and parsing them.
Sorry for the somewhat vague answer :) If you could be more specific about your requirements, and the delimiters, etc., we can provide more detailed answers.

Use regular expression.
I do not know exact format of your file. I is easier if question can be recognized somehow. For example if the file has format like this:
This is the first question
a. aaaaa
b. bbbbb
c. ccccc
1.b
This is the second question
a. aaaaa
b. bbbbb
c. ccccc
2.c
You can write expression like this:
Pattern p = Pattern.compile("^\\d+\\.\\s+(.+?)(?:[abc]\\.\\s+(.+)){3}", Pattern.MULTILINE)
Now just iterate over the pattern:
Matcher m = p.matcher(text);
while (m.find()) {
String questionNumber = m.group(1);
String questionText = m.group(2);
String answerA = m.group(3);
String answerB = m.group(4);
String answerC = m.group(5);
// your code
}
Note that I have not debugged this pattern, so it obviously does not work. But I believe it gives you enough tips, so it is a good start.

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.