Text processing to identify parts of speech - java

I've to write a program (in Java) to identify several parts of speech like nouns, adjectives, verbs etc. The program should also identify number (in numeric e.g. 10) and numbers written in plain English (ten, hundred etc) and much more. I'm not sure that what is the way forward. Is there any library available that can help? Can this be done only with regex? Or do I need to learn NLP?
Please suggest a way forward.

(1) OpenNLP
(2) LingPipe
(3) Stanford NLP
All 3 of the above (Java based) will help you out, out of the box in identifying the POS.
For numbers use regular expressions.

Part-of-speech (POS) tagging is a pretty standard NLP task. You could in theory write regular expressions that would POS-tag very simple sentences, you're unlikely to achieve reasonable coverage or accuracy with a regex model. You can do pretty well training a reasonably simple HMM model or a discriminative tagger on a hand-tagged training set.
But to tag a specific corpus, you don't necessarily need to learn all the details of POS tagging and roll your own - learning to use an existing library will probably suffice (e.g. NLTK or the Stanford NLP libraries).
Converting textual numeric representations to their arabic form (or vice-versa) falls under the label of 'text-normalization'. Regular expressions (or other finite-state transformations) might be more useful there, although again, you might want to look for an existing solution that meets your needs before you start from scratch.

Related

Are there constituency parsers that do not aim for a full parse?

I am currently working on a set of report-styled documents, of which I want to extract information. At the moment, I am trying to divide the text body into smaller constituents, for individual classification (what kind of information do we expect in the phrase). Because of the inaccurate grammar in which the reports are written, a standard constituency parser won’t find a common root for the sentences. This obviously cries for dependency parsing. I was however interested whether there would be constituency parsers which do not aim for a full parse of the sentence. Something anlong the line of probabilistic CKY which tries to return most probable sub nodes. I am currently working in the Python nltk framework, but Java solutions would be fine as well.
Sounds like you're looking for "shallow parsing", or "chunking". A chunker might just identify NPs in your text, or just NPs and VPs, etc. I don't believe the nltk provides a ready to use one, but it's pretty easy to train your own. Chapter 7 of the nltk book provides detailed instructions for how to create or train various types of chunkers. The chunks can even be nested if you want a bit of hierarchical structure.

Preserve punctuation characters when using Lucene's StandardTokenizer

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.
I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?
Example of current behaviour:
Welcome, Dr. Chasuble! => Welcome Dr. Chasuble
Example of desired behaviour:
Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !
Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version).
This would be a good place to start:
http://userguide.icu-project.org/boundaryanalysis
The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.
You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.
Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently.
http://cogcomp.cs.illinois.edu/page/software/

Sentence Classification (Categorization)

I have been reading about text classification and found several Java tools which are available for classification, but I am still wondering: Is text classification the same as sentence classification!
Is there any tool which focuses on sentence classification?
Theres no formal difference between 'Text classification' and 'Sentence classification'. After all, a sentence is a type of text. But generally, when people talk about text classification, IMHO they mean larger units of text such as an essay, review or speech. Classifying a politician's speech into democrat or republican is a lot easier than classifying a tweet. When you have a lot of text per instance, you don't need to squeeze each training instance for all the information it can give you and get pretty good performance out a bag-of-words naive-bayes model.
Basically you might not get the required performance numbers if you throw off-the-shelf weka classifiers at a corpora of sentences. You might have to augment the data in the sentence with POS tags, parse trees, word ordering, ngrams, etc. Also get any related metadata such as creation time, creation location, attributes of sentence author, etc. Obviously all of this depends on what exactly are you trying to classify.. the features that will work out for you need to be intuitively meaningful to the problem at hand.

Java Parser for Natural Language

I am looking for a parser (or generated parser) in java that is capable of followings:
I will provide sentences that are already part-of-speech tagged. I will use my own tag set.
I don't have any statistical data. So if the parser is statistical, I want to be able to use it without this feature.
Adaptable to other languages easily. Low learning curve
The Stanford Parser (which was listed on that other SO question) will do everything you list.
You can provide your own POS tags, but you will need to do some translation to the Penn TreeBank set if they are not already in that format. Parsers are either statistical or they're not. If they're not, you need a set of grammar rules. No parsers are really built this way anymore, except as toys, because they are really Bad™. So, you can rely on the statistical data the Stanford Parser uses (with no additional work from you). This does mean, however, that statistics about your own tags (if they don't map directly to the Penn TreeBank tags) will be ignored. But since you don't have statistics for your tags anyway, that should be expected.
They have parsers trained for several other languages too, but you will need your own tagged data if you want to go to a language they don't have available. There's no getting around that, no matter which parser you use.
If you know Java (and I assume you do), the Stanford Parser is very straightforward and easy to get going. Also their mailing list is a great resource and is fairly active.
I'm not very clear on what you'd want, but the first thing I thought of was Mallet:
http://mallet.cs.umass.edu/index.php

Online (preferably) lookup API of a word's class

I have a list of words and I want to filter it down so that I only have the nouns from that list of words (Using Java). To do this I am looking for an easy way to query a database of words for their type.
My question is does anybody know of a free, easy word lookup API that would enable me to find the class of a word, not necessarily its semantic definition.
Thanks!
Ben.
EDIT: By class of the word I meant 'part-of-speech' thanks for clearing this up
Word type? Such as verb, noun, adjective, etc? If so, you might run into the issue that some words can be used in more than one way. For example: "Can you trade me that card?", "That was a bad trade."
See this thread for some suggestions.
Have a look at this as well, seems like it might do exactly what you're looking for.
I think what you are looking for is the part-of-speech (POS) of a word. In general that will not be possible to determine except in the context of a sentence. There are many words that have can several different potential parts of speech (e.g. 'bank' can be used as a verb or noun).
You could use a POS tagger to get the information you want. However, the following part-of-speech taggers assume assume that you are tagging words within a well-structured English sentence...
The OpenNLP Java libraries are generally very good and released under the LGPL. There is a part-of-speech tagger for English and a few other languages included in the distribution. Just go to the project page to get the jar (and don't forget to download the models too).
There is also the Stanford part-of-speech tagger, written in Java under the GPL. I haven't had any direct experience with this library, but the Stanford NLP lab is generally pretty awesome.
Querying a database of words is going to lead to the problem that Ben S. mentions, e.g. is it lead (v. to show the way) or lead (n. Pb). If you want to spend some time on the problem, look at Part of Speech tagging. There's some good info in another SO thread.
For English, you could use WordNet with one of the available Java APIs to find the lexical category of a word (which in NLP is most commonly called the part of speech). Using a dedicated POS tagger would be another option.

Categories

Resources