Java Parser for Natural Language - java

I am looking for a parser (or generated parser) in java that is capable of followings:
I will provide sentences that are already part-of-speech tagged. I will use my own tag set.
I don't have any statistical data. So if the parser is statistical, I want to be able to use it without this feature.
Adaptable to other languages easily. Low learning curve

The Stanford Parser (which was listed on that other SO question) will do everything you list.
You can provide your own POS tags, but you will need to do some translation to the Penn TreeBank set if they are not already in that format. Parsers are either statistical or they're not. If they're not, you need a set of grammar rules. No parsers are really built this way anymore, except as toys, because they are really Bad™. So, you can rely on the statistical data the Stanford Parser uses (with no additional work from you). This does mean, however, that statistics about your own tags (if they don't map directly to the Penn TreeBank tags) will be ignored. But since you don't have statistics for your tags anyway, that should be expected.
They have parsers trained for several other languages too, but you will need your own tagged data if you want to go to a language they don't have available. There's no getting around that, no matter which parser you use.
If you know Java (and I assume you do), the Stanford Parser is very straightforward and easy to get going. Also their mailing list is a great resource and is fairly active.

I'm not very clear on what you'd want, but the first thing I thought of was Mallet:
http://mallet.cs.umass.edu/index.php

Related

Is there an easy and standard way to customize Lucene snowball stemmer?

I'm using Lucene 7.x and ItalianStemmer. I have seen the code of ItalianStemmer class and it seems to take long to be understood. So, I'm looking for a quick (possibly standard) way to customize italian stemmer, without extending ItalianStemmer or SnowballProgram, because I have few days.
The point is that I don't understand why the name "saluto" (greeting) is stemmed to "sal". It should be stemmed to "salut", since the verb "salutare" (greet) is stemmed to "salut". Moreover, "sala" (room) and "sale" (rooms) are also stemmed to "sal", which is confusing, because they have a different meaning.
The standard way would be to copy the source, and create your own.
Stemming is a heuristic process, based on rules. It is designed to generate stems that, while imperfect, are usually good enough to facilitate search. It doesn't have a dictionary of conjugated words and their stems for you to modify. -uto is one of the verb suffixes removed from words by the Italian snowball stemmer, as described here. You could create your own version removing that suffix from the list, but you are probably going to create more problems than you solve, all told.
A tool that returns the correct root word would generally be called a lemmatizer, and I don't believe any come with Lucene, out of the box. The morphological analysis tends to be slower and more complex. If it's important to your use case, you might want to look up an Italian lemmatizer, and work it into a custom filter, or preprocess your text before passing it off the to the analyzer.

Are there constituency parsers that do not aim for a full parse?

I am currently working on a set of report-styled documents, of which I want to extract information. At the moment, I am trying to divide the text body into smaller constituents, for individual classification (what kind of information do we expect in the phrase). Because of the inaccurate grammar in which the reports are written, a standard constituency parser won’t find a common root for the sentences. This obviously cries for dependency parsing. I was however interested whether there would be constituency parsers which do not aim for a full parse of the sentence. Something anlong the line of probabilistic CKY which tries to return most probable sub nodes. I am currently working in the Python nltk framework, but Java solutions would be fine as well.
Sounds like you're looking for "shallow parsing", or "chunking". A chunker might just identify NPs in your text, or just NPs and VPs, etc. I don't believe the nltk provides a ready to use one, but it's pretty easy to train your own. Chapter 7 of the nltk book provides detailed instructions for how to create or train various types of chunkers. The chunks can even be nested if you want a bit of hierarchical structure.

Is there way I can modify a ParseTree and it's accompanying TokenStream?

My question is both a language implementation question and an ANTLR4 API question. Is there way I can modify a ParseTree and it's accompanying TokenStream?
Here is the scenario. I have a simple language that defines a dataflow program. You can see it on github, if you're curious. I lex and parse the language with ANTLR4. I use listeners to walk the parse tree and evaluate the code.
The problem I have most recently run into is I need to be able to modify the code at runtime. I need to be able to define new objects and create instances from them. Note, I'm not referring to having reflection in the language. I'm referring to having a program like an IDE modify the internal representation of the source code.
I have started off down the path of defining a bunch of definition objects to create an AST, but I just realized this approach will require me to come up with my own solutions for walking the AST. Rather than reinvent the wheel, I'd rather use ANTLR's listeners/visitors.
Another problem I face is the need to be able to output the current state the AST as code at any point in time (The tool I'm embedding the language in needs to be able to save.) I am using StringTemplate to generate the code from my definition objects. I think I should be able to make ST render the parse tree.
In general, I need to able to lex, parse, evaluate, refactor, evaluate, and generate code all from within my runtime.
Rather than create my own definition objects, I'm wondering what the best approach is modify the ParseTree/TokenStreams?
I checked out your language. It looks pretty simple, and I'm assuming it is.
From your description I'm working on the basis that the IDE will operate directly on the tree. Given that this is so you need:
A parser for your language, to convert source code into tree. ANTLR can do this, but you may need to build your own tree rather than rely on what is provided. Writing your own parser is not that hard.
Tree rewriting rules. A series of permitted transformations of the tree that ensure it remains valid. This can be in the form of an API that operates on the tree. This http://www.program-transformation.org/Transform/TreeRewriting may be helpful.
Source code generation. The ability to reconstruct source code from the tree. If you need to preserve comments and formatting then either (a) include them in the grammar (b) merge generated source code with authored code.
The end result is that you can round-trip any change in either the source code or the generated tree.
If you think this is a lot of work, it is. However, it is the basis for many other tools that aim to provide these facilities. The saving grace is that your language is simple.
We have no means to alter parse tree safely at the moment. It's best to derive a new version from the old. You should also look at the token stream rewrite engine thingie.
Ter
The short answer to your question is yes, but as you surmise, it will be painful and fragile.
A far more workable alternative, unless you have a hard requirement explicitly to modify the 'internal' representation of the source code/parse tree, is to simply re-parse the modified source code between (virtually) every keystroke. Just grab the current contents of the active editor as a String and feed it as a stream to the Lexer.
While this may sound like an expensive operation, Antlr is actually quite fast. Eclipse in particular works well with this approach - I have used it with a number of DSL editors without any noticeable impact on editor performance. The parse occurs entirely on a background thread. Editor problemMarkers are only updated when there is a sufficient pause in the foreground editing thread. NetBeans should be similar.

Text processing to identify parts of speech

I've to write a program (in Java) to identify several parts of speech like nouns, adjectives, verbs etc. The program should also identify number (in numeric e.g. 10) and numbers written in plain English (ten, hundred etc) and much more. I'm not sure that what is the way forward. Is there any library available that can help? Can this be done only with regex? Or do I need to learn NLP?
Please suggest a way forward.
(1) OpenNLP
(2) LingPipe
(3) Stanford NLP
All 3 of the above (Java based) will help you out, out of the box in identifying the POS.
For numbers use regular expressions.
Part-of-speech (POS) tagging is a pretty standard NLP task. You could in theory write regular expressions that would POS-tag very simple sentences, you're unlikely to achieve reasonable coverage or accuracy with a regex model. You can do pretty well training a reasonably simple HMM model or a discriminative tagger on a hand-tagged training set.
But to tag a specific corpus, you don't necessarily need to learn all the details of POS tagging and roll your own - learning to use an existing library will probably suffice (e.g. NLTK or the Stanford NLP libraries).
Converting textual numeric representations to their arabic form (or vice-versa) falls under the label of 'text-normalization'. Regular expressions (or other finite-state transformations) might be more useful there, although again, you might want to look for an existing solution that meets your needs before you start from scratch.

Parsing a simple language for title formatting

I'm writing an audio player using java, and I want a title formatter similar to foobar2000. Examples are:
$if($strcmp(%album artist%,%artist%),%artist%,$if2(%album%,Unknown))
or
[%album%][ '('CD $ifgreater(%totaldiscs%,1,%discnumber%,)')']
first example returns a string, and if it is same for sequential tracks in a playlist, they are grouped, second example formats album name and adds a cd number if it exists. Full reference is here
I do not need everything but at least some of the functionality.
How difficult is it to write a parser for such a language. Are there any compiler compilers for java?
Would it be easier to use java's javascript engine? How fast will it be (processing several thousand tracks)?
Have you considered using a templating engine such as Freemarker or Velocity? Their markup languages aren't exactly what you specified, but they're very flexible. Might be too verbose if it's for end-users, though.
There are some tools around for parsing custom languages. One of the common tools is the parser generator antlr.
To answer your question, if it's difficult? Yes, it is. Parsing the statements into a syntax tree is just the first step. You still need to write the interpreter, and this foobar2000 title generator syntax looks pretty complex!

Categories

Resources