I want to write Java code to build a LALR parser for my grammar. Can someone please suggest some books or some links where I can learn how to write Java code for a LALR parser?
Writing a LALR parser by hand is difficult, but it can he done. If you want to learn the theory behind constructing parsers for them by hand, consider looking into "Parsing Techniques: A Practical Guide" by Grune and Jacobs. It's an excellent book on general parsing techniques, and the chapter on LR parsing is particularly good.
If you're more interested in just getting a LALR parser that is written in Java, consider looking into Java CUP, which is a general purpose parser generator for Java.
Hope this helps!
You can split the LALR functionality in two parts: preparation of the tables and parsing the input.
The first part is complex and errorprone, so even if you like knowing how it works I suggest to use a proven working table generator for the LALR states (and for the tokenizer DFA as well).
The second part consists of consuming those tables using some quite simple algorithms to tokenize and process the input into a parse tree/concrete syntax tree. This is easier to implement yourself if you like to do so, and you still have full control over how it works and what it does.
When doing parsing tasks, I personally use the free GOLD Parsing System, which has a nice UI for creating and debugging the grammar and it does also generate table files which can then be loaded and processed by an existing engine or your own implementation (the file format for these CGT files is well documented).
As previously stated, you would always use a parser-generator to produce an LALAR parser. A few such tools for Java are:
SableCC (my personal favourite)
CUP
Beaver3
SJPT
Gold
Just want to mention that my project CookCC ( http://coconut2015.github.io/cookcc/ ) is a LALR(1) parser + Lexer (much like flex).
The unique feature of CookCC is that you can write your lexer and parser in Java using Java annotations. See the calculator example here: https://github.com/coconut2015/cookcc/blob/master/tests/javaap/calc/Calculator.java
Related
I am developing Multi-mode resource-constrain project scheduling solver in Java. I was looking for test instances but only I found this. It is in .mm file that is extension for C++ compilator. Is there any way how to transform this data into something easy readable by java like XML, JSON?
As suggested you could of course parse the file as a text file. Alternatively the two other main approaches would be:
Use clang/llvm's active syntax tree (AST) to interpret the data in the file.
Use an Objective-C++ grammar for a compiler generator like yacc or, since you're using Java, JavaCC. This will also yield a syntax tree, that you can that walk and extract information from.
I am attempting to parse a nested file format in Java.
The file format looks like this:
head [
A [
property value
property2 value
property3 [
... down the rabbit hole ...
]
]
... more As ...
B [
.. just the same as A
]
... more Bs ...
]
What is the best/easiest technique to parse this into my program?
Finite State Machine?
Manually read it word by word and keep track of what part of the structure I am in?
Write a grammar...?
As a side note, I have no control over the format - because I knew someone would say it!
If the grammar is indeed nested like this, writing a very simple top-down parser would be a trivial task: you have very few tokens to recognize, and the nested structure repeats itself very conveniently for a textbook recursive-descent parser.
I would not even bother with ANTLR or another parser generator for something this simple, because the learning curve would eat the potential benefits for the project* .
* Potential benefits for you from learning a parser generator are hard to overestimate: if you can spend a day or two learning to build parsers with ANTLR, your view of structured text files will change forever.
I second the recommendation to take a look at Antlr. StAX adds SAX-like event handling.
http://www.antlr.org/wiki/display/ANTLR3/Interfacing+StAX+to+ANTLR
Yes there is a learning curve, but by the time you handled all the odd cases and debugged your code, you'd probably break even -- pluse you'd have a new item on your resume.
Arguably the easiest way to parse files of these kinds is using a recursive descent parser (http://en.m.wikipedia.org/wiki/Recursive_descent_parser). I guess this is what you mean by manually reading and keeping track of the structure you have found.
A finite state machine wouldn't work if you have to be able to deal with unlimited nesting. If there are only two levels it could be enough.
Writing a grammar and generating a parser would also work, but if you haven't done that before or don't have the time to learn how to use the tools it's probably overkill...
The fastest approach is to use a format like this already e.g. JSon or YAML. These formats do this and are supported.
As a side note, I have no control over the format
If you want to know the best way to parse something like Yaml, but not, is to read the code for a simple Yaml parser.
Just parsing the file is unlikely to be enough, you will also want to trigger events or generate a data model from the data you load.
(Sorry, not sure if ad-hoc is the right word here ... open for a better suggestion)
I'm trying to parse the Galaxy ToolConfig XML CLI tool wrapper format in a Java app, for replicating (in part) the behaviour of the Galaxy software itself.
The format includes some "free-text" if/else clauses, inside the command tag (that's the only place they occur, AFAIK):
...
<command interpreter="python">
sam_to_bam.py
--input1=$source.input1
--dbkey=${input1.metadata.dbkey}
#if $source.index_source == "history":
--ref_file=$source.ref_file
#else
--ref_file="None"
#end if
--output1=$output1
--index_dir=${GALAXY_DATA_INDEX_DIR}
</command>
...
What would be a recommended strategy for parsing this if/else structure into something that can be used to remodel the if/else logic in Java?
Is BNF/ANTLR overkill, better just to parse into some object structure, or? Any design patterns that would fit here? (Haven't worked with BNF/ANTLR before, but am willing to look into it if it will be worth it).
If you want to capture all the structure of the your input, a parser is the only way to go. One can code a parser manually top-down recursive, but there is little point in doing that, which is why parser generator tools exist; use them.
Regarding the #if #then #else: if that's the only structure you want to capture, then you need only a pretty primitive grammar that also allows tokens containing arbitrary text to pick up the goo between the #if#then#else constructs as a blob of text.
If you want to capture all code structure, and the conditionals are only allowed in certain places, then their existence can be simply integrated into whatever BNF you are using.
If, as I suspect, these can occur anywhere ("ad hoc"? the #if follows C preprocessor style, and those conditionals can occur virtually anywhere in the input stream), then parsing the text and retaining the conditionals is presently at the bleeding edge of what state of the art parsing can do. This is the standard C-preprocessing disease, and there have been no good solutions to this. Standard parser generators pretty can't help in this case. (Hand coded parsers don't fare better here either; the same kind of solution has to be used in either case).
One of the recent schemes (just reported as PhD research results in the last few months) to handle this is to fork the parse whenever a #if token is found to handle #if, and #else, and join when #endif is found; then you need a way to fuse to the generated subtrees typically as ambiguous subtrees marked with which arm of the conditional.
If you want to get on with your life, I suggest you simply insist that these conditionals occur in well-defined places in your grammar, and put up with the occasional complaint from people that write unstructured preprocessor directives. ("You wrote crazy code? Sorry, my tool doesn't handle it").
I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing
At my current place of work we have to use a web-service which works with a text template we feed to it. Now we would like to use those templates at other places in code and so I wondered what template language that could possibly be and if it's some off-the-shelf or off-the-net software. It's presumably something from Java or .NET world, but can be essentially anything.
It's got the following tokens:
$$variable$$
##function##
##function_call[$$parameter$$]##
Condition: ##IF[$$boolen$$]##THEN##Text##[ELSE##Text##]ENDIF##
Does someone recognize this?
it seems an home-grown templating system. At least it is not:
Velocity (Java)
StringTemplate (Java)
FreeMarker (Java)
##Looks like a really grim one##, $$made$$ by someone with ###no sense### of making things loook attractive $when$ written:######$$$$[][]
I would be surprised if it was the syntax of an actual product or open-source template engine.