How to represent a context free grammar in code? - java

I want to write a compiler as a personal project and am in the process of reading and understanding parsers (LL(k), LR(k), SLR etc.)
All of these parsers are based on some grammar which comes from the user and this grammar is generally written in a text file (for eg. in ANTLR, it comes in a .g4 file, which is a text file IMO). If I want my parser to create its parsing tables from such a grammar file, what is the best way to parse it and represent the productions in code?
EDIT:
For example, let's say I have this grammar:
S -> 'a'|'b'|'('S')'|T
T -> '*'S
I was thinking of parsing this given grammar and storing it as an ArrayList<ArrayList<String>>. This way every item in the ArrayList will be a collection of productions from the same non-terminal:
// with this type of a representation, I can assign an id to each production
//For example, production S -> 'a' has id 01 or T -> '*'S has an id of 10 and so on
{
{"S", "'a'", "'b'", "'('S')'", "T"},
{"T", "'*'S"}
}
I am not sure about representing the grammar as an AST, because then I don't know how to assign Ids to each production. But the above representation of a grammar seems pretty naive design to me and I am suspicious that there should be some standard way to doing this which will be easier to work with.

Related

Should I use a regexp or a top-down LL(1) parser in the following case?

I need to parse a flightplan document in order to extract relevant information. A sample document is given below:
(FPL-REU974-IS
-B77L/H-SDE1E2E3GFHIJ4J5M1RWXYZ/LB1D1
-LFPG1845
-N0490F310 OKASI UL612 MILPA UM730 TOP UL50 ELB UL12 VELAD UM728 NERAR UP3
RCA/N0489F350 UR611 TIKAT UG300 MAV UM665G ITLOX UM665 UVESO/N0486F370 DCT DENLI
UR780 MIDRI UR780G UVENA
-FMEE1036 FIMP
-PBN/A1D1L1S1 NAV/RNP10 DOF/121114 REG/FOLRA EET/LSAS0039 LFFF0039 LIMM0048 LIRR0111
LMMM0218 HLLL0237 HECC0343 HSSS0425 HAAA0545 HKNA0700 HCSM0701 FSSS0745 FMMM0900
SEL/CGFR ORGN/RUKOUU PER/C SRC/RQP RMK/ADSB ACARS EQUIPPED TCAS EQUIPPED)
As illustrated above. The flightplan information is contained between ( and ). The first line contains the FPL "keyword", aircraft identification (REU974) and type of flight(IS) (- is used as as seperator)
The following lines contain additional information about the flight such as departure airport and time (line 3, LFPG1845), a list of wayoipoints (line 4), arrival aiport and time (line 5) and so forth.
I was considering two approaches to parsing this data:
Use regular expressions to extract the relevant information from the document.
Write a top-down LL(1) recursive descent lexer and parser, and consider the flightplan document as a "DSL" of sorts.
In the latter case, I would write a lexer, which would return tokens such as (, -, /,\n,number,word. These tokens would in turn be consummed by a parser in order to extract the relevant information from the flightplan. The parser would also validate the input. For example, in the first line, the aircraft identification can be no more than 7 characters long, and the type of flight can only contain characters from a given set.
Would the previous approach be overkill compared to using regular expressions?
Also what would be the most efficient approach, considering that I need to parse thousands of flightplans, and I need to make this fast.
Edit: The implementation language would be Java

I need to parse, modfiy and write back Java Source Files? [duplicate]

I need to parse, modify and write back Java source files. I investigated some options but it seams that I miss the point.
The output of the parsed AST when written back to file always screwed up the formatting using a standard format but not the original one.
Basically I want something that can do: content(write(parse(sourceFile))).equals(content(sourceFile)).
I tried the JavaParser but failed. I might use the Eclipse JDT's parser as a stand alone parser but this feels heavy. I also would like to avoid doing my own stuff. The Java parser for instance has information about column and line already but writing it back seams to ignore these information.
I would like to know how I can achieve parsing and writing back while the output looks the same as the input (intents, lines, everything). Basically a solution that is preserving the original formatting.
[Update]
The modifications I want to do is basically everything that is possible with the AST like adding, removing implemented interfaces, remove / add final to local variables but also generate source methods and constructors.
The idea is to add/remove anything but the rest needs to remain untouched especially the formatting of methods and expressions if the resulting line is larger than the page margin.
You may try using antlr4 with its java8 grammar file
The grammar skips all whitespaces by default but based on token positions you may be able to reconstruct the source being close to the original one
The output of a parser generated by REx is a sequence of events written to this interface:
public interface EventHandler
{
public void reset(CharSequence input);
public void startNonterminal(String name, int begin);
public void endNonterminal(String name, int end);
public void terminal(String name, int begin, int end);
public void whitespace(int begin, int end);
}
where the integers are offsets into the input. The event stream can be used to construct a parse tree. As the event stream completely covers all of the input, the resulting data structure can represent it without loss.
There is sample driver, implementing XmlSerializer on top of this interface. That streams out an XML parse tree, which is just markup added to the input. Thus the string value of the XML document is identical to the original input.
For seeing it in action, use the Java 7 sample grammar and generate a parser using command line
-ll 2 -backtrack -tree -main -java
Then run the main method of the resulting Java.java, passing in some Java source file name.
Our DMS Software Reengineering Toolkit with its Java Front End can do this.
DMS is a program transformation system (PTS), designed to parse source code to an internal representation (usually ASTs), let you make changes to those trees, and regenerate valid output text for the modified tree.
Good PTSes will preserve your formatting/layout at places where you didn't change the code or generate nicely formatted results, including comments in the original source. They will also let you write source-to-source transformations in the form of:
if you see *this* pattern, replace it by *that* pattern
where pattern is written in the surface syntax of your targeted language (in this case, Java). Writing such transformations is usually a lot easier than writing procedural code to climb up and down the tree, inspecting and hacking individual nodes.
DMS has all these properties, including OP's request for idempotency of the null transform.
[Reacting to another answer: yes, it has a Java 8 grammar]

Formal notation for code translations by compiler

I'm designing a DSL which translates to Java source code. Are their notations which are commonly used to specify the semantics/translation of a compiler?
Example:
DSL:
a = b = c = 4
Translates into:
Integer temp0 = 4;
Integer a = temp0;
Integer b = temp0;
Integer c = temp0;
Thanks in advance,
Jeroen
Pattern matching languages can be used to formalise small tree transforms. For an example of such a DSL take a look at Nanopass framework. A more generic approach is to think of the tree transforms as a form of term rewriting.
Such transforms are formal enough, e.g., they can be certified, as in CompCert.
There are formal languages to define semantics; you can see such languages and definitions in almost any technical paper in conference proceedings on programming languages. Texts on the topic are available: https://mitpress.mit.edu/.../semantics-programming-languages You need to have some willingness to read concise mathematical notations.
As a practical matter, these semantics are not used to drive translations/compilers; this is still a research topic. See http://Fwww.andrew.cmu.edu%2Fuser%2Fasubrama%2Fdissertation.pdf To read these you typically need to have spent some time reading introductory texts such as the above.
There has been more practical work on defining translations; the most practical are program transformation systems.
With such tools, one can write, using the notations of source language (e.g., your DSL), and the notation of the target language (e.g., Java or assembler or whatever), transformation rules
of the form:
replace source_language_fragment by target_language_fragment if condition
These tools are driven by grammar for the source and target languages, and interpret the transformation rules from their readable form into AST to AST rewrites. To fully translate a complex DSL to another language typically requires hundreds of rules, but a key point is they are much more easily read than procedural code typical of hand-written translators.
Trying to follow OP's example, assuming one has grammars for the OP's "MyDSL" and for "Java" as a target, and using our DMS Software Reengineeering Toolkit's style of transformation rules:
source domain dsl;
target domain Java;
rule translate_single_assignment(t: dsl_IDENTIFIER, e: dsl_expression):
" \t = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" int \JavaIdentifier\(\t\)=\e;
".
rule translate_multi_assignment(t1: dsl_IDENTIFIER, t2: dsl_IDENTIFIER, e: dsl_expression):
" \t1 = \t2 = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" \>\dsl \t2 = \e \statement
int \t1;
\t1=\t2;
".
You need two rules: one for the base case of a simple assignment t=e; and one to handle the multiple assignment case. The multiple assignment case peels off the outermost assignment,
and generates code for it, and inserts the remainder of the multiple assignment back in in its original DSL form, to be reprocessed by one of the two rules.
You can see another example of this used for refactoring (source_language == target_language) at https://stackoverflow.com/questions/22094428/programmatic-refactoring-of-java-source-files/22100670#22100670

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

How to parse JSON array with no object name

How would I parse this JSON array in Java? I'm confused because there is no object. Thanks!
EDIT: I'm an idiot! I should have read the documentation... that's probably what it's there for...
[
{
"id":"63565",
"name":"Buca di Beppo",
"user":null,
"phone":"(408)377-7722",
"address":"1875 S Bascom Ave Campbell, California, United States",
"gps_lat":"37.28967000",
"gps_long":"-121.93179700",
"monhh":"",
"tuehh":"",
"wedhh":"",
"thuhh":"",
"frihh":"",
"sathh":"",
"sunhh":"",
"monhrs":"",
"tuehrs":"",
"wedhrs":"",
"thuhrs":"",
"frihrs":"",
"sathrs":"",
"sunhrs":"",
"monspecials":"",
"tuespecials":"",
"wedspecials":"",
"thuspecials":"",
"frispecials":"",
"satspecials":"",
"sunspecials":"",
"description":"",
"source":"ripper",
"worldsbarsname":"BucadiBeppo31",
"url":"www.bucadebeppo.com",
"maybeDupe":"no",
"coupontext":"",
"couponimage":"0",
"distance":"1.00317",
"images":[
0
]
}
]
It is perfectly valid JSON. It is an array containing one object.
In JSON, arrays and objects don't have names. Only attributes of objects have names.
This is all described clearly by the JSON syntax diagrams at http://json.org. (FWIW, the site has translations in a number of languages ...)
How do you parse it? There are many libraries for parsing JSON. Many of them are linked from the site above. I suggest you use one of those rather than writing your own parsing code.
In response to this comment:
OTOH, writing your own parser is a reasonable project, and a good exercise for both learning JSON and learning Java (or whatever language). A reasonable parser can be written in about 500 lines of text.
In my opinion (having written MANY parsers in my time), writing a parser for a language is a very inefficient way to gain a working understanding the syntax of a language. And depending on how you implement the parser (and the nature of the language syntax specification) you can easily get an incorrect understanding.
A better approach is to read the language's syntax specification, which the OP has now done, and which you would have to do in order to implement a parser.
Writing a parser can be a good learning exercising, but it is really a learning exercise in writing parsers. Even then, you need to pick an appropriate implementation approach, and an appropriate language to be parsed.
It's an array containing one element. That element is an object. The object (dictionary) contains about 20 name/value pairs.

Categories

Resources