Handling comments and Line/Column numbers in COBOL grammar using Javacc

Handling comments and Line/Column numbers in COBOL grammar using Javacc - java

I am working on a COBOL Parser using JavaCC. The COBOL file usually will have columns 1 to 6 as Line/Column numbers. If Line/Column numbers are not present it will have spaces.
I need to know how to handle comments and Sequence Area in a COBOL file and parse only Main Area.
I have tried many expressions but none is working. I created a special token that will check for new line and then six occurrences of spaces or any character except space and carriage return and after that seventh character will be "*" for comments and " " for normal lines.
I am using the Cobol.jj file available here http://java.net/downloads/javacc/contrib/grammars/cobol.jj
Can anyone suggest me what grammar should i use?
the sample of my grammar file:
PARSER_END(CblParser)
////////////////////////////////////////////////////////////////////////////////
// Lexical structure
////////////////////////////////////////////////////////////////////////////////
SPECIAL_TOKEN :
{
< EOL: "\n" > : LINE_START
| < SPACECHAR: ( " " | "\t" | "\f" | ";" | "\r" )+ >
}
SPECIAL_TOKEN :
{
< COMMENT: ( ~["\n","\r"," "] ~["\n","\r"," "] ~["\n","\r"," "] ~["\n","\r"," "] ~["\n","\r"," "] ~["\n","\r"," "] ) ( "*" | "|" ) (~["\n","\r"])* >
| < PREPROC_COMMENT: "*|" (~["\n","\r"])* >
| < SPACE_SEPARATOR : ( <SPACECHAR> | <EOL> )+ >
| < COMMA_SEPARATOR : "," <SPACE_SEPARATOR> >
}
<LINE_START> SKIP :
{
< ((~[])(~[])(~[])(~[])(~[])(~[])) (" ") >
}

Since the parser starts at the start of a line, you should use the DEFAULT state to represent the start of a line. I would do something like the following [untested code follows].
// At the start of each line, the first 6 characters are ignored and the 7th is used
// to determine whether this is a code line or a comment line.
// (Continuation lines are handled elsewhere.)
// If there are fewer than 7 characters on the line, it is ignored.
// Note that there will be a TokenManagerError if a line has at least 7 characters and
// the 7th character is other than a "*", a "/", or a space.
<DEFAULT> SKIP :
{
< (~[]){0,6} ("\n" | "\r" | "\r\n") > :DEFAULT
|
< (~[]){6} (" ") > :CODE
|
< (~[]){6} ("*"|"/") :COMMENT
}
<COMMENT> SKIP :
{ // At the end of a comment line, return to the DEFAULT state.
< "\n" | "\r" | "\r\n" > : DEFAULT
| // All non-end-of-line characters on a comment line are ignored.
< ~["\n","\r"] > : COMMENT
}
<CODE> SKIP :
{ // At the end of a code line, return to the DEFAULT state.
< "\n" | "\r" | "\r\n" > : DEFAULT
| // White space is skipped, as are semicolons.
< ( " " | "\t" | "\f" | ";" )+ >
}
<CODE> TOKEN :
{
< ACCEPT: "accept" >
|
... // all rules for tokens should be in the CODE state.
}

Related

How to allow lexer to parse specific code parts from java?

I am currently creating a compiler with antlr4 which should allow java code to be parsed.
How do i allow:
public void =(Integer value) => java { this.value = value; }
that the code between java { } is not being parsed by antlr, but should have a visitor in my parser.
Currently i have
javaStatementBody: KWJAVA LCURLY .*? RCURLY
but this obviously does not work and .*? parses the whole file.
Please do not answer with "use quotes", thats not gonna be my solution, because i want to allow java code highlighting.

You could create separate lexer and parser grammars so that you can use lexical modes. Whenever the lexer "sees" the input java {, it moves to the JAVA_MODE. And when in the Java mode, you tokenise comments, string- and char literals. Also when in this mode, you encounter a {, you push the same JAVA_MODE so that the lexer knows it's nested once. And when you encounter a }, you pop a mode from the stack (resulting in either going back to the default mode, or staying in the Java mode but one level less deep).
A quick demo:
IslandLexer.g4
lexer grammar IslandLexer;
JAVA_START
: 'java' SPACES '{' -> pushMode(JAVA_MODE)
;
OTHER
: .
;
fragment SPACES : [ \t\r\n]+;
mode JAVA_MODE;
JAVA_CHAR : '\'' ( ~[\\'\r\n] | '\\' [tbnrf'\\] ) '\'';
JAVA_STRING : '"' ( ~[\\"\r\n] | '\\' [tbnrf"\\] )* '"';
JAVA_LINE_COMMENT : '//' ~[\r\n]*;
JAVA_BLOCK_COMMENT : '/*' .*? '*/';
JAVA_OPEN_BRACE : '{' -> pushMode(JAVA_MODE);
JAVA_CLOSE_BRACE : '}' -> popMode;
JAVA_OTHER : ~[{}];
IslandParser.g4
parser grammar IslandParser;
options { tokenVocab=IslandLexer; }
parse
: unit* EOF
;
unit
: base_language
| java_janguage
;
base_language
: OTHER+
;
java_janguage
: JAVA_START java_atom+
;
java_atom
: JAVA_CHAR
| JAVA_STRING
| JAVA_LINE_COMMENT
| JAVA_BLOCK_COMMENT
| JAVA_OPEN_BRACE
| JAVA_CLOSE_BRACE
| JAVA_OTHER
;
Test it with the following code:
String source = "foo \n" +
"\n" +
"java { \n" +
" char foo() { \n" +
" /* a quote in a comment \\\" */ \n" +
" String s = \"java {...}\"; \n" +
" return '}'; \n" +
" }\n" +
"}\n" +
"\n" +
"bar";
IslandLexer lexer = new IslandLexer(CharStreams.fromString(source));
IslandParser parser = new IslandParser(new CommonTokenStream(lexer));
System.out.println(parser.parse().toStringTree(parser));
which is the following parse tree:

Multiline non terminated regex

I came across a problem with regex parsing columns in ASCII tables.
Imagine an ASCII table like:
COL1 | COL2 | COL3
======================
ONE | APPLE | PIE
----------------------
TWO | APPLE | PIES
----------------------
THREE | PLUM- | PIES
| APRICOT |
For the first 2 entries a trivial capture regex does the deal
(?:(?<COL1>\w+)\s*\|\s*(?<COL2>\w+)\s*\|\s*(?<COL3>\w+)\s*)
However this regex captures the header, as well as it doesn't capture the 3rd line.
I can't solve following two problems :
How to exclude the header?
How to extend the COL2 capture group to capture the multiline entry PLUM-APRICOT?
Thanks for your help!

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems. (http://regex.info/blog/2006-09-15/247)
I've assumed an input string like:
String input = ""
+ "\n" + "COL1 | COL2 | COL3"
+ "\n" + "======================"
+ "\n" + "ONE | APPLE | PIE "
+ "\n" + "----------------------"
+ "\n" + "TWO | APPLE | PIES"
+ "\n" + "----------------------"
+ "\n" + "THREE | PLUM- | PIES"
+ "\n" + " | APRICOT | ";
To split the header and the table you can use input.split("={2,}"). This returns an array of strings of the header and the table.
After trimming the table you can use table.split("-{2,}") to get the rows of the table.
All rows can be converted to arrays of cells by using row.split("\\|").
Dealing with multiline rows: Before converting the rows to cells, you can call row.split("\n") to split multiline rows.
When this split operations returns an array with more than one element, they should be split on pipes (split("\\|")) and the resulting cells should be merged.
From here its just element manipulation to get it into the format you desire.

how can I make this JAVACC grammar work with [ ]?

I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.
basically I have a def of Column : [table ] . field
but table itself could also contain the "." char, which causes confusion.
I think intuitively the following grammar should accept all the following sentences:
select mytable.myfield
select myfield
select mydb.mytable.myfield
but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)
how can I make this grammar work?
Thanks a lot
Yang
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=true;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
<K_SELECT> Column()
}
void Column():
{
}
{
[LOOKAHEAD(3) Table() "." ]
//[
//LOOKAHEAD(2) (
// LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
// |
// LOOKAHEAD(3) <S_IDENTIFIER>
//)
//
//
//
//]
Field()
}
void Field():
{}{
<S_IDENTIFIER>
}
void Table():
{}{
LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
|
LOOKAHEAD(3) <S_IDENTIFIER>
}
////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}

You could rewrite your grammar like this
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*
Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*

indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:
create mydb.mytable
create mytable
select mydb.mytable.myfield
select mytable.myfield
select myfield
so it is indeed due to limitation of LL parser
%%
statement:
create_sentence
|
select_sentence
;
create_sentence: CREATE table
;
select_sentence: SELECT table '.' ID
|
SELECT ID
;
table : table '.' ID
|
ID
;
%%

If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.
void Statement():{}{
"select" Column() | "create" "table" Table(false) }
void Column():{}{
[LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }
void Table(boolean expectDot):{}{
<ID> MoreTable(expectDot) }
void MoreTable(boolean expectDot) {
LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
{}
}
Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.

Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)
CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");

java indexOf returns -1 when it's supposed to return a positive number

I'm new to Network programming and I never used Java for network programming before.
I'm writing a server using Java and I have some problem processing message from client. I used
DataInputStream inputFromClient = new DataInputStream( socket.getInputStream() );
while ( true ) {
// Receive radius from the client
byte[] r=new byte[256000];
inputFromClient.read(r);
String Ffss =new String(r);
System.out.println( "Received from client: " + Ffss );
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( '\a' ));
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( ' '));
String Str = new String("add 12341\n13243423");
String SubStr1 = new String("\n");
System.out.print("Found Index :" );
System.out.println( Str.indexOf( SubStr1 ));
}
If I do this, and have a sample input asg 23\aag, it will return:
Found Index :-1
Found Index :3
Found Index :9
It's clear that if the the String object is created from scratch, indexOf can locate "\".
How come the code would have problem locating \a if the String is obtained from processing DataInputStream?

try String abc=new String("\\a"); - you need \\ to get a backslash in a string otherwise the \ defines the start of an "escape sequence".

It looks like the a is being escaped.
Have a look at this article to understand how the back slash affects a string.
Escape Sequences
A character preceded by a backslash (\) is an escape
sequence and has special meaning to the compiler. The following table
shows the Java escape sequences:
| Escape Sequence | Description|
|:----------------|------------:|
| \t | Insert a tab in the text at this point.|
| \b | Insert a backspace in the text at this point.|
| \n | Insert a newline in the text at this point.|
| \r | Insert a carriage return in the text at this point.|
| \f | Insert a formfeed in the text at this point.|
| \' | Insert a single quote character in the text at this point.|
| \" | Insert a double quote character in the text at this point.|
| \\ | Insert a backslash character in the text at this point.|

recognize fractional numbers in JFlex 1.4.3

in my SL.lex file i have this regular expression for fractional numbers:
Digit = [1-9]
Digit0 = 0|{Digit}
Num = {Digit} {Digit0}*
Frac = {Digit0}* {Digit}
Pos = {Num} | '.' {Frac} | 0 '.' {Frac} | {Num} '.' {Frac}
PosOrNeg = -{Pos} | {Pos}
Numbers = 0 | {PosOrNeg}
and then in
/* literals */
{Numbers} { return new Token(yytext(), sym.NUM, getLineNumber()); }
but every time i try to recognize a number with a dot, it fails and i get an error.
instead of '.' i also tried \\.,\.,".", but every time it fails.

You are right, . needs to be escaped, otherwise it matches anything but line return.
But quoting characters is done with double quotes, not single quotes.
Pos = {Num} | "." {Frac} | 0 "." {Frac} | {Num} "." {Frac}
If you do that, the input:
123.45
works as expected:
java -cp target/classes/ Yylex src/test/resources/test.txt
line: 1 match: --123.45--
action [29] { return new Yytoken(zzAction, yytext(), yyline+1, 0, 0); }
Text : 123.45
index : 10
line : 1
null
Also, regular expressions are more powerful than just unions, you could make it more concise:
Digit = [1-9]
Digit0 = 0 | {Digit}
Num = {Digit} {Digit0}*
Frac = {Digit0}* {Digit}
Pos = 0? "." {Frac} | {Num} ("." {Frac})?
PosOrNeg = -?{Pos}
Number = {PosOrNeg} | 0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Handling comments and Line/Column numbers in COBOL grammar using Javacc - java

Related

How to allow lexer to parse specific code parts from java?

Multiline non terminated regex

how can I make this JAVACC grammar work with [ ]?

java indexOf returns -1 when it's supposed to return a positive number

recognize fractional numbers in JFlex 1.4.3

Categories

Resources