I have a simple grammar defined in Antlr 3 as shown below:
grammar StringProcessor;
options {
output=AST;
}
#header {
package com.processor;
}
#rulecatch {
// ANTLR does not generate its normal rule try/catch
catch(RecognitionException e) {
throw e;
}
}
truevalue : 'true';
falsevalue : 'false';
nullvalue : 'null';
simpleValue : truevalue | falsevalue | nullvalue | STRING | INTEGER | FLOAT;
INTEGER : '0'..'9'+;
FLOAT : INTEGER'.'INTEGER;
QUOTE : '"';
SPECIALCHAR : '-'|':'|';'|'('|')'|'£'|'&'|'#'|','|'!'|'['|']'|'{'|'}'|'#'|'^'|'*'|'+'|'='|'_'|'<'|'>'|'€'|'$'|'%'|'/'|'.'|'?'|'~'|'|';
STRING : QUOTE('a'..'z'|'A'..'Z'|INTEGER|SPECIALCHAR|WS)+QUOTE;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; // handle white space between keywords
When I try the following STRING in AntlrWorks in the intrepreter:
"5Java Developer"
This works. It includes the white space. But when I try to parse this from the Java program, it throws a NoViableAltException. I have seen other posts, but those solutions does not apply to my problem. The WS is part of the STRING. The problem is Java program does not parse anything with a white space, whereas the interprets displays correctly.
An example to show the Exception:
public static void main(String[] args) throws Exception {
String input = ("\"5Java Developer\"");
StringProcessorParser parser = buildParser(input);
CommonTree commonTree = (CommonTree) parser.simpleValue().getTree(); // exception thrown
}
public static StringProcessorParser buildParser(String query) {
CharStream cs = new ANTLRStringStream(query);
// the input needs to be lexed
StringProcessorLexer lexer = new StingProcessorLexer(cs);
CommonTokenStream tokens = new CommonTokenStream();
StringProcessorParser parser = new StringProcessorParser(tokens);
tokens.setTokenSource(lexer);
// use the ASTTreeAdaptor so that the grammar is aware to build tree in AST format
parser.setTreeAdaptor((TreeAdaptor) new ASTTreeAdaptor().getASTTreeAdaptor());
return parser;
}
Having:
input = new String("\"5JavaDeveloper\""); correctly parses.
Any idea why this is not working.
EDIT:
I have also tried adding the $channel = HIDDEN;
But still it does not work
WS : (' '|'\t'|'\f'|'\n'|'\r')+ { $channel = HIDDEN; skip();}; // handle white space between keywords
Removing the skip() has fixed my problem.
Related
I am working on "The Definitive ANTLR 4 Reference" book and i'm trying to run ArrayInit.g4 example. I have provide everything which is necessary but when i run the example and enter the values into the console, nothing happens (pages 29 and 30).
Here is the grammar :
grammar ArrayInit;
init : '{' value ( ',' value)* '}';
value : init | INT ;
INT : [0-9]+ ; WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
And here is the Test.java
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class Test {
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
ArrayInitLexer lexer = new ArrayInitLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ArrayInitParser parser = new ArrayInitParser(tokens);
ParseTree tree = parser.init();
System.out.println(tree.toStringTree(parser));
}
}
Given input is : {1,{2,3},4}
The expected output is : ( init {(value 1), (value (init { (value 2), (value 3) })), (value 4)} )
I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.
Lines may lokk like this:
someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff
someStuff are placeholders for other elements, which can also include quotes in the same style
I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:
hello
hello, SO
say "hello, world"
say "hello, world
I tried OpenCSV but fail as one can only specifiy one type of quote:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVReader reader = new CSVReader(new FileReader(file));
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:
someStuff | hello | someStuff
someStuff | hello, SO | someStuff
someStuff | 'say "hello, world"' | someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.
public void test() {
String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
"numeStuff,\"hello, SO\", someStuff, someStuff",
"numeStuff,'say \"hello, world\"', someStuff, someStuff"
};
/* Matches a field and a potentially empty separator.
*
* ( - Field Group
* \" - Start with a quote
* [^\"]*? - Non-greedy match on anything that is not a quote
* \" - End with a quote
* | - Or
* ' - Start with a strop
* [^']*? - Non-greedy match on anything that is not a strop
* ' - End with a strop
* | - Or
* [^\"'] - Not starting with a quote or strop
* [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
* ) - End field group
* ( - Separator group
* [,$] - Comma separator or end of line
* ) - End separator group
*/
Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
for (String t : tests) {
System.out.println("Matching: " + t);
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.
class MyCSVParser extends CSVParser{
#Override
private String[] parseLine(String nextLine, boolean multi) throws IOException{
//Your algorithm here
}
}
Basically you only need to track ," and ,' (trimming what's in the middle).
When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.
When you meet the appropriate closing quote, reset the flag and keep slicing the elements.
To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).
Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).
If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVParser doubleParser = new CSVParser(',', '"');
CSVParser singleParser = new CSVParser(',', '\'');
String[] nextLine;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
if (line.contains(",'") && line.contains("',")) {
nextLine = singleParser.parseLine(line);
} else {
nextLine = doubleParser.parseLine(line);
}
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
}
It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help
you: https://stackoverflow.com/a/15905916/1688441
Below an example, please not notInsideComma actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.
public static ArrayList<String> customSplitSpecific(String s)
{
ArrayList<String> words = new ArrayList<String>();
boolean notInsideComma = true;
int start =0, end=0;
for(int i=0; i<s.length()-1; i++)
{
if(s.charAt(i)==',' && notInsideComma)
{
words.add(s.substring(start,i));
start = i+1;
}
else if(s.charAt(i)=='"')
notInsideComma=!notInsideComma;
}
words.add(s.substring(start));
return words;
}
I'm trying to write a grammar to handle binary numbers and compute their values:
grammar T;
options
{
backtrack=true;
}
prog :
(b2 = binarynum NEWLINE)+ EOF {System.out.println($binarynum.value);}
|
b1 = binarynum EOF {System.out.println($binarynum.value);}
;
binarynum returns [double value] :
s1=string '.' s2=string
{$value = $s1.value + $s2.value/Math.pow(2.0,$s2.length);}
|
string
{$value = $string.value;}
;
string returns [double value, int length] :
bit s2=string
{$value = $bit.value*Math.pow(2.0,$s2.length)+$s2.value; $length = $s2.length+1; }
|
bit
{$value = $bit.value; $length = 1; }
;
bit returns [double value] :
'0'
{ $value = 0;}
|
'1'
{ $value = 1;}
;
NEWLINE: ('\r')? '\n' {skip();} ;
Java code:
import org.antlr.runtime.*;
public class TestT {
public static void main(String[] args) throws Exception {
// Create an TLexer that feeds from that stream
//TLexer lexer = new TLexer(new ANTLRInputStream(System.in));
TLexer lexer = new TLexer(new ANTLRFileStream("input.txt"));
// Create a stream of tokens fed by the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// Create a parser that feeds off the token stream
TParser parser = new TParser(tokens);
// Begin parsing at rule prog
parser.prog();
}
}
Input File ("input.txt") contains:
11111.111
1000
1000.1
Error: line 3:4 missing EOF at '.'
I first tested the code with having just one input with the prog statement as the following:
prog :
binarynum EOF {System.out.println($binarynum.value);}
;
Everything works out just fine when I do the above modification with one input, however I can't seem to get the hang of it when using multiple inputs separated by new lines.
Can someone please help me out and tell me where I went wrong.
I also have another question, when should the EOF not be included in the grammar? When I tested the grammar for one input after removing the EOF from the grammar I received no errors and a correct output.
Can someone please help me out and tell me where I went wrong.
Your lexer is skipping line breaks while your parser uses them. Remove {skip();} from the lexer rule.
I also have another question, when should the EOF not be included in the grammar?
You'll usually have it at the end of your top level parser rule, which will force the parser to consume the entire input.
Given a String like..
(a+(a+b)), (d*e) :- (e-f)
Note: (d*e) and (e-f) are different expressions. How can I fetch the expressions from this string. I have the grammar defined as..
parse returns [String value]
: addExp {$value=$addExp.value;} EOF
;
addExp returns [String value]
: multExp {$value=$multExp.value;} (('+' | '-' | '*') multExp{$value+= '+' + $multExp.value;})*
;
multExp returns [String value]
: atom {$value=$atom.value;} (('*' | '/') atom {$value+=$atom.value;)*
;
atom returns [String value]
: x=ID {$value=$x.text;}
| '(' addExp ')' {$value='('+$addExp.value+')';}
;
ID : 'a'..'z' | 'A'..'Z';
I tried..
ANTLRStringStream a=new ANTLRStringStream("(a+(a+b)), (d*e) :- (e-f)");
SLexer l=new SLexer(a);
CommonTokenStream c=new CommonTokenStream(l);
SParser p=new Sparser(c);
String exp;
while(exp = p.parse())
{
System.out.println(exp);
}
I'm thinking of something like hasNext() and then fetching.
Your lexer rules TEXT possibly matches an empty string, causing the lexer to create an infinite amount of tokens. Also, you don't need all those return statements after your rule: you can simply grab what a parser (or lexer) rule matched by adding .text after it.
You could let your parser return a List<String>, or let it return a single String repeatedly invoke that parser rule until EOF is encountered.
A little demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
List<String> statements = parser.parse();
for(String s : statements) {
System.out.println(s);
}
}
}
parse returns [List<String> statements]
#init{$statements = new ArrayList<String>();}
: (statement {$statements.add($statement.text);} ~TEXT+)+ EOF
;
statement
: TEXT OPAR params CPAR
;
params
: (param (COMMA param)*)?
;
param
: TEXT
| statement
;
COMMA : ',';
OPAR : '(';
CPAR : ')';
TEXT : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t') {$channel=HIDDEN;};
OTHER : . ;
Note that ~TEXT+ in the parse rule matches one or more tokens other than TEXT.
If you now create a lexer and parser and run the TParser class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
or
Windows
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar TParser
you will see the following being printed to your console:
likes(a, b)
likes(a, X)
likes(X, b)
hates(a, b)
hates(a,X)
hates(X,b)
likes(a,b)
says(god, likes(a,b))
EDIT
And here's how to return a single String opposed to a List<String>:
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
String s;
while((s = parser.parse()) != null) {
System.out.println(s);
}
}
}
parse returns [String s]
: statement ~(TEXT| EOF)* {$s = $statement.text;}
| EOF {$s = null;}
;
You should just be able to call sentence() repeatedly until you hit the end of input.
I need to make JavaCC aware of a context (current parent token), and depending on that context, expect different token(s) to occur.
Consider the following pseudo-code:
TOKEN <abc> { "abc*" } // recognizes "abc", "abcd", "abcde", ...
TOKEN <abcd> { "abcd*" } // recognizes "abcd", "abcde", "abcdef", ...
TOKEN <element1> { "element1" "[" expectOnly(<abc>) "]" }
TOKEN <element2> { "element2" "[" expectOnly(<abcd>) "]" }
...
So when the generated parser is "inside" a token named "element1" and it encounter "abcdef" it recognizes it as <abc>, but when its "inside" a token named "element2" it recognizes the same string as <abcd>.
element1 [ abcdef ] // aha! it can only be <abc>
element2 [ abcdef ] // aha! it can only be <abcd>
If I'm not wrong, it would behave similar to more complex DTD definitions of an XML file.
So, how can one specify, in which "context" which token(s) are valid/expected?
NOTE: It would be not enough for my real case to define a kind of "hierarchy" of tokens, so that "abcdef" is always first matched against <abcd> and than <abc>. I really need context-aware tokens.
OK, it seems that you need a technique called lookahead here. Here is a very good tutorial:
Lookahead tutorial
My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).
Let's say we want to have some kind of markup language. All we want to "markup" are:
Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
Expressions consisting of numbers (0-9) --> numbers
We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.
I created the file WordNumber.jj which defines the grammar and the parser to be generated:
options
{
LOOKAHEAD= 1;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
STATIC = true;
DEBUG_PARSER = false;
DEBUG_LOOKAHEAD = false;
DEBUG_TOKEN_MANAGER = false;
ERROR_REPORTING = true;
JAVA_UNICODE_ESCAPE = false;
UNICODE_INPUT = false;
IGNORE_CASE = false;
USER_TOKEN_MANAGER = false;
USER_CHAR_STREAM = false;
BUILD_PARSER = true;
BUILD_TOKEN_MANAGER = true;
SANITY_CHECK = true;
FORCE_LA_CHECK = false;
}
PARSER_BEGIN(WordNumberParser)
/** Model-tree Parser */
public class WordNumberParser
{
/** Main entry point. */
public static void main(String args []) throws ParseException
{
WordNumberParser parser = new WordNumberParser(System.in);
parser.Input();
}
}
PARSER_END(WordNumberParser)
SKIP :
{
" "
| "\n"
| "\r"
| "\r\n"
| "\t"
}
TOKEN :
{
< WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
< NUMBER_TOKEN : (["0"-"9"])+ >
}
/** Root production. */
void Input() :
{}
{
( WordContext() | NumberContext() )* < EOF >
}
/** WordContext production. */
void WordContext() :
{}
{
"<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}
/** NumberContext production. */
void NumberContext() :
{}
{
"<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}
You can test it with a file like that:
<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>
The Last line will cause the parser to throw an exception like this:
Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9.
Was expecting:
<NUMBER_TOKEN> ...
That is because the parser did not find what it expected.
I hope that helps.
Cheers!
P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.
You need to use lexer states. Your example becomes something like:
<DEFAULT> TOKEN: { <ELEMENT1: "element1">: IN_ELEMENT1 }
<DEFAULT> TOKEN: { <ELEMENT2: "element2">: IN_ELEMENT2 }
<IN_ELEMENT1> TOKEN: { <ABC: "abc" (...)*>: DEFAULT }
<IN_ELEMENT2> TOKEN: { <ABCD: "abcd" (...)*>: DEFAULT }
Please note that the (...)* are not proper JavaCC syntax, but your example is not either so I can only guess.