As a beginner, when I was learning ANTLR4 from the The Definitive ANTLR 4 Reference book, I tried to run my modified version of the exercise from Chapter 7:
/**
* to parse properties file
* this example demonstrates using embedded actions in code
*/
grammar PropFile;
#header {
import java.util.Properties;
}
#members {
Properties props = new Properties();
}
file
:
{
System.out.println("Loading file...");
}
prop+
{
System.out.println("finished:\n"+props);
}
;
prop
: ID '=' STRING NEWLINE
{
props.setProperty($ID.getText(),$STRING.getText());//add one property
}
;
ID : [a-zA-Z]+ ;
STRING :(~[\r\n])+; //if use STRING : '"' .*? '"' everything is fine
NEWLINE : '\r'?'\n' ;
Since Java properties are just key-value pair I use STRING to match eveything except NEWLINE (I don't want it to just support strings in the double-quotes). When running following sentence, I got:
D:\Antlr\Ex\PropFile\Prop1>grun PropFile prop -tokens
driver=mysql
^Z
[#0,0:11='driver=mysql',<3>,1:0]
[#1,12:13='\r\n',<4>,1:12]
[#2,14:13='<EOF>',<-1>,2:14]
line 1:0 mismatched input 'driver=mysql' expecting ID
When I use STRING : '"' .*? '"' instead, it works.
I would like to know where I was wrong so that I can avoid similar mistakes in the future.
Please give me some suggestion, thank you!
Since both ID and STRING can match the input text starting with "driver", the lexer will choose the longest possible match, even though the ID rule comes first.
So, you have several choices here. The most direct is to remove the ambiguity between ID and STRING (which is how your alternative works) by requiring the string to start with the equals sign.
file : prop+ EOF ;
prop : ID STRING NEWLINE ;
ID : [a-zA-Z]+ ;
STRING : '=' (~[\r\n])+;
NEWLINE : '\r'?'\n' ;
You can then use an action to trim the equals sign from the text of the string token.
Alternately, you can use a predicate to disambiguate the rules.
file : prop+ EOF ;
prop : ID '=' STRING NEWLINE ;
ID : [a-zA-Z]+ ;
STRING : { isValue() }? (~[\r\n])+;
NEWLINE : '\r'?'\n' ;
where the isValue method looks backwards on the character stream to verify that it follows an equals sign. Something like:
#members {
public boolean isValue() {
int offset = _tokenStartCharIndex;
for (int idx = offset-1; idx >=0; idx--) {
String s = _input.getText(Interval.of(idx, idx));
if (Character.isWhitespace(s.charAt(0))) {
continue;
} else if (s.charAt(0) == '=') {
return true;
} else {
break;
}
}
return false;
}
}
Related
For a given plain JSON data do the following formatting:
replace all the special characters in key with underscore
remove the key double quote
replace the : with =
Example:
JSON Data: {"no/me": "139.82", "gc.pp": "\u0000\u000", ...}
After formatting: no_me="139.82", gc_pp="\u0000\u000"
Is it possible with a regular expression? or any other single command execution?
A single regex for the whole changes may be overkill. I think you could code something similar to this:
(NOTE: Since i do not code in java, my example is in javascript, just to get you the idea of it)
var json_data = '{"no/me": "139.82", "gc.pp": "0000000", "foo":"bar"}';
console.log(json_data);
var data = JSON.parse(json_data);
var out = '';
for (var x in data) {
var clean_x = x.replace(/[^a-zA-Z0-9]/g, "_");
if (out != '') out += ', ';
out += clean_x + '="' + data[x] + '"';
}
console.log(out);
Basically you loop through the keys and clean them (remove not-wanted characters), with the new key and the original value you create a new string with the format you like.
Important: Bear in mind overlapping ids. For example, both no/me and no#me will overlap into same id no_me. this may not be important since your are not outputting a JSON after all. I tell you just in case.
I haven't done Java in a long time, but I think you need something like this.
I'm assuming you mean 'all Non-Word characters' by specialchars here.
import java.util.regex.*;
String JsonData = '{"no/me": "139.82", "gc.pp": "\u0000\u000", ...}';
// remove { and }
JsonData = JsonData.substring(0, JsonData.length() - 1);
try {
Pattern regex = Pattern.compile("(\"[^\"]+\")\\s*:"); // find the keys, including quotes and colon
Matcher regexMatcher = regex.matcher(JsonData);
while (regexMatcher.find()) {
String temp = regexMatcher.group(1); // "no/me":
String key = regexMatcher.group(2).replaceAll("\\W", "_") + "="; // no_me=
JsonData.replaceAll(temp, key);
}
} catch (PatternSyntaxException ex) {
// regex has syntax error
}
System.out.println(JsonData);
is there any way to parse CSV file (variable number of columns) with the help of some CSV parser (e.g. SuperCSV) to set of List<String> without skipping quotes in Java? For the input:
id,name,text,sth
1,"John","Text with 'c,o,m,m,a,s' and \"",qwerty
2,Bob,"",,sth
after parsing, I'd like to have in the set the same text as in input instead of:
id,name,text,sth
1,John,Text with 'c,o,m,m,a,s' and \",qwerty
2,Bob,null,null,sth
that element
"John" will parsed to string "John" ( instead of John )
"" --> ""
,, --> ,null,
etc.
I already wrote about this here, but I probably didn't make this clear enough.
I want to parse csv file to set of List<String>, do something with this and print to the stdout leaving quotes where they was. Please help me.
Something like this? Not using any existing parser, doing it from scratch:
public List<String> parse(String st) {
List<String> result = new ArrayList<String>();
boolean inText = false;
StringBuilder token = new StringBuilder();
char prevCh = 0;
for (int i = 0; i < st.length(); i++) {
char ch = st.charAt(i);
if (ch == ',' && !inText) {
result.add(token.toString());
token = new StringBuilder();
continue;
}
if (ch == '"' && inText) {
if (prevCh == '\\') {
token.deleteCharAt(token.length() - 1);
} else {
inText = false;
}
} else if (ch == '"' && !inText) {
inText = true;
}
token.append(ch);
prevCh = ch;
}
result.add(token.toString());
return result;
}
Then
String st = "1,\"John\",\"Text with 'c,o,m,m,a,s' and \\\"\",qwerty";
List<String> result = parse(st);
System.out.println(result);
Will print out:
[1, "John", "Text with 'c,o,m,m,a,s' and "", qwerty]
I have used this one:
http://opencsv.sourceforge.net/
And I was pretty satasfied with the results. I had a bunch of differently organized CSV files (it's sometimes funny what kinds of things people call CSV these days), and I managed to set up the reader for it. However, I don't think it will generate commas, but it will leave blanks where there is an empty field. Since you can fetch the whole line as an array, you can iterate it and but a comma between each iteration.
Look up the settings, there is a bunch of them, including quote characters.
I am facing a little difficulty with a Syntax highlighter that I've made and is 90% complete. What it does is that it reads in the text from the source of a .java file, detects keywords, comments, etc and writes a (colorful) output in an HTML file. Sample output from it is:
(I couldn't upload a whole html page, so this is a screenshot.) As (I hope) you can see, my program seems to work correctly with keywords, literals and comments (see below) and hence can normally document almost all programs. But it seems to break apart when I store the escape sequence for " i.e. \" inside a String. An error case is shown below:
The string literal highlighting doesn't stop at the end of the literal, but continues until it finds another cue, like a keyword or another literal.
So, the question is how do I disguise/hide/remove this \" from within a String?
The stringFilter method of my program is:
public String stringFilter(String line) {
if (line == null || line.equals("")) {
return "";
}
StringBuffer buf = new StringBuffer();
if (line.indexOf("\"") <= -1) {
return keywordFilter(line);
}
int start = 0;
int startStringIndex = -1;
int endStringIndex = -1;
int tempIndex;
//Keep moving through String characters until we want to stop...
while ((tempIndex = line.indexOf("\"")) > -1 && !isInsideString(line, tempIndex)) {
//We found the beginning of a string
if (startStringIndex == -1) {
startStringIndex = 0;
buf.append( stringFilter(line.substring(start,tempIndex)) );
buf.append("</font>");
buf.append(literal).append("\"");
line = line.substring(tempIndex+1);
}
//Must be at the end
else {
startStringIndex = -1;
endStringIndex = tempIndex;
buf.append(line.substring(0,endStringIndex+1));
buf.append("</font>");
buf.append(normal);
line = line.substring(endStringIndex+1);
}
}
buf.append( keywordFilter(line) );
return buf.toString();
}
EDIT
in response to the first few comments and answers, here's what I tried:
A snippet from htmlFilter(String), but it doesn't work :(
//replace '&' i.e. ampersands with HTML escape sequence for ampersand.
line = line.replaceAll("&", "&");
//line = line.replaceAll(" ", " ");
line = line.replaceAll("" + (char)35, "#");
// replace less-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll("<", "<");
// replace greater-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll(">", ">");
line = multiLineCommentFilter(line);
//replace the '\\' i.e. escape for backslash with HTML escape sequences.
//fixes a problem when backslashes preceed quotes.
//line = line.replaceAll("\\\"", "\"");
//line = line.replaceAll("" + (char)92 + (char)92, "\\");
return line;
My idea is that when a backslash is met, ignore the next character.
String str = "blah\"blah\\blah\n";
int index = 0;
while (true) {
// find the beginning
while (index < str.length() && str.charAt(index) != '\"')
index++;
int beginIndex = index;
if (index == str.length()) // no string found
break;
index++;
// find the ending
while (index < str.length()) {
if (str.charAt(index) == '\\') {
// escape, ignore the next character
index += 2;
} else if (str.charAt(index) == '\"') {
// end of string found
System.out.println(beginIndex + " " + index);
break;
} else {
// plain content
index++;
}
}
if (index >= str.length())
throw new IllegalArgumentException(
"String literal is not properly closed by a double-quote");
index++;
}
Check for char found at tempIndex-1 it it is \ then don't consider as beginning or ending of string.
String originalLine=line;
if ((tempIndex = originalLine.indexOf("\"", tempIndex + 1)) > -1) {
if (tempIndex==0 || originalLine.charAt(tempIndex - 1) != '\\') {
...
Steps to follow:
First replace all \" with some temp string such as
String tempStr="forward_slash_followed_by_double_quote";
line = line.replaceAll("\\\\\"", tempStr);
//line = line.replaceAll("\\\"", tempStr);
do what ever you are doing
Finally replace that temp string with \"
line = line.replaceAll(tempStr, "\\\\\"");
//line = line.replaceAll(tempStr, "\\\"");
The trouble with finding a quote and then trying to work out whether it's escaped is that it's not enough to simply look at the previous character to see if it's a backslash - consider
String basedir = "C:\\Users\\";
where the \" isn't an escaped quote, but is actually an escaped backslash followed by an unescaped quote. In general a quote preceded by an odd number of backslashes is escaped, one preceded by an even number of backslashes isn't.
A more sensible approach would be to parse through the string one character at a time from left to right rather than trying to jump ahead to quote characters. If you don't want to have to learn a proper parser generator like JavaCC or antlr then you can tackle this case with regular expressions using the \G anchor (to force each subsequent match to start at the end of the previous one with no gaps) - if we assume that str is a substring of your input starting with the character following the opening quote of a string literal then
Pattern p = Pattern.compile("\\G(?:\\\\u[0-9A-Fa-f]{4}|\\\\.|[^\"\\\\])");
StringBuilder buf = new StringBuilder();
Matcher m = p.matcher(str);
while(m.find()) buf.append(m.group());
will leave buf containing the content of the string literal up to but not including the closing quote, and will handle escapes like \", \\ and unicode escapes \uNNNN.
Use double slash "\\"" instead of "\""... Maybe it works...
I've never been good with regex and I can't seem to get this...
I am trying to match statements along these lines (these are two lines in a text file I'm reading)
Lname Fname 12.35 1
Jones Bananaman 7.1 3
Currently I am using this for a while statement
reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")
But it doesn't enter the while statement.
The program reads the text file just fine when I remove the while.
The code segment is this:
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath);
while(reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")){
employeeInfo.add(new EmployeeFile(reader.next(), reader.next(), reader.nextDouble(), reader.nextInt(), new employeeRemove()));
}
for(EmployeeFile element: employeeInfo){
output.add(element);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
Use the \s character class for the spaces between words:
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]"))
Update:
According to the javadoc for the Scanner class, by default it splits it's tokens using whitespace. You can change the delimiter it uses with the useDelimiter(String pattern) method of Scanner.
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath).useDelimiter("\\n");
...
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]")){
...
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
From what I can see (And correct me if I'm wrong, because regex always seems to trick my brain :p), you're not handling the spaces correctly. You need to use \s, not just the standard ' ' character
EDIT: Sorry, \s. Someone else beat me to it :p
Actually
\w+
is going to catch [Lname, Fname, 12, 35, 1] for Lname Fname 12.35 1. So you can just store reader.nextLine() and then extract all regex matches from there. From there, you can abstract it a bit for instance by :
class EmployeeFile {
.....
public EmployeeFile(String firstName, String lastName,
Double firstDouble, int firstInt,
EmployeeRemove er){
.....
}
public EmployeeFile(String line) {
//TODO : extract all the required info from the string array
// instead of doing it while reading at the same time.
// Keep input parsing separate from input reading.
// Turn this into a string array using the regex pattern
// mentioned above
}
}
I created my own version, without files and the last loop, that goes like that:
private static void initializeFileData() {
String[] testStrings = {"Lname Fname 12.35 1", "Jones Bananaman 7.1 3"};
Pattern myPattern = Pattern.compile("(\\w+)\\s+(\\w+)\\s+(\\d*\\.\\d{1,2})\\s+([0-5])");
for (String s : testStrings) {
Matcher myMatcher = myPattern.matcher(s);
if (myMatcher.groupCount() == 4) {
String lastName = myMatcher.group(1);
String firstName = myMatcher.group(2);
double firstValue = Double.parseDouble(myMatcher.group(3) );
int secondValue = Integer.parseInt(myMatcher.group(4));
//employeeInfo.add(new EmployeeFile(lastName, firstName, firstValue, secondValue, new employeeRemove()));
}
}
}
Notice that I removed the slash before the dot (you want a dot, not any character) and inserted the parenthesis, in order to create the groups.
I hope it helps.
I need to make JavaCC aware of a context (current parent token), and depending on that context, expect different token(s) to occur.
Consider the following pseudo-code:
TOKEN <abc> { "abc*" } // recognizes "abc", "abcd", "abcde", ...
TOKEN <abcd> { "abcd*" } // recognizes "abcd", "abcde", "abcdef", ...
TOKEN <element1> { "element1" "[" expectOnly(<abc>) "]" }
TOKEN <element2> { "element2" "[" expectOnly(<abcd>) "]" }
...
So when the generated parser is "inside" a token named "element1" and it encounter "abcdef" it recognizes it as <abc>, but when its "inside" a token named "element2" it recognizes the same string as <abcd>.
element1 [ abcdef ] // aha! it can only be <abc>
element2 [ abcdef ] // aha! it can only be <abcd>
If I'm not wrong, it would behave similar to more complex DTD definitions of an XML file.
So, how can one specify, in which "context" which token(s) are valid/expected?
NOTE: It would be not enough for my real case to define a kind of "hierarchy" of tokens, so that "abcdef" is always first matched against <abcd> and than <abc>. I really need context-aware tokens.
OK, it seems that you need a technique called lookahead here. Here is a very good tutorial:
Lookahead tutorial
My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).
Let's say we want to have some kind of markup language. All we want to "markup" are:
Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
Expressions consisting of numbers (0-9) --> numbers
We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.
I created the file WordNumber.jj which defines the grammar and the parser to be generated:
options
{
LOOKAHEAD= 1;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
STATIC = true;
DEBUG_PARSER = false;
DEBUG_LOOKAHEAD = false;
DEBUG_TOKEN_MANAGER = false;
ERROR_REPORTING = true;
JAVA_UNICODE_ESCAPE = false;
UNICODE_INPUT = false;
IGNORE_CASE = false;
USER_TOKEN_MANAGER = false;
USER_CHAR_STREAM = false;
BUILD_PARSER = true;
BUILD_TOKEN_MANAGER = true;
SANITY_CHECK = true;
FORCE_LA_CHECK = false;
}
PARSER_BEGIN(WordNumberParser)
/** Model-tree Parser */
public class WordNumberParser
{
/** Main entry point. */
public static void main(String args []) throws ParseException
{
WordNumberParser parser = new WordNumberParser(System.in);
parser.Input();
}
}
PARSER_END(WordNumberParser)
SKIP :
{
" "
| "\n"
| "\r"
| "\r\n"
| "\t"
}
TOKEN :
{
< WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
< NUMBER_TOKEN : (["0"-"9"])+ >
}
/** Root production. */
void Input() :
{}
{
( WordContext() | NumberContext() )* < EOF >
}
/** WordContext production. */
void WordContext() :
{}
{
"<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}
/** NumberContext production. */
void NumberContext() :
{}
{
"<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}
You can test it with a file like that:
<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>
The Last line will cause the parser to throw an exception like this:
Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9.
Was expecting:
<NUMBER_TOKEN> ...
That is because the parser did not find what it expected.
I hope that helps.
Cheers!
P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.
You need to use lexer states. Your example becomes something like:
<DEFAULT> TOKEN: { <ELEMENT1: "element1">: IN_ELEMENT1 }
<DEFAULT> TOKEN: { <ELEMENT2: "element2">: IN_ELEMENT2 }
<IN_ELEMENT1> TOKEN: { <ABC: "abc" (...)*>: DEFAULT }
<IN_ELEMENT2> TOKEN: { <ABCD: "abcd" (...)*>: DEFAULT }
Please note that the (...)* are not proper JavaCC syntax, but your example is not either so I can only guess.