BASIC Lexer with regex written in Java

BASIC Lexer with regex written in Java - java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Related

Regular expression for hgsv notation in java

HGSV nomenclature has a pattern:
xxxxx.yyyy:charactersnumbercharacters
I would like to make a regex in java and fetch the all the tokens from above eg:
it should have 5 tokens :
{ 'xxxxx', 'yyyy', 'characters', 'number' , 'characters'}
I have used simple split methodology to fetch the tokens, but I don't find its an optimal solution:
my current code is :
String hgsv = "BRAF.p:V600E";
String[] tokens = hgsv.split(".");
this.symbol = tokens[0];
String type = tokens[1].split(":")[0];
I would like to use Pattern and Matcher in Java. No idea, how to make regex for the above token.
Any clue how to do that?
(even to separate characters, numbers, characters I will be using regex). So why not to use REGEX for entire token.
I found link but this is in Python, I need similar in Java.

I think what you're probably looking for is to use capture groups, like this:
String s = "BRAF.p:V600E";
Pattern p = Pattern.compile("(\\w+)\\.(\\w+):([a-zA-Z]+)(\\d+)([a-zA-Z]+)");
Matcher m = p.matcher(s);
if (m.matches()) {
String[] parts = {m.group(1),
m.group(2),
m.group(3),
m.group(4),
m.group(5)};
// Prints "[BRAF, p, V, 600, E]"
System.out.println(Arrays.toString(parts));
} else {
// The input String is invalid.
}
That's really just a lot like a split, but it's more stable because you're using the pattern to validate the String beforehand.
Note that I have no idea if that is the exact right pattern that you should be using. I don't know the exact details of the HGSV notation you're talking about and your description is actually pretty vague. (What are e.g. xxxxx and yyyy? What are "characters"?) If you link me to some sort of specification or detailed description of this notation I can try to write a regex that's more definitely correct.
Anyhow, my example shows the basic idea. You might also see http://www.regular-expressions.info/brackets.html for more information.

How tokenize code with ANTLR v4

at start i want to apologise for my bad english.
I make webApp and my task what i need to do is to tokenize Java code. I found tool like ANTLR v4 and i tried to implements it.
public class Tokenizer {
public void tokenizer(String code) {
ANTLRInputStream in = new ANTLRInputStream(code);
Java8Lexer lexer = new Java8Lexer(in);
List<? extends Token> tokenList = new ArrayList<>();
tokenList = lexer.getAllTokens();
for(Token token : tokenList){
System.out.println("Next token :" + token.getType() + "\n");
}
}
}
And this code print on screen list of int with number of token Type.
I need something like this:
Code with something like "comments" to code.
How can i get this result?
I have this grammar : https://github.com/antlr/grammars-v4/tree/master/java8

The Token class contains several methods including
int getLine();
int getCharPositionInLine();
that associate the token with the corresponding source.

Using
token.getText()
you should get the parsed text the token represents.
In addition, you should get the token's name by
lexer.getVocabulary().getSymbolicName(token.getType())

The problem you are facing here is you want a mix of tokens and rules in the output. For instance VARIABLE_DECLARATION is actually a parser rule, while IDENTIFIER ASSIGN IDENTIFIER consists of 3 lexer rules. You can use the token stream to print the recognized lexems, but that won't give you any parser rule.
What you can try instead is to print the return parse tree, which you get when you do a real parse run on your input (see ParseTree.toString()). You can use a parser listener to walk a parse tree and transform that into a stream of rule descriptions along with the text that belongs to a rule (context).

Java Regex with Pattern and Matcher

I am using Pattern and Matcher classes from Java ,
I am reading a Template text and I want to replace :
src="scripts/test.js" with src="scripts/test.js?Id=${Id}"
src="Servlet?Template=scripts/test.js" with src="Servlet?Id=${Id}&Template=scripts/test.js"
I'm using the below code to execute case 2. :
//strTemplateText is the Template's text
Pattern p2 = Pattern.compile("(?i)(src\\s*=\\s*[\"'])(.*?\\?)");
Matcher m2 = p2.matcher(strTemplateText);
strTemplateText = m2.replaceAll("$1$2Id=" + CurrentESSession.getAttributeString("Id", "") + "&");
The above code works correctly for case 2. but how can I create a regex to combine both cases 1. and 2. ?
Thank you

You don't need a regular expression. If you change case 2 to
replace Servlet?Template=scripts/test.js with Servlet?Template=scripts/test.js&Id=${Id}
all you need to do is to check whether the source string does contain a ? if not add ?Id=${Id} else add &Id=${Id}.
After all
if (strTemplateText.contains("?") {
strTemplateText += "&Id=${Id}";
}
else {
strTemplateText += "?Id=${Id}";
}
does the job.
Or even shorter
strTemplate += strTemplateText.contains("?") ? "&Id=${Id}" : "?Id=${Id}";

Your actual question doesn't match up so well with your example code. The example code seems to handle a more general case, and it substitutes an actual session Id value instead of a reference to one. The code below takes the example code to be more indicative of what you really want, but the same approach could be adapted to what you asked in the question text (using a simpler regex, even).
With that said, I don't see any way to do this with a single replaceAll() because the replacement text for the two cases is too different. You could nevertheless do it with one regex, in one pass, if you used a different approach:
Pattern p2 = Pattern.compile("(src\\s*=\\s*)(['\"])([^?]*?)(\\?.*?)?\\2",
Pattern.CASE_INSENSITIVE);
Matcher m2 = p2.matcher(strTemplateText);
StringBuffer revisedText = new StringBuffer();
while (m2.find()) {
// Append the whole match except the closing quote
m2.appendReplacement(revisedText, "$1$2$3$4");
// group 4 is the optional query string; null if none was matched
revisedText.append((m2.group(4) == null) ? '?' : '&');
revisedText.append("Id=");
revisedText.append(CurrentESSession.getAttributeString("Id", ""));
// append a copy of the opening quote
revisedText.append(m2.group(2));
}
m2.appendTail(revisedText);
strTemplateText = revisedText.toString();
That relies on BetaRide's observation that query parameter order is not significant, although the same general approach could accommodate a requirement to make Id the first query parameter, as in the question. It also matches the end of the src attribute in the pattern to the correct closing delimiter, which your pattern does not address (though it needs to do to avoid matching text that spans more than one src attribute).
Do note that nothing in the above prevents a duplicate query parameter 'Id' being added; this is consistent with the regex presented in the question. If you want to avoid that with the above approach then in the loop you need to parse the query string (when there is one) to determine whether an 'Id' parameter is already present.

You can do the following:
//strTemplateText is the Template's text
String strTemplateText = "src=\"scripts/test.js\"";
strTemplateText = "src=\"Servlet?Template=scripts/test.js\"";
java.util.regex.Pattern p2 = java.util.regex.Pattern.compile("(src\\s*=\\s*[\"'])(.*?)((?:[\\w\\s\\d.\\-\\#]+\\/?)+)(?:[?]?)(.*?\\=.*)*(['\"])");
java.util.regex.Matcher m2 = p2.matcher(strTemplateText);
System.out.println(m2.matches());
strTemplateText = m2.replaceAll("$1$2$3?Id=" + CurrentESSession.getAttributeString("Id", "") + (m2.group(4)==null? "":"&") + "$4$5");
System.out.println(strTemplateText);
It works on both cases.
If you are using java > 1.6; then, you could use custom-named group-capturing features for making the regex exp. more human-readable and easier to debug.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.

"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.

In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.

If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)

Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

java email extraction regular expression?

I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.

Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/

I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";

Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.

a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);

The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.