How tokenize code with ANTLR v4

How tokenize code with ANTLR v4 - java

at start i want to apologise for my bad english.
I make webApp and my task what i need to do is to tokenize Java code. I found tool like ANTLR v4 and i tried to implements it.
public class Tokenizer {
public void tokenizer(String code) {
ANTLRInputStream in = new ANTLRInputStream(code);
Java8Lexer lexer = new Java8Lexer(in);
List<? extends Token> tokenList = new ArrayList<>();
tokenList = lexer.getAllTokens();
for(Token token : tokenList){
System.out.println("Next token :" + token.getType() + "\n");
}
}
}
And this code print on screen list of int with number of token Type.
I need something like this:
Code with something like "comments" to code.
How can i get this result?
I have this grammar : https://github.com/antlr/grammars-v4/tree/master/java8

The Token class contains several methods including
int getLine();
int getCharPositionInLine();
that associate the token with the corresponding source.

Using
token.getText()
you should get the parsed text the token represents.
In addition, you should get the token's name by
lexer.getVocabulary().getSymbolicName(token.getType())

The problem you are facing here is you want a mix of tokens and rules in the output. For instance VARIABLE_DECLARATION is actually a parser rule, while IDENTIFIER ASSIGN IDENTIFIER consists of 3 lexer rules. You can use the token stream to print the recognized lexems, but that won't give you any parser rule.
What you can try instead is to print the return parse tree, which you get when you do a real parse run on your input (see ParseTree.toString()). You can use a parser listener to walk a parse tree and transform that into a stream of rule descriptions along with the text that belongs to a rule (context).

Related

Parse a single POJO from multiple YAML documents representing different classes

I want to use a single YAML file which contains several different objects - for different applications. I need to fetch one object to get an instance of MyClass1, ignoring the rest of docs for MyClass2, MyClass3, etc. Some sort of selective de-serializing: now this class, then that one... The structure of MyClass2, MyClass3 is totally unknown to the application working with MyClass1. The file is always a valid YAML, of course.
The YAML may be of any structure we need to implement such a multi-class container. The preferred parsing tool is snakeyaml.
Is it sensible? How can I ignore all but one object?
UPD: replaced all "document" with "object". I think we have to speak about the single YAML document containing several objects of different structure. More of it, the parser knows exactly only 1 structure and wants to ignore the rest.
UDP2: I think it is impossible with snakeyaml. We have to read all objects anyway - and select the needed one later. But maybe I'm wrong.
UPD2: sample config file
---
-
exportConfiguration781:
attachmentFieldName: "name"
baseSftpInboxPath: /home/user/somedir/
somebool: false
days: 9999
expected:
- ABC w/o quotes
- "Cat ABC"
- "Some string"
dateFormat: yyyy-MMdd-HHmm
user: someuser
-
anotherConfiguration:
k1: v1
k2:
- v21
- v22

This is definitely possible with SnakeYAML, albeit not trivial. Here's a general rundown what you need to do:
First, let's have a look what loading with SnakeYAML does. Here's the important part of the YAML class:
private Object loadFromReader(StreamReader sreader, Class<?> type) {
Composer composer = new Composer(new ParserImpl(sreader), resolver, loadingConfig);
constructor.setComposer(composer);
return constructor.getSingleData(type);
}
The composer parses YAML input into Nodes. To do that, it doesn't need any knowledge about the structure of your classes, since every node is either a ScalarNode, a SequenceNode or a MappingNode and they just represent the YAML structure.
The constructor takes a root node generated by the composer and generates native POJOs from it. So what you want to do is to throw away parts of the node graph before they reach the constructor.
The easiest way to do that is probably to derive from Composer and override two methods like this:
public class MyComposer extends Composer {
private final int objIndex;
public MyComposer(Parser parser, Resolver resolver, int objIndex) {
super(parser, resolver);
this.objIndex = objIndex;
}
public MyComposer(Parser parser, Resolver resolver, LoaderOptions loadingConfig, int objIndex) {
super(parser, resolver, loadingConfig);
this.objIndex = objIndex;
}
#Override
public Node getNode() {
return strip(super.getNode());
}
private Node strip(Node input) {
return ((SequenceNode)input).getValue().get(objIndex);
}
}
The strip implementation is just an example. In this case, I assumed your YAML looks like this (object content is arbitrary):
- {first: obj}
- {second: obj}
- {third: obj}
And you simply select the object you actually want to deserialize by its index in the sequence. But you can also have something more complex like a searching algorithm.
Now that you have your own composer, you can do
Constructor constructor = new Constructor();
// assuming we want to get the object at index 1 (i.e. second object)
Composer composer = new MyComposer(new ParserImpl(sreader), new Resolver(), 1);
constructor.setComposer(composer);
MyObject result = (MyObject)constructor.getSingleData(MyObject.class);

The answer of #flyx was very helpful for me, opening the way to workaround the library (in our case - snakeyaml) limitations by overriding some methods. Thanks a lot! It's quite possible there is a final solution in it - but not now. Besides, the simple solution below is robust and should be considered even if we'd found the complete library-intruding solution.
I've decided to solve the task by double distilling, sorry, processing the configuration file. Imagine the latter consisting of several parts and every part is marked by the unique token-delimiter. For the sake of keeping the YAML-likenes, it may be
---
#this is a unique key for the configuration A
<some YAML document>
---
#this is another key for the configuration B
<some YAML document
The first pass is pre-processing. For the given String fileString and String key (and DELIMITER = "\n---\n". for example) we select a substring with the key-defined configuration:
int begIndex;
do {
begIndex= fileString.indexOf(DELIMITER);
if (begIndex == -1) {
break;
}
if (fileString.startsWith(DELIMITER + key, begIndex)) {
fileString = fileString.substring(begIndex + DELIMITER.length() + key.length());
break;
}
// spoil alien delimiter and repeat search
fileString = fileString.replaceFirst(DELIMITER, " ");
} while (true);
int endIndex = fileString.indexOf(DELIMITER);
if (endIndex != -1) {
fileString = fileString.substring(0, endIndex);
}
Now we feed the fileString to the simple YAML parsing
ExportConfiguration configuration = new Yaml(new Constructor(ExportConfiguration.class))
.loadAs(fileString, ExportConfiguration.class);
This time we have a single document that must co-respond to the ExportConfiguration class.
Note 1: The structure and even the very content of the rest of configuration file plays absolutely no role. This was the main idea, to get independent configurations in a single file
Note 2: the rest of configurations may be JSON or XML or whatever. We have a method-preprocessor that returns a String configuration - and the next processor parses it properly.

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Java - generating conditions from string

I'm trying to generate some conditions using string i get as input.
For example, i get as in put the string "length = 15" and i want to create from that the condition:
length == 15.
To be more specific, i have an int in my program called length and it is set to a specific value.
i want to get from the user a conditon as input ("length < 15" or "length = 15"....) and create an if statement that generates the condition and test it.
What is the best way of doing that?
Thanks a lot
Ben

Unless you're talking about code-generation (i.e. generating Java-code by input strings) you can't generate an if-statement based on a string.
You'll have to write a parser for your condition-language, and interpret the resulting parse trees.
In the end it would look something like this:
Condition cond = ConditionParser.parse("length = 15");
if (cond.eval()) {
// condition is true
}

Use a string tokenizer. The default method to distinguish between tokens (or the smallest parts of the input string) is white space, which is to your benefit.
check out javadocs for details:
http://docs.oracle.com/javase/1.3/docs/api/java/util/StringTokenizer.html

Depending on what restrictions you can place on your input format, you could consider using Rhino to embed Javascript. Your 'conditions' then just have to be valid JavaScript code. Something like this (disclaimer: haven't compiled it):
import javax.script.*;
public bool evalCondition (Object context, String javascript) {
ScriptEngine engine = new ScriptEngineManager().getEngineByName("javascript");
Object result = engine.eval(javascript);
Boolean condTrue = (Boolean)result;
return condTrue;
}
See the Embedding Rhino Tutorial for more details.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.

"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.

In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.

If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)

Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Preprocessing with Javacc/ push front some chars in the stream?

Using javacc can I push some new characters in front of the inputstream ?
for example let's say that my parser parses the following syntax:
#define Paragraphs "Paragraph+"
#define Volume "(Title,(Chapter,${Paragraphs})+)"
Book=${Volume}+;
How can I tell javacc that its scanner should preprocess ${Volume} to (Title,(Chapter,Paragraph+)+) before invoking the parser ?
Can It be achieved using the MORE statement ?
Thanks

Token.image is a public field, so you could also just set it directly. Here's an example in my JavaCC book's tokenizer chapter:
TOKEN : {
{matchedToken.image = image.append("B").toString();}
}
You can download all the book's example source code here.

OK, I think I've found the solution: Some java statements can be added in the TOKEN section and the current buffer is defined in a StringBuilder named 'image':
| <Y:"${"(<NAME)+ "}" >
{
String oldValue=image.toString();
image.setLength(0);
image.append(my_dict.get(oldValue));
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.