Given a grammar (simplified version below) where I can enter arbitrary text in a section of the grammar, is it possible to format the content of the arbitrary text? I understand how to format the position of the arbitrary text in relation to the rest of the grammar, but not whether it is possible to format the content string itself?
Sample grammar
Model:
'content' content=RT
terminal RT: // (returns ecore::EString:)
'RT>>' -> '<<RT';
Sample content
content RT>>
# Some sample arbitrary text
which I would like to format
<<RT
you can add custom ITextReplacer to the region of the string.
assuming you have a grammar like
Model:
greetings+=Greeting*;
Greeting:
'Hello' name=STRING '!';
you can do something like the follow in the formatter
def dispatch void format(Greeting model, extension IFormattableDocument document) {
model.prepend[newLine]
val region = model.regionFor.feature(MyDslPackage.Literals.GREETING__NAME)
val r = new AbstractTextReplacer(document, region) {
override createReplacements(ITextReplacerContext it) {
val text = region.text
var int index = text.indexOf(SPACE);
val offset = region.offset
while (index >=0){
it.addReplacement(region.textRegionAccess.rewriter.createReplacement(offset+index, SPACE.length, "\n"))
index = text.indexOf(SPACE, index+SPACE.length()) ;
}
it
}
}
addReplacer(r)
}
this will turn this model
Hello "A B C"!
into
Hello "A
B
C"!
of course you need to come up with a more sophisticated formatter logic.
see How to define different indentation levels in the same document with Xtext formatter too
I am attempting to parse this Javascript via Nashorn:
function someFunction() { return b + 1 };
and navigate to all of the statements. This including statements inside the function.
The code below just prints:
"function {U%}someFunction = [] function {U%}someFunction()"
How do I "get inside" the function node to it's body "return b + 1"? I presume I need to traverse the tree with a visitor and get the child node?
I have been following the second answer to the following question:
Javascript parser for Java
import jdk.nashorn.internal.ir.Block;
import jdk.nashorn.internal.ir.FunctionNode;
import jdk.nashorn.internal.ir.Statement;
import jdk.nashorn.internal.parser.Parser;
import jdk.nashorn.internal.runtime.Context;
import jdk.nashorn.internal.runtime.ErrorManager;
import jdk.nashorn.internal.runtime.Source;
import jdk.nashorn.internal.runtime.options.Options;
import java.util.List;
public class Main {
public static void main(String[] args){
Options options = new Options("nashorn");
options.set("anon.functions", true);
options.set("parse.only", true);
options.set("scripting", true);
ErrorManager errors = new ErrorManager();
Context context = new Context(options, errors, Thread.currentThread().getContextClassLoader());
Source source = Source.sourceFor("test", "function someFunction() { return b + 1; } ");
Parser parser = new Parser(context.getEnv(), source, errors);
FunctionNode functionNode = parser.parse();
Block block = functionNode.getBody();
List<Statement> statements = block.getStatements();
for(Statement statement: statements){
System.out.println(statement);
}
}
}
Using private/internal implementation classes of nashorn engine is not a good idea. With security manager on, you'll get access exception. With jdk9 and beyond, you'll get module access error w/without security manager (as jdk.nashorn.internal.* packages not exported from nashorn module).
You've two options to parse javascript using nashorn:
Nashorn parser API ->https://docs.oracle.com/javase/9/docs/api/jdk/nashorn/api/tree/Parser.html
To use Parser API, you need to use jdk9+.
For jdk8, you can use parser.js
load("nashorn:parser.js");
and call "parse" function from script. This function returns a JSON object that represents AST of the script parsed.
See this sample: http://hg.openjdk.java.net/jdk8u/jdk8u-dev/nashorn/file/a6d0aec77286/samples/astviewer.js
I have a pkb file. It contain a package and under that package it has multiple functions.
I have to get the following details out of it:
package name
function names (for all functions one by one)
params in function
return type of function
Approach: I am parsing the pkb file. I have taken the grammar from these sources:
Presto
Antlrv4 Grammer for plsql
After getting these grammar I downloaded the jar from antlr-4.5.3-complete.jar. Then using
java -cp org.antlr.v4.Tool grammar.g
one by one I execute this command on these grammars separately to generate listener, lexer, parser and other files.
After this I created two project in eclipse one for each grammar. I imported these generated file into the respective and set antlr-4.5.3-complete.jar file into the path. After this I used following code to check if my .pkb file is correct or not?
public static void parse(String file) {
try {
SqlBaseLexer lex = new SqlBaseLexer(new org.antlr.v4.runtime.ANTLRInputStream(file));
CommonTokenStream tokens = new CommonTokenStream(lex);
SqlBaseParser parser = new SqlBaseParser(tokens);
System.err.println(parser.getNumberOfSyntaxErrors()+" Errors");
} catch (RecognitionException e) {
System.err.println(e.toString());
} catch (java.lang.OutOfMemoryError e) {
System.err.println(file + ":");
System.err.println(e.toString());
} catch (java.lang.ArrayIndexOutOfBoundsException e) {
System.err.println(file + ":");
System.err.println(e.toString());
}
}
I am not getting any error in parsing the file.
But after this I am stuck with next steps. I need to get all the package name, functions, params etc.
How to get these details?
Also is my approach is correct to attain the required output.
The Presto grammar is a generic SQL grammar which is not suitable for parsing Oracle packages. The ANTLRv4 grammar for PL/SQL is the right tool for your task.
Generally an ANTLR grammar as such works as a validator. When you want to make some additional processing while parsing you should use ANTLR actions (see overview slide in this presentation). These are blocks of text written in the target language (e.g. Java) and enclosed in curly braces (see documentation).
There are at least two ways to solve your task with ANTLR actions.
Stdout output
The simplest way is to add println()s for certain rules.
To print package name modify package_body rule in plsql.g4 as follows:
package_body
: BODY package_name (IS | AS) package_obj_body*
(BEGIN seq_of_statements | END package_name?)
{System.out.println("Package name is "+$package_name.text);}
;
Similarly to print information about function's arguments and return type: add prinln()s in create_function_body rule. But there is an issue whith printing of parameters. If you use $parameter.text it will return name, type specification and default value according to parameter rule without spaces (as token sequence). If you add println() to parameter rule and use $parameter_name.text it will print all parameter's names (including parameters of procedures, not only functions). So you can add an ANTLR return value for parameter rule and assign $parameter_name.text to the return value:
parameter returns [String p_name]
: parameter_name (IN | OUT | INOUT | NOCOPY)*
type_spec? default_value_part?
{$p_name=$parameter_name.text;}
;
Thus is context of create_function_body we can access the parameter's name by $parameter.p_name:
create_function_body
: (CREATE (OR REPLACE)?)? FUNCTION function_name
{System.out.println("Parameters of function "+$function_name.text+":");}
('(' parameter {System.out.println($parameter.p_name);}
(',' parameter {System.out.println($parameter.p_name);})* ')')?
RETURN type_spec
(invoker_rights_clause|parallel_enable_clause|result_cache_clause|DETERMINISTIC)*
((PIPELINED? (IS | AS) (DECLARE? declare_spec* body | call_spec))
| (PIPELINED | AGGREGATE) USING implementation_type_name) ';'
{System.out.println("Return type of function "
+$function_name.text+" is "
+ $type_spec.text);}
;
Accumulation
Also you can save some calculations to variables and access them as parser class members. E.g. you can accumulate function's name in variable func_name. For this add #members section at beginning of the grammar:
grammar plsql;
#members{
String func_name = "";
}
And modify function_name rule as follows:
function_name
: id ('.' id_expression)? {func_name = func_name+$id.text + " ";}
;
Using lexer and parser classes
Here is an example of application to run your parser parse.java:
import org.antlr.v4.runtime.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class parse {
static String readFile(String path) throws IOException
{
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, "UTF-8");
}
public static void main(String[] args) throws Exception {
// create input stream `in`
ANTLRInputStream in = new ANTLRInputStream( readFile(args[0]) );
// create lexer `lex` with `in` at input
plsqlLexer lex = new plsqlLexer(in);
// create token stream `tokens` with `lex` at input
CommonTokenStream tokens = new CommonTokenStream(lex);
// create parser with `tokens` at input
plsqlParser parser = new plsqlParser(tokens);
// call start rule of parser
parser.sql_script();
// print func_name
System.out.println("Function names: "+parser.func_name);
}
}
Compile and run
After this generate java code by ANTLR:
java org.antlr.v4.Tool plsql.g4
and compile your Java code:
javac plsqlLexer.java plsqlParser.java plsqlListener.java parse.java
then run it for some .pkb file:
java parse green_tools.pkb
You can find modified parse.java, plsql.g4 and green_tools.pkb here.
While iterating over the tokens using a Listener, I would like to know how to use the ParserRuleContext to peek at the next token or the next few tokens in the token stream?
In the code below I am trying to peek at all the tokens after the current token till the EOF:
#Override
public void enterSemicolon(JavaParser.SemicolonContext ctx) {
Token tok, semiColon = ctx.getStart();
int currentIndex = semiColon.getStartIndex();
int reqInd = currentIndex+1;
TokenSource tokSrc= semiColon.getTokenSource();
CharStream srcStream = semiColon.getInputStream();
srcStream.seek(currentIndex);
while(true){
tok = tokSrc.nextToken() ;
System.out.println(tok);
if(tok.getText()=="<EOF>"){break;}
srcStream.seek(reqInd++);
}
}
But the output I get is:
.
.
.
.
.
[#-1,131:130='',<-1>,13:0]
[#-1,132:131='',<-1>,13:0]
[#-1,133:132='',<-1>,13:0]
[#-1,134:133='',<-1>,13:0]
[#-1,135:134='',<-1>,13:0]
[#-1,136:135='',<-1>,13:0]
[#-1,137:136='',<-1>,13:0]
[#-1,138:137='',<-1>,13:0]
[#-1,139:138='',<-1>,13:0]
[#-1,140:139='',<-1>,13:0]
[#-1,141:140='',<-1>,13:0]
[#-1,142:141='',<-1>,13:0]
[#-1,143:142='',<-1>,13:0]
[#-1,144:143='',<-1>,13:0]
[#-1,145:144='',<-1>,13:0]
[#-1,146:145='',<-1>,13:0]
[#-1,147:146='',<-1>,13:0]
[#-1,148:147='',<-1>,13:0]
[#-1,149:148='',<-1>,13:0]
[#-1,150:149='',<-1>,13:0]
[#-1,151:150='',<-1>,13:0]
[#-1,152:151='',<-1>,13:0]
[#-1,153:152='',<-1>,13:0]
[#-1,154:153='',<-1>,13:0]
[#-1,155:154='',<-1>,13:0]
[#-1,156:155='',<-1>,13:0]
[#-1,157:156='',<-1>,13:0]
[#-1,158:157='',<-1>,13:0]
[#-1,159:158='',<-1>,13:0]
[#-1,160:159='',<-1>,13:0]
[#-1,161:160='<EOF>',<-1>,13:0]
[#-1,137:136='',<-1>,13:0]
[#-1,138:137='',<-1>,13:0]
[#-1,139:138='',<-1>,13:0]
[#-1,140:139='',<-1>,13:0]
[#-1,141:140='',<-1>,13:0]
[#-1,142:141='',<-1>,13:0]
[#-1,143:142='',<-1>,13:0]
[#-1,144:143='',<-1>,13:0]
[#-1,145:144='',<-1>,13:0]
[#-1,146:145='',<-1>,13:0]
[#-1,147:146='',<-1>,13:0]
[#-1,148:147='',<-1>,13:0]
[#-1,149:148='',<-1>,13:0]
[#-1,150:149='',<-1>,13:0]
[#-1,151:150='',<-1>,13:0]
[#-1,152:151='',<-1>,13:0]
[#-1,153:152='',<-1>,13:0]
[#-1,154:153='',<-1>,13:0]
[#-1,155:154='',<-1>,13:0]
[#-1,156:155='',<-1>,13:0]
[#-1,157:156='',<-1>,13:0]
[#-1,158:157='',<-1>,13:0]
[#-1,159:158='',<-1>,13:0]
[#-1,160:159='',<-1>,13:0]
[#-1,161:160='<EOF>',<-1>,13:0]
.
.
.
.
We see that although I am able to traverse through all the tokens till EOF, I unable to get the actual content or type of the tokens. I would like to know if there is a neat way of doing this using listener traversing.
Hard to be certain, but
tok = tokSrc.nextToken() ;
appears to be rerunning the lexer, starting at a presumed proper token boundary, but without having reset the lexer. The lexer throwing errors might explain the observed behavior.
Still, a better approach would be to simply recover the existing Token stream:
public class Walker implements YourJavaListener {
CommonTokenStream tokens;
public Walker(JavaParser parser) {
tokens = (CommonTokenStream) parser.getTokenStream()
}
then access the stream to get the desired tokens:
#Override
public void enterSemicolon(JavaParser.SemicolonContext ctx) {
TerminalNode semi = ctx.semicolon(); // adjust as needed for your impl.
Token tok = semi.getSymbol();
int idx = tok.getTokenIndex();
while(tok.getType() != IntStream.EOF) {
System.out.println(tok);
tok = tokens.get(idx++);
}
}
An entirely different approach that might serve your ultimate purpose is to get a limited set of tokens directly from the parent context:
ParserRuleContext pctx = ctx.getParent();
List<TerminalNode> nodes = pctx.getTokens(pctx.getStart(), pctx.getStop());
I am using RapidMiner 5. I want to make a text preprocessing module to use with a categorization system. I created a process in RapidMiner with these steps.
Tokenize
Transform Case
Stemming
Filtering stopwords
Generating n-grams
I want to write a script to do spell correction for these words. So, I used 'Execute Script' operator and wrote a groovy script for doing this (from here- raelcunha). This is the code ( helped by RapidMiner community) I wrote in execute Script operator of rapid miner.
Document doc=input[0]
List<Token> newTokens = new LinkedList<Token>();
nWords=train("set2.txt")
for (Token token : doc.getTokenSequence()) {
//String output=correct((String)token.getToken(),nWords)
println token.getToken();
Token nToken = new Token(correct("garbge",nWords), token);
newTokens.add(nToken);
}
doc.setTokenSequence(newTokens);
return doc;
This is the code for spell correction. ( Thanks to Norvig.)
import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;
import java.util.List;
import java.util.LinkedList;
def train(f){
def n = [:]
new File(f).eachLine{it.toLowerCase().eachMatch(/\w+/){n[it]=n[it]?n[it]+1:1}}
n
}
def edits(word) {
def result = [], n = word.length()-1
for(i in 0..n) result.add(word[0..<i] + word.substring(i+1))
for(i in 0..n-1) result.add(word[0..<i] + word[i+1] + word[i, i+1] + word.substring(i+2))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i+1))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i))
result
}
def correct(word, nWords) {
if(nWords[word]) return word
def list = edits(word), candidates = [:]
for(s in list) if(nWords[s]) candidates[nWords[s]] = s
if(candidates.size() > 0) return candidates[candidates.keySet().max()]
for(s in list) for(w in edits(s)) if(nWords[w]) candidates[nWords[w]] = w
return candidates.size() > 0 ? candidates[candidates.keySet().max()] : word
}
I am getting String index out of bounds exception while calling edits method.
And, I do not know how to debug this because rapidminer just tells me that there is an issue in the Execute Script operator and not saying which line of script caused this issue.
So, I am planning to do the same thing by creating an operator in Java as mentioned here-How to extend RapidMiner
The things I did:
Included all jar files from RapidMiner Lib folder , (C:\Program Files (x86)\Rapid-I\RapidMiner5\lib ) into the build path of my java project.
Started coding using the same guide the link to which is given above.
Input for my operator is a Document ( com.rapidminer.operator.text.Document)
as in the script.
But, I am not able to use this Document object in this code. Can you tell me why? Where are the text processing jars located?
For using the plugin jars, should we add some other locations to the BuildPath?