Antlr3 next available tokens when parsing incomplete statement

Antlr3 next available tokens when parsing incomplete statement - java

I an trying to implement simple auto complete for a command line client for SQL. I am using antlr for generating a parser in the rest of the application and I wanted to reuse the grammar to use autocompletion. My idea is:
- Parse the incomplete statement when the user asks for completion (for example select a from)
- Get from the parser the list of tokens which were expected when he raised a NoViableAltException
I wanted to then do from this list of token:
if (isreserved_word) { propose for completion}
else { notify the user a identifier is expected}
This in principle looked like a sensible idea (to me at least) and I found this:
http://www.antlr.org/wiki/pages/viewpage.action?pageId=11567208 which convinced me it was possible
However, after doing some testing I realized not that many tokens were in state.following[state._fsp]
for example for an entry of create it was only containing ';'
when my grammar for this part looks like:
root : statement? (SEMICOLON!)? EOF!;
statement : create | ...;
create : CREATE | ( TABLE table_create | USER user_create | ....);
So I was confused and looked at the generated code:
try {
int alt6=16;
alt6 = dfa6.predict(input);
switch (alt6) {
case 1 :
{
root_0 = (CommonTree)adaptor.nil();
pushFollow(FOLLOW_create_in_statement1088);
create8=create();
state._fsp--;
adaptor.addChild(root_0, create8.getTree());
}
break;
case 2 :
...
So then it made sense to me: The parser tries to read the next token and then from this token finds(switch case) the next rule.
In my case the predict just fails as there no next token.
So from there I understood I would need to hack a bit antlr and looked in the templates and in Java.stg I found these pieces of code:
/** A (...) subrule with multiple alternatives */
block(alts,decls,decision,enclosingBlockLevel,blockLevel,decisionNumber,maxK,maxAlt,description) ::= <<
// <fileName>:<description>
int alt<decisionNumber>=<maxAlt>;
<decls>
<#predecision()>
<decision>
<#postdecision()>
<#prebranch()>
switch (alt<decisionNumber>) {
<alts:{a | <altSwitchCase(i,a)>}>
}
<#postbranch()>
>>
and
/** A case in a switch that jumps to an alternative given the alternative
* number. A DFA predicts the alternative and then a simple switch
* does the jump to the code that actually matches that alternative.
*/
altSwitchCase(altNum,alt) ::= <<
case <altNum> :
<#prealt()>
<alt>
break;<\n>
>>
From there all I thought I had to do was to do my own function that would just put all altNum in a stack before the call to predict. so I did to try:
/*
Yout }>*/
And I was expecting to get nice little lists of token ids. But not at all I get really different things.
So I'm really lost and would like to know if either there was an easier way to provide this autocomplete feature without having to do it by hand or how am I missing to modify the template to add a custom stack to add the different alternatives in a rule so I can read it after when the exception is raised
Thank you very much

Sorry to say this but: don't use a parser directly for auto completion. There are several reasons why this won't work as you expect it, without massive manual changes in the generated parser (which requires intimate knowledge):
You often have incomplete input and unless you have only a simple language you will often find yourself in an unexpected rule path because of the backtracking nature of the parser. For instance, if you have several alts in a rule where the first alt would match if only an additional token were available the parser will fail not before it has tried all other alts giving you either completely different tokens or many more tokens than are really necessary.
The follow set is only available in an error case. However there might be no error or there is an error but at a completely different position than where the caret is currently (and where the user would expect an auto completion box).
The follow set suffices only for a small set of info you want to present (namely the keywords). However, usually you want to show, say, possible tables in a database if you are in a FROM clause (assuming an SQL language here). You will not get this type of info from the parser, simply because the parser does not have such context information. What you however get is 'identifier' which can be anything from a table, function name, variable or similar.
My current approach for this type of problem is to tokenize the input and apply domain knowledge in a decision tree. That is, I walk the input tokens and decide based on knowledge I have from the grammar what would be the most important stuff to show.

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map

Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

How to write a step definition that compares a json response to a scenario outline table

I have json response example:
{
"colours": ["green","blue", "red"],
"type" : ["shoes","socks","t-shirts"],
"make" : ["nike", "adidas"],
}
I have Scenario outline table:
|colours|type |make |
|red |shoes |nike |
|blue |socks |nike |
|green |t-shirts|adidas|
I want to use the scenario table to assert against the json response. Now I know how to check this one by one, for example
* Assert colour is correct: <colours>
* Assert type is correct: <type>
* Assert make is correct: <make>
And then perform the step definition like the example below for colour:
#Step("Assert colour is correct: <colours>")
public void assertColourIsCorrect(String colourValue) {
String responseBody = BridgeTestState.getLastResponse().getBody().toString();
itemState itemStateResp = new Gson().fromJson(responseBody, itemState.class);
assertThat("colours", itemStateResp.getColour(), is(equalTo(colourValue.toLowerCase())));
}
Note The getColour() comes from a getter and setter I have set.
Now this works but as you can see it's a bit long winded as I have to create three separate steps to assert against each column.
I want to be a little smarter than this but don't know how to implement. What I would like is a step definition where it will look at the json response and compare it to the table based on its field and then from there view the value.
Something along the lines of:
Assert correct "fields" and "values" are outputted.
I hope that makes sense, basically a smart one step definition to perform the check between the json response and the table row.

From the comments it seems you want a way to do a row by row where you use the headlines of a datatable as keys for the json. You cannot achieve that by using it by example, because that is specifically meant to be mapped directly into steps the way you describe yourself. As I see there are two ways of dealing with this, depending on your use case.
First, still deal with it as parameters in the steps, i.e.,
Then the "colours" is <colours>
And the "type" is <type>
and then just have one step implmentation
Then("the {string} is {string}")
public void theKeyIsValue(String key, String value) {
assertThat(json.get(key)).contains(value);
}
Another, and most likely better would be to deal with it as a normal scenario as already suggested in the comments (I did not understand why you claim that you can't). Which most often is better.
However, most likely the correct solution is - annoyingly enough - to actually rethink your scenario. There are some really great guidelines for best practices etc. on https://cucumber.io/docs/bdd/ they are fairly fast and easy to read, and will help with a lot of the initial problems.
It's hard without a complete example, but from what you write I suspect that your tests might be too technical. It's an extremely hard balance, but try to keep them so vague that they do not specify the "How" but only the "What". Example Given the username "Kate" is a better step than Given I type the username "Kate", because in the latter you are specifying that there should be something you can type text in. I usually ask people if their tests works with a voice assistant.
Another thing I suspect is that you try to test too many things at once. One thing I notice for instance is that there are no apparent connection between your json and your table. I.e., if they data should match on the index for instance, it might make more sense. However, looking at the sparse data, I think the tests you need are:
Scenario: The colour options
Given ...
When the options are given
Then the following can be chosen
| Colour |
| red |
| blue |
| green |
Scenario: The clothing options
Given ..
When the options are given
Then the following can be chosen
| Type |
| shoes |
| socks |
| t-shirts |
That way you can still re-use the steps, you can use the headline for a key in the json, and judging by your data the tests actually relate more closely to the expected things.
Writing acceptance tests is an art the requires practice. I hope some of the suggestions here can be used, however, it is hard to come with more direct suggestions without more context.

Doing what you want to do is counter productive and against the underlying design principles of Cucumber, Scenario Outlines, and probably even data tables in Cucumber.
What the cuke should be doing is explaining WHAT the json response represents and WHY its important. HOW the json response is constructed and the details of exploring its validity and content structure should be pushed down to the step definitions, or better yet helper methods called by the step definitions.
Its really hard to illustrate this with your sample data because its really hard to work out WHAT
{
"colours": ["green","blue", "red"],
"type" : ["shoes","socks","t-shirts"],
"make" : ["nike", "adidas"],
}
represents. Its also pretty hard to understand why you want to make the assertions you want to make.
If you gave a real example and explained what the data represents and WHY its important and perhaps also WHY you need to check it and WHAT you would do if it wasn't correct then we might be able to make more progress.
So I'll try with my own example which won't be very good
Given a set of clothing
When I get a json representation of the clothing
Then the json should be valid clothing
and the steps
Given 'a set of clothing' do
create_clothing_set
end
When 'I get a json representation of the clothing' do
#json = make_clothing_request type: :json
end
Then 'the json should be valid clothing' do
res = validate_clothing json: #json
expect res ...
end
Now your problem is how to write some code to validate your clothing i.e.
def validate_clothing(json: )
...
end
Which is a much simpler problem being executed in a much more powerful environment. This has nothing to do with cucumber, no interactions with features, scenarios etc. its just a simple programming problem.
In general with Cucumber either push technical problems down, so they become programming problems, or pull problems up, so they are outside Cucumber and become scripting problems. This keeps Cucumber on topic. Its job is describe WHAT and WHY and provide a framework to automate.

Protobuf lazy decoding of sub message

I am using proto 3 (java) in my projects . I have some huge protobufs embedded with smaller messages . Is there a way I can acheive partial decoding of only few nested sub messages that I want to look at. The current issue I am having is I need to join this huge proto based record data with another records ,but my join are based on very small sub messages ,so I don't want to decode the entire huge protobuf and be able to only decode the nested message (string id) to join and then only decode the entire protobuf for the joined data.
I tried using the [lazy=true] tagging method , but I don't see any difference in generated code , also I tried benchmarking the deserialization time with and without the lazy key work and it didn't seem to affect at all . Is this feature by default on for all fields? Or is this even possible? I do see there are few classes LazyFields.java and test cases in the protobuf-github so I assume this feature has been implemented.

For those that happen to look at this conversation later and finding it hard to understand, here's what Marc's talking about:
If your object is something like
message MyBigMessage{
string id = 1;
int sourceType = 2 ;
And many other fields here, that would be expensive to parse .......
}
And you get a block of bytes that you have to parse. But you want to only parse messages from a certain source and maybe match a certain id range.
You could first parse those bytes with another message as:
message MyFilterMessage{
string id = 1; //has to be 1 to match
int sourceType = 2 ; //has to be 1 to match
And NOTHING ELSE here.......
}
And then, you could look at sourceType and id. If they match whatever you are filtering for, then, you could go and parse the bytes again, but this time, using MyBigMessage to parse the whole thing.
One other thing to know:
FYI: As of 2017, lazy parsing was disabled in Java (except MessageSet) according to this post:
https://github.com/protocolbuffers/protobuf/issues/3601#issuecomment-341516826
I dont know the current status. Too lazy to try to find out ! :-)

I need to parse, modfiy and write back Java Source Files? [duplicate]

I need to parse, modify and write back Java source files. I investigated some options but it seams that I miss the point.
The output of the parsed AST when written back to file always screwed up the formatting using a standard format but not the original one.
Basically I want something that can do: content(write(parse(sourceFile))).equals(content(sourceFile)).
I tried the JavaParser but failed. I might use the Eclipse JDT's parser as a stand alone parser but this feels heavy. I also would like to avoid doing my own stuff. The Java parser for instance has information about column and line already but writing it back seams to ignore these information.
I would like to know how I can achieve parsing and writing back while the output looks the same as the input (intents, lines, everything). Basically a solution that is preserving the original formatting.
[Update]
The modifications I want to do is basically everything that is possible with the AST like adding, removing implemented interfaces, remove / add final to local variables but also generate source methods and constructors.
The idea is to add/remove anything but the rest needs to remain untouched especially the formatting of methods and expressions if the resulting line is larger than the page margin.

You may try using antlr4 with its java8 grammar file
The grammar skips all whitespaces by default but based on token positions you may be able to reconstruct the source being close to the original one

The output of a parser generated by REx is a sequence of events written to this interface:
public interface EventHandler
{
public void reset(CharSequence input);
public void startNonterminal(String name, int begin);
public void endNonterminal(String name, int end);
public void terminal(String name, int begin, int end);
public void whitespace(int begin, int end);
}
where the integers are offsets into the input. The event stream can be used to construct a parse tree. As the event stream completely covers all of the input, the resulting data structure can represent it without loss.
There is sample driver, implementing XmlSerializer on top of this interface. That streams out an XML parse tree, which is just markup added to the input. Thus the string value of the XML document is identical to the original input.
For seeing it in action, use the Java 7 sample grammar and generate a parser using command line
-ll 2 -backtrack -tree -main -java
Then run the main method of the resulting Java.java, passing in some Java source file name.

Our DMS Software Reengineering Toolkit with its Java Front End can do this.
DMS is a program transformation system (PTS), designed to parse source code to an internal representation (usually ASTs), let you make changes to those trees, and regenerate valid output text for the modified tree.
Good PTSes will preserve your formatting/layout at places where you didn't change the code or generate nicely formatted results, including comments in the original source. They will also let you write source-to-source transformations in the form of:
if you see *this* pattern, replace it by *that* pattern
where pattern is written in the surface syntax of your targeted language (in this case, Java). Writing such transformations is usually a lot easier than writing procedural code to climb up and down the tree, inspecting and hacking individual nodes.
DMS has all these properties, including OP's request for idempotency of the null transform.
[Reacting to another answer: yes, it has a Java 8 grammar]

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks

If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.

I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.