Recursion in parser combinator running out of stack space - java

I've been creating a simple parser combinator library in Java and for a first attempt I'm using programatic strcutures to define both the tokens and the parser grammar, see below:
final Combinator c2 = new CombinatorBuilder()
/*
.addParser("SEXPRESSION", of(Option.of(new Terminal("LPAREN"), zeroOrMore(new ParserPlaceholder("EXPRESSION")), new Terminal("RPAREN"))))
.addParser("QEXPRESSION", of(Option.of(new Terminal("LBRACE"), zeroOrMore(new ParserPlaceholder("EXPRESSION")), new Terminal("RBRACE"))))
*/
.addParser("SEXPRESSION", of(Option.of(new Terminal("LPAREN"), new ParserPlaceholder("EXPRESSION"), new Terminal("RPAREN"))))
.addParser("QEXPRESSION", of(Option.of(new Terminal("LBRACE"), new ParserPlaceholder("EXPRESSION"), new Terminal("RBRACE"))))
.addParser("EXPRESSION", of(
Option.of(new Terminal("NUMBER")),
Option.of(new Terminal("SYMBOL")),
Option.of(new Terminal("STRING")),
Option.of(new Terminal("COMMENT")),
Option.of(new ParserPlaceholder("SEXPRESSION")),
Option.of(new ParserPlaceholder("QEXPRESSION"))
)).build()
If I take the first Parser "SEXPRESSION" defined using the builer I can explain the structure:
Parameters to addParser:
Name of parser
an ImmutableList of disjunctive Options
Parameters to Option.of:
An array of Elements where each element is either a Terminal, or a ParserPlaceholder which is later substituted for the actual Parser where the names match.
The idea is to be able to reference one Parser from another and thus have more complex grammars expressed.
The problem I'm having is that using the grammar above to parse a string value such as "(+ 1 2)" gets stuck in an infinite recursive call when parsing the RPAREN ')' as the "SEXPRESSIONS" and "EXPRESSION" Parsers have "one or many" cardinaltiy.
I'm sure I could get creative and come up with some way of limiting the depth of the recursive calls, perhaps by ensuring that when the "SEXPRESSION" parser hands off to the "EXPRESSION" parser which then hands off to the "SEXPRESSION" parser, and no token are taken, then drop out? But I don't want a hacky solution if a standard solution exists.
Any ideas?
Thanks

Not to dodge the question, but I don't think there's anything wrong with calling an application using VM arguments to increase stack size.
This can be done in Java by adding the flag -XssNm where N is the amount of memory the application is called with.
The default Java stack size is 512 KB which, frankly, is hardly any memory at all. Minor optimizations aside, I felt that it was difficult, if not impossible to work with that little memory to implement complex recursive solutions, especially because Java isn't the least bit efficient when it comes to recursion.
So, some examples of this flag, as as follows:
-Xss4M 4 MB
-Xss2G 2 GB
It also goes right after you call java to launch the application, or if you are using an IDE like Eclipse, you can go in and manually set the command line arguments in run configurations.
Hope this helps!

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map
Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

Load a Perl Hash into Java

I have a big .pm File, which only consist of a very big Perl hash with lots of subhashes. I have to load this hash into a Java program, do some work and changes on the data lying below and save it back into a .pm File, which should look similar to the one i started with.
By now, i tried to convert it linewise by regex and string matching, converting it into a XML Document and later Elementwise parse it back into a perl hash.
This somehow works, but seems quite dodgy. Is there any more reliable way to parse the perl hash without having a perl runtime installed?
You're quite right, it's utterly filthy. Regex and string for XML in the first place is a horrible idea, and honestly XML is probably not a good fit for this anyway.
I would suggest that you consider JSON. I would be stunned to find java can't handle JSON and it's inherently a hash-and-array oriented data structure.
So you can quite literally:
use JSON;
print to_json ( $data_structure, { pretty => 1 } );
Note - it won't work for serialising objects, but for perl hash/array/scalar type structures it'll work just fine.
You can then import it back into perl using:
my $new_data = from_json $string;
print Dumper $new_data;
Either Dumper it to a file, but given you requirement is multi-language going forward, just using native JSON as your 'at rest' data is probably a more sensible choice.
But if you're looking at parsing perl code within java, without a perl interpreter? No, that's just insanity.

Nashorn OutOfMemoryError when building large js objects (strings)

I'm using Nashorn on 1.8u60 to create model objects to pass back to view tier (thymeleaf). Part of the model object is a somewhat large string (not big enough to cause any issues in plain java) containing HTML. When trying to convert the object back into Java using ScriptObjectMirror methods i'm hitting the following exception. Changing max heap size doesn't seem to have any effect ( changed from 900mb to 1800mb, same error). I couldn't find much online about this, but are there any restrictions that Nashorn places on object sizes? I'm going to try latest 1.8 JDK now.
java.lang.OutOfMemoryError: Java heap space
at jdk.nashorn.internal.runtime.ConsString.flatten(ConsString.java:105)
at jdk.nashorn.internal.runtime.ConsString.flattened(ConsString.java:98)
at jdk.nashorn.internal.runtime.ConsString.toString(ConsString.java:69)
at jdk.nashorn.api.scripting.ScriptObjectMirror.wrap(ScriptObjectMirror.java:704)
at jdk.nashorn.api.scripting.ScriptObjectMirror.wrapLikeMe(ScriptObjectMirror.java:721)
at jdk.nashorn.api.scripting.ScriptObjectMirror.wrapLikeMe(ScriptObjectMirror.java:730)
at jdk.nashorn.api.scripting.ScriptObjectMirror.access$300(ScriptObjectMirror.java:64)
at jdk.nashorn.api.scripting.ScriptObjectMirror$13.call(ScriptObjectMirror.java:371)
at jdk.nashorn.api.scripting.ScriptObjectMirror$13.call(ScriptObjectMirror.java:364)
at jdk.nashorn.api.scripting.ScriptObjectMirror.inGlobal(ScriptObjectMirror.java:859)
at jdk.nashorn.api.scripting.ScriptObjectMirror.entrySet(ScriptObjectMirror.java:364)
...
Thanks,Adrian
That line reads
final char[] chars = new char[length];
so it appears there's indeed not enough memory for the final string. Nashorn uses ConsString as a way to amortize concatenation costs by delaying concatenation until the result is used (most JS engines use this optimization otherwise e.g. a concatenating lots of strings in a loop will require O(n^2) time).
This means that you might have a result of many + operators on strings be a tree of ConsString objects that get "flattened" at once. The tradeoff for linearizing the time of the concatenation is the need to keep those ConsStrings around, which'll require more than twice the memory required for the string (more than twice 'cause of the ConsString objects own overhead).
One way to get around this is to periodically invoke str.toString(). It is seemingly a no-op but internally it forces flattening of the concatenation tree. Try introducing it into your code at some point and see whether it helps.

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

BigDecimal notation eclipse plugin or nice external tool

I need to make a lot of operations using BigDecimal, and I found having to express
Double a = b - c * d; //natural way
as
BigDecimal a = b.subtract(c.multiply(d))//BigDecimal way
is not only ugly, but a source of mistakes and communication problems between me and business analysts. They were perfectly able to read code with Doubles, but now they can't.
Of course a perfect solution will be java support for operator overloading, but since this not going to happen, I'm looking for an eclipse plugin or even an external tool that make an automatic conversion from "natural way" to "bigdecimal way".
I'm not trying to preprocess source code or dynamic translation or any complex thing, I just want something I can input text and get text, and keep the "natural way" as a comment in source code.
P.S.: I've found this incredible smart hack but I don't want to start doing bytecode manipulation. Maybe I can use that to create a Natural2BigDecimal translator, but I don't want to reinvent the wheel if someone has already done such a tool.
I don't want to switch to Scala/Groovy/JavaScript and I also can't, company rules forbid anything but java in server side code.
"I'm not trying to preprocess source code ... I just want something I can input [bigDecimal arithmetic expression] text".
Half of solving a problem is recognizing the problem for what it is. You exactly want something to preprocess your BigDecimal expressions to produce legal Java.
You have only two basic choices:
A stand-alone "domain specific language" and DSL compiler that accepts "standard" expressions and converts them directly to Java code. (This is one kind of preprocessor). This leaves you with the problem of keeping all the expression fragments around, and somehow knowing where to put them in the Java code.
A tool that reads the Java source text, finds such expressions, and converts them to BigDecimal in the text. I'd suggest something that let you code the expressions outside the actual code and inserted the translation.
Perhaps (stolen from another answer):
// BigDecimal a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) );
with the meaning "compile the big decimal expression in the comment into its java equivalent, and replace the following statement with that translation.
To implement the second idea, you need a program transformation system, which can apply source-to-source rewriting rules to transforms (generate as a special case of transform) the code. This is just a preprocessor that is organized to be customizable to your needs.
Our DMS Software Reengineering Toolkit with its Java Front End could do this. You need a full Java parser to do that transformation part; you'll want name and type resolution so that you can parse/check the proposed expression for sanity.
While I agree that the as-is Java notation is ugly, and your proposal would make it prettier, my personal opinion is this isn't worth the effort. You end up with a dependency on a complex tool (yes, DMS is complex: manipulating code isn't easy) for a rather marginal gain.
If you and your team wrote thousands of these formulas, or the writers of such formulas were Java-naive it might make sense. In that case,
I'd go further, and simply insist you write the standard expression format where you need it. You could customize the Java Front End to detect when the operand types were of decimal type, and do the rewriting for you. Then you simply run this preprocessor before every Java compilation step.
I agree, it's very cumbersome! I use proper documentation (comments before each equation) as the best "solution" to this.
// a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) )
You might go the route of an expression evaluator. There is a decent (albeit paid) one at http://www.singularsys.com/jep. Antlr has a rudimentary grammar that also does expression evaluation (tho I am not sure how it would perform) at http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator.
Neither would give you the compile-time safety you would have with true operators. You could also write the various algorithm-based classes in something like Scala, which does support operator overloading out of the box and would interoperate seamlessly with your other Java classes.

Categories

Resources