What is the correct abstract syntax tree for representing algebra? I have tried way too many setups, and constantly been rewriting the syntax tree, and all of my configurations end up forgetting something important (e.g. fractions not being supported). Currently my configurations for equations and expressions seem to be fine. Expressions simply consist of an array of terms, each with a positive/negative sign, and a coefficient. That's where the trouble comes in. What exactly is a term? Wikipedia helps some, and even has an example AST for a couple of terms. However, for practical purposes I'm trying to keep everything closer to the concepts we use when we learn algebra, rather than breaking it down into nothing but variables and operators. It appears that just about anything can be contained in a term: terms can contain fractions (which contain expressions), sub-terms, sub-expressions, and regular variables, each of them having their own exponents.
Currently my configuration is something like this:
Term
|
-----------------------------------------------------------------
| | | | |
Coefficient ArrayList of ArrayList of ArrayList of ArrayList of
| sub-expressions powers of fractions powers of
| sub-expressions* (may contain fractions*
--------------- variables)
| |
integer/decimal fraction
(no variables)
*Expressions/fractions don't have exponents on their own, but may have one outside sometimes (e.g. 2(x+3)^3).
NOTE: For the sake of simplicity the diagram leaves out an ArrayList of variables (and one for roots), an an ArrayList of their respective exponents, all contained by the term.
NOTE 2: In case it's not clear, the diagram doesn't show inheritance. It's showing members of the Term class.
This seems rather sloppy to me, and might not scale well with the project when things get more complex. Is a term really supposed to be this kind of soup? I have a feeling yet another thing should be included in term, but I can't think of what it would be. Although I've been strugling with this for some months, I haven't taken the discipline to just stop and really work it out, which I should have done before starting.
Am I making a mistake in making nearly everything fit in a term? If so, what should I be doing instead? If not, is it really supposed to be this... ugly/non-intuitive? Part of my feeling that this must be wrong is due to the fact that that almost no one thinks of an algebraic term this way.
Example term: 2.3x(2/3)^4(√23)((x+6)/(x-6)) (overly complex, I know, but it contains everything mentioned above).
My real question: What is the correct syntax structure for the the term, the heart and soul of algebra?
Related
I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map
Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?
I want to be able to parse expressions representing physical quantities like
g/l
m/s^2
m/s/kg
m/(s*kg)
kg*m*s
°F/(lb*s^2)
and so on. In the simplest way possible. Is it possible to do so using something like Pyparsing (if such a thing exists for Java), or should I use more complex tools like Java CUP?
EDIT: To answere MrD's question the goal is to make conversion between quantities, so for example convert g to kg (this one is simple...), or maybe °F/(kg*s^2) to K/(lb*h^2) supposing h is four hour and lb for pounds
This is harder than it looks. (I have done a fair amount of work here). The main problem is there is no standard (I have worked with NIST on units and although they have finally created a markup language few people use it). So it's really a form of natural language processing and has to deal with :
ambiguity (what does "M" mean - meters or mega)
inconsistent punctuation
abbreviations
symbols (e.g. "mu" for micro)
unclear semantics (e.g. is kg/m/s the same as kg/(m*s)?
If you are just creating a toy system then you should create a BNF for the system and make sure that all examples adhere to it. This will use common punctuation ("/", "", "(", ")", "^"). Character fields can be of variable length ("m", "kg", "lb"). Algebra on these strings ("kg" -> 1000"g" has problems as kg is a fundamental unit.
If you are doing it seriously then ANTLR (#Yaugen) is useful, but be aware that units in the wild will not follow a regular grammar due to the inconsistencies above.
If you are REALLY serious (i.e. prepared to put in a solid month), I'd be interested to know. :-)
My current approach (which is outside the scope of your question) is to collect a large number of examples from the literature automatically and create a number of heuristics.
I'm trying to do a document classification using Weka java API.
Here is my directory structure of the data files.
+- text_example
|
+- class1
| |
| 3 html files
|
+- class2
| |
| 1 html file
|
+- class3
|
3 html files
I have the 'arff' file created with 'TextDirectoryLoader'. Then I use the StringToWordVector filter on the created arff file, with filter.setOutputWordCounts(true).
Below is a sample of the output once the filter is applied. I need to get few things clarified.
#attribute </form> numeric
#attribute </h1> numeric
.
.
#attribute earth numeric
#attribute easy numeric
This huge list should be the tokenization of the content of the initial html files. right?
Then I have,
#data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........
why there is no class attribute for the first 3 items? (it should have class1).
what does the leading 0 means as in {0 class2,..}, {0 class3..}.
It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?
How do I reduce the dimensionality of the feature vector? don't we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).
Am I missing something here? I'm new to Weka.
Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)
The huge list of #attribute contains all the tokens derived from your input.
Your #data section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can't see it (if it were unknown, you would see a 0 ? at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can't see the class in the first three lines. (Also see the section "Sparse ARFF files" on http://www.cs.waikato.ac.nz/ml/weka/arff.html)
To get the word/token represented by index n, you can either count or, if you have the Instances object, invoke attribute(n).name() on it. For that, n starts counting at 0.
To reduce dimensionality of the feature vector, there are a lot of options. If you only want to have the 100 most frequent terms, you stringToWordVector.setWordsToKeep(100). Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true). You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.
As for the new words occuring in the test phase, I think that cannot happen because you have to hand the stringToWordVector all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector transform all my instances before telling the classifier anything about it.
I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily.
Hope I was able to help you, although the answer is a bit late.
I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?
One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.
I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.
I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.
Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)
Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.
Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)