I'm currently looking for a java library (or native library with a java API) for formula parsing and evaluation.
Using recommandations from here, I took a look on many libraries :
JFormula
JEval
Symja
JEP
But none of them fulfil my needs, that are :
Multiple formula evaluation with dependency between them (a formula is always an affectation to a variable using other variables or numerical values)
Possibility to change only one formula out of maybe 50, with good performances if only one formule changes
no need to handle by hand variables dependancies
Automatically update other dependant variables if a formula changes
Possibility to listen which variable changed
no need to have a specific format for the variables (the user will directly enter a name and doesn't want to have a complexe notation)
Maybe an exemple will be better. Let's say we have, entered in the system in this order :
a = b + c
c = 2 * d
b = 3
d = 2
I would like to be able to enter those 4 lines in this order, and ask for the result of "a" (or "b", whatever).
Then if in the user interface (basically a table variable <> formula) "b" is changed to "2 * d", the library will automatically change the value of "b" and "a", and return me (or lunch an event, or call a function) a list of changes
The best library would be one just like JEP, but with the out-of-order variables capability and the possibility to auto-evaluate dependant variables
I know that compilers and spreadsheet softwares uses such mechanisms, but I didn't found any java or java compatible libraries directly usable
Does someone know one?
EDIT : Precision : the question is really about a library, or eventually a set of libraries to link together. The question is for a project in a company and the idea is to spend the minimum amount of time. The "do it yourself" solution has already been estimated and is not in the scope of the question
For a project that I also needed a simple formula parser I used the code of the article Lexical analysis, Part 2: Build an application in javaworld.com. It's simple and small (7 classes), and you can adapt it to your needs.
You can downdoad the source form here (search for 'Lexical Analysis Part II' entry).
Don't know of any libraries.
Assuming what you have is a set of equations with a single variable on at least one side of the equation (A+B=C-D is disallowed) and no cycles, (e.g., A=B+1; B=A-2), what you technically need to do is to build a a data flow graph showing how each operator depends on its operands. For side-effect-free equations (e.g., pure math) this is pretty easy; you end up with a directed acyclic graph (a forest with shared subtrees representing shared subexpressions). Then if a value of a variable is changed, or a new formula is introduced, you revise the dag and re-evaluate the changed parts, propagating changes up the dag to the dag roots. So, you need to build trees for the expressions, and then share them (often by hashing on subtrees to find potential equivalent candidates). So, lots of structure manipulation to keep the dag (and is root values)
But if its only 50 variables of the complexity you show, it would act, you could simply reevaluate them all. If you store the expression as trees (or better yet, reverse polish) you can evaluate each tree quite fast, and you don't pay any overhead to keep all those data structures up to date.
If you have hundreds of equations, the dag scheme is likely a lot better.
If you have constraint equations (e.g., you aren't restricted as to what can be on both sides), you're outside the spreadsheet paradigm and into constraint solvers, which is a far more complex technology.
Why would not you just write your own? Your assessment of complexity of this task might be wrong. It is much easier than you might think - chances are, learning how to deal with any 3rd party library would require much more effort than implementing such a trivial thing from scratch. It should not take more than a couple of hours in the worst case.
It does not make any sense to look for 3rd party libraries for doing simple things (I know, it is a part of the Java ethos, but still...)
I'd recommend to take a look at the Cells library for inspiration. It is in Common Lisp, but ideas are basic enough to be transferred anywhere else.
you can check these links too...
MathPiper (a Java fork of the Java Yacas version) (has it's own
editor based on jEdit) (GPL) http://code.google.com/p/mathpiper/
Symja/Matheclipse (my own project, uses JAS and Commons Math
libraries) (LGPL) http://krum.rz.uni-mannheim.de/jas/
Java Algebra System (JAS) (LGPL) http://krum.rz.uni-mannheim.de/jas/
I would embedd Groovy, see the Tutorial about embedding here. Freeplane (a Java Mindmapper) also uses Groovy for formulas.
Whenever a variable is changing you have to put the new value into the binding.
All the cell code should be given to the Groovy Shell as single code piece. You can register on changes via BindPath.
Anyway I assume you have to implement a thin layer to fullfill your requirements:
no need to handle by hand variables dependancies
Possibility to listen which variable changed
Related
I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922
I'm looking at the Mallet source codes, and it seems that most of the classifier implementations (e.g naive bayes) didn't really take into account the feature selections even though the InstanceList class has a setFeatureSelection method.
Now I want to conduct some quick experiments with my datasets with feature selection involved. I am thinking, from a technical shortcut standpoint, I might get the lowest ranking features and set those values to 0 in the instance vectors. Is this equivalent in machine learning to feature selection in classifier training whereby they are not considered at all (if smoothing e.g laplace estimation is not involved)?
thank you
Yes, setting the feature value to zero will have the same effect as removing it from the feature vector, since MALLET has no notion of "missing features," only zero and nonzero feature values.
Using the FeatureSelection class isn't too painful, though. MALLET comes with several built-in classes that apply a "mask" under the hood based on RankedFeatureVector sublcasses. For example, to use information gain feature selection, you should just be able to do this:
FeatureSelection fs = FeatureSelection(new InfoGain(ilist), numFeatures);
ilist.setFeatureSelection(fs);
You can also implement your own RankedFeatureVector subclass (the API is here) for something more customized. To manually select features some other way, you can still do so by creating a feature mask as a BitSet that contains all the feature ids (from the Alphabet) that you want to use, e.g.:
java.util.BitSet featureMask = /* some code to pick your features */;
FeatureSelection fs = FeatureSelection(ilist.getAlphabet(), featureMask);
ilist.setFeatureSelection(fs);
In general, I recommend using FeatureSelection objects instead of destructively changing the instance data.
I learned that PostgreSQL is written in C. I would like to extend it by
a customized index structure
a customized nearest neighbor retrieval (with various distance functions)
custom data types
I feared so far to use PostgreSQL because it is written in C. However, I've seen on the PostgreSQL about page (http://www.postgresql.org/about/) that they support "library interfaces" e.g. for Java. Can I, thus, use Java to implement (at least) a nearest neighbor retrieval and custom data types (I guess not the index structure since it is quite low-level)?
The answer here is "it's complicated." You can actually go quite far with a procedural language (including pl/java) but you are never going to get quite the flexibility you can get with C. What is fundamentally missing is being able to do proper indexing support in PL/Java because one cannot create new primitives. For quite a bit more, you may want to look at my blog although most of the examples are in pl/pgsql.
Types
Now you can actually get very far with PL/Java (or PL/Perl, or PL/Python, or whatever you like) but there are some things that are going to be out of reach. This is also a very high overview of what is possible with a procedural language in the db and what is not.
There are two effective ways you can work with types in procedural languages. You can work with domains (subtypes of primitives), or you can work with complex types (objects with properties each of which is another type, either a primitive, a domain or a complex type itself). In general you cannot really do much in terms of indexing complex types themselves but you can index their members. Another thing that is not safe to do is output formatting, but you can supply other functions to replace this.
For example, suppose we want to have a type for storing PNG files and processing them for certain properties in the database. We might do this in the following way:
CREATE DOMAIN png_image as bytea check value like [magic number goes here];
We could then create a bunch of stored procedures to process the png in various ways. For example we might look for orange near the top in a function is_sunset. We might be able to do something like:
SELECT name FROM landmark l
JOIN landmark s ON (s.name = 'San Diego City hall'
and ST_DISTANCE(l.coords, s.coords) < '20')
WHERE is_sunset(photo)
ORDER BY name;
There is no reason that is_sunset could not be handled in Java, Perl, or whatever language you like. Since is_sunset returns a bool, we could even:
CREATE INDEX l_name_sunset_idx ON landmark (name) where is_sunset(photo);
This would speed up the query by allowing us to cache the index of names of photographs of sunsets.
What you can't do in Java is create new primitive types. Keep in mind that things like index support is at the primitive level, and therefore you can't, for example, create a new ip address type supporting GiST indexing (not that you'd need to, since ip4r is available).
So to the extent you can re-used and work with what primitives are already existing, you can do your development in Java or whatever you like. You are really limited only by what primitives are available, and enough people have written new ones in C you may not need to touch these at all.
Indexes
Index code is pretty much C only as are the primitives. You cannot customize index behavior in a procedural language. What you can do is work with the primitives of other developers and so forth. This is the area where you are most likely to have to drop to C.
(Update: As I think about it, it may be possible to hook into existing index types to add support for various indexes based on other PL functions, using the CREATE OPERATOR CLASS and CREATE OPERATOR commands. I have no experience doing this though.)
Performance
keep in mind that PL/Java means you are running a JVM in each backend process. In many cases if you can do a what you want to do in pl/pgsql you will get better performance. The same goes with other languages of course too, because you need an interpretor or other environment in the backend process.
I'm writing a biological evolution simulator. Currently, all of my code is written in Python. For the most part, this is great and everything works sufficiently well. However, there are two steps in the process which take a long time and which I'd like to rewrite in Scala.
The first problem area is sequence evolution. Imagine you're given a phylogenetic tree which relates a large set of proteins. The length of each branch represents the evolutionary distance between the parent and child. The root of the tree is seeded with a single sequence, and then an evolutionary model (e.g. http://en.wikipedia.org/wiki/Models_of_DNA_evolution) is used to evolve the sequence along the tree structure; taking into account the branch lengths. PyCogent takes a long time to perform this step, and I believe that a reasonable Java/Scala implementation would be significantly faster. Do you know of any libraries that implement this type of functionality. I want to write the application in Scala, so, due to interoperability, any Java library will suffice.
The second problem area is the comparison of the generated sequences. The problem is, given a set of sequences for the proteins in a number of different extant species, attempt to use the sequence to reconstruct the phylogenetic tree which relates the species. This problem is inherently computationally demanding, because one must basically do a pairwise comparison between all sequences in the extant species. Here again, however, I feel like a Java/Scala implementation would perform significantly faster than a Python one, if for nothing else than the unfortunately slow speed of looping in Python. This part I could write from scratch more easily than the sequence evolution part, but I'd be willing to use a library for it as well if a good one exists.
Thanks,
Rob
For the second problem, why not make use an existing program for comparing sequences and infering phylogenetic trees, like RAxML or MrBayes and call that? Maximum likelihood and Bayesian inference are very sophisticated models for these problems, and using them seems a far better idea than implementing it yourself - something like a maximum parsiomony or a neihbour-joining tree, which probably could be written from scratch for such a project, is not sufficient for evolutionary analysis. Unless you just want a very quick and dirty topology (and trees inferred via MP or NJ are really often quite false), where you can probably use something like this
For my university's debate club, I was asked to create an application to assign debate sessions and I'm having some difficulties as to come up with a good design for it. I will do it in Java. Here's what's needed:
What you need to know about BP debates: There are four teams of 2 debaters each and a judge. The four groups are assigned a specific position: gov1, gov2, op1, op2. There is no significance to the order within a team.
The goal of the application is to get as input the debaters who are present (for example, if there are 20 people, we will hold 2 debates) and assign them to teams and roles with regards to the history of each debater so that:
Each debater should debate with (be on the same team) as many people as possible.
Each debater should uniformly debate in different positions.
The debate should be fair - debaters have different levels of experience and this should be as even as possible - i.e., there shouldn't be a team of two very experienced debaters and a team of junior debaters.
There should be an option for the user to restrict the assignment in various ways, such as:
Specifying that two people should debate together, in a specific position or not.
Specifying that a single debater should be in a specific position, regardless of the partner.
If anyone can try to give me some pointers for a design for this application, I'll be so thankful!
Also, I've never implemented a GUI before, so I'd appreciate some pointers on that as well, but it's not the major issue right now.
Also, there is the issue of keeping Debater information in file, which I also never implemented in Java, and would like some tips on that as well.
This seems like a textbook constraint problem. GUI notwithstanding, it'd be perfect for a technology like Prolog (ECLiPSe prolog has a couple of different Java integration libraries that ship with it).
But, since you want this in Java why not store the debaters' history in a sql database, and use the SQL language to structure the constraints. You can then wrap those SQL queries as Java methods.
There are two parts (three if you count entering and/or saving the data), the underlying algorithm and the UI.
For the UI, I'm weird. I use this technique (there is a link to my sourceforge project). A Java version would have to be done, which would not be too hard. It's weird because very few people have ever used it, but it saves an order of magnitude coding effort.
For the algorithm, the problem looks small enough that I would approach it with a simple tree search. I would have a scoring algorithm and just report the schedule with the best score.
That's a bird's-eye overview of how I would approach it.