Short, Java implementation of a suffix tree and usage?

Short, Java implementation of a suffix tree and usage? - java

I'm looking for a short, simple suffix tree building/usage algorithm in Java. The best I've found so far lies withing the Semantic Discovery Toolkit, but the implementation is several thousand lines long and spans several classes. Ideally, the implementation would be as short as possible and span no more than a few hundred lines.
Does anyone have such an implementation?

I just finished a Java implementation of a suffix tree. In my blog entry you can find out more about suffix trees, see how to use my library, as well as download and build the library using Subversion and Maven. Yes, it's longer than just a few lines in a single class file, but it is highly documented and is created for use in the real world for practical purposes. In addition, it uses the Ukkonen approach for linear time construction. (Most of the implementations noted here have at least O(n^2) running time.)

The article "Simple Linear Work Suffix Array Construction", by Karkkainen and Sanders, terminates with 50 lines of C++. You will probably also want something to produce the LCP array. Googling for "Computing the LCP array in linear time, given S and the suffix array POS." should find you that.

You can also take mine but this is not Ukkonen's algorithm - as all other simple approaches, it runs in quadratic time. I agree that a naive algorithm (that may work ok for the shorter sequences) is easy to write in half a day at most.

Related

If you have a dictionary of strings, what's the fastest way to search a file and increment the number of times the strings appear?

Let's say you have a dictionary with 5 strings in it, and you also have multiple files. I want to iterate through those files and see how many times the strings in my dictionary appears in them. How can I do this so it is most efficient?
I would like this to scale as well..so more than 5 strings and more than a few documents. I'm pretty open about what language I'm using. Preferably Java or C#, but once again, I can work in another language.

Most efficient is always a trade off between time you want to put into it and the results you want (or need).
One easy approach that is efficient is to use a regular expression. This is probably pretty good with five strings and this will be fairly efficient. If that isn't good enough for you, well... You can certainly find a better approach.

This is a Pattern Matching Problem. The best algorithm to solve this kind of problem is Knuth-Morris-Pratt Algorithm. This is a fomous algorithm therefore you will find its description anywhere, but it found on Introduction to Algorithm book.

How to measure C++ or Java file complexity?

I want to start measuring what Michael Feathers has referred to as the turbulence of code, namely churn vs. complexity.
To do this, I need to measure the complexity of a C++ or Java file. So I found a couple tools that measure cyclomatic complexity (CC). They each measure CC well at the function or method level. However, I need a metric at the file level, and they don't do so well there. One tool just returns the average of all method complexities in the file, and the other tool treats the whole file like it is one giant method, i.e., it counts all the decision points in the whole file.
So I did some research and found that McCabe defines CC only in terms of modules--and they define a module as a function--not as a file (see slides 20 and 30 of this presentation). And I think that makes sense.
So now I'm left with trying to figure out how to represent file complexity. My thought is that I should just use the maximum method CC for that file.
Any thoughts about that approach or any other suggestions?
Thanks!
Ken

Few years ago I had the same question. I answered it in the following way and it worked and works for me perfectly:
The purpose to minimize complexity is to improve maintainability. Cyclomatic complexity is an indicator of logical complexity, and you are right - it is applied to the smallest 'unit', i.e. function. It is possible to derive 'summary' metrics, like total/max/min/etc but they rarely show something useful, when it is about cyclomatic complexity. I tried to use 'summary' metrics to compare 2 code bases, but concluded that only distribution graphs of cyclomatic complexity are really useful here.
So, what could be used to indicate something about maintainability level for bigger units/levels of abstractions, like files/components/subsystems? I found that the first metric is a size of a unit in lines of code. If you limit the size of a file, like 1000 lines, and limit cyclomatic complexity for each function in the file, you will have relatively "simple" file, because it is "small" and contains only "simple" functions. You may include or exclude comment/blank lines or count only statements or only executable lines...
However, I concluded that it does not really matter in this particular application. Just limit some 'size' metric and it will serve the purpose in most cases.. Later you may think about limiting the total number of lines of code per a component/subsystem. It will have the same effect - component is "simple", because it contains "small" number of "simple" files.
The post you referred to is very good. It can be extended to broader metric, which usually is named as 'maintainability index'. The index is very high if a function is complex, file is big and has got frequent changes, little coverage by tests, and so on (add here whatever you think defines maintainability). It is the best way, I know, to find hot-spots for re-factoring...
Disclaimer: I am looking after Metrix++ tool which executes the use case scenario, I explained above.

Existing Algorithm for Scheduling Problems?

Let's say I want to build a function that would properly schedule three bus drivers to drive in a week with the following constraints:
Each driver must not drive more than five times per week
There must be two drivers driving everyday
They will rest one day each week (will not clash with other drivers' rest day)
What kind of algorithm would be used to solve a problem like this?
I looked through several sites and I found these:
1) Backtracking algorithm (brute force)
2) Genetic algorithm
3) Constraint programming
Frankly, these are all "culture shock" for me as I have never learnt any kind of linear programming in the past. There are two things I want to know:
1) Which algorithm will best suit the case scenario above?
2) What would be the simplest algorithm to solve this problem?
3) Please suggest any other algorithms I can look into to solve the above problem.

1) I agree brute force is bad.
2) Your Problem is an Integer Problem. They can be solved with Linear Programming though.
3) You can distinquish 2 different approaches: heuristics and exact approaches.
Heuristics provide good solutions in reasonable computation time. They are used when there are strict requirements on the computation time or if the problem is too hard to calculate an optimal solution. Genetic Algorithms is a heuristic.
As your Problem is comparably simple, you would probably go with an exact approach.
4) The standard way to solve this exacly, is to embed a Linear Program in a Branch & Bound search tree. There is lots of literature on it. The procedure can be outlined as follows:
Solve the Linear Program with the Simplex-Algorithm
Find a fractional variable for branching. I.e. x=1.5
Create two new nodes and add the constraints x<=1 and x>=2 respectively
Go into one node (selected by some strategy)
Go to point 1
Additionally, at every node in the tree, after point 1, the algorithms checks, if a node can be pruned. That means to stop searching 'deeper' from this node on, because
a) the problem has become infeasible,
b) a better solution already exists,
c) an integer solution is found. This objective value of this solution is used to determine point b.
The procedure finishes when all nodes are pruned.
Luckily, as Nicolas stated, there are free implementations that do just this. All you have to do is to create your model. Code its objective and constraints in some tool and let it solve.

First of all this is a discrete optimization problem, so linear programming is probably not a good idea (since it is meant for continuous optimization). You can still solve this using linear programming (it will become an integer or mixed-integer program) but that is exponentially heard (if your input size is small then it is ok).
Now back to the comparison:
Brute force : worst.
Genetic: Can not guarantee optimality. The algorithm may not be able to solve the problem.
Constraint programming: definitely the best in this case (and in many discrete optimization problems). There is a super efficient implementation of it in IBM ILOG CPLEX solver (but is is not free, it is free for academia or for testing though).

Lightweight library cappable of suggesting different spellings of words from a bounded set?

I was looking for lightweight library that'd allow me to feed it a bunch of words, and then ask it whether a given word would have any close matches.z
I'm not particularly concerned with the underlying algorithm (I reckon a simple hamming distance algorithm would probably suffice, were I to undertake the task myself).
I'm just in the development of a small language and I found it nifty to make suggestions to the user when an "Undefined class" error is detected (lots of times it's just a misspelled word). I don't want to lose much time on the issue though.
Thanks

Levenshtein distance is a common way of handling it. Just add all the words to a list and then brute-force iterate over it and return the smallest distance. Here's one library with a Levenschtein function: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html
If you have a large number of words and you want it to run fast, then you'd have to use ngrams. Spilt each word into bigrams and then add (bigram, word) to a map. Use the map to look up the bigrams in the target word, and then iterate through the candidates. That's probably more work than you want to do, though.

not necessarily a library but i think this article may be really helpful. it mostly describes the general workings of how a spelling corrector works in python, but also has a link for a java implementation which you may use if that is what you are looking for specifically (note that I haven't specifically used the java one before)

Is there a library similar to PyCogent, but in Java (or Scala)?

I'm writing a biological evolution simulator. Currently, all of my code is written in Python. For the most part, this is great and everything works sufficiently well. However, there are two steps in the process which take a long time and which I'd like to rewrite in Scala.
The first problem area is sequence evolution. Imagine you're given a phylogenetic tree which relates a large set of proteins. The length of each branch represents the evolutionary distance between the parent and child. The root of the tree is seeded with a single sequence, and then an evolutionary model (e.g. http://en.wikipedia.org/wiki/Models_of_DNA_evolution) is used to evolve the sequence along the tree structure; taking into account the branch lengths. PyCogent takes a long time to perform this step, and I believe that a reasonable Java/Scala implementation would be significantly faster. Do you know of any libraries that implement this type of functionality. I want to write the application in Scala, so, due to interoperability, any Java library will suffice.
The second problem area is the comparison of the generated sequences. The problem is, given a set of sequences for the proteins in a number of different extant species, attempt to use the sequence to reconstruct the phylogenetic tree which relates the species. This problem is inherently computationally demanding, because one must basically do a pairwise comparison between all sequences in the extant species. Here again, however, I feel like a Java/Scala implementation would perform significantly faster than a Python one, if for nothing else than the unfortunately slow speed of looping in Python. This part I could write from scratch more easily than the sequence evolution part, but I'd be willing to use a library for it as well if a good one exists.
Thanks,
Rob

For the second problem, why not make use an existing program for comparing sequences and infering phylogenetic trees, like RAxML or MrBayes and call that? Maximum likelihood and Bayesian inference are very sophisticated models for these problems, and using them seems a far better idea than implementing it yourself - something like a maximum parsiomony or a neihbour-joining tree, which probably could be written from scratch for such a project, is not sufficient for evolutionary analysis. Unless you just want a very quick and dirty topology (and trees inferred via MP or NJ are really often quite false), where you can probably use something like this

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.