I want to start measuring what Michael Feathers has referred to as the turbulence of code, namely churn vs. complexity.
To do this, I need to measure the complexity of a C++ or Java file. So I found a couple tools that measure cyclomatic complexity (CC). They each measure CC well at the function or method level. However, I need a metric at the file level, and they don't do so well there. One tool just returns the average of all method complexities in the file, and the other tool treats the whole file like it is one giant method, i.e., it counts all the decision points in the whole file.
So I did some research and found that McCabe defines CC only in terms of modules--and they define a module as a function--not as a file (see slides 20 and 30 of this presentation). And I think that makes sense.
So now I'm left with trying to figure out how to represent file complexity. My thought is that I should just use the maximum method CC for that file.
Any thoughts about that approach or any other suggestions?
Thanks!
Ken
Few years ago I had the same question. I answered it in the following way and it worked and works for me perfectly:
The purpose to minimize complexity is to improve maintainability. Cyclomatic complexity is an indicator of logical complexity, and you are right - it is applied to the smallest 'unit', i.e. function. It is possible to derive 'summary' metrics, like total/max/min/etc but they rarely show something useful, when it is about cyclomatic complexity. I tried to use 'summary' metrics to compare 2 code bases, but concluded that only distribution graphs of cyclomatic complexity are really useful here.
So, what could be used to indicate something about maintainability level for bigger units/levels of abstractions, like files/components/subsystems? I found that the first metric is a size of a unit in lines of code. If you limit the size of a file, like 1000 lines, and limit cyclomatic complexity for each function in the file, you will have relatively "simple" file, because it is "small" and contains only "simple" functions. You may include or exclude comment/blank lines or count only statements or only executable lines...
However, I concluded that it does not really matter in this particular application. Just limit some 'size' metric and it will serve the purpose in most cases.. Later you may think about limiting the total number of lines of code per a component/subsystem. It will have the same effect - component is "simple", because it contains "small" number of "simple" files.
The post you referred to is very good. It can be extended to broader metric, which usually is named as 'maintainability index'. The index is very high if a function is complex, file is big and has got frequent changes, little coverage by tests, and so on (add here whatever you think defines maintainability). It is the best way, I know, to find hot-spots for re-factoring...
Disclaimer: I am looking after Metrix++ tool which executes the use case scenario, I explained above.
Related
I have a few questions about my genetic algorithm and GAs overall.
I have created a GA that when given points to a curve it tries to figure out what function produced this curve.
An example is the following
Points
{{-2, 4},{-1, 1},{0, 0},{1, 1},{2, 4}}
Function
x^2
Sometimes I will give it points that will never produce a function, or will sometimes produce a function. It can even depend on how deep the initial trees are.
Some questions:
Why does the tree depth matter in trying to evaluate the points and
produce a satisfactory function?
Why do I sometimes get a premature convergence and the GA never
breaks out if the cycle?
What can I do to prevent a premature convergence?
What about annealing? How can I use it?
Can you take a quick look at my code and tell me if anything is obviously wrong with it? (This is test code, I need to do some code clean up.)
https://github.com/kevkid/GeneticAlgorithmTest
Source: http://www.gp-field-guide.org.uk/
EDIT:
Looks like Thomas's suggestions worked well I get very fast results, and less premature convergence. I feel like increasing the Gene pool gives better results, but i am not exactly sure if it is actually getting better over every generation or if the fact that it is random allows it to find a correct solution.
EDIT 2:
Following Thomas's suggestions I was able to get it work properly, seems like I had an issue with getting survivors, and expanding my gene pool. Also I recently added constants to my GA test if anyone else wants to look at it.
In order to avoid premature convergence you can also use multiple-subpopulations. Each sub-population will evolve independently. At the end of each generation you can exchange some individuals between subpopulations.
I did an implementation with multiple-subpopulations for a Genetic Programming variant: http://www.mepx.org/source_code.html
I don't have the time to dig into your code but I'll try to answer from what I remember on GAs:
Sometimes I will give it points that will never produce a function, or will sometimes produce a function. It can even depend on how deep the initial trees are.
I'm not sure what's the question here but if you need a result you could try and select the function that provides the least distance to the given points (could be sum, mean, number of points etc. depending on your needs).
Why does the tree depth matter in trying to evaluate the points and produce a satisfactory function?
I'm not sure what tree depth you mean but it could affect two things:
accuracy: i.e. the higher the depth the more accurate the solution might be or the more possibilities for mutations are given
performance: depending on what tree you mean a higher depth might increase performance (allowing for more educated guesses on the function) or decrease it (requiring more solutions to be generated and compared).
Why do I sometimes get a premature convergence and the GA never breaks out if the cycle?
That might be due to too little mutations. If you have a set of solutions that all converge around a local optimimum only slight mutations might not move the resulting solutions far enough from that local optimum in order to break out.
What can I do to prevent a premature convergence?
You could allow for bigger mutations, e.g. when solutions start to converge. Alternatively you could throw entirely new solutions into the mix (think of is as "immigration").
What about annealing? How can I use it?
Annealing could be used to gradually improve your solutions once they start to converge on a certain point/optimum, i.e. you'd improve the solutions in a more controlled way than "random" mutations.
You can also use it to break out of a local optimum depending on how those are distributed. As an example, you could use your GA until solutions start to converge, then use annealing and/or larger mutations and/or completely new solutions (you could generate several sets of solutions with different approaches and compare them at the end), create your new population and if the convergence is broken, start a new iteration with the GA. If the solutions still converge at the same optimum then you could stop since no bigger improvement is to be expected.
Besides all that, heuristic algorithms may still hit a local optimum but that's the tradeoff they provide: performance vs. accuracy.
There are many tools for code quality. But sometimes need gain performance also if code is not corresponds to rules of cod quality. Exists some open source tool for this?
Thanks.
There's no tool for that, but you can try out jVisualVM, however.
http://download.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.html
It usually comes with your jdk. # C:\Program Files\Java\jdk1.6.0_21\bin
No tool is going to tell you performance and quality. Both are hard to measure.
You can certainly use something like FindBugs or IntelliJ's Inspector to examine your code, but they'll just look for rule violations. I'm not aware of a tool that will point out when I've written code that performs badly. How will a Java code inspector know that your database has no indexes?
I can't answer you regarding code quality. Others can. But when you "need gain performance", I would rather tell you how to do it than tell you what tools to use.
There are tools, but more important than tools is understanding what you're doing.
The most important is to understand that measuring doesn't tell you what to fix to get higher performance; it only tells you how much improvement you got.
The way to improve performance is to find activities, whatever they are, that account for a significant fraction of time and can be improved.
Measuring is not finding.
Example:
I can manually sample the state of a program, several times, and see it much of the time doing container class manipulations, like fetching elements, testing for end conditions, etc.
(That's the finding part.)
This can be happening in many different places in the code, so no particular routine appears to be causing a large fraction of time to be spent.
There is no particular hotspot or obvious bottleneck.
There is no "bad algorithm" or "slow routine", the kinds of thing people say they look for.
Nevertheless, I can see in those few samples that it is doing container class operations, and I can see exactly where.
If I can replace those container class operations with something else that accomplishes the same purpose, I can save time.
How much time? Up to roughly the fraction of time I saw those operations happening, and that can be quite large.
The real payoff for doing this is there can be multiple issues.
Suppose issue A costs 40% of the time, B costs 20%, and C costs 10%,
and the total time is, say, 10 seconds.
You go after A, the most obvious one.
Fixing it reduces time to about 6 seconds. (Speedup 10/6 = 1.67).
Then problem B takes a larger percent of time (2/6 = .33) so it is easier to find with samples.
Fixing it reduces time to 4 seconds (Speedup 6/4 = 1.5)
Then C is (1/4 = 25%) and is much easier to find than before.
Removing it reduces time to 3 seconds (Speedup 4/3 = 1.33).
The total speedup factor is 10/3 = 3.33.
You can look at it as the compounded product of each speedup: 10/6 * 6/4 * 4/3 = 10/3.
Now I'm dealing in numbers here, but none of these had to be measurements of time spent in localized pieces of code.
They were just rough estimates gotten from describing what was happening in a small number of detailed samples of what the program was doing.
The samples aren't really concerned with measuring.
They are concerned with exposing the problems.
I want to ask a complex question.
I have to code a heuristic for my thesis. I need followings:
Evaluate some integral functions
Minimize functions over an interval
Do this over thousand and thousand times.
So I need a faster programming language to do these jobs. Which language do you suggest? First, I started with Java, but taking integrals become a problem. And I'm not sure about speed.
Connecting Java and other softwares like MATLAB may be a good idea. Since I'm not sure, I want to take your opinions.
Thanks!
C,Java, ... are all Turing complete languages. They can calculate the same functions with the same precision.
If you want achieve performance goals use C that is a compiled and high performances language . Can decrease your computation time avoiding method calls and high level features present in an interpreted language like Java.
Anyway remember that your implementation may impact the performances more than which language you choose, because for increasing input dimension is the computational complexity that is relevant ( http://en.wikipedia.org/wiki/Computational_complexity_theory ).
It's not the programming language, it's probably your algorithm. Determine the big0 notation of your algorithm. If you use loops in loops, where you could use a search by a hash in a Map instead, your algorithm can be made n times faster.
Note: Modern JVM's (JDK 1.5 or 1.6) compile Just-In-Time natively (as in not-interpreted) to a specific OS and a specific OS version and a specific hardware architecture. You could try the -server to JIT even more aggressively (at the cost of an even longer initialization time).
Do this over thousand and thousand times.
Are you sure it's not more, something like 10^1000 instead? Try accurately calculating how many times you need to run that loop, it might surprise you. The type of problems on which heuristics are used, tend to have a really big search space.
Before you start switching languages, I'd first try to do the following things:
Find the best available algorithms.
Find available implementations of those algorithms usable from your language.
There are e.g. scientific libraries for Java. Try to use these libraries.
If they are not fast enough investigate whether there is anything to be done about it. Is your problem more specific than what the library assumes. Are you able to improve the algorithm based on that knowledge.
What is it that takes so much/memory? Is this realy related to your language? Try to avoid observing JVM start times instead of the time it performed calculation for you.
Then, I'd consider switching languages. But don't expect it to be easy to beat optimized third party java libraries in c.
Order of the algorithm
Tipically switching between languages only reduce the time required by a constant factor. Let's say you can double the speed using C, but if your algorithm is O(n^2) it will take four times to process if you double the data no matter the language.
And the JVM can optimize a lot of things getting good results.
Some posible optimizations in Java
If you have functions that are called a lot of times make them final. And the same for entire classes. The compiler will know that it can inline the method code, avoiding creating method-call stack frames for that call.
For a project at university, we had to implement a few different algorithms to calculate the equivalenceclasses when given a set of elements and a collection of relations between said elements.
We were instructed to implement, among others, the Union-Find algorithm and its optimizations (Union by Depth, Size). By accident (doing something I thought was necessary for the correctness of the algorithm) I discovered another way to optimize the algorithm.
It isn't as fast as Union By Depth, but close. I couldn't for the life of me figure out why it was as fast as it was, so I consulted one of the teaching assistants who couldn't figure it out either.
The project was in java and the datastructures I used were based on simple arrays of Integers (the object, not the int)
Later, at the project's evaluation, I was told that it probably had something to do with 'Java caching', but I can't find anything online about how caching would effect this.
What would be the best way, without calculating the complexity of the algorithm, to prove or disprove that my optimization is this fast because of java's way of doing stuff? Implementing it in another, (lower level?) language? But who's to say that language won't do the same thing?
I hope I made myself clear,
thanks
The only way is to prove the worst-case (average case, etc) complexity of the algorithm.
Because if you don't, it might just be a consequence of a combination of
The particular data
The size of the data
Some aspect of the hardware
Some aspect of the language implementation
etc.
It is generally very difficult to perform such task given modern VM's! Like you hint they perform all sorts of stuff behind your back. Method calls gets inlined, objects are reused. Etc. A prime example is seeing how trivial loops gets compiled away if their obviously are not performing anything other than counting. Or how a funtioncall in functional programming are inlined or tail-call optimized.
Further, you have the difficulty of proving your point in general on any data set. An O(n^2) can easily be much faster than a seemingly faster, say O(n), algorithm. Two examples
Bubble sort is faster at sorting a near-sorted data collection than quick sort.
Quick sort in the general case, of course is faster.
Generally the big-O notation purposely ignores constants which in a practical situation can mean life or death to your implementation. and those constants may have been what hit you. So in practice 0.00001 * n
^2 (say the running time of your algorithm) is faster than 1000000 * n log n
So reasoning is hard given the limited information you provide.
It is likely that either the compiler or the JVM found an optimisation for your code. You could try reading the bytecode that is output by the javac compiler, and disabling runtime JIT compilation with the -Djava.compiler=NONE option.
If you have access to the source code -- and the JDK source code is available, I believe -- then you can trawl through it to find the relevant implementation details.
I'm looking for a short, simple suffix tree building/usage algorithm in Java. The best I've found so far lies withing the Semantic Discovery Toolkit, but the implementation is several thousand lines long and spans several classes. Ideally, the implementation would be as short as possible and span no more than a few hundred lines.
Does anyone have such an implementation?
I just finished a Java implementation of a suffix tree. In my blog entry you can find out more about suffix trees, see how to use my library, as well as download and build the library using Subversion and Maven. Yes, it's longer than just a few lines in a single class file, but it is highly documented and is created for use in the real world for practical purposes. In addition, it uses the Ukkonen approach for linear time construction. (Most of the implementations noted here have at least O(n^2) running time.)
The article "Simple Linear Work Suffix Array Construction", by Karkkainen and Sanders, terminates with 50 lines of C++. You will probably also want something to produce the LCP array. Googling for "Computing the LCP array in linear time, given S and the suffix array POS." should find you that.
You can also take mine but this is not Ukkonen's algorithm - as all other simple approaches, it runs in quadratic time. I agree that a naive algorithm (that may work ok for the shorter sequences) is easy to write in half a day at most.