Code understanding, reverse engineering, best concepts and tools. Java

Code understanding, reverse engineering, best concepts and tools. Java - java

One of most demanding tasks for any programmer, architect is understanding other's code.
For example, I am contractor, hired to rescue some project very quickly. Fix bugs, plan global refactoring and therefore I need most efficient way to understand the code. What is the list of concepts, their priority and best tools for this?
Of what I know: reverse code engineering to create object models (creating of diagram per package is not so convenient), create sequence diagrams (the tool connects in debug mode to the system and generates diagrams from runtime). Some visualizing techniques, using some tools to work not just with .java but also with e.g. JPA implementors like Hibernate. Generate diagram for not all the codebase, but add some class and then classes used by it.
Is Sparx Enterprise Architect state of the art in reverse engineering or far from that? Any other better tools? Ideally would be that tool makes me understand the code as if I wrote it myself :)

The book Object-Oriented Reengineering Patterns deals with this in detail. Unfortunately there is no silver bullet attached :-)
However, it lists a lot of useful techniques for taking over legacy code. In brief
interview at least some of the original developers (if they are still around) about
development history: phases, releases
current state of affairs
team social structure, politics, dynamics: when and why did people join and leave
bugs: typical, easiest, hardest
code quality: cleanest / ugliest parts
configuration data: form, content and usage
unit / integration / manual / ... test cases and data
SCM branch structure and usage
documentation: what is documented where, is it up to date
contact persons for external interfaces
Watch developers / users during demo to find
main features
typical use cases
usage anecdotes
good / bad, missing / superfluous functionality
"read all the code in one hour"
get high level view of class hierarchies, interfaces
take multiple sessions if needed
identify large structures (these often contain important functionality)
look for design patterns
check comments (they can reveal a lot, but may be also misleading)
skim documentation (if there is any)
just record the availability of specific types of docs e.g. specification, UML diagram, Wiki, Javadoc etc.
is it useful and why (not)
is it up to date

By far the most important tools are your ears, your tongue and your larynx. Ask the people who are familiar with the code - they'll be able to help you understand its general architecture much better than any software tools.
Automatically reverse-engineered complete UML models are generally nearly useless because they cannot distinguish between important abstractions and implementation details - which is the whole point of such models.
Software tools are more useful to answer very specific questions when you are investigating details, such as "where is this method called from?" or "what classes implement this interface" - any good IDE will be able to do that. Debuggers can help too - placing breakpoints at keypoints of the code and looking at the call stack when they're hit is often very enlightening.

Just to elaborate on Michaels mentioning of good IDE's which can help you:
I use the following Eclipse facilities a lot:
Shift-F2 when the cursor is placed in an identifier brings up the Javadoc for that identifier, if any. Good for navigating.
Hovering the mouse over an identifier brings up a box with the Javadoc in it, if any. Good for reminding when writing e.g. a method call.
The Declaration view shows the source where the keyword where the cursor is placed, is defined. This is updated when the cursor moved.
F3 goes to the definition of the current identifier.
Ctrl-T on an identifier shows all subclasses and implementations in a popup. Very useful when working on interfaces.
F4 on an identifier brings up the implementation hierarchy of that identifier in a panel, which can be navigated. Very useful to learn how things are connected. This includes both classes and interfaces.

EclipseUML Omondo is the best Java reverse engineering tool. It reverse all the java code, all packages and even class interaction with interface if not in the same package. Just amazing.
You can also reverse:
- .class
- hibernate annotations
- JPA annotations
What I like with this tool is that my code is clean because all the model information is saved into an xmi format and not as tag in my code. You can also create small documentation inside each existing package using diagrams as a view of the model. Just marvelous and respecting the official uml 2.2 specification.
The only problem is that it is really too expensive so the price is a stop for me !!

Doesn't extract high level architectures, but does make it much easier to climb around your Java code: our Java Source Code Browser. This reads source code (and supporting class files) and produces Javadoc style documentation plus source text bi-directionally hyperlinked to the Javadoc information.
(I'm one of the principals behind it).

I use Enterprise Architect for whole UML (including reverse engineering with Java) and it works perfectly.

Related

How Do You Keep UML Diagrams Up To Date?

I am from a Physics background and not a Computer Science background and never did any course at University on class/component diagrams etc and I have never found the need to use them at work.
The main thing that I don't understand is how do you keep them up to date if the code is still being developed or maintained?
e.g. What's to stop me from refactoring several methods or classes and making the class diagram obsolete?
Do you have to constantly update the diagram manually?
I have seen tools that generate UML from the code and these could keep it up to date I suppose but from what I have seen, the auto-generated diagrams don't seem to be useful enough.
Is the UML for a project likely to be created at the start then be left in a documentation folder and gradually get more and more out of date?

I work for a moderately large government agency, so most of our major projects fall into the "Enterprise Java" category. This is what works for us:
Architects model any changes and extensions to our corporate data model using UML diagrams. Generally there will be a conceptual model class diagram, plus a few sequence diagrams that illustrate how the various parts of the system will interact, and maybe a couple of component diagrams.
We have a walkthrough with the business analysts, DBAs and lead developers. This idea of this is to challenge the new model, and agree on changes (there is a lot of "robust" discussion at these sessions). With a good architect, the changes are minimal.
A senior developer creates a technical specification that will typically include a physical database ER diagram based on the architectural model. From the physical model, we automatically generate a database creation script.
The DBAs upgrade the creation script (e.g. Add tablespace and indexspace info) and create/extend the database.
The code gets written. Developers may create their own mini class hierarchies (e.g. POJOs to carry around data). We don't bother to model these in UML as the code should be self-documenting, and changes are inevitable as the code evolves.
Quite often changes will occur during the development phase, especially if using agile methodologies. If these impact on the corporate data model, then the UML and ER diagrams will be updated.
At the end of the project, the documentation is updated to reflect "as built" state.
Getting back to the gist of your question, I'm not a great believer in automated UML <-> code generation. Generally there is data that is personal to the UML diagram (notes, relationship cardinality, sequence diagrams etc) that does not appear in the code or is very difficult to extract. Conversely the code contains stuff (e.g. behavioural method working logic, data structures and caches) that do not necessarily show in the logical UML model. Then there is the whole question of how you map the logical model class hierarchy to database tables...
To summarise, I recommend:
Get the design correct up front. Changes to the logical model are expensive and awkward to implement.
Use a modelling tool that will support all of the artifacts you need from the same data source. That is, the initial UML logical model, the database ER diagram and the database creation SQL DDL. We use Enterprise Architect, but there are lots of other tools that will do this.
Use UML to model the "big picture" and forget it for describing detailed coding. A good rule of thumb is you need UML if any change to the model affects more than just your team. (e.g. A new database field may require a change to the database, a change to a web service, a change to the GUI and a change to a mainframe batch process. UML has a place in defining the data change in a way multiple teams can understand)

Although this is not a question that is normally answered on SO, here's what I've seen and heard:
A project's SW development plan must define how design is being done, and if UML is used, how an update to the SW must be made. That plan can define that the UML is "one shot" - so it is indeed forgotten after the first design progresses into code. OTOH, a strict follow-up rule and ensuing checking may require and guarantee that the UML design is updated during bug fixes (if required) or more extensive changes. (More often than not, you may even have to go back to requirements and update there, too.)
A completely different approach is to generate code from UML - that way you never change the code. Whether this works or not, given the potential differences between UML's expressiveness and what and how a language like Java or C++ provide to implement the semantics of the various diagrams, is a question I'd dearly love to have answered on more reliable data than a salesman's pitch.

As for my experience, class UML diagrams are mostly useless. Generic code changes too often, thus having UML diagrams for it adds too much burden.
Possible exceptions are:
Architecture (component-level) diagrams. Created once, changes rarely, useful for others
Business model. If your application operates on complex model, it may be worth it to generate classes from UML representation. This UML can become quite valuable if you have many applications operating on the same model.
University projects - no comments :-)

It depends on what you choose, what you agree on with the team and stakeholders, what are your priorities, what are your processes and their deliverable artifacts and what are the costs and who will pay them.
As of today there are no production-ready tools or machines to keep the UML documentation up-to-date completely automatically although many are close, e.g. Graphviz + Doxygen to generate UML class diagrams
and many make this task easier, e.g. Sparx Systems's Enterprise Architect or Rapid Quality Systems's Code Rocket
As any other process the UML documentation creation/maintenance is a process that needs to be defined, implemented, managed, optimized (same way as you need to manage experiments in Physics which you already know)
There is a whole website devoted to this topic at Agile Modeling - Effective Practices for Modeling and Documentation

What is the difference between Acceleo and Xpand?

I have a DSL which is based on a custom metamodel, which in its turn is based on EMF/Ecore. I am trying to figure out which solution to choose, and I cant find any decent comparisons anywhere.
Does anyone have any reasons why I should choose one over the other?
What I know so far is that Acceleo uses a OMG standardized language, but it seems harder to use than Xpand.

First of all, I wonder why you consider Acceleo more difficult to learn than Xpand, while both languages have differences (blocks and delimiters for example) they have quite a similar structure. I won't details all the elements in both languages but, for example, I don't see such a difference between something like:
«FOREACH myAttributes AS a»«a.name»«ENDFOREACH»
and
[for (a: Attribute|myAttributes)][a.name/][/for]
Both are template based languages and as such they have quite the same structure. The main difference between Acceleo and Xpand comes from the fact that Acceleo is based on the standards MOFM2T and OCL from the OMG and the tooling.
I am not very familiar with Xpand tooling but you can find more about it on their wiki. Acceleo on the other side contains an editor with syntax highlighting, code completion, error detection, refactoring and more. It also contains a debugger, a profiler, Ant and Maven support. You can also easily deploy your generators as Eclipse plugin for other users or use them out of Eclipse in a regular Java application. You can find more information on Acceleo here. You can see in videos most of the features of Acceleo on the Obeo Network (registration required).
Finally, the latest activity on xPand as occurred a year ago while Acceleo is actively developed. You can even follow the Acceleo development on github if you want.
Stephane Begaudeau
Disclaimer: I am one of the member of the Acceleo dev' team.

I am a dabbler, not an expert.
My impression is that if you need little more than a templating language, then Xpand is the way to go. Otherwise, pick Acceleo - but as you say, the learning curve is very steep.
When do you need more than a templating language? For me, they seem to run out of gas when the structure (not content) of the output is dependent on multiple independent pieces of the input. If you don't want to get into Acceleo, but have one of these cases, consider inventing an auto-generated "shim" language that gets you partway from input language to output language, perhaps with a lot of redundancy in it to avoid lookups at template-generation time.

I've been using the old 2.x Acceleo on a full scalled project and done some test with the new one.
The langage is pretty easy to use, but with the new version it's a little bit more difficult to bind some
java code to your template when the script langage is not enought.
I was a very big fan of the 2.x, but with the 3.x, I add lots of troubles to make it work. You have to write java code to handle eclipse resources for instance. I totaly gave up when updating to juno, my acceleo projects didn't worked anymore and I didn't manage to correct it in two days. I hope they will make it easier to use out of the box.

Basically the main difference is that ACCELEO is an implementation of the MOF Models To Text Transformation Language which is the OMG (Object Management Group) Standard for the definition of Models to Text transformation. It is therefore a standard language designed by the same group ho designed MOF, UML, SysML and MDA in general. XPAnd is a language which I guess existed before the standard but it is now different from it.
If you start from scratch then start with Acceleo.

In my case, I use a custom meta-model (derived from UML2) with custom stereotypes and stereotypes properties). I tried both Acceleo and Xpand template languages. Indeed they are pretty similar in term of structure and capabilities.
However, I can see one big difference (which makes Xpand much better in this use case): you can use your custom stereotypes in your Xpand templates.
Xpand engine brilliantly chooses the "best matching template/rule" for every stereotype (taking into account inheritance between stereotypes as well).
Furthermore, it is very easy to obtain stereotype properties.
These two "features" make the templates very elegant, compact and readable.
For example:
«DEFINE myTemplate FOR MyUmlProfile::MyStereoType»
MyValue: «this.myStereotypeProperty» or simply: «myStereotypeProperty»
«ENDDEFINE»
In Acceleo, I found it clumsy to achieve the same (longer statements, more code) and my templates ended up lengthy and complex. The positive thing about Acceleo, however, was that it worked conveniently from IBM RSA (applied directly to RSA (emx) models). It has code highlighting and auto-complete working nicely.
Xpand only worked if I exported my RSA models to ".uml" (~XML) format. It doesn't offer code highlighting or auto-complete (or at least I didn't figure out how).
Considering all pros and cons, I still vote for Xpand (in my use case).

Designing APIs in Java with top-down approach - Is writing up the Javadoc the best starting point?

Whenever I have the need to design an API in Java, I normally start off by opening up my IDE, and creating the packages, classes and interfaces. The method implementations are all dummy, but the javadocs are detailed.
Is this the best way to go about things? I am beginning to feel that the API documentation should be the first to be churned out - even before the first .java file is written up. This has few advantages:
The API designer can complete the design & specification and then split up the implementation among several implementors.
More flexible - change in design does not require one to bounce around among java files looking for the place to edit the javadoc comment.
Are there others who share this opinion? And if so, how do you go about starting off with the API design?
Further, are there any tools out there which might help? Probably even some sort of annotation-based tool which generates documentation and then the skeleton source (kind of like model-to-code generators)? I came across Eclipse PDE API tooling - but this is specific to Eclipse plugin projects. I did not find anything more generic.

For an API (and for many types of problems IMO), a top-down approach for problem partitioning and analysis is the way to go.
However (and this is just my 2c based on my own personal experience, so take it with a grain of salt), focusing on the Javadoc part of it is a good thing to do, but that is still not sufficient, and cannot reliably be the starting point. In fact, that is very implementation oriented. So what happened to the design, the modeling and reasoning that should take place before that (however brief that might be)?
You have to do some sort of modeling to identify the entities (the nouns, roles and verbs) that make up your API. And no matter how "agile" one would like to be, such things cannot be prototyped without having a clear picture of the problem statement (even if it is just a 10K foot view of it.)
The best starting point is to specify what you are trying to implement, or more precisely, what type of problems your API is trying to address. BDD might be of help (more of that below). That is, what is it that your API will provide (datum elements), and to whom, performing what actions (the verbs) and under what conditions (the context). That leads then to identify what entities provide these things and under what roles (interfaces, specifically interfaces with a single, clear role or function, not as catch-all bags of methods). That leads to an analysis on how they are orchestrated together (inheritance, composition, delegation, etc.)
Once you have that, then you might be in a good position to start doing some preliminary Javadoc. Then you can start working on the implementation of those interfaces, of those roles. More Javadoc follows (in addition to other documentation that might not fall within Javadoc .ie. tutorials, how-tos, etc.)
You start your implementation with use cases and verifiable requirements and behavioral descriptions of what each thing should do alone or in collaboration. BDD would be extremely helpful here.
As you work on, you continuously refactor, hopefully by taking some metrics (cyclomatic complexity and some variant of LCOM). These two tell you where you should refactor.
A development of an API should not be inherently different from the development of an application. After all, an API is a utilitarian application for a user (who happens to have a development role.)
As a result, you should not treat API engineering any diferently from general software-intensive application engineering. Use the same practices, tune them according to your needs (which every one who works with software should), and you'll do fine.
Google has been uploading its "Google Tech Talk" video lecture series on youtube for quite some time. One of them is an hour long lecture titled "How To Design A Good API and Why it Matters". You might want to check it out also.
Some links for you that might help:
Google Tech Talk's "Beyond Test Driven Development: Behaviour Driven Development" : http://www.youtube.com/watch?v=XOkHh8zF33o
Behavior Driven Development : http://behaviour-driven.org/
Website Companion to the book "Practical API Design" : http://wiki.apidesign.org/wiki/Main_Page
Going back to the Basics - Structured Design#Cohesion and Coupling : http://en.wikipedia.org/wiki/Structured_Design#Structured_Design

Defining the interface first is the programming-by-contract style of declaring preconditions, postconditions and invariants. I find it combines well with Test-Driven-Development (TDD), because the invariants and postconditions you write first are the behaviours that your tests can check for.
As an aside, it seems the Behaviour-Driven-Development elaboration of TDD seems to have come about because of programmers who did not habitually think of the interface first.

As for my self, I always prefer starting with writing the interfaces along with their documentation and only then start with the implementation.
In the past I took another approach which was starting with the UML and then using the automatic code generation.
The best tool I encountered for this matter was Rational Rose which is not free but I'm sure there are plenty of free plugins and utils.
The advantage of Rational Rose over other designers I bumped into was that you can "attach" the design to your code and then modify on either code or design and the other will update.

I jump right in with the coding with a prototype. Any required interfaces soon pop out at you and you can mould your proto into a final product. Get feedback along the way from whomever is going to be using your API if you can.
There is no 'best way' of approaching API design, do whatever works for you. Domain knowledge also has a large part to play

I'm a great fan of programming to the interface. It forms a contract between the implementors and the users of your code.
Rather than diving straight into code, I usually start with a basic model of my system (UML diagrams etc, depending on the complexity). Not only does this serve as good documentation, it provides a visual clarification of the system structure. Having this makes the coding part much easier to do. This kind of design documentation also makes it easier to understand the system when you come back to it in 6 months, or try to fix bugs :)
Prototyping also has its merits, but be prepared to throw it away and start again.

Intelligent search and generation of Java code, preferrably using Python?

Basically, I do lots of one-off code generation, large-scale refactorings, etc. etc. in Java.
My tool language of choice is Python, but I'll take whatever solutions you can offer.
Here is a simplified illustration of what I would like, in a pseudocode
Generating an implementation for an interface
search within my project:
for each Interface as iName:
write class(name=iName+"Impl", implements=iName)
search within the body of iName:
for each Method as mName:
write method(name=mName, body="// TODO implement this...")
Basically, the tool I'm searching for would allow me to:
parse files according to their Java structure ("search for interfaces")
search for words contextualized by language elements and types ("variables of type SomeClass", "doStuff() method calls on SomeClass instances")
to run searches with structural context ("within the body of the current result")
easily replace or generate code (with helpers to generate, as above, or functions for replacing, "rename the interface to Foo", "insert the line Blah.Blah()", etc.)
The point is, I don't want to spend a lot of time writing these things, as they are usually throwaway. But sometimes I need something just a little smarter than what grep offers. It wouldn't be too hard to write up a simplistic version of this, but if I'm going to use something like this at all, I'd expect it to be robust.
Any suggestions of a tool/library that will help me accomplish this?
Edit to add some clarification
Python is definitely not necessary; I'll take whatever is that. I merely suggest it incase there are choices.
This is to be used in combination with IDE refactoring; sometimes it just doesn't do everything I want.
In instances where I'm using for code generation (as above), it's for augmenting the output of other code generators. e.g. a library we use outputs a tonne of interfaces, and we need to make standard implementations of each one to mesh it to our codebase.

First, I am not aware of any tool or libraries implemented in Python that specifically designed for refactoring Java code, and a Google search did not give me any leads.
Second, I would posit that writing such a decent tool or library for refactoring Java in Python would be a large task. You would have to implement a Java compiler front-end (lexer/parser, AST builder and type analyser) in Python, then figure out how to integrate this with a program editor. I'm not surprised that nobody has done this ... given that mature alternatives already exist.
Thirdly, doing refactoring without a full analysis of the source code (but uses pattern matching for example) will be incapable of doing complex refactoring, and will is likely to make mistakes in edge cases that the implementor did not think of. I expect that is the level at which the OP is currently operating ...
Given that bleak outlook, what are the alternatives:
One alternative is to use one of the existing Java IDEs (e.g. NetBeans, Eclipse, IDEA. etc) as a refactoring tool. The OP won't be able to extend the capabilities of such a tool in Python code, but the chances are that he won't really need to. I expect that at least one of these IDEs does 95% of what he needs, and (if he is realistic) that should be good enough. Especially when you consider that IDEs have lots of incidental features that help make refactoring easier; e.g. structured editing, undo/redo, incremental compilation, intelligent code completion, intelligent searching, type and call hierarchy views, and so on.
(Aside ... if existing IDEs are not good enough (#WizardOfOdds - only the OP can make that call!!), it would make more sense to try to extend the refactoring capability of an existing IDE than start again in a different implementation language.)
Depending on what he is actually doing, model-driven code generation may be another alternative. For instance, if the refactoring is happening because he is frequently creating and recreating his object model(s), then an alternative is to code the models in some modeling language and generate his code from those models. My tool of choice when doing this kind of thing is Eclipse EMF and related technologies. The EMF technologies include generation of editors, XML serialization, persistence, queries, model to model transformation and so on. I have used EMF to implement and roll out projects with object models consisting of 50 to 100 distinct classes with complex relationships and validation requirements. EMF's support for merging source code edits when you regenerate from an updated model is a key feature.

If you are coding in Java, I strongly recommend that you use NetBeans IDE. It has this kind of refactoring support builtin. Eclipse also supports this kind of thing (although I prefer NetBeans). Both projects are open source, so if you want to see how they perform this refactoring, you can look at their source code.

Java has its fair share of criticism these days but in the area of tooling - it isn't justified.
We are spoiled for choice; Eclipse, Netbeans, Intellij are the big three IDEs. All of them offer excellent levels of searching and Refactoring. Eclipse has the edge on Netbeans I think and Intellij is often ahead of Eclipse
You can also use static analysis tools such as FindBugs, CheckTyle etc to find issues - i.e. excessively long methods and classes, overly complex code.
If you really want to leverage your Python skills - take a look at Jython. Its a Python interpreter written in Java.

Generating Class Diagram

HI All I am at the end of the release of my project.So in order to keep working our manager asked us to generate Class Diagrams for the code we had written.Its medium project with 3500 java files .So I think we need to generate class diagrams.First I need to know how reverse engineering works here. Also I looked for some tools in Google(Green, Violet) but not sure
whether they are of any help.Please suggest me how to proceed.Also a good beginning tutorial is appreciated.

I strongly recommend BOUML. Its Java reverse support is absolutely ROCK SOLID.
BOUML has many other advanteges:
it is extremely fast (fastest UML tool ever created, check out benchmarks),
has rock solid C++, Java, PHP and others import support,
it is multiplatform (Linux, Windows, other OSes),
has a great SVG export support, which is important, because viewing large graphs in vector format, which scales fast in e.g. Firefox, is very convenient (you can quickly switch between "birds eye" view and class detail view),
it is full featured, impressively intensively developed (look at development history, it's hard to believe that such fast progress is possible).
supports plugins, has modular architecture (this allows user contributions, looks like BOUML community is forming up)

The tool you want to use is Doxygen. It's similar to Javadoc, but works across multiple languages. If figures out the dependencies, and can call graphviz to render the class diagrams. Here's an example of a few Java classes run through Doxygen.

This is more a toolchain than a tool and I haven't tried it out myself. But it maybe a starting point. Using UMLGraph, ant and GraphViz. Explained step by step: in this article.

I ve used Visual Paradigm for UML for what you want to do and it was quite good.
See here for details.
Just go Tools -> Instant reverse and select your packages.

You may be able to reverse engineer class diagrams with the open source modelleing tool ArgoUML http://argouml.tigris.org/

ObjectAid is pretty nice. You can drag classes into a diagram and arrange them the way you want.

Visual Paradigm for UML Standard Edition (or Better) will reverse engineer Java files in to Class Diagrams.

I guess if your boss just wants to keep you busy until the next project starts then there's no harm in it, but you will find pretty quickly that creating a class diagram with 3500 classes will tell you exactly NOTHING about your system. In fact, you don't really want a diagram with more than about 10 classes on it. So once you have reversed all the code into your modelling tool, you will want to start organizing and arranging to find the meaning. Create a new diagram, drop a single important class onto it and bring in all the classes that are directly related to that class. Repeat for maybe the 300 most significant classes. Don't worry, it isn't as horrible as it sounds, maybe a week's work.
For the record, my modelling tool of choice is Enterprise Architect by Sparx Systems. It will reverse java sources or .jar files. There is a free 30 day trial edition.

There are some tools available that will help you generate these diagrams. These cost money.
Otherwise you could to try to parse your Java files. This could be as simple to create a simple parser that reads the Java files and writes the name of the class and all the import statements to a file and generates a class diagram from there, graphviz can help you there.

I've been using Enterprise Architect for a number of years. A JBoss developer suggested it to me. It works very well for all types of UML modeling including the reverse engineering of class models (Java, C# and others). The basic version is currently $120 per seat, but it has most of the capabilities of much more expensive tools and it is much easier to learn. I particularly like its ability to generate HTML and RTF documentation.
It is very easy to synchronize code between the tool and your source code. Even bi-directional if you want.
Your PM may also like the activity and sequence diagrams that it can create. I also frequently use the deployment diagrams. It's very helpful to have all of this in one tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.