Static code parser for Java source code to extract methods / comments

Static code parser for Java source code to extract methods / comments - java

I'm looking for a parser that can extract methods from a java class (static source code -> .java file) and method signature, comments / documentation, variables of each of the methods. Preferably in Java programming language.
Could someone please advise?
Thanks.

You can use ASTParser by eclipse. Its super simple to use.
Find a quick standalone example here.

Here is what I do to extract the method signatures from a java file/s:
I use Sublime Text 2, to the file I want to get the signatures from and the do a find Ctrl+F with regular expression set for the following Regex I made (I tested it on my code and it works, I hope it will work for you too)
((synchronized +)?(public|private|protected) +(static [a-Z\[\]]+|[a-Z\[\]]+) [a-Z]+\([a-Z ,\[\]]*\)\n?[a-Z ,\t\n]*\{)
After Sublime Text 2 highlight my results I click on "Find All" then copy Ctrl+C, open a new tab Ctrl+N and paste Ctrl+V.
You will then see all your methods signatures.
I hope it helped.

If all you want is the exact text of each method, and the exact text of the variables inside methods, you could get by with a parser that produces a CST, walking the CST to find the right nodes, and then prettyprinting the found subtrees. ANTLR has a Java parser that would work for this. I don't know if it will capture comments. I think the main distribution of ANTLR is coded in Java.
You can likely do this more hackily, in Java, with a lexer for Java, implementing what amounts to a bad island parser that looks for the key phrases. ("After 'class', find '{' and print out everything you find up to the matching '}'" would give you all the methods and fields).
If you want more precise detail (e.g, you want to know the actual type of an argument rather than just its name, or where the type is actually defined) you'll need a parser with a full front end and name resolution. (ANTLR won't do this.) The Eclipse JDT certainly builds trees; it likely does name resolution. Our DMS Software Reengineering Toolkit with its Java Front End can provide everything necessary for this task, including comment capture and extraction. DMS isn't coded in Java.
You objected to Javadoc as being inadequate, because it doesn't give you the content of methods. Perhaps our Java Source Browser, which does give you that code, would serve better. It integrates name resolution data from our DMS/Java Front End to hyperlink JavaDoc-type information into browsable source text; all fields as well as local variables are explicitly indexed. The Source Browser isn't coded in Java, but then presumably you simply want to run it and scrape your result. Such scraping might be harder than it appears staring at the screen; there's a lot of HTML behind such a display.

Related

Is it possible to remove tags (or sequences) and relate or remember them as indexes?

I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?

You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.

Efficient java library for text templating?

I've got a simple string coming in from a UI component as The device id is %{test}. Assume %{test} is a dynamic variable and the values for it are being assigned from the backend code. The final string should look like:
The device id is some text
----------------------------^ should be replaced with %{test} and appended to the whole string
I've read a bit and tried out some of the libraries which were pointed out here, such as Velocity and FreeMarker. But I'm quite unaware in terms of efficiency and performance on using those libraries.
Hope I could get some insights on this since I'm pretty new to this. Any help could be appreciated.

I suggest you to take a look at Arco Template Engine: It compiles the template in compile-time, producing a .java (or .class) file. And so, at run-time, the expansion is done very fast.
The templates should be coded in JSP format. Thus, all variables references must be written ${variable} (not %{variable}).
The only thing to take in account is that templates must be staticly generated (in order to be processed at compile-time).
(Read the FAQ and the examples).

How to write "AstRoot" object to a file including comments using Rhino?

I've already parsed javascript source using Rhino and reconstructed it successfully.
and when I call astroot.toSource(), it shows to me reconstructed source well.
but .toSource() method can't prints Comments.
using .toSource() method, all my javascript source's comments are disappear.
so, How can I get the full source including comments?
My goal is write AstRoot Object(contain source) to a new javascript file that including full comments.
I'm using Rhino 1.7R4

In general, this is difficult because comments can appear in the middle of any decl, state ment or expression. So how to represent that fact in the various AST objects? It could be done but is very messy for parser and the AST objects it creates.
If you restrict yourself to only allowing comments on statement boundaries there are some possible solutions.
One way would be to write your own javascript tokenizer and inspect the stream while reading the file. Then you would need to figure out how to track them. One hackish way would be to transform them into 'var somexXXxx = "comment";' and use a naming convention to transform them back after ast.toSource() call. That would map your comments into the AST node structure.

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?

How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.

Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.

Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.

There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

Can I automatically refactor an entire java project and rename uppercase method parameters to lowercase?

I'm working in a java project where a big part of the code was written with a formatting style that I don't like (and is also non standard), namely all method parameters are in uppercase (and also all local variables).
On IntellJ I am able to use "Analyze -> Inspect Code" and actually find all occurrences of uppercase method parameters (over 1000).
To fix one occurrence I can do "refactor > rename parameter" and it works fine (let's assume there is no overlapping).
Is there a way to automagically doing this refactor (e.g: rename method parameter starting with uppercase to same name starting with lowercase)?

Use a Source Parser
I think what you need to do is use a source code parser like javaparser to do this.
For every java source file, parse it to a CompilationUnit, create a Visitor, probably using ModifierVisitorAdapter as base class, and override (at least) visit(MethodDeclaration, arg). Then write the changed CompilationUnit to a new File and do a diff afterwards.
I would advise against changing the original source file, but creating a shadow file tree may me a good idea (e.g. old file: src/main/java/com/mycompany/MyClass.java, new file src/main/refactored/com/mycompany/MyClass.java, that way you can diff the entire directories).

I'd advise that you think about a few things before you do anything:
If this is a team effort, inform your team.
If this is for an employer, inform your boss.
If this is checked into a version control system, realize that you'll have diffs coming out the wazoo.
If it's not checked into a version control system, check it in.
Take a backup before you make any changes.
See if you have some tests to check before & after behavior hasn't changed.
This is a dangerous refactoring. Be careful.

I am not aware of any direct support for such refactoring out of the box in IDEs. As most IDEs would support name refactoring (which is regularly used). You may need to write some IDE plugin that could browse through source code (AST) and invoke rename refactoring behind the scene for such parameter names matching such format.

I have done a lot of such refactorings on a rather large scale of files, using TextPad or WildPad, and a bunch of reg-ex replace-all. Always worked for me!
I'm confident that if the code is first formatted using an IDE like Eclipse (if it is not properly formatted), and then a reg-ex involving the methods' signature (scope, return-type, name, bracket, arg list, bracket) can be devised, your job will be done in seconds with these tools. You might need more than one replace-all sets of reg-ex.
The only time-taking activity would be to come up with such a set of reg-ex.
Hope this helps!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.