I am currently writing a compiler for a big subset of java and i can't seem to find anything useful for name resolution techniques. Can you please point me towards some resources
The VM specification for class file format contains the naming conventions expected by the various types of names used in the JVM.
They differ slightly (not by much) depending on whether you are referring to a class name, a package name, a member name, or a method signature.
As far as name resolution techniques, you need to ensure that you follow the resolution rules as laid out in the Language Specification.
Basically, if you violate the rules laid out in the language spec (or the names expected in the class file spec), then you violate how the Java language works, or what the class loader expects (respectively).
Related
In JAVA, class name must always be the same as file name, but sometimes file contains multiple classes. Only single class(or interface) in file can be public, and it must have the same name as file. But how is the file name determined if it has multiple classes (or interfaces) that are not public?
interface Foo {}
class Bar{}
Some people seem to be confused about this question
I actually know that it'll work regardless if I choose Foo or Bar as a file name. However what interest's me is if there are some kind of convention of naming the class.
Why don't I name it whatever I feel like it? Because i'm actually writing an application that refactors code, and whenever it renames classes, i need to know how and when to change my filename.
So far i think the right way is:
if class has a public node, use it's name as filename,
else just pick the first node, so in this example Foo would win. So I simplify the question: is this the right way, or is there something more to it?
Quoting the Java Language Specification, section 7.6 Top Level Type Declarations :
If and only if packages are stored in a file system (§7.2), the host system may choose to enforce the restriction that it is a compile-time error if a type is not found in a file under a name composed of the type name plus an extension (such as .java or .jav) if either of the following is true:
The type is referred to by code in other compilation units of the package in which the type is declared.
The type is declared public (and therefore is potentially accessible from code in other packages).
This restriction implies that there must be at most one such type per compilation unit. This restriction makes it easy for a Java compiler to find a named class within a package. In practice, many programmers choose to put each class or interface type in its own compilation unit, whether or not it is public or is referred to by code in other compilation units.
So, as you can see, it is not a requirement that "class name must always be the same as file name", as you said it.
It is simply a way to allow some compilers an easy way to find the class source code during compilation.
But, more importantly, it also help humans find the source code. If you see a reference to class com.example.Foo, you know exactly where to find it, because it's going to be in file com/example/Foo.java.
Non-public (package private) top-level classes, can technically be placed in files of any name, and multiple such classes can be bundled in a single file, but that makes them difficult to find. For this reason, I've seen a guideline (don't remember where) that said that you should always put top-level classes in their own file, with one exception:
If the non-public class is only used by one other class, it's ok to place it in the same compilation unit (.java file) as that other class.
Basically this means that you should consider any top-level class, whose name is not the file name, to be "file-scoped", even though it's technically packages-scoped.
There are 2 rules to follow:
1st Rule: The class can have either package (default) or public visibility
2nd Rule: Teh class which you have defined as public must be implemented in a .java source file with the same name, however classes that are non-public can be with other name in source files.
Is there a way to use a shortened package name in Java if you have conflicting names?
For instance, instead of typing out com.domain.a.b, if the conflict is in com.domain.a, you can just say b.SomeClass instead of com.domain.a.b.SomeClass. C# has a feature similar to this.
No, you either use fully qualified names or short names. You're probably looking for obscuring
A simple name may occur in contexts where it may potentially be
interpreted as the name of a variable, a type, or a package. In these
situations, the rules of §6.5 specify that a variable will be chosen
in preference to a type, and that a type will be chosen in preference
to a package. Thus, it is may sometimes be impossible to refer to a
visible type or package declaration via its simple name. We say that
such a declaration is obscured.
If you follow Java naming conventions, you shouldn't really have any issues.
Can anyone give me some idea of how to extract information from a given C++ or Java program(source code)? The information may be names of classes or names of methods or telling some inheritance relation or class hierarchy,etc.You have to write a c++ or Java program for the same.I have tried and abled to do that but it is not totally correct.Right now what I'm doing is reading the given program line by line and checking for "class" keyword and if I find any such word,it means the word following right after that is name of that class(to extract name of classes).I'm just thinking is there any built in libraries in C or Java which can do this work more efficiently ?And please suggest some simple ideas(not some external libraries or plugins).
If all you want is the names of classes and methods within classes, you can rig a set of regular expressions to pick off various tokens (identifiers, "{", "}", operator, number, string), and a crummy parser (called an "island parser") to recognize the sequence of tokens that make up class declarations and method declarations. (Hint: for Java and C++, make sure you somehow match
corresponding { ... }").
This stunt works for classes and methods because in essence this how real compilers work: they break the input stream into tokens (usually using the compiler-generalization of regexps called "lexer generators"), and then use a parser to determine the actual code structure, and classes and methods are pretty easy to spot in the syntax. (This solution is a kind of clean version of what OP posted).
If you want to any other information form Java or C++ source code (e.g., types of method arguments, etc.) you probably need a tool that actually parses the languages, and builds symbol tables so you have a chance of knowing what the identifiers found in various locations mean.
(EDIT: OP indicated he wants to find out what function calls what other function. He can't do this sensibly without a full language front end (parser+ symbol table as a minimum).
You can get various tools to parse C++ (GCC, Clang, Elsa, ...) and various other tools to parse Java (ANTLR, javacc, ...). You will find that GCC is pretty hard to bend to general tasks, Clang and Elsa less problematic. ANTLR and Javacc will parse Java code but don't AFAIK build symbol tables, so they fall a little flat for general purpose tasks. What you will find is that dealing with a C++ tool will turn out to be completely different than dealing with a Java tool since none of these tools have any common compiler infrastructure.
How you extract class and method names from each of these will vary in detail, but most of them offer some kind of way to climb over a parse tree (and you code some ad hoc match for what you want to find, e.g., class declaration syntax) and/or navigate symbol tables (and spit out symbols marked as "class" or "method" names). How you find the right syntax requires you to know in intimate detail the structure of the tree and code lots of test to match for the proper tree structures.
If you really want to process both languages, and use a single infrastructure to do it, you could consider our DMS Software Reengineering Toolkit. DMS is language agnostic but can be tuned to arbitrary langauges, and then parse those languages, build abstract symbol tables and various kinds of flow analysis. DMS has both full C++ Front end (with a built-in preprocessor and handling C++ in its various forms including the new standard C++11) and a Java Front end handling all dialects of Java up through 1.6 (with 1.7 happening momentarily).
To do OPs (originally stated) task of finding classes and methods, you'd tell DMS to parse the file and then climb over trees or symbol tables, much as for the other tools. You can code an ad hoc tree matcher in DMS, but it easier to write patterns:
pattern match_class_declaration(i: identifier, b: statements): class_declaration
= " class \i { \b } ";
can be used with DMS to match those trees that happen to be class declarations, and will return "i" (and "b" which we don't care about) bound to the correspond subtrees. "i" of course contains the class name you want. Other patterns can be used to recognize other constructs, such as class names that inherit, or implement interfaces, or methods that return some type or methods that return void.
The point is you don't have to know the tree structure in any great detail to use such patterns.
To go further, as OP seems to want to do (e.g build caller/callee information), you'd need to construct control flow graphs, do points-to analysis, etc. DMS provides support for that.
The good news is one infrastructure handles both languages; you can even mix C++ and Java in DMS without it getting anything confused. The more difficult news is that DMS is a fairly complex beast, but that's because it has to handle all the complexities of C++ and Java (as well as many other langauges). Still beats working with two different language parsers with two radically different implementations and thus two complete sets of learning curves.
the question sounds too vague to answer. please elaborate.
from what i could gauge, use Reflection when you are working with Java classes to figure out almost everything about a class and its methods. There are other (static) APIs that you could use on the Class object (if you have that hand). Refer the javadocs for more.
You could try to use some source from compilers, like gcc. They already have all the syntax parsing and preprocessing there, so you could save tons of time.
For compiled Java you could also use bytecode manipulation libraries (like asm).
As you're trying to parse a text file, a shell script based on awk and/or sed would be sufficient. You'll have to define some simple regular expressions based on the languages keywords and syntax to extract the informations you need.
For instance, this regular expression would match most of the class declarations of a C++ source file:
class *([A-Za-z_][A-Za-z_0-9]*) *\{?$
The parenthesis allow you to extract the identifier you're looking for, this is called a capturing group.
If you really want to do it in C/C++/Java, you'll have to find a library that provides regular expressions facilities (Java standard library already provides some). Maybe Boost Regex for a C++ program.
Here's an example building up how to parse a C++ file using the clang (llvm) libraries. Its long and pretty detailed but you should be able to adapt it to do what you need (for C and C++ anyway .. dont know if llvm is any good at handling Java ... and don't know if its easy to adapt that approach for Java).
Not sure about current Java, but C++ is a true nightmare to parse if you want to fully extract semantic information (consider that it took YEARS for the industry to agree 100% on how and if certain construct should have been parsed).
Note that while class name in C++ is easy enough (just remember however that the word class or struct can also be present before a template parameter instead of typename, that you can have "nested classes" and that you can have class "forward declarations") for members things are much harder because member name comes after the type and even understanding what is a type, where the type ends or what is the member name is not trivial... consider
int (*foo)(int x, int y);
Node<Bar, Baz, Allocator<Foo, &Q::operator > >, 12> (*rex)(int);
in the first case the member name is foo, and in the second case member name is rex (note that I'm not sure if the second example is valid C++ code or, supposing it's valid, if common C++ compilers would accept it).
Note that even just understanding where the class member list begins after the class name is not trivial (you have to skip the inheritance list that can include templated classes with parameters that are generic types).
So, giving up with a regular expression (that clearly is not able to parse a type being it a complex recursive entity), the only solution is to use code written by someone else.
For this job (for C++) you can try for example GCC-XML that has been written exactly for this reason (it generates an XML result from parsing C++ source code).
I’m a huge believer in consistency, and hence conventions.
However, I’m currently developing a framework in Java where these conventions (specifically the get/set prefix convention) seem to get in the way of readability. For example, some classes will have id and name properties and using o.getId() instead of o.id() seems utterly pointless for a number of reasons:
The classes are immutable so there will (generally) be no corresponding setter,
there is no chance of confusion,
the get in this case conveys no additional semantics, and
I use this get-less naming schema quite consistently throughout the library.
I am getting some reassurance from the Java Collection classes (and other classes from the Java Platform library) which also violate JavaBean conventions (e.g. they use size instead of getSize etc.).
To get this concern out of the way: the component will never be used as a JavaBean since they cannot be meaningfully used that way.
On the other hand, I am not a seasoned Java user and I don’t know what other Java developers expect of a library. Can I follow the example of the Java Platform classes in this or is it considered bad style? Is the violation of the get/set convention in Java library classes deemed a mistake in retrospect? Or is it completely normal to ignore the JavaBean conventions when not applicable?
(The Sun code conventions for Java don’t mention this at all.)
If you follow the appropriate naming conventions, then 3rd-party tools can easily integrate with and use your library. They will expect getX(), isX() etc. and try to find these through reflection.
Although you say that these won't be exposed as JavaBeans currently, I would still follow the conventions. Who knows what you may want to do further down the line ? Or perhaps at a later stage you'll want to extract an interface to this object and create a proxy that can be accessed via other tools ?
I actually hate this convention. I would be very happen if it was replaced by a real java tool that would provide the accessor/modifier methods.
But I do follow this convention in all my code. We don't program alone, and even if the whole team agrees on a special convention right now, you can be assured that future newcomers, or a future team that will maintain your project, will have a hard time at the beginning... I believe the inconvenience for get/set is not as big as the inconvenience from being non-standard.
I would like to raise another concern : often, java software uses too many accessors and modifiers (get/set). We should apply much more the "Tell, don't ask" advice. For example, replace the getters on B by a "real" method:
class A {
B b;
String c;
void a() {
String c = b.getC();
String d = b.getD();
// algorithm with b, c, d
}
}
by
class A {
B b;
String c;
void a() {
b.a(c); // Class B has the algorithm.
}
}
Many good properties are obtained by this refactor:
B can be made immutable (excellent for thread-safe)
Subclasses of B can modify the computation, so B might not require another property for that purpose.
The implementation is simpler in B it would have been in A, because you don't have to use the getter and external access to the data, you are inside B and can take advantage of implementation details (checking for errors, special cases, using cached values...).
Being located in B to which it has more coupling (two properties instead of one for A), chances are that refactoring A will not impact the algorithm. For a B refactoring, it may be an opportunity to improve the algorithm. So maintenance is less.
The violation of the get/set convention in the Java library classes is most certainly a mistake. I'd actually recommend that you follow the convention, to avoid the complexity of knowing why/when the convention isn't followed.
Josh Bloch actually sides with you in this matter in Effective Java, where he advocates the get-less variant for things which aren't meant to be used as beans, for readability's sake. Of course, not everyone agrees with Bloch, but it shows there are cases for and against dumping the get. (I think it's easier to read, and so if YAGNI, ditch the get.)
Concerning the size() method from the collections framework; it seems unlikely it's just a "bad" legacy name when you look at, say, the more recent Enum class which has name() and ordinal(). (Which probably can be explained by Bloch being one of Enum's two attributed authors. ☺)
The get-less schema is used in a language like scala (and other languages), with the Uniform Access Principle:
Scala keeps field and method names in the same namespace, which means we can’t name the field count if a method is named count. Many languages, like Java, don’t have this restriction, because they keep field and method names in separate namespaces.
Since Java is not meant to offer UAP for "properties", it is best to refer to those properties with the get/set conventions.
UAP means:
Foo.bar and Foo.bar() are the same and refer to reading property, or to a read method for the property.
Foo.bar = 5 and Foo.bar(5) are the same and refer to setting the property, or to a write method for the property.
In Java, you cannot achieve UAP because Foo.bar and Foo.bar() are in two different namespaces.
That means to access the read method, you will have to call Foo.bar(), which is no different than calling any other method.
So this get-set convention can help to differentiate that call from the others (not related to properties), since "All services (here "just reading/setting a value, or computing it") offered by a module cannot be available through a uniform notation".
It is not mandatory, but is a way to recognize a service related to get/set or compute a property value, from the other services.
If UAP were available in Java, that convention would not be needed at all.
Note: the size() instead of getSize() is probably a legacy bad naming preserved for the sake of Java's mantra is 'Backwardly compatible: always'.
Consider this: Lots of frameworks can be told to reference a property in object's field such as "name". Under the hood the framework understands to first turn "name" into "setName", figure out from its singular parameter what is the return type and then form either "getName" or "isName".
If you don't provide such well-documented, sensible accessor/mutator mechanism, your framework/library just won't work with the majority of other libraries/frameworks out there.
When i want to create a java class it is generating automatically a file with the same name of class.
But when it generate a class, it can change the file name different than class name..
Am i missing something?
(source: screencast.com)
Quoting the section 7.6 Top Level Type Declarations from the Java Language Specification:
When packages are stored in a file
system (§7.2.1), the host system
may choose to enforce the restriction
that it is a compile-time error if a
type is not found in a file under a
name composed of the type name plus an
extension (such as .java or .jav)
if either of the following is true:
The type is referred to by code in other compilation units of the package
in which the type is declared.
The type is declared public (and therefore is potentially accessible
from code in other packages).
This restriction implies that there
must be at most one such type per
compilation unit. This restriction
makes it easy for a compiler for the
Java programming language or an
implementation of the Java virtual
machine to find a named class within a
package; for example, the source code
for a public type wet.sprocket.Toad
would be found in a file Toad.java
in the directory wet/sprocket, and
the corresponding object code would be
found in the file Toad.class in the
same directory.
When packages are stored in a database
(§7.2.2), the host system must
not impose such restrictions. In
practice, many programmers choose to
put each class or interface type in
its own compilation unit, whether or
not it is public or is referred to by
code in other compilation units.
Because the language designers say so. It really is that simple. It's a convention and they force you to follow it.
The language specification itself does not dictate this (I've just had a look, and can find no reference to it), but it's generally enforced by tools. It makes it considerably easier for tools' dependency management, since it knows where to look for class B if class A has a reference to it. The convention extends to the directory structure echoing the package structure, but again, this is just a convention.
If I can change the world I wish c# designers also do that.
How much time can be saved from forcing guys to not create file classes.cs and put ALL code in it. Isn't it such as requirement of braces for If. Why language force me do that silly thing:
if (true)
{
}
instead of
if true
{
}
:-)