After taking a look in the Java VM specification, I noticed that a lot more than just ASCII letters could be used to create an identifier.
Firstly, I was wondering if there were any extra symbols (apart from $, that are available for identifiers)
Do you think it would be possible, with the extended character set to encode additional information in an identifier, and a custom classloader, to implement true Java generics?
Of course, you would have to get around type erasure, but that could be possible with a custom parser?
So you could store generic names in a format like: $g$GenericList$_Java_lang_String$
I'm using GenericList here as I don't intend to modify the original implementation!
Load them in with the class loader, create a proper GenericList<String> version and send it back.
EDIT: I plan to use this for a language I'm building on the JVM. As it uses $'s and _'s as special characters, encoding information like that might just work!
EDIT 2: I suppose the more difficult thing to do would be generic methods? Does anyone have any information on how those would be implemented?
EDIT 3: Since classes can only be unloaded when the classloader disappears, would I be able to cache and remove resolved templates like it works in .Net, or would I do it like C++?
The JVM allows any characters in class/field/method names except /, and ; which have a special meaning. Using numbers and other character is common for obfuscators to make de-compiling difficult.
However you could just use the $ and _ for generated class/fields/methods.
Note: JDK 7 is supposed to have better generic support with the Type with a combination of Class and generics.
EDIT:
One way to have proper generic type is to always use
Set<String> set = new LinkedHashSet<String>() { };
The use of { } creates an anonymous class which has a parent type with the generic you want. You can get this information via reflection.
You can cache and remove class by having your own class loader which you dispose of as you wish. The most extreme case would be to have a ClassLoader per class.
Once you have your own Generic types, you could just use these in your methods, like normal types.
As you can use Unicode you can basically use everything except the few letters mentioned in the previous answer (/,;).
There is nothign like "true generics" btw ... I know what you mean ;D and that is called "Templates".
Yes you can use any unicode character as identifier name in java. See here for identifier names allowed in java. But as mentioned in previous answer, you mean "templates" for "true generics".
Related
I have some (maybe) strange requirements - I wanted to detect definitions of local (method) variables of a given interface name. When finding such a variable I would like to detect which methods (set/get*) will be called on this variable.
I tried Javassist without luck, and now I have a deeper look into ASM, but not sure if it is possible what I wanted.
The reason for this is that I like to generated a dependency graph with GraphViz of beans that depend on the same data structure.
If this thing is possible could somebody please give me a hint on how it could be done? Maybe there are other Frameworks that could do?
01.09.2015
To make things more clear:
The interface is self written - the target of the whole action is to create a dependency graph in the first step automatically - later on a graphical editor should be implemented that is based on the dependencies.
I wonder how FindBugs/PMD work, because they also use the byte code and detect for example null pointer calls (variable not initialized and method will be called on it). So I thought that I could implement my idea in the same way. The whole code is Spring based - maybe this opens another solution to the point? Last but not least I could work on a source-jar?
While thinging about the problem - would it be possible via ASM/javassist to detect all available methods from the interface and find calls to them in the other classes?
I’m afraid, what you want to do is not possible. In compiled Java code, there are no local variables in the form you have in the source code. Methods use stack frames which have memory reserved for local variables, which is addressed by a numerical index. The type is implied by what instructions write to it and may change throughout the method’s code as the memory may get reused for different variables having a disjunct scope. The names on the other hand are completely irrelevant.
When bytecode gets verified, the effect of all instructions to the stack frame will get modeled to infer the type of each stack frame slot at each point of the execution so that the validity of all operations can be checked. Starting with class file version 50, there will be StackMapTable attributes aiding the process by containing explicit type information, but only for code with branches. For sequential code, the type of variables still has to be derived by inference.
These inferred types are not necessarily the declared types. E.g., on the byte code level, there will be no difference between
CharSequence cs="foo";
cs.charAt(0);
and
String s="foo";
((CharSequence)s).charAt(0);
In both cases, there will be a storage of a String constant into a local variable followed by the invocation of an interface method. The inferred type will be String in both cases and the invocation of a CharSequence method considered valid as String implements CharSequence.
This disproves the idea of detecting that there is a local variable declared using the CharSequence (interface) type, as the actual declared type is irrelevant and not stored in the regular byte code.
There are, however, debugging attributes containing information about the local variables, see the LocalVariableTable attribute and libraries like ASM will tell you about the declarations if such information is present. But you can’t rely on these optional information. E.g. Oracle’s JRE libraries are by default shipped without them.
I don't sure I understood exacly what you want but .
you can use implement on each object ,
evry object that have getter you can implement it with class called getable .
and then you could do stuff only on object that have the function that you implement from the class getable .
https://docs.oracle.com/javase/tutorial/java/IandI/createinterface.html
Can anyone give me some idea of how to extract information from a given C++ or Java program(source code)? The information may be names of classes or names of methods or telling some inheritance relation or class hierarchy,etc.You have to write a c++ or Java program for the same.I have tried and abled to do that but it is not totally correct.Right now what I'm doing is reading the given program line by line and checking for "class" keyword and if I find any such word,it means the word following right after that is name of that class(to extract name of classes).I'm just thinking is there any built in libraries in C or Java which can do this work more efficiently ?And please suggest some simple ideas(not some external libraries or plugins).
If all you want is the names of classes and methods within classes, you can rig a set of regular expressions to pick off various tokens (identifiers, "{", "}", operator, number, string), and a crummy parser (called an "island parser") to recognize the sequence of tokens that make up class declarations and method declarations. (Hint: for Java and C++, make sure you somehow match
corresponding { ... }").
This stunt works for classes and methods because in essence this how real compilers work: they break the input stream into tokens (usually using the compiler-generalization of regexps called "lexer generators"), and then use a parser to determine the actual code structure, and classes and methods are pretty easy to spot in the syntax. (This solution is a kind of clean version of what OP posted).
If you want to any other information form Java or C++ source code (e.g., types of method arguments, etc.) you probably need a tool that actually parses the languages, and builds symbol tables so you have a chance of knowing what the identifiers found in various locations mean.
(EDIT: OP indicated he wants to find out what function calls what other function. He can't do this sensibly without a full language front end (parser+ symbol table as a minimum).
You can get various tools to parse C++ (GCC, Clang, Elsa, ...) and various other tools to parse Java (ANTLR, javacc, ...). You will find that GCC is pretty hard to bend to general tasks, Clang and Elsa less problematic. ANTLR and Javacc will parse Java code but don't AFAIK build symbol tables, so they fall a little flat for general purpose tasks. What you will find is that dealing with a C++ tool will turn out to be completely different than dealing with a Java tool since none of these tools have any common compiler infrastructure.
How you extract class and method names from each of these will vary in detail, but most of them offer some kind of way to climb over a parse tree (and you code some ad hoc match for what you want to find, e.g., class declaration syntax) and/or navigate symbol tables (and spit out symbols marked as "class" or "method" names). How you find the right syntax requires you to know in intimate detail the structure of the tree and code lots of test to match for the proper tree structures.
If you really want to process both languages, and use a single infrastructure to do it, you could consider our DMS Software Reengineering Toolkit. DMS is language agnostic but can be tuned to arbitrary langauges, and then parse those languages, build abstract symbol tables and various kinds of flow analysis. DMS has both full C++ Front end (with a built-in preprocessor and handling C++ in its various forms including the new standard C++11) and a Java Front end handling all dialects of Java up through 1.6 (with 1.7 happening momentarily).
To do OPs (originally stated) task of finding classes and methods, you'd tell DMS to parse the file and then climb over trees or symbol tables, much as for the other tools. You can code an ad hoc tree matcher in DMS, but it easier to write patterns:
pattern match_class_declaration(i: identifier, b: statements): class_declaration
= " class \i { \b } ";
can be used with DMS to match those trees that happen to be class declarations, and will return "i" (and "b" which we don't care about) bound to the correspond subtrees. "i" of course contains the class name you want. Other patterns can be used to recognize other constructs, such as class names that inherit, or implement interfaces, or methods that return some type or methods that return void.
The point is you don't have to know the tree structure in any great detail to use such patterns.
To go further, as OP seems to want to do (e.g build caller/callee information), you'd need to construct control flow graphs, do points-to analysis, etc. DMS provides support for that.
The good news is one infrastructure handles both languages; you can even mix C++ and Java in DMS without it getting anything confused. The more difficult news is that DMS is a fairly complex beast, but that's because it has to handle all the complexities of C++ and Java (as well as many other langauges). Still beats working with two different language parsers with two radically different implementations and thus two complete sets of learning curves.
the question sounds too vague to answer. please elaborate.
from what i could gauge, use Reflection when you are working with Java classes to figure out almost everything about a class and its methods. There are other (static) APIs that you could use on the Class object (if you have that hand). Refer the javadocs for more.
You could try to use some source from compilers, like gcc. They already have all the syntax parsing and preprocessing there, so you could save tons of time.
For compiled Java you could also use bytecode manipulation libraries (like asm).
As you're trying to parse a text file, a shell script based on awk and/or sed would be sufficient. You'll have to define some simple regular expressions based on the languages keywords and syntax to extract the informations you need.
For instance, this regular expression would match most of the class declarations of a C++ source file:
class *([A-Za-z_][A-Za-z_0-9]*) *\{?$
The parenthesis allow you to extract the identifier you're looking for, this is called a capturing group.
If you really want to do it in C/C++/Java, you'll have to find a library that provides regular expressions facilities (Java standard library already provides some). Maybe Boost Regex for a C++ program.
Here's an example building up how to parse a C++ file using the clang (llvm) libraries. Its long and pretty detailed but you should be able to adapt it to do what you need (for C and C++ anyway .. dont know if llvm is any good at handling Java ... and don't know if its easy to adapt that approach for Java).
Not sure about current Java, but C++ is a true nightmare to parse if you want to fully extract semantic information (consider that it took YEARS for the industry to agree 100% on how and if certain construct should have been parsed).
Note that while class name in C++ is easy enough (just remember however that the word class or struct can also be present before a template parameter instead of typename, that you can have "nested classes" and that you can have class "forward declarations") for members things are much harder because member name comes after the type and even understanding what is a type, where the type ends or what is the member name is not trivial... consider
int (*foo)(int x, int y);
Node<Bar, Baz, Allocator<Foo, &Q::operator > >, 12> (*rex)(int);
in the first case the member name is foo, and in the second case member name is rex (note that I'm not sure if the second example is valid C++ code or, supposing it's valid, if common C++ compilers would accept it).
Note that even just understanding where the class member list begins after the class name is not trivial (you have to skip the inheritance list that can include templated classes with parameters that are generic types).
So, giving up with a regular expression (that clearly is not able to parse a type being it a complex recursive entity), the only solution is to use code written by someone else.
For this job (for C++) you can try for example GCC-XML that has been written exactly for this reason (it generates an XML result from parsing C++ source code).
Let's say I have:
class A {
Integer b;
void c() {}
}
Why does Java have this syntax: A.class, and doesn't have a syntax like this: b.field, c.method?
Is there any use that is so common for class literals?
The A.class syntax looks like a field access, but in fact it is a result of a special syntax rule in a context where normal field access is simply not allowed; i.e. where A is a class name.
Here is what the grammar in the JLS says:
Primary:
ParExpression
NonWildcardTypeArguments (
ExplicitGenericInvocationSuffix | this Arguments)
this [Arguments]
super SuperSuffix
Literal
new Creator
Identifier { . Identifier }[ IdentifierSuffix]
BasicType {[]} .class
void.class
Note that there is no equivalent syntax for field or method.
(Aside: The grammar allows b.field, but the JLS states that b.field means the contents of a field named "field" ... and it is a compilation error if no such field exists. Ditto for c.method, with the addition that a field c must exist. So neither of these constructs mean what you want them to mean ... )
Why does this limitation exist? Well, I guess because the Java language designers did not see the need to clutter up the language syntax / semantics to support convenient access to the Field and Method objects. (See * below for some of the problems of changing Java to allow what you want.)
Java reflection is not designed to be easy to use. In Java, it is best practice use static typing where possible. It is more efficient, and less fragile. Limit your use of reflection to the few cases where static typing simply won't work.
This may irk you if you are used to programming to a language where everything is dynamic. But you are better off not fighting it.
Is there any use that is so common for class literals?
I guess, the main reason they supported this for classes is that it avoids programs calling Class.forName("some horrible string") each time you need to do something reflectively. You could call it a compromise / small concession to usability for reflection.
I guess the other reason is that the <type>.class syntax didn't break anything, because class was already a keyword. (IIRC, the syntax was added in Java 1.1.)
* If the language designers tried to retrofit support for this kind of thing there would be all sorts of problems:
The changes would introduce ambiguities into the language, making compilation and other parser-dependent tasks harder.
The changes would undoubtedly break existing code, whether or not method and field were turned into keywords.
You cannot treat b.field as an implicit object attribute, because it doesn't apply to objects. Rather b.field would need to apply to field / attribute identifiers. But unless we make field a reserved word, we have the anomalous situation that you can create a field called field but you cannot refer to it in Java sourcecode.
For c.method, there is the problem that there can be multiple visible methods called c. A second issue that if there is a field called c and a method called c, then c.method could be a reference to an field called method on the object referred to by the c field.
I take it you want this info for logging and such. It is most unfortunate that such information is not available although the compiler has full access to such information.
One with a little creativity you can get the information using reflection. I can't provide any examples for asthere are little requirements to follow and I'm not in the mood to completely waste my time :)
I'm not sure if I fully understand your question. You are being unclear in what you mean by A.class syntax. You can use the reflections API to get the class from a given object by:
A a = new A()
Class c = a.getClass()
or
Class c = A.class;
Then do some things using c.
The reflections API is mostly used for debugging tools, since Java has support for polymorphism, you can always know the actual Class of an object at runtime, so the reflections API was developed to help debug problems (sub-class given, when super-class behavior is expected, etc.).
The reason there is no b.field or c.method, is because they have no meaning and no functional purpose in Java. You cannot create a reference to a method, and a field cannot change its type at runtime, these things are set at compile-time. Java is a very rigid language, without much in the way of runtime-flexibility (unless you use dynamic class loading, but even then you need some information on the loaded objects). If you have come from a flexible language like Ruby or Javascript, then you might find Java a little controlling for your tastes.
However, having the compiler help you figure our potential problems in your code is very helpful.
In java, Not everything is an object.
You can have
A a = new A()
Class cls = a.getClass()
or directly from the class
A.class
With this you get the object for the class.
With reflection you can get methods and fields but this gets complicated. Since not everything is an object. This is not a language like Scala or Ruby where everything is an object.
Reflection tutorial : http://download.oracle.com/javase/tutorial/reflect/index.html
BTW: You did not specify the public/private/protected , so by default your things are declared package private. This is package level protected access http://download.oracle.com/javase/tutorial/java/javaOO/accesscontrol.html
This question is inspired from Joel's "Making Wrong Code Look Wrong"
http://www.joelonsoftware.com/articles/Wrong.html
Sometimes you can use types to enforce semantics on objects beyond their interfaces. For example, the Java interface Serializable does not actually define methods, but the fact that an object implements Serializable says something about how it should be used.
Can we have UnsafeString and SafeString interfaces/subclasses in, say Java, that are used in much of the same way as Joel's Hungarian notation and Java's Serializable so that it doesn't just look bad--it doesn't compile?
Is this feasible in Java/C/C++ or are the type systems too weak or too dynamic?
Also, beyond input sanitization, what other security functions can be implemented in this manner?
The type system already enforces a huge number of such safety features. That is essentially what it's for.
For a very simple example, it prevents you from treating a float as an int. That's one aspect of safety -- it guarantees that the type you're working on are going to behave as expected. It guarantees that only string methods are called on a string. Assembly doesn't have that safeguard, for example.
It's also the job of the type system to ensure that you don't call private functions on a class. That's another safety feature.
Java's type system is too anemic to enforce a lot of interesting constraints effectively, but in many other languages (including C++), the type system can be used to enforce far more wide-ranging rules.
In C++, template metaprogramming gives you a lot of tools for prohibiting "bad" code. For example:
class myclass : boost::noncopyable {
...
};
enforces at compile-time that the class can not be copied. The following will produce compile errors:
myclass m;
myclass m2(m); // copy construction isn't allowed
myclass m3;
m3 = m; // assignment also not allowed
Likewise, we can ensure at compile-time that a template function only gets called on types which fulfill certain criteria (say, they must be random-access iterators, while bilinear ones aren't allowed, or they must be POD types, or they must not be any kind of integer type (char, short, int, long), but all other types should be legal.
A textbook example of template metaprogramming in C++ implements a library for computing physical units. It allows you to multiply a value of type "meter" with another value of the same type, and automatically determines that the result must be of type "square meter". Or divide a value of type "mile" with a value of type "hour" and get a unit of type "miles per hour".
Again, a safety feature that prevents you from getting your types mixed up and accidentally getting your units mixed up. You'll get a compile error if you compute a value and try to assign it to the wrong type. trying to divide, say, liters by meters^2 and assigning the result to a value of, say, kilograms, will result in a compile error.
Most of this requires some manual work to set up, certainly, but the language gives you the tools you need to basically build the type-checks you want. Some of this could be better supported directly in the language, but the more creative checks would have to be implemented manually in any case.
Yes you can do such thing. I don't know about Java, but in C++ it isn't customary and there is no support for this, so you have to do some manual work. It is customary in some other languages, Ada for example, which have the equivalent of a typedef which introduces a new type which can't be converted implicitly into the orignal one (this new type "inherits" some basic operations from the one it is created, so it stays usefull).
BTW, in general inheritance isn't a good way to introduce the new types, as even if there is no implicit conversion in one way, there is one in the other one.
You can do a certian amount of this out of the box in Ada. For example, you can make integer types that cannot implcitily interoperate with each other, and Ada enumerations are not compatible with any integer type. You can still convert between them, but you have to explicitly do it, which calls attention to what you are doing.
You could do the same with present-day C++, but you'd have to wrap all your integers and enums in classes, which is just way too much work for something that should be simple (or better yet, the default way of doing things).
I understand the next version of C++ is going to fix at least the enumeration issue.
In C++, I suppose you could use typedef to create a synonym for a primitive type. Your synonym could imply something about the content of that variable, replacing the function of the apps hungarian notation.
Intellisense will report the synonym you used during declaration, so if you don't like using actual hungarian, it does save you from scrolling about (or using Go To Definition).
I guess you are thinking of something along the lines of Perl's "tainting" analysis.
In Java, it should be possible to use custom annotations and an annotation processor to implement this. Not necessarily easy though.
You can't have a UnsafeString subclass of String in Java, since java.lang.String is final.
In general, you cannot provide any kind of security on the source level - if you want to protect against evil code, you must do that on the binary level (e.g. Java bytecode). That's why private/protected can't be used as a security mechanism in C++: it is possible to bypass that with pointer manipulations.
I've been writing .NET software for years but have started to dabble a bit in Java. While the syntax is similar the methodology is often different so I'm asking for a bit of help in these concept translations.
Properties
I know that properties are simply abstracted get_/set_ methods - the same in C#. But, what are the commonly accepted naming conventions? Do you use 'get_' with an underscode or just 'get' by itself.
Constructors
In C# the base constructor is called automatically. Is this also true in Java?
Events
Like properties, events in .NET are abstracted add_/remove_/fire_ methods that work on a Delegate object. Is there an equivalent in Java? If I want to use some sort of subscriber pattern do you simply define an interface with an Invoke/Run method and collect objects or is there some built-in support for this pattern?
Update: One more map:
String Formatting
Is there an equivalent to String.Format?
Java from a C# developer's perspective
Dare Obasanjo has updated his original 10 year old article with a version 2:
C# from a Java Developer's Perspective v2.0
Although for you its the other way round :)
To answer your specific questions:
Properties
By convention, Java uses "get" or "set" followed by the variable name in upper camel case. For example, "getUserIdentifier()". booleans often will use "is" instead of "get"
Constructors
In Java, superclass constructors are called first, descending down the type hierarchy.
Events
By convention (this is the one you'll get the least agreement on...different libraries do it slightly differently), Java uses methods named like "addEventTypeListener(EventTypeListener listener)" and "removeEventTypeListener(EventTypeListener listener)", where EventType is a semantic name for the type of event (like MouseClick for addMouseClickListener) and EventTypeListener is an interface (usually top-level) that defines the methods available on the receivers - obviously one or more of those references is essentially a "fire" method.
Additionally, there is usually an Event class defined (for example, "MouseClickEvent"). This event class contains the data about the event (perhaps x,y coordinates, etc) and is usually an argument to the "fire" methods.
Wikipedia has a nice in depth comparison here: http://en.wikipedia.org/wiki/Comparison_of_C_Sharp_and_Java
A bean property in java is preceeded by a get followed by the bean name starting with a capital letter. For instance the property 'color' would be associated with the methods 'getColor()' and 'setColor(int color)' (assuming the property is of type int). There is a special case for boolean properties, the getter will be called 'is'... as in 'isWhite()', 'isBlack()'. The setter remains the same.
When a class is created in java, all its parent class constructors are called in order, parents before children.
Events in Java are specific to a given event model, and not a core part of the language. Examine the documentation for Swing or SWT for information on the event models of those GUI toolkits.
Sun's Code Conventions are a great reference for the Java way of doing and naming things.
Property getters and setters can go by whichever naming convention you desire, or that your organization has standardized. A good naming convention is simply one that is common among those who will use/see it. That said, most in the Java community use 'getSomething/setSomething' as the naming convention on getters and setters.
Base constructors are called automatically, just like C#.