Get name resolution in a C++ project - java

Problem
I am developing a Java stand-alone static testing tool for C++ projects. In this tool, I need to get name resolution inside a project.
For example, given two statements in a function:
int x = 0;
int y = x + 1;
By using name resolution on the variable x of the second statement, I detect that it is declared at the first one.
Current solution
Firstly, I used Eclipse CDT plugin (only a part of CDT) to create abstract syntax trees (AST). More clearly, each of source code files in the testing project is analyzed to construct a corresponding AST. However, these ASTs do not contain name resolution. Therefore, I have to analyze AST to detect name resolution. My current solution seems to be good, but for large-scale projects and complex structures, it definitely fails.
Later, I have known that information about name resolution could be obtained automatically by using Eclipse CDT. But I need to create a stand-alone tool (outside Eclipse), it means that I could not integrate my tool into CDT.
I have known that C++ uses static name resolution rather than dynamic approach. So, this information about name resolution could be collected. Can you suggest me any further ideas to overcome my issue?
Updated (based on recommendations below)
Some suggest me as follows, and my response.
+ Use Clang
It is true that Clang supports to analyse C++ files (and C files also), and there is no denying that Clang is a good choice. However, my language I want to use is Java. Currently, I only find one (i.g., Eclipse CDT plugin). As I said, CDT plugin does not support name resolution when I try to use it outside Eclipse CDT IDE.
My current Java stand-alone tool, namely CFT4Cpp, uses CDT plugin to parse C/C++ program. Due to the limit of CDT plugin, I have analyzed name resolution by using some simple algorithms. However, these algorithms fail when analyzing complex projects in terms of syntaxes.

C++ is a very complex programming language (and different of C). Parsing it is a very difficult task (many years of work - perhaps a lifetime if working alone - if you do that from scratch).
So build your tool above some existing C++ parsing technology. You could use GCC, perhaps thru GCC plugins, or Clang (see this), or Edison C++ frontend etc. Free software C++ compilers are huge beasts (several millions of lines) continuously evolving and growing, and mastering them requires a lot of work. BTW, you could use common inter-process communication (e.g. JSONRPC or other approaches) or foreign function interfaces techniques (e.g. JNI) to use C++ compiler frameworks from Java.
However, my language I want to use is Java.
Be pragmatic. So code a small part in C++ (above existing parsers, e.g. from C++ compilers), and the rest in Java.
(for an academic prototype, using some inter-process communication between an adapted compiler in C++ and some tool in Java is probably the less difficult; however, you will have to code several thousands lines on the compiler side in C++, since C++ is complex; and you'll need more on your Java side; BTW, you probably need a bit of practice in C++ to be able to design useful things for it...)
(since you probably won't find complete C++ compilers or front-ends in Java)
Even if building your thing above an existing C++ parser, the task is not easy and could consume several months of your time. And existing C++ parsers are evolving (e.g. the internal representations of GCC is slightly changing from one version to the next one). So you need to budget the evolution of these parsers.
And parsing C++ is itself an ill defined task (think of preprocessing, template expansions, etc....). You need to define on what form of C++-related code representation you want to work on. Of course the C++ standard have several releases, etc.
Perhaps you should consider asking your manager (or get some research grant, if you are academic) to work full time on that for several years. But ask yourself if it is worthwhile....
Alternatively, if you restrict yourself to one C++ project, consider instead defining some project-specific conventions, generating some C++ code and some tests in it. YMMV.
Another approach (which works on Linux, but probably not everywhere else) is to ask your user to compile with debug information enabled (e.g. with g++ -g if he uses GCC) and to parse DWARF debug information.
BTW, I did work on similar goals: a few years ago in GCC MELT, and now in my bismon github project (temporary name, will change). Be sure to get funded for several years of work full time, since your goal is very ambitious.

There is more to using a an existing and wide-spread compiler over your own solution then the complexity involved in the implementation.
C++ is now an ever-changing language. Since C++11 the road map is a new version of the standard every 3 years. And they sicked with it: we have C++11, C++14, C++17 and C++20 is on track.
You will have a very very difficult and time consuming challenge ahead just for staying up with the changes in standard.
For example I show you just 1 change per version that you would need to add support for. Can you / are you willing to support each new standard version in its entirely? Or are you going to end up with an application/tool that by the time it gets out of development is already obsolete?
C++98:
int x = 0;
int y = x + 1;
C++11:
auto x = 0;
auto y = x + 1;
C++14
[](auto x) { auto y = x; }
C++17
if (const auto [iter, inserted] = mySet.insert(value); inserted)
C++20 hopefully this:
template <class T, class F, class P>
requires requires(T x, F f, P p) {
f(x);
{p(f(x))} -> bool;
}
auto bar(T x, F f, P p)
{
//
}
With a solution based on an compiler like gcc or clang you have all these taken care of by the compiler itself. All you need to do is use it for your own purpose.

Related

If, in C/C++, we use #define, what about in Java?

Okay so, if I am not mistaken, in C or C++ we use the code below to shorten or substitute the statement to a different one. So you can just write P rather than printf as a command right?
#define P printf
Then how do we do that in Java?
Java does not have macros, or a pre-processing step.
One must realize that with every programming language comes its own set of tools.
Many times MACROS are used where C++ templates or Java generics can be used, for example in case of a MAX macro.
If you really want to have a pre processing state, you should consider inserting a step to your build system (i.e - maven pluggin) that will go over your "Java code with macros", generate real Java files from it (similar to how inline functions behave in C++), and then compile the generated java code.
You can find examples to it for example in case where Java code is generated from XSD or other schemas, so theoretically, why not generate it from "Java with macros code"?
If you look for example at project Lombok you will see they introduce a "properties" systax to Java, but in fact they just introduced IDE plugins (so the code does not look "broken" or "in error" when you code with your favorite IDE), and they introduced mavan steps/goals so you can actually build something developed with Lombok.
Maybe you should adopt a similar approach, if this is that crucial for you (actually in past , prior to JDK 5, this is how "annotations" were used in some frameworks, but you should have a really good reason to do that in your code).
Java does not have a preprocessor step like the languages you enumerated (the C macro language is handled by the preprocessor). You can make a static final function, or you could use cpp to pre-process your Java src (which I would not recommend because it wouldn't work with standard tools). Another somewhat similar alternative (but only in the sense of being able to omit a class name by adding a symbol to a local namespace) might be the static import.
import static java.lang.System.out;
// ...
out.println("Hello, World"); // <-- System.out.println
java doesn't have any internal preprocessor but if it is strongly desired by project (usually by mobile project where needed small code corrections for many destination devices) then external tools can be used, somebody uses even C/C++ preprocessor to preprocess sources, I use my own java-comment-preprocessor but anyway all java preprocessors, I have seen, don't allow such tricks as C/C++ preprocessor does, because preprocessor directives are not supported on the java language level

Assembly Programming on the JVM?

I was kind a curious if it was possible to do assembly programming in a similar fashion of using NASM in C.
After quick Google search to see if it was possible to do assembly language programming on the JVM and was surprised to find some results.
Has anyone tried doing something like this before?
I'm also wondering if there are any support assembly support for Clojure or Scala.
Invoking Assembly Language Programming from Java
minijavac : Not in English but it looks like it using some kind of NASM support.
Assembly is usually used in C so that a) you can access instructions C doesn't generate or b) lower level performance tuning.
As byte code is designed for Java,
there aren't any useful byte code instructions it doesn't generate
The JVM looks for common patterns in byte code generated by the compiler and optimises for those. This means if you write the byte code yourself it is more likely to be less optimised i.e.
slower, unless it is the same as what the compiler would produce.
Write a JNI library in C with inline assembly in it.
In theory, you could write a JNI-compliant library in pure assembly, but why bother?
I'd like to point to another solution: generating assembly code at runtime from your java program.
Some (long) time ago there was a project called softwire, written in c++, that did exactly that. It (ab)used (method and operator) overloading to create some kind of c++ DSL that closely resembles x86 ASM, and which behind the scene would assemble the corresponding assembly. The main goal was to be able to dynamically assemble an assembly routine customized for specific configuration, while eliminating nearly all the branchings (the routine would be recompiled if the confiugration changed).
This was an excellent library and the author used to to great effect to implement a software renderer with shading support (shaders were dynamically translated to x86 assembly and the assembled, all at runtime), so this was not just a crazy idea. Unforuntately he was hired by a company and the library acquired in the process.
Today, to follow such a route you could create a JNI binding to DynAsm (that alone is probably no small task) and use it to assemble at runtime. If you are willing to use scala over java, you can even relatively easily create a DSL ala softwire, that will under the hood generate the assembly source code and pass it to DynASM.
Sounds like fun :-)
No reason to be bored anymore.
Are you looking for something like jasmin project? Because,for some reason for me, minijava always reminds me of jasmin parser...

Application of Project Sumatra to other JVM languages

I have just recently discovered Project Sumatra, which aims to bring the JVM to the graphics card. From their webpage this includes a custom compiler (called Rootbeer) for Java.
This is all good news, however, I would like to hear from someone with more knowledge about the project internals if this means that project Sumatra applies to other JVM languages as well? Will it be possible to make Aparapi calls from Scala or Clojure directly? Or will you have to develop some core functionality in Java and then access that via other JVM languages?
I only just came across this question. Apologies for taking so long. Full disclosure I am the Aparapi inventor/lead and co sponsor of Sumatra.
Unlike Aparapi Sumatra has the advantage of working from the IR (Intermediate Representation) of Java methods from inside the JVM. This means that ultimately it will detect opportunities for GPU offload based on patterns found at this abstract level. Aparapi had to reverse engineer opportunities from bytecode.
It is likely that Sumatra will initially key off user hints, rather than trying to auto-parallelize code. The main focus at present is the new 'lambda' feature of Java 8 and it's companion 'stream API'. So where Aparapi required the user to inherit from a Kernel base class. Sumatr a will likely use the 'explicit' hint of parallelism suggested by:-
IntRange.range(1024).parallel().forEach(gid->{out[gid]=a[gid]+b[gid];});
Although for obvious cases, such as
for (int id=0; i< 1024; i++){
out[gid]=a[gid]+b[gid];
}
It should be entirely possible to offload this loop. So support for other JVM based languages will depend on how ambitious we are looking for opportunities to auto-parallelize. I suspect that many patterns from other languages (JavaScript (Nashorn), JRuby, Scala, JPython etc) will be detectable.
AFAIK Rootbeer (a university project) and Aparapi (an AMD based project) are unrelated, so you may haved missed something here.
Regarding Aparapi itself it states in its Wiki that it won't work with Scale/Closure etc. or in fact with anything except pure Java, since it depends on patterns used by JDK's javac to properly analyse bytecode. It also requires you to extend its Kernel class to be able to convert the bytecode into OpenCL and execute it in GPU. So it looks like you would use one or another.
Back to your question: based on all this you would have to develop in Java and call it from other JVM languages.

Java decompiling and JNI

A little bit like this question How to lock compiled Java classes to prevent decompilation? , However I am well aware of how to decompile an application and try to understand it even if it is obfuscated but one thing im not too sure about is how the same process would work if the application loaded C libraries (.so files) using jni.
For example say if there was a calculator, if this calculator was built in pure java it would be possible to go in and mess up the square root button so that when you passed in 2 it would give back 2^3 rather then 2^2.
Now if this application used JNI to do all this math commands (so it passed the 2 to a native method), how would you be able to go into the C, change it so that it returns 2^3 and not 2^2?
Just figure out the C function signature and compile your own object file that implements that signature.
Years ago, working in a mainframe shop, my boss made his own version of the system date function and re-linked a commercial app we were using so he didn't have to renew the time-limited license. It was illegal as hell, but it worked.
Decompilation is older than bytecode. Pretty much everything can be decompiled. It's definitely harder (both to decompile and to understand/modify the result) with mangled, optimized machine code with zero metadata preserved, but nonetheless possible. Of course you'd need a different decompiler, and - as hinted before - it would be a bit harder, but the fact (which makes all DRM tools imperfect, by the way) "if their CPU runs it, they can modify it", holds for native code as much as for any bytecode.
One option is to use disassembler. A simpler option is to replace the library with your own library.I use it for test purposes almost every day.
You could use a debugger to step into the C code.
You could disassemble it. IDA (Interactive Disassembler) was (is?) a great example, and could produce high quality disassembled code (cross-references, documentation, name of system/lib functions in calls, ...).
It is then possible to patch the binary (which could be protected in some way).
If you concern is that you don't want the people who use your app to see the code or even change it, could you consider letting it run as a web or client/server application, where the user doesn't have access to the server? This would let you resolve the problem.

Linking languages

I asked a question earlier about which language to use for an AI prototype. The consensus seemed to be that if I want it to be fast, I need to use a language like Java or C++, but that Python / Perl / Ruby would be good for the interface bits.
So, this leads me on to another question. How easy is it to link these languages together? And which combination works best? So, if I wanted to have a Ruby CGI-type program calling C++ or Java AI functions, is that easy to do? Any pointers for where I look for information on doing that kind of thing? Or would a different combination be better?
My main experience with writing web applications started with C++ CGI and then moved on to Java servlets (about 10 years ago) and then after a long gap away from programming I did some PHP. But I've not had experience of writing a web application in a scripting language which then calls out to a compiled language for the speed-critical bits. So any advice will be welcome!
Boost.Python provides an easy way to turn C++ code into Python modules. It's rather mature and works well in my experience.
For example, the inevitable Hello World...
char const* greet()
{
return "hello, world";
}
can be exposed to Python by writing a Boost.Python wrapper:
#include <boost/python.hpp>
BOOST_PYTHON_MODULE(hello_ext)
{
using namespace boost::python;
def("greet", greet);
}
That's it. We're done. We can now build this as a shared library. The resulting DLL is now visible to Python. Here's a sample Python session:
>>> import hello_ext
>>> print hello.greet()
hello, world
(example taken from boost.org)
First, a meta comment: I would highly recommend coding the entire thing in a high-level language, profiling like mad, and optimizing only where profiling shows it's necessary. First optimize the algorithm, then the code, then think about bringing in the heavy iron. Having an optimum algorithm and clean code will make things much easier when/if you need to reimplement in a lower-level language.
Speaking for Python, IronPython/C# is probably the easiest optimization path.
CPython with C++ is doable, but I find C a lot easier to handle (but not all that easy, being C). Two tools that ease this are cython/pyrex (for C) and shedskin (for C++). These compile Python into C/C++, and from there you can access C/C++ libraries without too much ado.
I've never used jython, but I hear that the jython/Java optimization path isn't all that bad.
I agree with the Idea of coding first in a high level language such as Python, Profiling and then Implementing any code that needs speeding up in C / C++ and wrapping it for use in the high level language.
As an alternative to boost I would like to suggest SWIG for creating Python callable code from C. Its reasonably painless to use, and will compile callable modules for a wide range of languages. (Python, Ruby, Java, Lua. to name a few) from C code.
The wrapping process is semi automated, so there is no need to add new functions to the base C code, making a smoother work flow.
If you choose Perl there are plenty of resources for interfacing other languages.
Inline::C
Inline::CPP
Inline::Java
From Inline::C-Cookbook:
use Inline C => <<'END_C';
void greet() {
printf("Hello, world\n");
}
END_C
greet;
With Perl 6 it gets even easier to import subroutine from native library code using NativeCall.
use v6.c;
sub c-print ( Str() $s ){
use NativeCall;
# restrict the function to inside of this subroutine because printf is
# vararg based, and we only handle '%s' based inputs here
# it should be possible to handle more but it requires generating
# a Signature object based on the format string and then do a
# nativecast with that Signature, and a pointer to printf
sub printf ( str, str --> int32 ) is native('libc:6') {}
printf '%s', $s
}
c-print 'Hello World';
This is just a simple example, you can create a class that has a representation of a Pointer, and have some of the methods be C code from the library you are using. ( only works if the first argument of the C code is the pointer, otherwise you would have to wrap it )
If you need the Perl 6 subroutine/method name to be different you can use the is symbol trait modifier.
There are also Inline modules for Perl 6 as well.
Perl has several ways to use other languages. Look at the Inline:: family of modules on CPAN. Following the advice from others in this question, I'd write the whole thing in a single dynamic language (Perl, Python, Ruby, etc) and then optimize the bits that need it. With Perl and Inline:: you can optimize in C, C++, or Java. Or you could look at AI::Prolog which allows you to embed Prolog for AI/Logic programming.
It may be a good approach to start with a script, and call a compilation-based language from that script only for more advanced needs.
For instance, calling java from ruby script works quite well.
require "java"
# The next line exposes Java's String as JString
include_class("java.lang.String") { |pkg, name| "J" + name }
s = JString.new("f")
You can build your program in one of the higher level languages for example Python or Ruby and then call modules that are compiled in the lower level language for the parts you need performance. You can choose a platform depending on the lower level language you want.
For example if you want to do C++ for the speedy stuff you can just use plain Python or Ruby and call DLLs compiled in C++. If you want to use Java you can use Jython or one of the other dynamic languages on the Java platform to call the Java code this is easier than the C++ route because you've got a common virtual machine so a Java object can be used directly in Jython or JRuby. The same can be done on the .Net platform with the Iron-languages and C# although you seem to have more experience with C++ and Java so those would be better options.
I have a different perspective, having had lots of luck with integrating C++ and Python for some real time live video image processing.
I would say you should match the language to the task for each module. If you're responding to a network, do it in Python, Python can keep up with network traffic just fine. UI: Python, People are slow, and Python is great for UIs using wxPython or PyObjC on Mac, or PyGTK. If you're doing math on lots of data, or signal processing, or image processing... code it in C or C++ with unit tests, then use SWIG to create the binding to any higher level language.
I used the image libraries in wxWidgets in my C++, which are already exposed to Python through wxPython, so it was extremely powerful and quick. SCONS is a build tool (like make) which knows what to do with swig's .i files.
The topmost level can be in C or Python, you'll have more control and fewer packaging and deployment issues if the top level is in C or C++... but it will take a really long time to duplicate what Py2EXE or Py2App gives you on Windows or Mac (or freeze on Linux.)
Enjoy the power of hybrid programming! (I call using multiple languages in a tightly coupled way 'hybrid' but it's just a quirk of mine.)
If the problem domain is hard (and AI problems can often be hard), then I'd choose a language which is expressive or suited to the domain first, and then worry about speeding it up second. For example, Ruby has meta-programming primitives (ability to easily examine and modify the running program) which can make it very easy/interesting to implement certain types of algorithms.
If you implement it in that way and then later need to speed it up, then you can use benchmarking/profiling to locate the bottleneck and either link to a compiled language for that, or optimise the algorithm. In my experience, the biggest performance gain is from tweaking the algorithm, not from using a different implementation language.

Categories

Resources