Writing languages for the JVM - java

Suppose I write a programming language; for namesake, I'll call it lang.
To begin the long journey of writing lang, I decide to begin, by writing lang in itself. I can't actually run it, because theres nothing to run the program that runs itself.
So I begin by writing another compiler for lang in Java. This time, when I am done, I decide to convert it to Bytecode, and leave it at that. I now have a working compiler, which will convert all my lang code into Bytecode.
So I decide to plug in my self-compiler for the language, into the compiler I just made in Java. I then convert the self-compiler to Bytecode, and chuck out the Java compiler. I now have a lang compiler, purely written in itself, converted into Bytecode, ready for use.
This creates a solid program, and I understand all of this, but my question is, relative to compiler design for the JVM, what if I decide to release an update for my language? How do I go about updating the Bytecode? Do I simply re-write the updated version of the language in the older one?
I ask this because this is what I want to do. Write a non-existing language in itself, and then bootstrap it to the JVM by firstly creating a compiler in Java.
It's the same as what was done with C++. C with Classes was written, and then C++ in it, and finally C with Classes was abandoned for the bootstrapped C++. But then how on earth did they ever go about updating the language?

I'll answer this from two possible scenarios in your development. With any byte-code language at any time you can update the virtual machine or the language.
Suppose first you wanted to update your language to have new syntax or change the current semantics. Then you'd keep your current compiled compiler written in lang (compiler A) and edit its source so that it can correctly compile your new features. Then you compile your compiler using the old one giving you compiler B. If necessary, you can now rewrite the compiler to use the new features and then compile it using compiler B to give you compiler C.
What if the JVM changes? Well in that case you keep an old version of the JVM around, adjust your compiler to cope with the new bytecode changes, and then compile it with the old one (this is analogous to compiler B from before). That will get you a compiler that compiles to the new bytecode but runs on the old VM. The next step is get it to compile itself, and now you have a new compiler that runs on the new VM (analogous to compiler C).

I don't think your compiler is the best way to go about this.
I'd start with a grammar for my language.
Next comes the lexer/parser to turn expressions in my language to an abstract syntax tree (AST). The AST is a correct intermediate representation of the expression.
You would emit bytecode or assembly language instructions for the virtual machine or processor of your choice by writing a code generator that traverses the AST.
Where does your update happen?
If it's language fundamentals, you have to modify both the grammar and the bytecode emission.
If you're optimizing the bytecode or porting to a new processor you have to modify the code generator.

The first lang compiler can be written in a subset of lang. And you only need a subset (bootstrap) compiler (or even interoreter). This can be written in java.
Later, more extensive compilers can be written in lang. Newer versions can do too.
You could even write a translator that converts a lang program to java, and use that to create a first translator in lang, and then turn it into a bytecode compiler.

Related

How JVM distinguish between Scala bytecode and Java bytecode?

As Scala also produces bytecode and executed by JVM. I am wondering How JVM distinguish between Scala bytecode and Java bytecode. Can anyone please explain?
Scalac Myprogram.scala
java Myprogram
So this statements are perfectly fine?
I am wondering How JVM distinguish between Scala bytecode and Java bytecode.
It doesn't. There is no such thing as Scala bytecode. The Scala compiler compiles to JVM bytecode. Just like the Java compiler also compiles to JVM bytecode.
The JVM doesn't know anything about Scala. It doesn't know anything about Java, either. Nor does it know anything about Groovy, Clojure, Kotlin, Ceylon, Fantom, Ruby, Python, ECMAScript, or any of the other ~400 programming languages for which there are implementations on the JVM.
The JVM only knows about one language: JVM bytecode.
Note that this is really no different from any other machine, virtual or not. The CLR only knows about CIL, it knows nothing about C#, VB.NET, or F#. An Intel Core CPU knows only about AMD64 and x86 machine code, it knows nothing about C, C++, Objective-C, Swift, Go, Java, Python. The CPython VM only knows about CPython bytecode, it knows nothing about Python.
It doesn't. Scala compiles to the same bytecode as Java.
A picture is worth a thousand words
Both scalac and javac generate bytecode. The JVM doesn't care how the bytecode was produced, it's all the same to the JVM.
However, scala and java sets up the boot CLASSPATH differently, so if your code contains Scala Runtime Library calls, and it very likely will, it needs to be run by scala, not java.
You can setup the boot CLASSPATH manually using java, if you absolutely have to, but why go through all that extra work, when scala will do it for you?
Scala compiles to normal Java bytecode, so the JVM doesn't seem any difference. The extra features of Scala that Java doens't have are implemented through a combination of compile time passes and runtime helper functions. If you disassemble generated Scala classes, you'll probably see tons of calls to the Scala runtime for stuff like boxing and unboxing arguments.
The byte code as defined in the standard is the same in the point of view of the executor. But when you work with a debugger, more information is needed to match the correct language and source files.
The only way to indicate what is the source code from which the JVM bytecode has been generated is through optional class attributes SourceFile (from Java 1.0.2) and SourceDebugExtension (from Java 5.0).
https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.10 and 11
An additional information will give the line numbers of the various objects in the source LineNumberTable, this also is optional.
https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.12
You can find the equivalent for other JVM specs.
These are used by IDEs to get the context of JVM class, but are optional values and debug information depends on the compiler.
So it can be an info to know what the compiler was, but you cannot rely on it as sure info. If it's defined, it should be in the compiler definition.
With scalac you can switch the information storage with the -g option, the same as javac except the notc as Java doesn't manage TCO:
g:{none,source,line,vars,notc}
"none" generates no debugging info,
"source" generates only the source file attribute,
"line" generates source and line number information,
"vars" generates source, line number and local variable information,
"notc" generates all of the above and will not perform tail call optimization.
By default line
To get a ClassLoader to retrieve info from the class files you can use the Open Source library ASM : http://asm.ow2.org/
You can also parse the class file or internal storage by yourself, it is not very difficult retrieving class attributes.

byte code, libraries and Java

If I wanted to create a new language for Java I should make a compiler that is able to generate the byte-code compatible with the JVM spec, right? and also for the JDK libraries?
Where can I find some info?
Thanks.
Depends what you mean by "create a new language for Java" -- do you mean a language that compiles to bytecode and the code it generates can be used from any Java app (e.g. Groovy) or an interpreted language (for which you want to write a parser in Java)?
If it is the former one then #Joachim is right, look at the JVM spec; for the latter look at the likes of JavaCC for creating a parser for your language grammar.
I would start with a compiler which produced Java source. You may find this easier to read/understand/debug. Later you can optimise it to produce byte code.
EDIT:
If you have features which cannot be easily translated to Java code, you should be able to create a small number of byte code classes using Jasmin with all the exotic functionality which you can test to death. From the generated Java code this will look like a plain method call. The JVM can still inline the method so this might not impact performance at all.
The Java Virtual Machine Spec should have most of what you need.
An excellent library for bytecode generation/manipulation is ASM: http://asm.ow2.org.
It is very versatile and flexible. Note however that it's API is based on events (similar to Sax parsers) - it reads .class files and invokes a method whenever it encounters a new entity (class declaration, method declaration, statements, etc.). This may seem a bit awkward at first, but it saves a lot of memory (compared to the alternative: the library read the input, spits out a fully-evolved tree structure and then you have to iterate over it).
I don't think this will help much practically, but it has a lot of sweet theoretical stuff that I think you'll find useful.
http://www.codeproject.com/KB/recipes/B32Machine1.aspx

Programming an Interpreter for a Compiler

I'm writing an interpreter for a compiler program in Java. So after checking the source code, syntax and semantics, I want to be able to run the source code, which is the input for my compiler. I'm just wondering if I can just translate some tokens, for example, out (it prints stuff on screen), can I just replace it with System.out.print? then feed the source code again to run it in java?
I've heard of using the Java Compiler API, would this be a good plan?
Thank you very much in advance!
What you asking is a virtual machine implementation technique, to run your Java code in general you should implement following:
The first few steps I guess you already done (Design/describe the language semantics, construct AST and perform required validation of the code)
You need to generate your byte code, original Java works exactly in the same way, it generates another representation of the source code, from human readable to machine readable.
Here you can see how Java byte code looks like http://www.ibm.com/developerworks/ibm/library/it-haggar_bytecode/
You need to implement virtual aka stack machine that reads byte code and runs it for execution.
So as you can see you should have 3 separated components (projects) for your task:
1. Language grammar
2. Compiler (byte code generator)
3. Virtual machine (interpreter of byte code)
P.S. I have experience in creation of tiny Java similar compiler from scratch (define grammar with ANTlr, implementation of compiler, implementation of virtual machine), so probably I can share more information with you (even source code) if you need something particular
You really need to read some books and/or take courses on compilers - this can't be solved by a two-paragraph answer on SO.
You could create a cross-compiler which reads your language and outputs Java code to do the same thing. This may be the simplest option.
The Java Compiler API can be used to compile Java code. You would need to translate your existing code to Java first to use it.
This would not be the same thing as writing an interpreter. Is this homework? Does the task say you have to write the interpreter or can you have the code run any way which works?
Unfortunately you did not mention which scripting language are you planning to support. If it is one of well known languages, just use its ready interpreter written in pure java. See BSF and Java 5 scripting (http://www.ibm.com/developerworks/java/library/j-javascripting1/)
It it is your own language
think twice: do you really need it?
If you are sure you need your own language think about JavaCC
First of all, thank you very much for the fast replies.
As part of our compiler project, we need to be able to compile and run a program written in our own specified language. The language is very similar to C. I am confused on how an interpreter works, is there a simpler way to implement this? Without generating byte codes? My idea was to translate each statement into Java equivalent statements, and let Java handle the byte code generation.
I would look into the topics mentioned. Again, thank you very much for the suggestions.

Difference between C++ and Java compilation process [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Why does C++ compilation take so long?
Hi,
I searched in google for the differences between C++ and Java compilation process, but C++ and Java language features and their differences are returned.
I am proficient in Java, but not in C++. But I fixed few bugs in C++. From my experience, I noticed that C++ always took more time to build compared to Java for minor changes.
Regards
Bala
There are a few high-level differences that come to my mind. Some of those are generalizations and should be prefixed with "Often ..." or "Some compilers ...", but for the sake of readability I'll leave that out.
C/C++ compilation doesn't read any information from binary files, but reads method/type definitions only from header files that need to be parsed in full (exception: precompiled headers)
C/C++ compilation includes a pre-processor step that can do a wide array of text-replacement (which makes header pre-compilation harder to do)
The C++ syntax is a lot more complex than the Java syntax
The C++ type system is a lot more complex than the Java type system
C++ compilation usually produces native assembler code, which is a lot more complex to produce than the relatively simple byte code
C++ compilers need to do optimizations because there isn't any other thing that will do them. The Java compiler pretty much does a simple 1:1 translation of Java source code to Java byte code, no optimizations are done at that step (that's left for the JVM to do).
C++ has a template language that's Turing complete! (so strictly speaking C++ code needs to be run to produce executable code and a C++ compiler would need to solve the halting problem to tell you if arbitrary C++ code is compilable).
Java compiles code into bytecode, which is interpreted by the Java VM. C++ must compile into object code, then to machine language. Because of this, it's possible for Java to compile only a single class for minor changes, while C++ object files must be re-linked with other object files to machine code executable (or DLLs). This may make the process take a bit longer.
I am not sure why you expect the compilation speed of Java and C++ to be comparable since they are different languages with completely different design goals and implementations.
That said a few specific differences to keep in mind are:
Java is compiled to byte code and not right down to machine code. Compiling to this abstract virtual machine is simpler.
C++ compilation involves not only compilation but also linking. So it is typically a multi step process.
Java performs some late binding that is the association of a call to a function and the actual code to run is done at runtime. So a small change in one area need not trigger a compile of the whole program. In C++ this association needs to be done at compile time this is called early binding.
A C++ program using all the language's features is inherently more difficult to compile. A few template invocations with a number of types can easily double or triple the amount of code to generate.
Glossing over a lot of details, in Java you compile .java files into one or more .class files. In C++ you compile .cc (or whatever) source files into .o files, and then link the .o files together into an executable or library. The linking process is usually what kills you, especially for minor changes as the amount of work for linking is roughly proportional to the size of your entire project. (this is ignoring incremental linkers, which are specifically designed to not behave as badly for small changes)
Another factor is that the #include mechanism means that whenever you change a .h file, all of the .o files that depend on it need to be rebuilt. In Java, a .class file can depend on more than one .java file (eg: because of constant inlining), but there tend to be far fewer of these "hot spots" where changing one source file requires many other source files to be rebuilt.
Also, if you're using an IDE like Eclipse it's building your Java code in the background all the time, so by the time you tell it to build it's already mostly (if not completely) done.
Java compiles any source code into bytecode, which is interpreted by JVM. Because of this feature it can be used in multiple platform.

Compiler to translate Java bytecode to platform-independent C code before runtime?

I'm looking for a compiler to translate Java bytecode to platform-independent C code before runtime (Ahead-of-Time compilation).
I should then be able to use a standard C compiler to compile the C code into an executable for the target platform. I understand this approach is suitable only for certain Java applications that are modified infrequently.
So what Java-to-C compilers are available?
I could suggest a tool called JCGO which is a Java source to C translator. If you need to convert bytecode then you can decompile the class files by some tool (e.g., JadRetro+Jad) and pass the source files to JCGO. The tool translates all the classes of your java program at once and produces C files (one .c and .h for each class), which could, further, be compiled (by third-party tools) into highly-optimized native code for the target platform. Java generics is not supported yet. AWT/Swing and SWT are supported.
Why do that? The Java virtual machine includes a runtime Java-to-assembly compiler.
Compilation at runtime can yield better performance, since all information about runtime values is available. While ahead-of-time compilation has to take assumptions about runtime values and thus may emits less fast code. Please refer to Java vs C performance by Cliff Click for more details.
GCJ has this capability, but it hasn't got great support for Java features past 1.4, and Swing support is likely to be troublesome. In practice though, the HotSpot JIT compiler beats all the ahead-of-time compilers for Java. See benchmarks from Excelsior JET.
To clarify: GCJ converts java source/bytecode to natively compiled code
Toba will convert (old) Java bytecode to C source. However, it hasn't been updated since Java 1.1. It may be helpful to partially facilitate the porting, but it just can't handle all the complex libraries Java has.
https://github.com/badlogic/jack -- Java to C++ transpiler, ignores memory model and other stuff, uses Boehm GC for extra slowness and GC pauses
The license is unclear to me.
http://ptolemy.eecs.berkeley.edu/publications/papers/03/java-2-C/ -- A Retargetable Optimizing Java-to-C Compiler for Embedded Systems
A paper, not sure whether the program is available.
(I've been googling for this stuff, this is how I came to this question at SO.)
AFAIK, there is no such product but you have two options:
Implement your own byte-code to C transpiler. Byte-code is pretty simple, this isn't too hard.
If you just want a native binary (i.e. when you don't need the C source code), then give GCJ a try.
Note: If you're doing this for performance reasons, then you're going to be disappointed. Java is generally as fast as C/C++. Moreover, improvements to the VM will make all Java code faster but not your native binary. Compiling the code will just give you a little better startup time.
Not really an answer to my own question, but how does Oracle do it?
http://download.oracle.com/docs/cd/B28359_01/java.111/b31225/chone.htm#BABCIHGA
There used to be a product called TowerJ, which was essentially a "via C" static compiler for Java, but it is long gone.
I was told that Sun Labs has created something like this as part of the Sun SPOT project, but I am not sure if it is public.
#BobMcGee: In the benchmarks you refer to, GCJ indeed loses, but Excelsior JET (which is a 32-bit AOT compiler) beats the 32-bit HotSpot on all three test systems, so I am not sure what was your point.
But, after all, there are lies, damn lies, and benchmarks. :)

Categories

Resources