byte code, libraries and Java - java

If I wanted to create a new language for Java I should make a compiler that is able to generate the byte-code compatible with the JVM spec, right? and also for the JDK libraries?
Where can I find some info?
Thanks.

Depends what you mean by "create a new language for Java" -- do you mean a language that compiles to bytecode and the code it generates can be used from any Java app (e.g. Groovy) or an interpreted language (for which you want to write a parser in Java)?
If it is the former one then #Joachim is right, look at the JVM spec; for the latter look at the likes of JavaCC for creating a parser for your language grammar.

I would start with a compiler which produced Java source. You may find this easier to read/understand/debug. Later you can optimise it to produce byte code.
EDIT:
If you have features which cannot be easily translated to Java code, you should be able to create a small number of byte code classes using Jasmin with all the exotic functionality which you can test to death. From the generated Java code this will look like a plain method call. The JVM can still inline the method so this might not impact performance at all.

The Java Virtual Machine Spec should have most of what you need.

An excellent library for bytecode generation/manipulation is ASM: http://asm.ow2.org.
It is very versatile and flexible. Note however that it's API is based on events (similar to Sax parsers) - it reads .class files and invokes a method whenever it encounters a new entity (class declaration, method declaration, statements, etc.). This may seem a bit awkward at first, but it saves a lot of memory (compared to the alternative: the library read the input, spits out a fully-evolved tree structure and then you have to iterate over it).

I don't think this will help much practically, but it has a lot of sweet theoretical stuff that I think you'll find useful.
http://www.codeproject.com/KB/recipes/B32Machine1.aspx

Related

Writing languages for the JVM

Suppose I write a programming language; for namesake, I'll call it lang.
To begin the long journey of writing lang, I decide to begin, by writing lang in itself. I can't actually run it, because theres nothing to run the program that runs itself.
So I begin by writing another compiler for lang in Java. This time, when I am done, I decide to convert it to Bytecode, and leave it at that. I now have a working compiler, which will convert all my lang code into Bytecode.
So I decide to plug in my self-compiler for the language, into the compiler I just made in Java. I then convert the self-compiler to Bytecode, and chuck out the Java compiler. I now have a lang compiler, purely written in itself, converted into Bytecode, ready for use.
This creates a solid program, and I understand all of this, but my question is, relative to compiler design for the JVM, what if I decide to release an update for my language? How do I go about updating the Bytecode? Do I simply re-write the updated version of the language in the older one?
I ask this because this is what I want to do. Write a non-existing language in itself, and then bootstrap it to the JVM by firstly creating a compiler in Java.
It's the same as what was done with C++. C with Classes was written, and then C++ in it, and finally C with Classes was abandoned for the bootstrapped C++. But then how on earth did they ever go about updating the language?
I'll answer this from two possible scenarios in your development. With any byte-code language at any time you can update the virtual machine or the language.
Suppose first you wanted to update your language to have new syntax or change the current semantics. Then you'd keep your current compiled compiler written in lang (compiler A) and edit its source so that it can correctly compile your new features. Then you compile your compiler using the old one giving you compiler B. If necessary, you can now rewrite the compiler to use the new features and then compile it using compiler B to give you compiler C.
What if the JVM changes? Well in that case you keep an old version of the JVM around, adjust your compiler to cope with the new bytecode changes, and then compile it with the old one (this is analogous to compiler B from before). That will get you a compiler that compiles to the new bytecode but runs on the old VM. The next step is get it to compile itself, and now you have a new compiler that runs on the new VM (analogous to compiler C).
I don't think your compiler is the best way to go about this.
I'd start with a grammar for my language.
Next comes the lexer/parser to turn expressions in my language to an abstract syntax tree (AST). The AST is a correct intermediate representation of the expression.
You would emit bytecode or assembly language instructions for the virtual machine or processor of your choice by writing a code generator that traverses the AST.
Where does your update happen?
If it's language fundamentals, you have to modify both the grammar and the bytecode emission.
If you're optimizing the bytecode or porting to a new processor you have to modify the code generator.
The first lang compiler can be written in a subset of lang. And you only need a subset (bootstrap) compiler (or even interoreter). This can be written in java.
Later, more extensive compilers can be written in lang. Newer versions can do too.
You could even write a translator that converts a lang program to java, and use that to create a first translator in lang, and then turn it into a bytecode compiler.

Assembly Programming on the JVM?

I was kind a curious if it was possible to do assembly programming in a similar fashion of using NASM in C.
After quick Google search to see if it was possible to do assembly language programming on the JVM and was surprised to find some results.
Has anyone tried doing something like this before?
I'm also wondering if there are any support assembly support for Clojure or Scala.
Invoking Assembly Language Programming from Java
minijavac : Not in English but it looks like it using some kind of NASM support.
Assembly is usually used in C so that a) you can access instructions C doesn't generate or b) lower level performance tuning.
As byte code is designed for Java,
there aren't any useful byte code instructions it doesn't generate
The JVM looks for common patterns in byte code generated by the compiler and optimises for those. This means if you write the byte code yourself it is more likely to be less optimised i.e.
slower, unless it is the same as what the compiler would produce.
Write a JNI library in C with inline assembly in it.
In theory, you could write a JNI-compliant library in pure assembly, but why bother?
I'd like to point to another solution: generating assembly code at runtime from your java program.
Some (long) time ago there was a project called softwire, written in c++, that did exactly that. It (ab)used (method and operator) overloading to create some kind of c++ DSL that closely resembles x86 ASM, and which behind the scene would assemble the corresponding assembly. The main goal was to be able to dynamically assemble an assembly routine customized for specific configuration, while eliminating nearly all the branchings (the routine would be recompiled if the confiugration changed).
This was an excellent library and the author used to to great effect to implement a software renderer with shading support (shaders were dynamically translated to x86 assembly and the assembled, all at runtime), so this was not just a crazy idea. Unforuntately he was hired by a company and the library acquired in the process.
Today, to follow such a route you could create a JNI binding to DynAsm (that alone is probably no small task) and use it to assemble at runtime. If you are willing to use scala over java, you can even relatively easily create a DSL ala softwire, that will under the hood generate the assembly source code and pass it to DynASM.
Sounds like fun :-)
No reason to be bored anymore.
Are you looking for something like jasmin project? Because,for some reason for me, minijava always reminds me of jasmin parser...

Is there a programming language that would have:

I am curious about such thing... Is there a programming language that would have:
syntax such as Java and/or C++
templates/generics support
memory management (no garbage collection)
"clean syntax" (no mess like perl or c/c++)
"normal" OOP (polyphormism, interfaces, abstract classes, overloading and etc.)
(preferably) compiles to machine code
namespace support
exception support
no source preprocessor (as is in c\c++)
statically typed
Maybe ADA ? I can advice you to learn C/C++ or Java or something else and use it smartly - then you'll get everything you need.
UPD: You may be interested by D
syntax such as Java and/or C++
"clean syntax" (no mess like perl or c/c++)
So, basically you want syntax such as C++, but you don't want syntax such as C++. It should be obvious that such a language cannot possibly exist, since the intersection of the set of languages that have syntax such as C++ and the set of languages that do not have syntax such as C++ must necessarily be the empty set.
There also some other requirements that don't make sense, like this one:
(preferably) compiles to machine code
What the compiler produces as its output is a trait of the compiler, it has nothing to do with the language. Every language can be compiled to every other language, provided the target language has at least the same computational power as the source language. (Which typically means that the target language must be Turing-complete, since most source language are Turing-complete.)
What is your need for those features? Or are they things you think you need? Why not find a syntax you think you'll feel comfortable with, since that seems to be the most important thing in your list, and then explore your other application requirements
Vala - designed as unmanaged C# for gnome
D - Built on c but simpler than C++. I think it has some kind of GC though
The new versions of Delphi, doesn't have curly brace syntax though
I'm betting you'll have a hard time finding a language that meets all your criteria. However, these may be worth looking into:
Go. Clean syntax, compiles to machine code. Has GC, though. And isn't strictly O-O.
Scala addresses many, but not all, of your issues (as mentioned by others in this thread).
Haskell. Functional, not O-O. But worth looking at anyway.
D, also as mentioned by others.
It's definitely Scala. It confirms all your points
To put it bluntly: Learn C++ and use it the way it should be used.
Done.
You only get GC issues if you discard objects. Write your application to recycle object instead and you won't have any garbage collection.
You can design an application which only GC's over night for example. i.e. zero-cost during the day, but some garbage is allowed.
Perhaps you could say what your concern is with having a GC. There may be ways to work around the problem which opens up languages like C# and Java.
BTW: Java and C# is compiled to machine code at run time.

Programming an Interpreter for a Compiler

I'm writing an interpreter for a compiler program in Java. So after checking the source code, syntax and semantics, I want to be able to run the source code, which is the input for my compiler. I'm just wondering if I can just translate some tokens, for example, out (it prints stuff on screen), can I just replace it with System.out.print? then feed the source code again to run it in java?
I've heard of using the Java Compiler API, would this be a good plan?
Thank you very much in advance!
What you asking is a virtual machine implementation technique, to run your Java code in general you should implement following:
The first few steps I guess you already done (Design/describe the language semantics, construct AST and perform required validation of the code)
You need to generate your byte code, original Java works exactly in the same way, it generates another representation of the source code, from human readable to machine readable.
Here you can see how Java byte code looks like http://www.ibm.com/developerworks/ibm/library/it-haggar_bytecode/
You need to implement virtual aka stack machine that reads byte code and runs it for execution.
So as you can see you should have 3 separated components (projects) for your task:
1. Language grammar
2. Compiler (byte code generator)
3. Virtual machine (interpreter of byte code)
P.S. I have experience in creation of tiny Java similar compiler from scratch (define grammar with ANTlr, implementation of compiler, implementation of virtual machine), so probably I can share more information with you (even source code) if you need something particular
You really need to read some books and/or take courses on compilers - this can't be solved by a two-paragraph answer on SO.
You could create a cross-compiler which reads your language and outputs Java code to do the same thing. This may be the simplest option.
The Java Compiler API can be used to compile Java code. You would need to translate your existing code to Java first to use it.
This would not be the same thing as writing an interpreter. Is this homework? Does the task say you have to write the interpreter or can you have the code run any way which works?
Unfortunately you did not mention which scripting language are you planning to support. If it is one of well known languages, just use its ready interpreter written in pure java. See BSF and Java 5 scripting (http://www.ibm.com/developerworks/java/library/j-javascripting1/)
It it is your own language
think twice: do you really need it?
If you are sure you need your own language think about JavaCC
First of all, thank you very much for the fast replies.
As part of our compiler project, we need to be able to compile and run a program written in our own specified language. The language is very similar to C. I am confused on how an interpreter works, is there a simpler way to implement this? Without generating byte codes? My idea was to translate each statement into Java equivalent statements, and let Java handle the byte code generation.
I would look into the topics mentioned. Again, thank you very much for the suggestions.

Are there any examples of code that is difficult to decompile?

Sometimes when decompiling Java code, the decompiler doesn't manage to decompile it properly and you end up with little bits of bytecode in the output.
What are the weaknesses of decompilers? Are there any examples of Java source code that compiles into difficult-to-decompile bytecode?
Update:
Note that I'm aware that exploiting this information is not a safe way to hide secrets in code, and that decompilers can be improved in the future.
Nonetheless I am still interested in finding out what kinds of code foxes todays crop of decompilers.
Any Java byte code that's been through an obfuscator will have "ridiculous" output from the decompiler. Also, when you have other languages like Scala that compile to JVM byte code, there's no rule that the byte code be easily represented back in Java, and likely isn't.
Over time, decompilers have to keep up with the new language features and the byte code they produce, so it's plausible that new language features are not easily reversed by the tools you're using.
Edit: As an example in .NET, the following code:
lock (this)
{
DoSomething();
}
compiles to this:
Monitor.Enter(this);
try
{
DoSomething();
}
catch
{
Monitor.Exit(this);
}
The decompiler has to know that C# (as opposed to any other .NET language) has a special syntax dedicated to exactly those two calls. Otherwise you get unexpected (verbose) results.
The JDBC type-4 drivers for DB2 Connect are classics. Everything called one or two-letter names, irrelevant code that ends up having no effect, and more. I once tried to take a look to debug a particularly annoying problem and basically gave up. I'm hoping (but by no means confident) that this was passed through an obfuscator rather than the code actually looking like that.
Another favorite trick (although I can't remember the product) was to rename all objects to be constructed from the set {'0','O','l','1'}, which made reading it very difficult.
Assuming you can decompile back to a reasonable style of source code (you can't always do that), what is hard to "reverse engineer" are algorithms that operate in unfamiliar problem domains. If you don't understand Fast Fourier transforms, it doesn't matter much if you can get back the code that implements an FFT Butterfly.
(If this phrase is unfamiliar to you, I've already won if I encode one. If it is familiar to you, you are a pretty good engineer and probably don't have any interest in reverse engineering code). [Your mileage with North Koreans may vary.]
Java keeps a lot of information in the bytecode (for instance many names). So it is relatively easy to decompile. Hard to decompile bytecode mostly is generated by hard to read sourcecode (so that's not really an option). If you really want to obfuscate your code, use a obfuscator, that renames all methods and variables to unrecognizable stuff.
Exceptions are often difficult to decompile.
However, any code which has been obfuscated or has been written in another language is difficult to decompile.
BTW: Why would you want to know this?
Java Bytecode does not correspond directly to Java constructions, so decompiling implies that you know that a certain java byte code sequence corresponds to a Java code construction.
The Soot framework for decompiling java byte code has a lot of information on this, but their webpage is down for me right now.
http://www.sable.mcgill.ca/soot/

Categories

Resources