I’ve been reading this fantastic book called “Crafting Interpreters” lately by Bob Nystrom.
It talks about the different types of programming languages. We can think about all compilers as falling into three categories.
- Compilers that output machine code.
- Compilers that output bytecode and need to be interpreted.
- Compilers that output an intermediary format and then hand it off to another compiler.
There is a lot I find interesting about the book so far, but one thing that I was fascinated to read about the differences between “compiled” and “interpreted” languages.
Bob talks about if you are building a new language, you need to need to:
- Build a parser to handle your new syntax and build an AST.
- Write any transformations or optimizations.
- Build a generator to generate output code.
This third step is the really interesting one!
An example of a compiler that does #3 from above (Compilers that output an intermediary format and then hand it off to another compiler) would be Babel. Babel basically works like this.
- Turn it into an AST.
- Transform the AST into the equivalent ES5 AST.
If you are building a language that must run on multiple processor architectures (let’s say x86 and ARM), you’ll need to write two different “back ends” or “generators.” These will both take the same AST your compiler created and output CPU specific instructions (this is a lot of work!). These are what we refer to as “compiled” languages. Every compiled language has a custom generator for each CPU architecture they support.
Another approach would be to build a single generator. Instead of having it generate code for an existing CPU architecture, you could have it generate code for a “virtual” or “hypothetical” CPU. This is what Java does and we refer to the output as “bytecode.”
If you create an interpreted language, you’ll need to also build a “virtual machine” so that your hypothetical CPU code can run on real CPUs. This is where things are kind of neat but a little confusing.
If you build a VM (something capable of executing your generated bytecode) in an existing language, then the VM can run on any CPU architecture that compiles that language. So you can:
- Write in a custom langauge
- Compile it to your custom bytecode
- Make a small VM in a popular language like C
- Now your custom language has only one generator and can run anywhere that C can compile!
I thought this was an interesting distinction and made the different approaches C++ and Java took seem a lot more like a reasonable trade-off as opposed to a mystery.
Written by Jon Kuperman living in Florida working on Adobe's Creative Cloud. You should follow him on Twitter