This is a 2 min read ยท Published on 08 Nov 2019

I've been reading this fantastic book called "Crafting Interpreters" lately by Bob Nystrom.

It talks about the different types of programming languages. We can think about all compilers as falling into three categories.

  1. Compilers that output machine code.
  2. Compilers that output bytecode and need to be interpreted.
  3. Compilers that output an intermediary format and then hand it off to another compiler.

There is a lot I find interesting about the book so far, but one thing that I was fascinated to read about the differences between "compiled" and "interpreted" languages.

Bob talks about if you are building a new language, you need to need to:

  1. Build a parser to handle your new syntax and build an AST.
  2. Write any transformations or optimizations.
  3. Build a generator to generate output code.

This third step is the really interesting one!

An example of a compiler that does #3 from above (Compilers that output an intermediary format and then hand it off to another compiler) would be Babel. Babel basically works like this.

  1. Let users author ES6+ JavaScript.
  2. Turn it into an AST.
  3. Transform the AST into the equivalent ES5 AST.
  4. Convert the AST to JavaScript.
  5. Hand that JavaScript off to a different compiler (the one in your browser).

So, in this case, the "output code" that Babel generates is just JavaScript! But what if we wanted to make a language that actually generated CPU instructions?

Compiled vs. Interpreted

Compiled

If you are building a language that must run on multiple processor architectures (let's say x86 and ARM), you'll need to write two different "back ends" or "generators." These will both take the same AST your compiler created and output CPU specific instructions (this is a lot of work!). These are what we refer to as "compiled" languages. Every compiled language has a custom generator for each CPU architecture they support.

Interpreted

Another approach would be to build a single generator. Instead of having it generate code for an existing CPU architecture, you could have it generate code for a "virtual" or "hypothetical" CPU. This is what Java does and we refer to the output as "bytecode."

If you create an interpreted language, you'll need to also build a "virtual machine" so that your hypothetical CPU code can run on real CPUs. This is where things are kind of neat but a little confusing.

If you build a VM (something capable of executing your generated bytecode) in an existing language, then the VM can run on any CPU architecture that compiles that language. So you can:

  1. Write in a custom langauge
  2. Compile it to your custom bytecode
  3. Make a small VM in a popular language like C
  4. Now your custom language has only one generator and can run anywhere that C can compile!

I thought this was an interesting distinction and made the different approaches C++ and Java took seem a lot more like a reasonable trade-off as opposed to a mystery.


Subscribe to my email list!

Let me be real with you. Sometimes when I'm bored I log in to Buttondown and look at the audience number. If it's bigger than it was the week before, well that makes me feel really happy! I promise I'll never spam you and I will, at most, send you a monthly update with what's happening on this site.