Say you want to create a new programming language, and you want it to have garbage collection, advanced object-oriented features, etc. but don’t want to go through the trouble of implementing those features in your compiler from scratch. In fact, the more you think about it, the more your idea sounds like “Java, but with X”.
Well luckily for you, there exists the Polyglot Extensible Compiler Framework, which is a Java 7 compiler that can be modified to add additional language features or change the behavior of features already present in Java.
I’m Interested, Tell Me More
Polyglot is a compiler frontend, or a source-to-source compiler - this means that it takes in Java source code, typechecks it, and outputs Java source code. By extending Polyglot, you can make a compiler that takes in your new language’s source code, typechecks it, and translates it into another language (typically Java, but other targets such as LLVM are possible).
For the past few months, my research project has been building a compiler using Polyglot. This extension is relatively complex - it adds several new language features and even an entirely new type system layered on top of Java’s type system, and outputs Java source code with generated classes to interface with the custom runtime.
Despite the complexity of this extension, I’ve written less than 5000 lines of code for this new compiler. Since the base language for Polyglot is Java 7, features like classes and polymorphism are supported out of the box, making the new language practically useful for writing programs, as opposed to just a toy language used as proof-of-concept.
The best references for learning/using Polyglot are probably the most recent tutorial, the paper, and the Polyglot source code. The tutorial is pretty in-depth and the source code interfaces are well-documented, but one thing I found lacking when I was learning how to work with Polyglot was high-level guidance for how to implement large extensions.
The example extensions in the tutorial and the
/examples directory of the repository (some of which are described in the paper) are pretty simple. It makes sense to limit the complexity of examples presented in a paper or tutorial, but it makes planning out a non-trivial Polyglot extension a daunting task - sometimes it’s hard to know where to start or what you need to do.
Non-trivial examples do exist - there’s a list of some other projects that use Polyglot on the website, some of which have source code available. However, they are not ideal for learning from due to two reasons:
- The first reason is complexity - the larger extensions are a little too large to easily digest. For example, the Jif extension described in the Polyglot paper comprises several hundred files in half a dozen packages, making it nearly as large as Polyglot itself. Unlike the examples from the paper and the tutorial, there’s no in-depth guide explaining the implementation of these extensions.
- The second reason is age - Polyglot has been around for almost 2 decades now, and many major additions/changes have been made since then, including adding support for Java 5 and Java 7, and adding a language dispatcher abstraction. Many of the example extensions are from the earlier days of Polyglot, so they extend only Java 1.4 and don’t make use of Polyglot’s newer features & abstractions.
That said, it’s relatively easy to dive into Polyglot by following the tutorial and learning via making small syntax extensions. After that, it should be easier to scale up into larger extensions that change the type system. It took me approximately 2-3 months of tinkering with Polyglot part-time before I realized the best way to implement my extension and refactored my existing code to follow better patterns.
While writing the extension, I got to experience a lot of the benefits of Polyglot’s architecture, but also some of the drawbacks and tradeoffs.
One of the nice things about Polyglot is that the amount of code you have to write is proportional to how much is changed from the base language. That doesn’t mean there isn’t any boilerplate, it just means that you have to write the same amount of boilerplate for each AST node you add. Some of the boilerplate is partly due to intentional design decisions.
To support extensibility, almost every class has its own interface in a separate file. This allows developers to take advantage of multiple inheritance in interfaces to easily add new nodes that implement the same operations as existing nodes, without having to extend or modify any existing nodes. However, it essentially doubles the number of files in the project and sometimes hinders navigating to the actual implementation of a method via IDE.
As alluded to earlier, extensions that affect the type system are more challenging to implement, because most of Polyglot’s flexibility is in adding/changing AST nodes. Although operations on types can be overridden the same way as AST operations, the systems for creating new types and resolving package & class names are difficult to change, and developers looking to make major changes/additions to the type system should be prepared to override a significant fraction of the methods in the 2000-line type system class.
One thing that might trip up new developers is Polyglot’s immutable AST. During each pass, the original node is not changed; any changes are made on a deep-copy of the node, and that modified node is output to the next pass. This allows developers to reason about passes as functions that take in an AST and output a new AST, without side effects on the original AST. While this is a nice feature, it feels more at home in functional languages where objects are by default immutable. The most common errors I encountered were null-pointer exceptions caused by extension objects not being copied properly between passes, but perhaps that is due to errors on my part.
Debugging Polyglot extensions didn’t present any major issues; Polyglot’s passes have good error handling and reporting, so user code errors can be easily identified. If the compiler crashes, Java’s stack trace is enough to pinpoint the source of the error, although there tends to be a lot of junk on the stack trace due to operations being dispatched between the language dispatcher and nodes/visitors/extension objects.
Overall, my experience working with Polyglot was positive - it wasn’t always easy or straightforward to figure out how to support each new language feature, but it was way better than having to write a compiler from scratch and rebuild all the OOP features that come for free in Java.
Polyglot is most useful when your language has largely the same features, semantics, and syntax to Java. If your language has completely different syntax and semantics from Java, Polyglot may be more trouble than it’s worth.
If you want a broad overview of how to write Polyglot extensions, check out my Guide to Writing Polyglot Extensions.