DANNY YANG

about me · blog · projects · hire me

Guide to Writing Polyglot Compiler Extensions

23 Jul 2020 - 1784 words - 8 minute read - RSS

This post is intended to be a supplement for the 2014 Polyglot tutorial. First, I’ll briefly go over the structure and naming conventions for Polyglot, focusing on clarifications and corrections from the tutorial. Then, I’ll discuss at a high level several ways we might want to extend Polyglot, and what we need to do to make it happen.

The rest of this post assumes you’ve at least skimmed the Polyglot tutorial; I will frequently refer back to the Carray extension from that tutorial.

Code Structure #

Recommended readings:

Polyglot has several naming conventions, which are not all listed in the same place in the tutorial. They are as follows:

  • File names for an extension are prefixed with the name of the extension.
  • Most classes have their own separate interface for extensibility. Concrete classes that implement a particular interface use the same name suffixed with _c. For example, the type system class for the Java 7 extension is named JL7TypeSystem_c, and its interface is JL7TypeSystem. Thus, creating a CarrayTypeSystem_c that extends JL7TypeSystem_c also requires creating a corresponding CarrayTypeSystem interface that extends JL7TypeSystem.

These conventions will be important because writing a Polyglot extension doesn’t just involve writing new, self-contained methods; extension code will import, implement, extend, or call on many classes from the base Polyglot compiler.

AST & Extension Objects #

Abstract Extension Factory: #

In the carray.ast package, the Extension Factory is actually split between CarrayAbstractExtFactory and CarrayExtFactory, with the latter extending the former. It’s not listed on the extension code structure overview, but shows up in a later section. The Abstract Extension Factory is a boilerplate class that adds support for attaching extension objects to AST nodes added in this extension.

TypeNode vs Type: #

Classes extending TypeNodes are AST nodes which represent user-written type annotations which are parsed. For example, formal parameters and local declarations each contain a TypeNode. During the Type Building, Disambiguation, and Type Checking passes, the TypeNode is resolved to an actual Java type (a Type object), which can be accessed using the type() method of the TypeNode. After the typechecking pass, each Expr node will also hold a Type.

Ext classes: #

These classes define extension objects, which are all subclasses of the default extension object for that extension (in our case, they all extend CarrayExt). Extension objects do not follow the typical _c naming convention. For example, if we wanted to create an extension object that holds extra information specifically for the LocalDecl node, we would just create a class called CarrayLocalDeclExt.

While the lack of extension object interfaces appears to conflict with the goal of extensibility, this is actually an intentional design choice. Extension objects from different extensions should not be part of the same typing hierarchy. If our Polyglot extension builds on top of another another extension (for example, the Java 7 extension), then there there’s already a JL7Ext attached to each AST node. If CarrayExt extends JL7Ext, then there will be 2 JL7Exts attached to each node, which is not desirable.

In general, our extension object should only contain the implementation for functionality exclusive to our extension, since it can call the “parent” extension’s implementation through the language dispatcher. Every extension object has a pointer to the current extension’s language dispatcher (through the lang() function) and the parent extension’s language dispatcher (through the superLang() function). To invoke the parent extension’s operations for a particular node, we should call superLang().typeCheck(...) as opposed to the usual super.typeCheck(...).

Ops classes: #

These interfaces define operations supported by AST nodes. In Polyglot, a few examples include NodeOps, CallOps, and TryOps.

In an extension, there should be a default ops class which defines operations added to all nodes by the extension (in the case of Carray, it would be CarrayOps). If our extension doesn’t add any operations, then the interface can be empty. Additional interfaces can be added to define operations which are not supported by all nodes. For example, if we add an operation that only applies to expressions, we can put it in CarrayExprOps.

Type System #

Polyglot’s Type System isn’t designed to be as extensible as its AST nodes (there’s no concept of extension objects in the type system), so you’ll have to extend the classes directly if we want to modify or add functionality.

Instance classes: #

These classes represent typing information for local variables, methods, constructors and more. They are created by factories in the Type System class.

There is a single MethodInstance for each method that is declared in a class, and after method call nodes are resolved the Call node itself will contain a pointer to that MethodInstance. Similarly, each local variable declaration is associated with a LocalInstance, and all further usages of that local variable contain pointers to that LocalInstance.

Creation and resolution of Instances happens during the Type Building, Disambiguation, and Type Checking passes.

How To Extend Polyglot #

Adding new syntax: #

  • Create classes/interfaces for the new AST nodes.
  • Update the Abstract Extension Factory to handle the new nodes.
  • If our extension added new operations for AST nodes, see “Adding fields/operations to AST nodes” and treat the new nodes the same as existing nodes.
  • Update the Node Factory to add methods to construct the new nodes.
  • Modify the flex lexer file if necessary (if new tokens or keywords are needed).
  • Modify the ppg parser file to account for any changes to the lexer, and add constructions for the new nodes by calling the new methods in the Node Factory.

Adding fields/operations to AST nodes: #

If the field/operation should be supported by all AST nodes:

  • Just add it to CarrayExt and (if it is a method) CarrayOps

If the field/operation isn’t supported by all AST nodes:

  • Create a new extension class. For example, if we’re adding a new operation to all statements, we would create a CarrayStmtExt that extends CarrayExt.
  • Add the field/operation to the new class.
  • Modify CarrayExtFactory and override the corresponding factory method to return the new extension class. In our example, we would override extStmt.
  • To avoid casting a lot when we use extension objects, I’d recommend writing 2 getter methods. The first method gets the extension object from the node (in our example, a static method with type signature Stmt -> CarrayStmtExt). The second method overrides the Ext.node method to return a Stmt instead of a Node.

Extending the Type System #

In general, adding new types or type system operations is similar to adding new AST nodes / operations, except the factory methods for types are in the Type System class instead of the Node Factory. New operations should be added to both the Type System class and the type classes themselves; oftentimes (but not always), the former simply delegates to the latter. If operations are being added to existing types, we will have to add the new operation by subclassing each existing type class, since extension objects do not exist for types.

Adding new types presents additional challenges if you’re building on top of Polyglot’s Java 5 or Java 7 extensions, because you’ll need to make sure that the new type interacts correctly with generics and the type inference mechanism.

The Type System class has a lot of methods that do things like checking if method calls are valid, finding the correct version of an overloaded method, or checking if a class implements a particular interface. In a larger extension, many of these operations will need to be overridden, which may require a lot of engineering effort. Larger extensions that I’ve seen have Type System classes that are 1000-3000 loc, which is similar in length to Polyglot’s base TypeSystem_c class. This suggests that many operations cannot simply be delegated to the parent class’s implementation and must be re-implemented from scratch.

Extending the Type System: Alternative Approach #

In some cases, we can bypass the engineering complexity of rewriting/extending the numerous type system operations by storing additional typing information in the AST nodes themselves. This approach leverages the AST’s superior extensibility, and is the approach I used in my compiler for research. It is appropriate if our additional type system mechanics do not interact much with Java’s type system - as in, if it could be viewed as a new type system layered on top of Java’s.

In this case, you’d need to create extension objects that contain extra fields for this typing information, and override the buildTypes and typeCheck methods to check this information before delegating to the parent extension’s implementation.

If we need to store extra information about each method or variable or class, then we have to extend the type system’s Instance classes and their factory methods. For this, we can consult the source code of Polyglot’s Java 5 extension for an example of extending MethodInstance to support throw types and generics.

Generating Additional Java classes: #

The Polyglot tutorial discusses how to generate translations of added AST nodes relatively well. In particular, the tutorial covers the QuasiQuoter feature, which is very important for conciseness, as nesting AST node constructors gets very verbose.

I have two additional tips for codegen:

  • When generating a new AST node that doesn’t cleanly map to the original source, we can set its position with Position.COMPILER_GENERATED.
  • To generate multiple Java source files from a single extended language source file, we can create an extension for the SourceFile node and override the extRewrite method to return a SourceCollection.

Conclusion #

The overview I gave here represents the guidance that I wish I had when I started working with Polyglot. I hope that it’s useful for anyone that’s planning to write meaningful extensions with Polyglot in the future.

For those that want to learn more, I’ve written a post about my experience writing an extension with Polyglot.



github · linkedin · email · rss