Fermat's Library | Compiling without Continuations annotated/explained version.

Compiling without Continuations

Luke Maurer Paul Downen

Zena M. Ariola

University of Oregon, USA

{maurerl,pdownen,ariola}@cs.uoregon.edu

Simon Peyton Jones

Microsoft Research, UK

simonpj@microsoft.com

Abstract

Many ﬁelds of study in compilers give rise to the concept

of a join point—a place where different execution paths

come together. Join points are often treated as functions

or continuations, but we believe it is time to study them

in their own right. We show that adding join points to a

direct-style functional intermediate language is a simple

but powerful change that allows new optimizations to be

performed, including a signiﬁcant improvement to list fusion.

Finally, we report on recent work on adding join points to the

intermediate language of the Glasgow Haskell Compiler.

CCS Concepts • Software and its engineering → Com-

pilers

; Software performance; Formal language deﬁnitions;

Functional languages; Control structures;

• Theory of com-

putation → Equational logic and rewriting

Keywords

Intermediate languages; CPS; ANF; Stream fu-

sion; Haskell; GHC

1. Introduction

Consider this code, in a functional language:

if (if e1 then e2 else e3) then e4 else e5

Many compilers will perform a commuting conversion [

which na

ıvely would produce:

if e1 then (if e2 then e4 else e5)

else (if e3 then e4 else e5)

Commuting conversions are tremendously important in prac-

tice (Sec. 2), but there is a problem: the conversion duplicates

and

. A natural countermeasure is to name the offending

expressions and duplicate the names instead:

let { j4 () = e4; j5 () = e5 }

in if e1 then (if e2 then j4 () else j5 ())

else (if e3 then j4 () else j5 ())

We describe

and

as join points, because they say where

execution of the two branches of the outer

joins up again.

The duplication is gone, but a new problem has surfaced: the

compiler may allocate closures for locally-deﬁned functions

and

. That is bad because allocation is expensive.

And it is tantalizing because all we are doing here is encoding

control ﬂow: it is plain as a pikestaff that the “call” to

should be no more than a jump, with no allocation anywhere.

That’s what a C compiler would do! Some code generators

can cleverly eliminate the closures, but perhaps not if further

transformations intervene.

The reader of Appel’s inspirational book [

] may be

thinking “Just use continuation-passing style (CPS)!” When

expressed over CPS terms, many classic optimizations boil

down to

-reduction (i.e., function application), or arithmetic

reductions, or variants thereof. And indeed it turns out that

commuting conversions fall out rather naturally as well. But

using CPS comes at a fairly heavy price: the intermediate

language becomes more complicated, some transformations

are harder or out of reach, and (unlike direct style) CPS

commits to a particular evaluation order (Sec. 8).

Inspired by Flanagan et al. [

], the reader may now be

thinking “OK, just use administrative normal form (ANF)!”

That paper shows that many transformations achievable in

CPS are equally accessible in direct style. ANF allows an

optimizer to exploit CPS technology without needing to

implement it. The motto is: Think in CPS; work in direct

style.

But alas, a subsequent paper by Kennedy shows that there

remain transformations that are inaccessible in ANF but fall

out naturally in CPS [

]. So the obvious question is this:

could we extend ANF in some way, to get all the goodness

of direct style and the beneﬁts of CPS? In this paper we say

“yes!”, making the following contributions:

•

We describe a modest extension to a direct-style

calculus intermediate language, namely adding join points

(Sec. 3). We give the syntax, type system, and operational

semantics, together with optimising transformations.

•

We describe how to infer which ordinary bindings are in

fact join points (Sec. 4). In a CPS setting this analysis is

called contiﬁcation [

], but it looks rather different here.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a

fee. Request permissions from Permissions@acm.org.

PLDI’17, June 18–23, 2017, Barcelona, Spain

ACM. 978-1-4503-4988-8/17/06...$15.00

http://dx.doi.org/10.1145/3062341.3062380

482

•

We show that join points can be recursive, and that recur-

sive join points open up a new and entirely unexpected

(to us) optimization opportunity for fusion (Sec. 5). In

particular, this insight fully resolves a long-standing ten-

sion between two competing approaches to fusion, namely

stream fusion [6] and unfold/destroy fusion [26].

•

We give some metatheory in Sec. 6, including type sound-

ness and correctness of the optimizing transformations.

We show the safety of adding jumps as a control effect by

establishing an equivalence with System F.

•

We demonstrate that our approach works at scale, in a

state-of-the-art optimizing compiler for Haskell, GHC

(Sec. 7). As hoped, adding join points turned out to be a

very modest change, despite GHC’s scale and complexity.

Like any optimization, it does not make every program go

faster, but it has a dramatic effect on some.

Overall, adding join points to ANF has an extremely good

power-to-weight ratio, and we strongly recommend it to any

direct-style compiler. Our title is somewhat tongue-in-cheek,

but we now know of no optimizing transformation that is

accessible to a CPS compiler but not to a direct-style one.

2. Motivation and Key Ideas

We review compilation techniques for commuting conver-

sions, to expose the challenge that we tackle in this paper. For

the sake of concreteness we describe the way things work in

GHC. However, we believe that the whole paper is equally

applicable to a call-by-value language.

The Case-of-Case Transformation Consider:

isNothing :: Maybe a -> Bool

isNothing x = case x of Nothing -> True

Just _ -> False

mHead :: [a] -> Maybe a

mHead ps = case ps of [] -> Nothing

(p:_) -> Just p

null :: [a] -> Bool

null as = isNothing (mHead as)

Here

null

is a simple composition of the library functions

isNothing

and

mHead

. When the optimizer works on

null

it will inline both isNothing and mHead to yield:

null as = case (case as of [] -> Nothing

(p:_) -> Just p) of

{ Nothing -> True; Just _ -> False }

Executed directly, this would be terribly inefﬁcient; if the

argument list is non-empty we would allocate a result

Just

only to immediately decompose it. We want to move the

outer case into the branches of the inner one, like this:

null as = case as of

[] -> case Nothing of Nothing -> True

Haskell’s standard null function returns whether a list is empty.

Just z -> False

p:_ -> case Just p of Nothing -> True

Just _ -> False

This is a commuting conversion, speciﬁcally the case-of-case

transformation. In this example, it now happens that both

inner

case

expressions scrutinize a data constructor, so they

can be simpliﬁed, yielding

null as = case as of { [] -> True; _:_ -> False }

which is exactly the code we would have written for

null

from scratch.

GHC does a tremendous amount of inlining, including

across modules or even packages, so commuting conversions

like this are very important in practice: they are the key that

unlocks a cascade of further optimizations.

Join Points

Commuting conversions have a problem,

though: they often duplicate the outer

case

. In our example

that was OK, but what about

case (case v of { p1 -> e1; p2 -> e2 }) of

{ Nothing -> BIG1; Just x -> BIG2 }

where

BIG1

and

BIG2

are big expressions? Duplicating these

large expressions would risk bloating the compiled code,

perhaps exponentially when

case

expressions are deeply

nested [

]. It is easy to avoid this duplication by ﬁrst

introducing an auxiliary let binding:

let { j1 () = BIG1; j2 x = BIG2 } in

case (case v of { p1 -> e1; p2 -> e2 }) of

{ Nothing -> j1 (); Just x -> j2 x }

Now we can move the outer

case

expression into the arms

of the inner case, without duplicating BIG1 or BIG2, thus:

let { j1 () = BIG1; j2 x = BIG2 } in

case v of

p1 -> case e1 of Nothing -> j1 ()

Just x -> j2 x

p2 -> case e2 of Nothing -> j1 ()

Just x -> j2 x

Notice that

takes as its parameter the variable bound by

the pattern Just x, whereas j1 has no parameters

Compiling Join Points Efﬁciently

We call

and

join

points because you can think of them as places where control

joins up again, but so far they are perfectly ordinary

let

bound functions, and as such they will be allocated as closures

in the heap. But that’s ridiculous: all that is happening here

is control ﬂow splitting and joining up again. A C compiler

would generate a jump to a label, not a call to a heap-allocated

function closure!

So, right before code generation, GHC performs a simple

analysis to identify bindings that can be compiled as join

points. This identiﬁes

let

-bound functions that will never be

captured in a closure or thunk, and will only be tail-called

The dummy unit parameter is not necessary in a lazy language, but it is in

a call-by-value language.

483

with exactly the right number of arguments. (We leave the

exact criteria for Sec. 4.) These join-point bindings do not

allocate anything; instead a tail call to a join point simply

adjusts the stack and jumps to the code for the join point.

The case-of-case transformation, including the idea of

using

let

bindings to avoid duplication, is very old; for

example, both are features of Steele’s Rabbit compiler for

Scheme [

]. In Rabbit the transformation is limited to

booleans, but the discussion above shows that it generalizes

very naturally to arbitrary data types. In this more general

form, it has been part of GHC for decades [

]. Likewise,

the idea of generating more efﬁcient code for non-escaping

let

bindings is well established in many other compilers

[

], dating back to Rabbit and its

BIND-ANALYZE

routine [24].

Preserving and Exploiting Join Points

So far so good, but

there is a serious problem with recognizing join points only

in the back end of the compiler. Consider this expression:

case (let j x = BIG in

case v of { A -> j 1; B -> j 2; C -> True })

of { True -> False; False -> True }

Here

is a join point. Now suppose we do case-of-case on

this expression. Treating the binding for

as an ordinary

let

binding (as GHC does today), we move the outer

case

past

the

let

, and duplicate it into the branches of the inner

case

let j x = BIG in

case v of

A -> case j 1 of { True -> False; False -> True }

B -> case j 2 of { True -> False; False -> True }

C -> case True of { True -> False; False -> True }

The third branch simpliﬁes nicely, but the ﬁrst two do not.

There are two distinct problems:

The binding for

is no longer a join point (it is not tail-

called), so the super-efﬁcient code generation strategy

does not apply, and the compiler will allocate a closure

for

at runtime. This happens in practice: we have cases

in which GHC’s optimizer actually increases allocation

because it inadvertently destroys a join point.

Even worse, the two copies of the outer

case

now scruti-

nize an uninformative call like

(j 1)

. So the extra code

bloat from duplicating the outer

case

is entirely wasted.

And it’s a huge lost opportunity, as we shall see.

So it is not enough to generate efﬁcient code for join points;

we must identify, preserve, and exploit them. In our example,

if the optimizer knew that the binding for

is a join point,

it could exploit that knowledge to transform our original

expression like this:

let j x = case BIG of True -> False

False -> True

in case v of

A -> j 1

B -> j 2

C -> case True of { True -> False; False -> True }

This is much, much better than our previous attempt:

•

The outer

case

has moved into the right-hand side of the

join point, so it now scrutinizes

BIG

. That’s good, because

BIG

might be a data constructor or a

case

expression

(which would expose another case-of-case opportunity).

So the outer

case

now scrutinizes the actual result of the

expression, rather than an uninformative join-point call.

That solves problem (2).

•

The

and

branches do not mention the outer

case

because it has moved into the join point itself. So

is still

tail-called and remains an efﬁciently-compiled join point.

That solves problem (1).

•

The outer

case

still scrutinizes the branches that do not

ﬁnish with a join point call, e.g. the C branch.

The Key Idea

Thus motivated, in the rest of this paper we

explore the following very simple idea:

•

Distinguish certain

let

bindings as join-point bindings,

and their (tail-)call sites as jumps. This, by itself, is not

new; see Section 9.

•

Adjust the case-of-case transformation to take account of

join-point bindings and jumps.

•

In all the other transformations carried out by the compiler,

ensure that join points remain join points.

Our key innovation is that, by recognising join points as a

language construct, we both preserve join points through

subsequent transformations and exploit them to make those

transformations more effective. Next, we formalize this ap-

proach; subsequent sections develop the consequences.

3. System F

: Join Points and Jumps

We now formalize the intuitions developed so far by describ-

ing System

, a small intermediate language with join points.

is an extension of GHC’s Core intermediate language [

We omit existentials, GADTs, and coercions [

], since they

are largely orthogonal to join points.

Syntax

System

is a simple

-calculus language in the

style of System F, with

let

expressions, data type construc-

tors, and

case

expressions; its syntax is given in Fig. 1. Sys-

tem

is an explicitly-typed language, so all binders are

typed, but in our presentation we will often drop the types.

The join-point extension is highlighted in the ﬁgure and

consists of two new syntactic constructs:

•

join

binding declares a (possibly-recursive) join point.

Each join point has a name, a list of type parameters, a

list of value parameters, and a body.

•

jump

expression invokes a join point, passing all

indicated arguments as well as an additional result-type

argument (as discussed shortly, under “The type of a join

point”).

Although we use curried syntax for

jump

s, join points are

polyadic; partial application is not allowed.

Static Semantics

The type system for System F

is given

in Fig. 2, where

typeof

gives the type of a constructor and

ctors gives the set of constructors for a datatype.

484

Terms

x ∈ Term variables

j ∈

Label variables

e, u, v ::= x | l | λx:σ.e | e u

| Λa.e | e ϕ Type polymorphism

| K

#»

e Data construction

| case e of

# »

alt Case analysis

| let vb in v Let binding

join jb in u Join-point binding

| jump j

#»

e τ Jump

alt ::= K

# »

x:σ → u Case alternative

Value bindings and join-point bindings

vb ::= x:τ = e Non-recursive value

| rec

# »

x:τ = e Recursive values

jb ::=

#»

# »

x:σ = e Non-recursive join point

rec

# »

#»

# »

x:σ = e Recursive join points

Answers

A ::= λx:σ.e | Λa.e | K

#»

Types

a, b ∈ Type variables

τ, σ, ϕ ::= a Variable

| T Datatype

| σ → τ Function type

| τ ϕ Application

| ∀a. τ Polymorphic type

Frames, evaluation contexts, and stacks

F ::=  v Applied function

|  τ Instantiated polymorphism

| case  of

# »

p → u Case scrutinee

| join jb in  Join point

E ::=  | F [E] Evaluation contexts

s ::= ε | F : s Stacks

Tail contexts

L ::=  Empty unary context

| case e of

# »

p → L Case branches

| let vb in L Body of let

| join j

#»

# »

x:σ = L in L

′

Join point, body

| join rec

# »

#»

# »

x:σ = L in L

′

Rec join points, body

Miscellaneous

C ∈ General single-hole term contexts

Σ ::= · | Σ, x:σ = v Heap

c : := h e ; s; Σi Conﬁguration

Figure 1: Syntax of System F

The typing judgement carries two environments,

and

∆

with

∆

binding join points. The environment

∆

is extended

by a

join

(rules

JBIND

and

RJBIND

) and consulted at a

jump

. Note that we rely on scoping conventions in some

places: if

Γ; ∆ ⊢ e : τ

, then every variable (type or term)

free in

appears in

, and the symbols in

are unique.

Similarly, every label free in e appears in ∆.

For jumps to truly be compiled as mere jumps, they must

not occur in any subterm whose evaluation will take place

in some unknown context. Otherwise, the jump would leave

the stack in an unknown state. We enforce this invariant by

resetting

∆

in every premise for such a subterm, for

example in the premise for the body e in rule ABS.

The Type of a Join Point

The type given to a join point

deserves some attention. A join point that binds type vari-

ables

#»

and value arguments of types

#»

is given the type

∀

#»

a .

#»

σ → ∀r. r

(rule

JBIND

). Dually, a

jump

applies a join

point to some type arguments (to instantiate

#»

a ), some value

arguments (to saturate the

#»

), and a ﬁnal type argument (to

instantiate

) that speciﬁes the type returned by the

jump

We put the universal quantiﬁcation of

at the end to indicate

that the argument types

#»

do not (and must not

) mention

this “return-type parameter.” Indeed, when we introduce the

abort

axiom (Sec. 3), it will need to change this type argu-

ment arbitrarily, which it can only safely do if the type is

never actually used in the other parameters.

So a join point’s result type

∀r

does not reﬂect the value

of its body. What then keeps a join point from returning

arbitrary values? It is the

JBIND

rule (or its recursive variant)

that checks the right-hand side of the join point, making sure

it is the same as that of the entire

join

expression. Thus we

cannot have

join j = "Gotcha!" in if b then jump j Int else 4

because

returns a

String

but the body of the

join

returns an

Int

. In short, the burden of typechecking has moved: whereas

a function can be declared to return any type but can only be

invoked in certain contexts, a join point can be invoked in any

context but can only return a certain type.

Finally, the reader may wonder why join points are poly-

morphic (apart from the result type). In

as presented here,

we could manage with monomorphic join points, but they

become absolutely necessary when we add data constructors

that bind existential type variables. We omitted existentials

from this paper for simplicity, but they are very important in

practice and GHC certainly supports them.

Managing ∆

The typing of join points is a little bit more

ﬂexible than you might suspect. Consider this expression:







join j x = RHS

in case v of A → jump j True C2C

B → jump j False C2C

C → λc.c







’x’

where

C2C = Char → Char

. This is certainly well typed.

A valid transformation is to move the application to

’x’

into

485

Γ; ∆ ⊢ e : τ

(x:τ) ∈ Γ

Γ; ∆ ⊢ x : τ

VAR

typeof(K) = ∀

#»

a .

#»

σ → T

#»

# »

Γ; ε ⊢ u : σ{ϕ/a}

Γ; ∆ ⊢ K

#»

u : T

#»

CON

Γ, (x:σ); ε ⊢ e : τ

Γ; ∆ ⊢ λ(x:σ).e : σ → τ

ABS

Γ, a; ε ⊢ e : τ

Γ; ∆ ⊢ Λa.e : ∀a. τ

TABS

Γ; ∆ ⊢ e : σ → τ Γ; ε ⊢ u : σ

Γ; ∆ ⊢ e u : τ

APP

Γ; ∆ ⊢ e : ∀ a. τ

Γ; ∆ ⊢ e ϕ : τ{ϕ/a}

TAPP

(j:∀

#»

a .

#»

σ → ∀r. r) ∈ ∆

# »

Γ; ε ⊢ u : σ{

# »

ϕ/a}

Γ; ∆ ⊢ jump j

#»

u τ : τ

JUMP

Γ; ε ⊢ u : σ Γ, x:σ; ∆ ⊢ e : τ

Γ; ∆ ⊢ let x:σ = u in e : τ

VBIND

# »

Γ,

# »

x:σ; ε ⊢ u : σ Γ,

# »

x:σ; ∆ ⊢ e : τ

Γ; ∆ ⊢ let rec

# »

x:σ = u in e : τ

RVBIND

Γ,

#»

a ,

# »

x:σ; ∆ ⊢ u : τ Γ; ∆, (j:∀

#»

a .

#»

σ → ∀r. r) ⊢ e : τ

Γ; ∆ ⊢ join j

#»

# »

x:σ = u in e : τ

JBIND

# »

Γ,

#»

a ,

# »

x:σ; ∆,

# »

j:∀

#»

a .

#»

σ → ∀r. r ⊢ u : τ Γ; ∆,

# »

j:∀

#»

a .

#»

σ → ∀r. r ⊢ e : τ

Γ; ∆ ⊢ join rec

# »

#»

# »

x:σ = u in e : τ

RJBIND

Γ; ∆ ⊢ e : T

#»

# »

typeof(K) = ∀

#»

a .

#»

σ → T

#»

# »

#»

ν =

#»

σ {

# »

ϕ/a}

# »

Γ,

# »

x:ν; ∆ ⊢ u : τ ctors(T ) = {

#»

K }

Γ; ∆ ⊢ case e of

# »

x:ν → u : τ

CASE

Figure 2: Type system for System F

both the body and the right-hand side of the join, thus:

join j x = RHS ’x’





case v of A → jump j True C2C

B → jump j False C2C

C → λc.c





’x’

Now we can move the application into the branches:

join j x = RHS ’x’

in case v of A → (jump j True C2C) ’x’

B → (jump j False C2C) ’x’

C → (λc.c) ’x’

Should this be well typed? The jumps to

are not exactly tail

calls, but they can (and indeed must) discard their context—

here the application to

’x’

—and resume execution at

. We

will see shortly how this program can be further transformed

to remove the redundant applications to

’x’

, but the point

here is that this intermediate program should be well typed.

That point is reﬂected in the typing rules by the fact that

∆

not reset in the function part of an application (rule

APP

), or

in the scrutinee of a case (rule CASE).

Operational Semantics

We give System F

an operational

semantics (Fig. 3) in the style of an abstract machine. A

conﬁguration of the machine is a triple

he; s; Σi

consisting

of an expression e which is the current focus of execution; a

stack

representing the current evaluation context (including

join-point bindings); and a heap

of value bindings. The

stack is a list of frames, each of which is an argument to apply,

a case analysis to perform, or a bound join point (or recursive

group). Each frame is moved to the stack via the

push

rule.

Since we deﬁne evaluation contexts by composing frames

(hence

F [E]

in Fig. 1), the rule has a simple form. Most of the

he; s; Σi 7→ he

′

; s

′

; Σ

′

hF [e ]; s; Σi 7→ he; F : s ; Σi (push)

hλx.e;  v : s; Σi 7→ he; s; Σ, x = vi (β)

hΛa.e;  ϕ : s; Σi 7→ he{ϕ/a}; s; Σi (β

)

hlet vb in e; s ; Σi 7→ he; s; Σ, vbi (bind )

hx; s; Σ[x = v]i 7→ h v; s; Σ[x = v]i (look )

#»

v ;

case  of

# »

alt : s;

7→

hu; s; Σ,

# »

x = vi

if (K

#»

x → u) ∈

# »

alt

(case)

jump j

#»

v τ;

′

++(join jb in  : s);

7→

# »

ϕ/a};

join jb in  : s;

Σ,

# »

x = v

(jump)

if (j

#»

x = u) ∈ jb

join jb in  : s;

7→ hA; s; Σi (ans)

Figure 3:

Call-by-name operational semantics for System

rules are quite conventional. We describe only call-by-name

evaluation here, as rule

look

shows; switching to call-by-need

by pushing an update frame is absolutely standard.

Note that only value bindings are put in the heap. Join

points are stack-allocated in a frame: they represent mere

code blocks, not ﬁrst-class function closures. As expected, a

jump throws away its context (the

jump

rule); it does so by

popping all the frames from the stack to the binding (as usual,

++ stands for the concatenation of two stacks):



join j x = x

in case (jump j 2 (Int → Bool)) 3 of . . .; ε; ε



486

7→

⋆

jump j 2 (In t → Bool );

 3 : case  of . . . : join j x = x in  : ε;

7→ hx; join j x = x in  : ε; x = 2i

Here three frames are pushed onto the stack: the join-point

binding, the case analysis, and ﬁnally the application of the

jump to 3. Then the jump is evaluated, popping the latter two

frames, replacing the term with the one from the join point,

and binding the argument.

The

ans

rule removes a join-point binding from the

context once an answer

(see Fig. 1) is computed; note that

a well-typed answer cannot contain a jump, so at that point

the binding must be dead code. Continuing our example:

hx; join j x = x in  : ε; x = 2i 7→

⋆

h2; ε; x = 2i

Optimizing Transformations

The operational semantics

operates on closed conﬁgurations. An optimizing compiler,

by contrast, must transform open terms. To describe possible

optimizations, then, we separately develop a sound equational

theory (Fig. 4), which lays down the “rules of the game”

by which the optimizer is allowed to work. It is up to the

optimizer to determine how to apply the rules to rewrite code.

All the axioms carry the usual implicit scoping restrictions to

avoid free-variable capture. We spell out the side conditions in

drop

and

jdrop

because these actually restrict when the rule

can be applied rather than merely ensuring hygiene. (In these

side conditions, the

function stands for “bound variables”;

for instance, bv(x:σ = e) = {x}.)

The

, and

case

axioms are analogues of the similarly-

named rules in the operational semantics. Since there is no

heap,

and

case

create

let

expressions instead. Compile-

time substitution, or inlining, is performed for values by

inline

and for join points by

jinline

. If a binding is inlined

exhaustively, it becomes dead code and can be eliminated

by the

drop

jdrop

axiom. Values may be substituted

anywhere

, which we indicate using a general single-hole

context

inline

. Inlining of join points is a bit more

delicate. A jump indicates both that we should execute the

join point and that we should throw out the evaluation context

up to the join point’s declaration. Simply copying the body

accomplishes the former but not the latter. For example:

join j (x : Int) = x + 1 in (jump j 2 (Int → Int)) 3

If we na

ıvely inline

here, we end up with the ill-typed term:

join j (x : Int) = x + 1 in (2 + 1) 3

Inlining is safe, however, if the jump is a tail call, since then

there is no extra evaluation context to throw away. To specify

the allowable places to inline a join point, then, we use a

syntactic notion called a tail context. A tail context

(see

Fig. 1) is a multi-hole context describing the places where a

term may return to its evaluation context. Since

 3

is not a

tail context, the jinline axiom fails for the above term.

For brevity, we have omitted rules allowing inlining a recursive deﬁnition

into the deﬁnition itself (or another deﬁnition in the same recursive group).

The

caseﬂoat

ﬂoat

jﬂoat

, and

jﬂoat

rec

axioms perform

commuting conversions. The former two are conventional,

but

jﬂoat

and

jﬂoat

rec

exploit the new join-point construct

to perform exactly the transformation we needed in Sec. 2 to

avoid destroying a join point. The only difference between the

two is that

jﬂoat

rec

acts on a recursive binding; the operation

performed is the same.

Consider again the example at the beginning of Sec. 2.

With our new syntax, we can write it as:

case







join j x = BIG

in case v of A → jump j 1 Bool

B → jump j 2 Bool

C → True







{True → Fals e; False → True}

We can use

jﬂoat

to move the outer

case

both into the right-

hand side of the

join

binding and into its body; use

caseﬂoat

to move the outer

case

into the branches of the inner

case

;

use

abort

to discard the outer

case

where it scrutinizes a

jump

; and use

case

to simplify the

alternative. The result

is just what we want:

join j x = case BIG of {True → False; False → True}

in case v of A → jump j 1 Bool

B → jump j 2 Bool

C → False

The commute Axiom

The left-hand sides of axioms

ﬂoat

jﬂoat

rec

, and

caseﬂoat

enumerate the forms of a tail

context

(Figure 1). So the four axioms are all instances of

a single equivalent form:

E[L[

#»

e ]] = L[

# »

E[e]] (commute)

This rule

commute

moves the evaluation context

into each

hole of the tail context L.

We can also derive new axioms succinctly using tail

contexts. For example, our commuting conversions as written

risk quite a bit of code duplication by copying

arbitrarily

many times (into each branch of a

case

and each join point).

Of course, in a real implementation, we would prefer not to

do this, so instead we might use a different axiom:

E[L[

#»

e ] : τ] = join j x = E[x] in L[

# »

jump j e τ ]

This can be derived from

commute

by ﬁrst applying

jdrop

and jinline backward.

4. Contiﬁcation: Inferring Join Points

Not all join points originate from commuting conversions.

Though the source language doesn’t have join points or jumps,

many

let

-bound functions can be converted to join points

without changing the meaning of the program. In particular,

Since

has a hole wherever something is returned to

commute

“substitutes” the latter into the places where it is invoked. In fact, from a CPS

standpoint,

commute

(in concert with

abort

) is precisely a substitution

operation.

487

e = e

′

(λx:σ.e) v = let x:σ = v in e (β)

(Λa.e) ϕ = e {ϕ /a } (β

)

let vb in C[x] = let vb in C[v] if (x:σ = v) ∈ vb (inline)

let vb in e = e if bv(vb) ∩ fv(e) = ∅ (drop)

join jb in L[

#»

e , jump j

#»

v τ,

#»

′

] = join jb in L[

#»

e , let

# »

x:σ = v in u{

# »

ϕ/a},

#»

′

] if (j

#»

# »

x:σ = u) ∈ j b (jinline)

join jb in e = e if bv(jb) ∩ fv(e) = ∅ (jdrop)

case K

#»

v of

# »

alt = let

# »

x:σ = v in e if (K

# »

x:σ → e) ∈

# »

alt (case)

E[case e of

# »

#»

x → u] = case e of

# »

#»

x → E[u] (caseﬂoat )

E[let vb in e] = let vb in E[e] (ﬂoat)

E[join j

#»

x = u in e] = join j

#»

x = E[u] in E[e] (jﬂoat)

E[join rec

# »

#»

x = u in e] = join rec

# »

#»

x = E[u] in E[e] (jﬂoat

rec

)

E[jump j

#»

e τ] : τ

′

= jump j

#»

e τ

′

(abort)

Figure 4: Common optimizations for System F

e = e

′

let f = Λ

#»

a .λ

#»

x .u in L[

#»

e ] : τ = join j

#»

x = u in L[

# »

tail

(e)] (contify)

if ρ(f

#»

x ) = jump j

#»

x τ

and f /∈ fv(L ), u : τ

let rec

# »

f = Λ

#»

a .λ

#»

x .L[

#»

u ] in L

′

[

#»

e ] : τ = join rec

# »

#»

x = L[

# »

tail

(u)] in L

′

[

# »

tail

(e)] (contify

rec

)

# »

ρ(f

#»

x ) = jump j

#»

x τ

and

# »

f /∈ fv(

#»

L ), f /∈ fv(L

′

# »

#»

u ] : τ

tail

#»

u ) , e{

# »

σ/a}{

# »

u/x} if ρ(f

#»

x ) = e and dom(ρ) ∩ fv(

#»

u ) = ∅

tail

(e) , e if dom(ρ) ∩ fv(e) = ∅

tail

(e) , undeﬁned otherwise

Figure 5: Contiﬁcation as a source-to-source transformation.

if every call to a given function is a saturated tail call (i.e.

appears only in an

-context), and we turn the calls into

jumps, then whenever one of the jumps is executed, there

will be nothing to drop from the evaluation context (the

′

jump will be empty).

The process is a form of contiﬁcation [

] (or continuation

demotion), which we formalize in Fig. 5, where

fv(e)

means

the set of free variables of

(and similarly

fv(L)

for tail

contexts), and

dom(ρ)

means the domain of the environment

ρ (to be described shortly).

The non-recursive version,

contify

, attempts to decom-

pose the body of the

let

(i.e., the scope of

) into a tail

context

and its arguments, where the arguments contain

all the occurrences of

, then attempts to run the special par-

tial function

tail

on each argument to the tail context. This

function will only succeed if there are no non-tail calls to f.

The

tail

function takes an environment

mapping appli-

cations of contiﬁable variables

to jumps to corresponding

join points

. For each expression that matches the form of a

saturated call to such an

, then,

tail

turns the call into a jump

to its

, provided that none of the arguments to the function

contains a free occurrence of a variable being contiﬁed—an

occurrence in argument position is disallowed by the typing

rules. For any other expression,

tail

changes nothing but does

check that no variable being contiﬁed appears; otherwise,

tail

fails, causing the contify axiom not to match.

There is one last proviso in the

contify

and

contify

rec

axioms, which is that the body of each function to be contiﬁed

must have the same type as the body of the

let

. This can fail

if some function f is polymorphic in its return type [8].

Finding bindings to which

contify

rec

will apply

is not difﬁcult. Our implementation is essentially a free-

variable analysis that also tracks whether each free variable

has appeared only in the holes of tail contexts. This is much

simpler than previous contiﬁcation algorithms because we

only look for tail calls. We invite the reader to compare to

[

] or to Sec. 5 of [

], which both allow for more general

calls to be dealt with. Yet we claim that, in concert with the

Simpliﬁer and the Float In pass, our algorithm covers most

of the same ground.

To demonstrate, consider the local CPS transformation in

Moby [

], which produces mutually tail-recursive functions

488

to improve code generation in much the same way GHC does.

Moby uses a direct-style intermediate representation, but its

contiﬁcation pass is expressed in terms of a CPS transform,

which turns

let f x = ... in E[... f y ... f z ...]

(where the calls to f are tail calls within E) into

let { j x = E[x]; f x = j <rhs> }

in ...f y...f z...

where the tail calls to

are now compiled as jumps. Note that

now matches the

contify

axiom, but it did not before due to

the

in the way. Nonetheless, our extended GHC achieves

the same effect, only in stages. Starting with:

let f x = rhs in E[. . . f y . . . f z . . .]

First, applying ﬂoat from right to left ﬂoats f inward:

E[let f x = rhs in . . . f y . . . f z . . .]

Next, contify applies, since the calls to f are now tail calls:

E[join f x = rhs in . . . jump f y τ . . . jump f z τ . . .]

And now jﬂoat pushes E into the join point f and the body:

join f x = E[rhs] in . . . E[jump f y τ] . . . E[jump f z τ] . . .]

From here,

abort

removes

from the jumps, and we can

abstract E by running jdrop and jinline backward:

join {j x = E[x]; f x = jump j rhs τ} in . . . f y . . . f z . . .

Thus we achieve the same result without any extra effort

Naturally, contiﬁcation is more routine and convenient

in CPS-based compilers [

]. The ability to handle an

intervening context comes nearly “for free” since contexts

already have names. Notably, it is still possible to name

contexts in direct style (the Moby paper [

] does so using

labelled expressions), so it is only a matter of convenience.

5. Recursive Join Points and Fusion

We have mentioned, without stressing the point, that join

points can be recursive. We have also shown that it is rather

easy to identify let-bindings that can be re-expressed (more

efﬁciently) as join points. To our complete surprise, we

discovered that the combination of these two features allowed

us to solve a long-standing problem with stream fusion.

Recursive Join Points

Consider this program, which ﬁnds

the ﬁrst element of a list that satisﬁes a predicate p:

ﬁnd = Λa.λ(p : a→Bool )(xs : [a]).

let go xs = case xs of

x : xs

′

→ if p x then Ju st x

else go xs

′

[] → Nothing

in go xs

The parts of this sequence not speciﬁcally to do with join points were

already implemented before in GHC: The Float In pass applies

ﬂoat

reverse, and the Simpliﬁer regularly creates join points to share evaluation

contexts (except that previously they were ordinary let bindings).

Programmers quite often write loops like this, with a local

deﬁnition for

, perhaps to allow

ﬁnd

to be inlined at a call

site. Our ﬁrst observation is this:

is a (recursive) join point!

The contiﬁcation transformation of will identify

as a join

point, and will transform the

let

(which allocates) to a

join

(which does not), and each call to

into an efﬁcient

jump

But it gets better! Because

is a join point, it can

participate in a commuting conversion. Suppose, for example,

that ﬁnd is called from any like this:

any = Λa.λ(p : a → Bool)(xs : [a]).

case ﬁnd p xs of Just

→ True

Nothing → False

The call to ﬁnd can be inlined:

any = Λa.λ(p : a → Bool)(xs : [a]).

case







join go xs = case x s of

x : xs

′

→ if p x then Just x

else jump go xs

′

(Maybe a)

[] → Nothing

in jump go xs (Maybe a)







{Just

→ True; Nothing → False}

Now, we have a

case

scrutinizing a

join

so we can apply

axiom

jﬂoat

from Figure 4. After some easy further transfor-

mations, we get

any = Λa.λ(p : a→ Bool)(xs : [a]).

join go xs = case x s of

x : xs

′

→ if p x then True

else jump go xs

′

Bool

[] → False

in jump go xs Bool

Look carefully at what has happened here: the consumer

(

any

) of a recursive loop (

) has moved all the way to the

return point of the loop, so that we were able to cancel the

case

in the consumer with the data constructor returned at

the conclusion of the loop.

Stream Fusion

It turns out that this new ability to move a

consumer all the way to the return points of a tail-recursive

loop has direct implications for a very widely used transfor-

mation: stream fusion. The key idea of stream fusion is to

represent a list (or array, or other sequence) by a pair of a

state and a stepper function, thus:

data Stream a where

MkStream :: s -> (s -> Step s a) -> Stream a

There are two competing approaches to the

Step

type. In

unfold/destroy fusion (Svenningsson [26]), we have:

data Step s a = Done | Yield s a

Hence a stepper function takes an incoming state and either

yields an element and a new state or signals the end. Now

Note that

Stream

is an existential type, so as to abstract the internal state

type s as an implementation detail of the stream.

489

a pipeline of list processors can be rewritten as a pipeline

of stepper functions, each of which produces and consumes

elements one by one. A typical stepper function for a stream

transformer looks like:

next s = case <incoming step> of

Yield s’ a -> <process element>

Done -> <process end of stream>

When composed together and inlined, the stepper functions

become a nest of

case

s, each scrutinizing the output of

the previous stepper. It is crucial for performance that each

Yield

Done

expression be matched to a

case

, much as

we did with

Just

and

Nothing

in the example that began

Sec. 2. Fortunately, case-of-case and the other commuting

conversions that GHC performs are usually up to the task.

Alas, this approach requires a recursive stepper function

when implementing

filter

, which must loop over incoming

elements until it ﬁnds a match. This breaks up the chain of

case

s by putting a loop in the way, much as our

any

above

becomes a

case

on a loop. Hence until now, recursive stepper

functions have been un-fusible. Coutts et al. [

] suggested

adding a Skip construtor to Step, thus:

data Step s a = Done | Yield s a | Skip s

Now the stepper function can say to update the state and call

again, obviating the need for a loop of its own. This makes

filter

fusible, but it complicates everything else! Every-

thing gets three cases instead of two, leading to more code

and more runtime tests; and functions like

zip

that consume

two lists become more complicated and less efﬁcient.

But with join points, just as with

any

, Svenningsson’s

original Skip-less approach fuses just ﬁne! Result: simpler

code, less of it, and faster to execute. It’s a straight win.

6. Metatheory of F

Proofs can be found in the extended version of this paper

Correctness and Type Safety

The way to “run” a program

on our abstract machine is to initialize the machine with an

empty stack and an empty store. Type safety, then, says that

once we start the machine, the program either runs forever or

successfully returns an answer.

Proposition 1 (Type safety). If ε; ε ⊢ e : τ , then either:

1. The initial conﬁguration he; ε; εi diverges, or

2. he; ε; εi 7→

⋆

hA; ε; Σi, for some store Σ and answer A.

To establish the correctness of our rewriting axioms, we

ﬁrst deﬁne a notion of observational equivalence.

Deﬁnition 2.

Two terms

and

′

are observationally equiv-

alent, written

∼

′

, if, given any context

hC[e]; ε; εi

diverges if and only if hC[e

′

]; ε; εi diverges.

The equational theory is sound with respect to

∼

https://www.microsoft.com/en-us/research/publication/

compiling-without-continuations/

Proposition 3. If e = e

′

, then e

∼

′

Equivalence to System F

The best way to be sure that

can be implemented without headaches is to show that it

is equivalent to GHC’s existing System F-based language.

This would suggest that join points do not allow us to write

any new programs, only to implement existing programs

more efﬁciently. To prove the equivalence, we establish an

erasure procedure that removes all join points from an

term, leaving an equivalent System F term.

To erase the join points, we want to apply the

contify

axiom (or its recursive variant) from right to left. However,

we cannot necessarily do so immediately for each join point,

since

contify

only applies when all invocations are in tail

position. For example, we cannot de-contify j here:

join j x = x + 1 in (jump j 1 (Int → Int)) 2

Simply rewriting the join point as a function and the jump as

a function call would change the meaning of the program—in

fact, it would not even be well-typed:

let f = λx.x + 1 in f 1 2

However, if we apply abort ﬁrst:

join j x = x + 1 in jump j 1 Int

Now the jump is a tail call, so contify applies.

The

abort

axiom is not enough on its own, since the jump

may be buried inside a tail context:

join j x = x + 1 in





case b of

True → jump j 1 (Int → Int )

False → jump j 3 (Int → Int )





However, this can be handled by a commuting conversion:

join j x = x + 1 in case b of

True → (jump j 1 (Int → Int )) 2

False → (jump j 3 (Int → Int )) 2

And now abort applies twice and j can be de-contiﬁed.

Lemma 4.

For any well-typed term

, there is an

′

such that

′

= e and every jump in e

′

is in tail position.

By “tail position,” we mean one of the holes in a tail

context that starts with the binding for the join point being

called. In other words, given a term

join j

#»

x = u in L[

#»

e ],

the terms

#»

e are in tail position for j.

The proof of Lemma 4 relies on the observation that

the places in a term that may contain free occurrences of

labels are precisely those appearing in the hole of either

an evaluation or a tail context. For example, the

CASE

typing rule propagates

∆

into both the scrutinee and the

branches; note that

case  of

# »

alt

is an evaluation context

and

case e of

# »

p → 

is a tail context. But

e 

is (in call-by-

name) neither an evaluation context nor a tail context, and

APP does not propagate ∆ into the argument.

490

Thus any expression can be written as:

# »

E[L

′

[

# »

′

[. . . [L

(n)

[

# »

(n)

[e]]] . . .]]]], (1)

which is to say a tree of tail contexts alternating with eval-

uation contexts, where all free occurrences of join points

are at the leaves. By iterating

commute

and

abort

, we can

ﬂatten the tree, rewriting

(1)

to say that any expression can

be written

#»

e ]

, where each

is a leaf from the tree in

(1)

In this form, no

can be expressed as

E[L [. . .]]

for non-

trivial, non-binding

and nontrivial

, and every jump to

a free occurrence of a label is some

. Let us say a term

in the above form is in commuting-normal form. (Note that

ANF is simply commuting-normal form with named inter-

mediate values.) By

commute

and

abort

, every term has a

commuting-normal form, and by construction, every jump in

a commuting-normal form is a tail call. Thus every label can

be decontiﬁed, and we have:

Theorem 5

(Erasure)

For any closed, well-typed

term

there is a System F term e

′

such that e

′

= e.

7. Join Points in Practice

Is is one thing to deﬁne a calculus, but quite another to use it

in a full-scale optimising compiler. In this section we report

on our experience of doing so in GHC.

Implementing Join Points in GHC

We have implemented

System

as an extension to the Core language in GHC.

As a representation choice, instead of adding two new data

constructors for

join

and

jump

to the Core data type, we in-

stead re-use ordinary

let

-bindings and function applications,

distinguishing join points only by a ﬂag on the identiﬁer itself.

Thus, with no code changes, GHC treats join-point identi-

ﬁers identically to other identiﬁers, and join-point bindings

identically to ordinary

let

bindings. This is extremely con-

venient in practice. For example, all the code that deals with

dropping dead bindings, inlining a binding that occurs just

once, inlining a binding whose right-hand side is small, and

so on, all works automatically for join points too.

With the modiﬁed Core language in hand, we had three

tasks. First, GHC has an internal typechecker, called Core

Lint, that (optionally) checks the type-correctness of the

intermediate program after each pass. We augmented Core

Lint for F

according to the rules of Fig. 2.

Second, we added a simple new contiﬁcation analysis to

identify let-bindings that can be converted into join points

(see Sec. 4). Since the analysis is simple, we run it frequently,

whenever the so-called occurrence analyzer runs.

Finally, the new Core Lint forensically identiﬁed several

existing Core-to-Core passes that were “destroying” join

points (see Sec. 2). Destroying a join point de-optimizes the

program, so it is wonderful now to have a way to nail such

join

can be treated as either an evaluation context or a tail context; using

commute to push a join inward is not necessarily helpful, however.

problems at their source. Moreover, once Core Lint ﬂagged

a problem, it was never difﬁcult to alter the Core-to-Core

transformation to make it preserve join points. Here are some

of the speciﬁcs about particular passes:

The Simpliﬁer

is a sort of partial evaluator responsible for

many local transformations, including commuting conver-

sions and inlining [

]. The Simpliﬁer is implemented as

a tail-recursive traversal that builds up a representation of

the evaluation context as it goes; as such, implementing

the

jﬂoat

and

abort

axioms (Sec. 3) requires only two

new behaviors:

•

(

jﬂoat

) When traversing a join-point binding, copy the

evaluation context into the right-hand side.

•

(

abort

) When traversing a jump, throw away the eval-

uation context.

The Float Out pass

moves

let

bindings outwards [

Moving a

join

binding outwards, however, risks destroy-

ing the join point, so we modiﬁed Float Out to leave

join

bindings alone in most cases.

The Float In pass

moves

let

bindings inwards. It too can

destroy join points by un-saturating them. For exam-

ple, given

let j x y = ... in j 1 2

, the Float In

pass wants to narrow

’s scope as much as possible:

(let j x y = ... in j) 1 2

. We modiﬁed Float In

so that it never un-saturates a join point.

Strictness analysis

is as useful for join points as it is for or-

dinary

let

bindings, so it is convenient that

join

bindings

are, by default, treated identically to ordinary

let

bindings.

In GHC, the results of strictness analysis are exploited

by the so-called worker/wrapper transform [

]. We

needed to modify this transform so that the generated

worker and wrapper are both join points. We found that

GHC’s constructed product result (CPR) analysis [

]

caused the wrapper to invoke the worker inside a

case

expression, thus preventing the worker from being a join

point. We simply disable CPR analysis for join points; it

turns out that the commuting conversions for join points

do a better job anyway.

Benchmarks

The reason for adding join points is to im-

prove performance; expressiveness is unchanged (Sec. 6). So

does performance improve? Table 1 presents benchmark data

on allocations, collected from the standard

spectral

real

and

shootout

NoFib benchmark suites

. We ran the tests on

our modiﬁed GHC branch, and compared them to the GHC

baseline to which our modiﬁcations were applied. Remember,

the baseline compiler already recognises join points in the

back end and compiles them efﬁciently (Sec. 2); the perfor-

mance changes here come from preserving and exploiting

join points during optimization.

We report only heap allocations because they are a repeat-

able proxy for runtime; the latter is much harder to measure

The

imaginary

suite had no interesting cases. We believe this is because

join points tend to show up only in fairly large functions, and the

imaginary

tests are all micro-benchmarks.

491

spectral

Program Allocs

ﬁbheaps -1.1%

ida -1.4%

nucleic2 +0.2%

para -4.3%

primetest -3.6%

simple -0.9%

solid -8.4%

sphere -3.3%

transform +1.1%

(45 others)

Min -8.4%

Max +1.1%

Geo. Mean -0.4%

real

Program Allocs

anna +0.5%

cacheprof -0.5%

fem +3.6%

gamteb -1.4%

hpg -2.1%

parser +1.2%

rsa -4.7%

(18 others)

Min -4.7%

Max +3.6%

Geo. Mean -0.2%

shootout

Program Allocs

k-nucleotide -85.9%

n-body -100.0%

spectral-norm -0.8%

(5 others)

Min -100.0%

Max +0.0%

Geo. Mean n/a

Table 1:

Benchmarks from the

spectral

real

, and

shootout NoFib suites.

reliably. All tests omitted from the tables had an improvement

in allocations, but less than 0.3%.

There are some startling ﬁgures: using join points elimi-

nated all allocations in

n-body

and 85.9% in

k-nucleotide

We caution that these are highly atypical programs, already

hand-crafted to run fast. Still, it seems that our work may

make it easier for performance-hungry authors to squeeze

more performance out of their inner loops.

The complex interaction between inlining and other trans-

formations makes it impossible to guarantee improvements.

For example, improving a function

might make it small

enough to inline into

, but this may cause

to become too

large to inline elsewhere, and that in turn may lose the op-

timization opportunities previously exposed by inlining

GHC’s approach is heuristic, aiming to make losses unlikely,

but they do occur, including a 1.1% increase in allocations in

spectral/transform and a 3.6% increase in real/fem.

Beyond Benchmarks

These benchmarks show modest but

fairly consistent improvements for existing, unmodiﬁed pro-

grams. But we believe that the systematic addition of join

points may have a more signiﬁcant effect on programming

patterns. Our discussion of fusion in Sec. 5 is a case in point:

with join points we can use skip-less unfoldr/destroy streams

without sacriﬁcing fusion. That knowledge in turn affects the

way libraries are written: they can be smaller and faster.

Moreover, the transformation pipeline becomes more

robust. In GHC today, if a “join point” is inlined we get

good fusion behavior, but if its size grows to exceed the

(arbitrary) inlining threshold, suddenly behavior becomes

much worse. An innocuous change in the source program

can lead to a big change in execution time. That step-change

problem disappears when we formally add join points.

8. Why Not Use Continuation-Passing Style?

Our join points are, of course, nothing more than continua-

tions, albeit second-class continuations that do not escape,

and thus can be implemented efﬁciently. So why not just use

CPS? Kennedy’s work makes a convincing argument for CPS

as a language in which to perform optimization [16].

There are many similarities between Kennedy’s work and

ours. Notably, Kennedy distinguishes ordinary bindings (

let

)

from continuation bindings (

letcont

), just as we distinguish

ordinary bindings from join points (

join

); similarly, he distin-

guishes continuation invocations (i.e. jumps) from ordinary

function calls, and we follow suit. But there are a number of

reasons to prefer direct style, if possible:

•

Direct style is, well, more direct. Programs are simply

easier to understand, and the compiler’s optimizations are

easier to follow. Although it sounds superﬁcial, in practice

it is a signiﬁcant advantage of direct style; for example

Haskell programmers often pore over the GHC’s Core

dumps of their programs.

•

The translation into CPS encodes a particular order of

evaluation, whereas direct style does not. That dramati-

cally inhibits code-motion transformations. For example,

GHC does a great deal of “

let

ﬂoating” [

], in which

let

binding is ﬂoated outwards or inwards, which is

valid for pure (effect-free) bindings. This becomes harder

or impossible in CPS, where the order of evaluation is

prescribed.

Fixing the order of evaluation is a particular issue when

compiling a call-by-need language, since the known call-

by-need CPS transform [18] is quite involved.

•

Some transformations are much harder in CPS. For exam-

ple, consider common sub-expression elimination (CSE).

f (g x) (g x)

, the common sub-expression is easy

to see. But it is much harder to ﬁnd in the CPS version:

letcont k1 xv = letcont k2 yv = f k xv yv

in g k2 x

in g k1 x

•

GHC makes extensive use of user-written rewrite rules

as optimizing transformations [

]. For example, stream

fusion relies on the following rule, which states that

turning a stream into a list and back does nothing [6]:

{-# RULES "stream/unstream"

forall s. stream (unstream s) = s #-}

In CPS, these nested function applications are more

difﬁcult to spot. Also, rule matching is simply easier to

reason about when the rules are written in more-or-less

the same syntax as the intermediate language, but CPS

makes radical changes compared to the source language.

492

9. Related Work

Join Points and Commuting Conversions

Join points have

been around for a long time in practice [

], but they have

lacked a formal treatment until now. By introducing join

points at the level at which common optimizations are ap-

plied, we’re able to exploit them more fully. For example,

stream fusion as discussed in Sec. 5 depends on several algo-

rithms working in concert, including commuting conversions,

inlining, user-speciﬁed rewrite rules [

], and call-pattern

specialization [21].

Fluet and Weeks [

] describe MLton’s intermediate

language, whose syntax is much like ours (only ﬁrst-order).

However, it requires that non-tail calls be written so as to

pass the result to a named continuation (what we would call

a join point). As the authors note, however, this is only a

minor syntactic change from passing the continuation as a

parameter, and so the language has more in common with

CPS than with direct style.

Commuting conversions are also discussed by Benton et

al. in a call-by-value setting [4]. Consider:

let z = let y = case a of { A -> e1; B -> e2 }

in e3

in e4

They show how to apply commuting conversions from the

inside outward, creating functions to share code, getting:

let z = let j2 y = e3

in case a of { A -> j2 e1; B -> j2 e2 }

in e4

and then:

let { j1 z = e4; j2 y = e3 }

in case a of { A -> j1 (j2 e1); B -> j1 (j2 e2) }

They call

a “useless function”: it is only applied to the

result of

. It would be better to combine

with

to save

a function call. Their solution is to be careful about the order

of commuting conversions, since the problem does not occur

if one goes from the outside inward instead. However, with

join points, the order does not matter! If we make

a join

point, then the second step is instead

join j2 y = let z = e3 in e4

in case a of { A -> j2 e1; B -> j2 e2 }

which is the same result one gets starting from the outside.

So our approach is more robust to the order in which trans-

formations are applied.

SSA

The majority of current commercial and open-source

compilers (including GCC, LLVM, Mozilla JavaScript) and

compiler frameworks use the Static Single Assignment (SSA)

form [

], which for an assembly-like language means that

variables are assigned only once. If a variable might have

different values, it is deﬁned by a

-node, which chooses

a value depending on control ﬂow. This makes data ﬂow

explicit, which helps to simplify some optimizations.

As it happens, SSA is inter-derivable with CPS [

] or

ANF [

]. Code blocks in SSA become mutually-recursive

continuations in CPS or functions in ANF, and

-nodes

indicate the parameters at the different call sites. In fact, in

ANF, the functions representing blocks are always tail-called,

so adding join points to ANF gives a closer correspondence

to SSA code—functions correspond to functions and join

points correspond to blocks. Indeed the Swift Intermediate

Language SIL appears to have adopted the idea of “basic

blocks with arguments” instead of φ-nodes [14].

Sequent Calculus

Our previous work [

] showed how to

deﬁne an intermediate language, called Sequent Core, which

sits in between direct style and CPS. Sequent Core disen-

tangles the concepts of “context” and “evaluation order”—

contexts are invaluable, but Haskell has no ﬁxed evaluation

order, a fact which GHC exploits ruthlessly. The inspiration

for our language’s design came from logic, namely the se-

quent calculus. The sequent calculus is the twin brother of

natural deduction, the foundation of direct-style representa-

tions. In this paper, we use Sequent Core as our inspiration

much as Flanagan et al. [

] used CPS, with a new motto:

Think in sequent calculus; work in λ-calculus.

Relation to a Language with Control

Since

has a no-

tion of control, it becomes natural to relate it to known control

theories such as the one developed to reason about

callcc

in Scheme [

]. In fact, our language can encode

callcc v

join j x = x in JvK (λy. jump j y)

. By design, this en-

coding does not type in our system since the continuation

variable

is free in a lambda-abstraction. This has repercus-

sions on the semantics: join points can no longer be saved in

the stack but need to be stored in the heap, which is precisely

what is needed to implement callcc.

10. Reﬂections

Based on our experience in a mature compiler for a statically-

typed functional language, the use of

as an intermediate

language seems very attractive. Compared to the baseline of

System F,

is a rather small change; other transformations

are barely affected; the new commuting conversions are valu-

able in practice; and they make the transformation pipeline

more robust.

Although we have presented

as a lazy language, ev-

erything in this paper applies equally to a call-by-value lan-

guage. All one needs to do is to change the evaluation context,

the notion of what is substitutable, and a few typing rules (as

described in Sec. 6).

Acknowledgments

We gratefully acknowledge helpful feedback from Arjan

Boeijink, Joachim Breitner, Gabor Greif, Andrew Kennedy,

Oleg Kiselyov, Simon Marlow, and Carter Schonwald. This

material is based upon work supported by the National

Science Foundation under Grant No. CCF-1423617.

493

References

[1]

A. W. Appel. Compiling with Continuations. Cambridge

University Press, 1992. ISBN 0-521-41695-7.

[2]

A. W. Appel. SSA is functional programming. SIGPLAN

Notices, 33(4):17–20, 1998.

[3]

C. A. Baker-Finch, K. Glynn, and S. L. Peyton Jones. Con-

structed product result analysis for Haskell. J. Funct. Program.,

14(2):211–245, 2004.

[4]

N. Benton, A. Kennedy, S. Lindley, and C. V. Russo. Shrinking

reductions in SML.NET. In Implementation and Application of

Functional Languages, 16th International Workshop, IFL 2004,

ubeck, Germany, September 8-10, 2004, Revised Selected

Papers, pages 142–159, 2004.

[5]

M. M. T. Chakravarty, G. Keller, and P. Zadarnowski. A

functional perspective on SSA optimisation algorithms. Electr.

Notes Theor. Comput. Sci., 82(2):347–361, 2003.

[6]

D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion:

from lists to streams to nothing at all. In ACM SIGPLAN Inter-

national Conference on Functional Programming (ICFP’07),

Freiburg, Germany, Oct. 2007. ACM.

[7]

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K.

Zadeck. Efﬁciently computing static single assignment form

and the control dependence graph. ACM Transactions on

Programming Languages and Systems, 13(4):451–490, 1991.

[8]

P. Downen, L. Maurer, Z. M. Ariola, and S. Peyton Jones.

Sequent calculus as a compiler intermediate language. In

Proceedings of the 21st ACM SIGPLAN International Confer-

ence on Functional Programming, ICFP 2016, Nara, Japan,

September 18-22, 2016, pages 74–88, 2016.

[9]

M. Felleisen and R. Hieb. A revised report on the syntactic

theories of sequential control and state. Theoretical Computer

Science, 103(2):235–271, 1992.

[10]

C. Flanagan, A. Sabry, B. F. Duba, and M. Felleisen. The

essence of compiling with continuations. In Proceedings of

the ACM SIGPLAN’93 Conference on Programming Language

Design and Implementation (PLDI), Albuquerque, New Mexico,

USA, June 23-25, 1993, pages 237–247, 1993.

[11]

M. Fluet and S. Weeks. Contiﬁcation using dominators.

In Proceedings of the Sixth ACM SIGPLAN International

Conference on Functional Programming (ICFP ’01), Firenze

(Florence), Italy, September 3-5, 2001., pages 2–13, 2001.

[12]

A. Gill and G. Hutton. The worker/wrapper transformation. J.

Funct. Program., 19(2):227–251, 2009.

[13]

J.-Y. Girard, P. Taylor, and Y. Lafont. Proofs and types,

volume 7. Cambridge University Press Cambridge, 1989.

[14]

J. Groff and C. Lattner. Swift intermediate lan-

guage. LLVM Developers Meeting

http://www.llvm.org/

devmtg/2015-10/, 2015.

[15]

A. Keep, A. Hearn, and R. Dybvig. Optimizing closures in

O(0) time. In Annual Workshop on Scheme and Functional

Programming. ACM, 2012.

[16]

A. Kennedy. Compiling with continuations, continued. In Pro-

ceedings of the 12th ACM SIGPLAN International Conference

on Functional Programming, ICFP 2007, Freiburg, Germany,

October 1-3, 2007, pages 177–190, 2007.

[17]

S. Lindley. Normalisation by evaluation in the compilation

of typed functional programming languages. PhD thesis,

University of Edinburgh, College of Science and Engineering,

School of Informatics, 2005.

[18]

C. Okasaki, P. Lee, and D. Tarditi. Call-by-need and

continuation-passing style. Lisp and Symbolic Computation, 7

(1):57–82, 1994.

[19]

S. Peyton Jones and A. Santos. A transformation-based

optimiser for Haskell. Science of Computer Programming,

32(1-3):3–47, Sept. 1998.

[20]

S. Peyton Jones, W. Partain, and A. Santos. Let-ﬂoating: mov-

ing bindings to give faster programs. In ACM SIGPLAN Inter-

national Conference on Functional Programming (ICFP’96),

Philadelphia, May 1996. ACM.

[21]

S. L. Peyton Jones. Call-pattern specialisation for Haskell

programs. In Proceedings of the 12th ACM SIGPLAN Inter-

national Conference on Functional Programming, ICFP 2007,

Freiburg, Germany, October 1-3, 2007, pages 327–337, 2007.

[22]

S. L. Peyton Jones, A. Tolmach, and T. Hoare. Playing by the

rules: rewriting as a practical optimisation technique in GHC.

In R. Hinze, editor, 2001 Haskell Workshop. ACM, September

2001.

[23]

J. H. Reppy. Optimizing nested loops using local CPS con-

version. Higher-Order and Symbolic Computation, 15(2-3):

161–180, 2002.

[24]

G. L. Steele, Jr. RABBIT: A compiler for SCHEME. Technical

Report AITR-474, Massachusetts Institute of Technology,

Artiﬁcial Intelligence Laboratory, 1978.

[25]

M. Sulzmann, M. Chakravarty, S. P. Jones, and K. Donnelly.

System F with type equality coercions. In ACM SIGPLAN

International Workshop on Types in Language Design and

Implementation (TLDI’07), pages 53–66. ACM, January 2007.

ISBN 1-59593-393-X.

[26]

J. Svenningsson. Shortcut fusion for accumulating parame-

ters & zip-like functions. In Proceedings of the Seventh ACM

SIGPLAN International Conference on Functional Program-

ming (ICFP ’02), Pittsburgh, Pennsylvania, USA, October 4-6,

2002., pages 124–132, 2002.

[27]

A. P. Tolmach and D. Oliva. From ML to Ada: Strongly-

typed language interoperability via source translation. J. Funct.

Program., 8(4):367–412, 1998.

494

Comments

Products

Project