Stitching in DyC

Official UW-DCG Design Document
Written by Brian Grant (grant@cs.washington.edu)
Date created: 7/25/96
Last modified: 3/18/97
Revision history

Preface

This document is dated, but much of it is still applicable to our current design. The annotations, of course, have changed. See Markus's quals writeup for a description of makeStatic and makeDynamic, the new types of policies, and the new function annotations. In summary, keepConst has become makeStatic, with a few more options added.

Consequently, the BTA has changed as well. Some aspects of it are up in the air. We think that the live-variables analysis must precede the BTA. It is possible that the BTA need not be iterative if we have sufficient use-def information, but the reachability analysis might require iteration. It is clear that we must keep track of the derivations of all static variables, and that keeping track of which annotated variable each temp corresponds to is a pain. Clustering of unit boundaries will have to either happen within the BTA, or clustering could incrementally update output from the BTA, or the BTA may need to run before and after clustering.

Our approach to building the specialized dynamic compilers has also changed. Charlie and Matthai have written a Perl script to merge the setup code with the template code in a post-pass to the compiler. So, Multiflow will run only once, at a cost of less well scheduled code in the specialized dynamic compiler.

The cache scheme has changed to a simpler, flat design. There will be one flat cache per unit.

Many of these changes should be outlined in our upcoming PEMP '97 paper.

Introduction

In the next-generation dynamic-compilation system, we have planned to integrate setup and stitching. We have also designed a more general set of annotations that requires replicating specialization and lazy stitching of nested dynamic regions. Thus, we must implement both eager and lazy stitching.

This document describes how the static compiler will construct a fast specialized dynamic compiler for each dynamic region. The dynamic compiler will perform both setup and stitching, and lazily stitch when required.

In the following description of our approach to setup and stitching, I will generally refer to the combined task of setup and stitching simply as stitching.

The New Annotations

For the next generation of our dynamic-compilation system, we designed all-new annotations based around keepConst. The only annotation we have preserved from the original set is the dynamicRegion construct.

The dynamicRegion annotation need not be single-entry or single-exit, but must be completely contained within a function. As before, it is used to demarcate a code region that is to be specialized. dynamicRegions may be nested.

The isConst and keepConst annotations specify the variables on which the region is to be specialized. These variables are called run-time constants because their values are hardcoded into each specialized version of the code, like true constants in statically compiled code. The annotations may be used at the beginning of or anywhere within a dynamic region.

isConst indicates that particular variables are to be made run-time constants in the specialized code that follows until their values can no longer be uniquely determined from true and run-time constants. keepConst implies this as well, but further dictates that code replication should be used to keep the specified variables run-time-constant past merges of paths on which the variables acquire different run-time-constant values, past late entries into a dynamic region (if there are multiple entries), and past assignments of dynamic values (e.g., input) to the specified variables. If keepConst is used on a loop induction variable, and the loop bound is also a run-time constant, the effect is the unrolling of the loop. The unkeepConst annotation can be used to terminate this replicating specialization prior to the end of the enclosing dynamic region.

lazy is a hint to the dynamic-compilation system that it might want to stitch the annotated statement lazily. When the system cannot determine that it is safe to stitch the statement eagerly, it ensures that the statement is stitched lazily. The current effect of the annotation is to cause conditionals (?:, ||, &&, if, switch, while, and for) to be stitched lazily.

Some of the new functionality these annotations provide include:

Code cacheing (for "run-time constants" that change)
Replicating specialization
Multi-way loop unrolling
Nested and overlapping dynamic regions
Conditional specialization (if (x < 100) keepConst(x);)
Dynamic guards for specialized code (via the lazy annotation)

For a complete description of the new annotations, see the annotations document.

Nested Regions and Lazy Stitching

As mentioned above, it is possible to nest dynamic regions, or even overlap specialization policies in unstructured ways using isConst, keepConst, and unkeepConst. The values of the variables to become run-time constants are not known until the isConst or keepConst annotation is reached, unless the variables are already run-time constants (it is permissible to use isConst and keepConst on variables that have already been annotated for specialization or that are derived run-time constants). The values of variables to be kept constant may also be unknown after dynamic assignments. For these reasons, we must lazily stitch portions of dynamic regions. Note also that the outermost region must always be lazily stitched when it is reached from the surrounding code.

Since some laziness is required, we considered stitching all templates lazily. In such an implementation, the template code could be executed along with the setup code and stitcher instructions since the program's original control flow would be followed. In the mixed-eager/lazy approach (eager stitching where possible, and lazy stitching only when necessary), multiple paths through the graph are stitched eagerly, violating the program's original control flow, and requiring all eager stitching be completed before executing the stitched templates.

In the all-lazy approach, fewer context switches between the stitcher and the stitched code should be required during the first run through the code, but more context switches may be required to switch back to the stitcher to stitch unstitched paths through the code compared to the mixed-eager/lazy approach. Also, interleaving the three types of instructions (setup, stitcher, template) in the all-lazy approach may create additional register pressure and spills. Because of the questionable benefit of the all-lazy approach and its possible disadvantages, we decided on the mixed-eager/lazy approach.

However, it should be pointed out that eager stitching is potentially unsafe. If some unsafe code to be specialized is guarded by a dynamic conditional, then the stitcher could raise an exception (e.g., a/b, tan(a), *p) or enter into an infinite loop while stitching eagerly, although the program ran correctly when statically compiled. This case can occur because the BTA cannot always determine the relationship between the conditional predicate and the unsafe expression. Thus, we should provide a mechanism for the programmer to ensure that unsafe code of this type is stitched lazily. This is the rationale behind the lazy annotation.

Since we must implement both eager and lazy stitching anyway, we have also decided to implement the eager and lazy specifications designed for the keepConst annotation to control the eagerness of code replication. The same implementation should work for all types of eagerness and laziness.

Replicating Specialization

Replicating specialization, triggered by the keepConst annotation, keeps a variable run-time-constant beyond dynamic assignments and beyond merges where the variable has different definitions, run-time-constant or dynamic, on different incoming edges, as long as the variable is run-time-constant on at least one incoming edge. This is achieved by duplicating the code beyond such merges, called discordant merges, specializing the code for the different incoming values of the variable. The following code shows a typical example.

dynamicRegion {
	...

	x = foo(x, y, z);

	keepConst(x);
	/* one version will be generated for every
	   value of x */
	...

	if (d)
		x += 1;
	else
		x += 2;

	/* two versions of the code that follows
	   will be generated for every version of
	   the preceding code */
	...
}

Replication ends at an unkeepConst annotation, at the end of the enclosing dynamic region, or after the last reference of the variable to be kept constant (the last case is an optimization).

The following code presents a simple example of nested and overlapping regions of replicating specialization.

dynamicRegion {
	unit_0: /* Kconst={} */ ...
	keepConst(x);
	unit_1: /* Kconst={x} */ ...
	keepConst(y);
	unit_2: /* Kconst={x,y} */ ...
	keepConst(z);
	unit_3: /* Kconst={x,y,z} */ ...
	unkeepConst(y);
	unit_4: /* Kconst={x,z} */ ...
	unkeepConst(x);
	unit_5: /* Kconst={z} */ ...
}

Here is an example of multi-way unrolling of a byte-code interpreter for a language that allows indirect jumps on run-time-computed values (e.g., a hardware simulator). The instructionList is an array of instructions that will be run-time-constant. opcode(pc) extracts the operation from the instruction, and is a derived run-time constant. firstOperand(pc) and secondOperand(pc) are used to extract the operands from the instruction at pc, and are also derived run-time constants. The registerFile is an array used to represent the register file, and its contents are dynamic.

dynamicRegion {
	...

	pc = &instructionList[0];
	keepConst( pc :cache);
	keepConst(*pc :unchecked);
	while (opcode(pc) != RETURN) {
		...
		switch opcode(pc) {
		case BR:
			/* we can compute the pc on both paths eagerly
			 * ==> multi-way unrolling
			 */
			if (registerFile[firstOperand(pc)]) {
				pc += secondOperand(pc);
			} else {
				pc += 4;
			}
			break;
		case IJMP:
			/* no way to compute the new pc eagerly!
			 */
			pc = registerFile[firstOperand(pc)]; break;
		...
		default:
			pc += 4; break;
		}
	}
}

The keepConst annotations will cause the loop to be multi-way unrolled. The IJMP instruction does not prevent the unrolling, though it does halt eager stitching. When the IJMP case is reached in the stitched code, a cache check will determine whether the loop has been unrolled for the new pc value. If not, then control will transfer back to the stitcher, with eager multi-way unrolling again proceeding from there.

Replicating specialization essentially requires splitting at merges where variables to be kept constant could have different incoming values. However, the situation is complicated by loops. At merges with different incoming sets of variables to be kept constant, the graph is split in the BTA. At discordant merges where the same variables are to be kept constant on all incoming paths, replicating specialization is implemented using the same mechanism as for implementing nested dynamic regions: specialization units (discussed below). The same decisions about whether or not to perform cache checks and whether to stitch lazily or eagerly apply to the beginnings of dynamic regions and to discordant merges.

Building a Specialized Dynamic Compiler

Why Build a Specialized Dynamic Compiler?

Run-time efficiency. We decided to merge the setup code and stitching to reduce our run-time overheads. The more work we can do statically, the less work required by the dynamic compiler.

Steps for Building a Specialized Dynamic Compiler

In order to construct a specialized dynamic compiler, we will run the static compiler on the annotated source file twice. Both times, the initial steps will be the same.

Run the BTA, computing:
- Iconst, the set of root run-time constants (includes cacheing and lazy/eager policies),
- Kconst, the set of variables to be kept constant (includes cacheing and lazy/eager policies),
- Context, the set of variables from Iconst and Kconst that make up the current specialization context,
- RTconst, the set of run-time constants, and
- RC, the reachability conditions.
Convert the run-time constants to SSA form. This is necessary because the reordering phase that comes later may move redefinitions of a variable between a use of it and its corresponding defs.
Perform a liveness analysis to refine the output of the BTA. Variables should not enter or leave the Context while they are not live. Premature or late changes in context due to dead variables entering or leaving the sets causes new units to be created. Discordant merges also should not start new merges if all discordant run-time constants are dead at the merge. There is an issue, however, of what to do if run-time constants derived from dead variables are still live.
Convert values computed to be run-time-constant to type VALUE_RTCONST. Also, move static branches into blocks of their own to allow greater freedom of movement and partition basic blocks based on Context changes.
Identify back edges using p2dfo (Multiflow calls them quasi back edges) and compute topological numbers and predecessor/successor information with p2topo.
Assign basic blocks to specialization units (or just units), mark the units as eager or lazy.

Afterwards, during the first run of the static compiler, the following steps must be taken to generate the templates and directives to the second pass.

Eliminate all non-template instructions from the dynamic region (except static branches) and insert dummy loads to load run-time constants.
Insert cache checks and generate specialized context switches.
Introduce dependencies to restrict code movement in the scheduler.

In the second run of the static compiler, the templates are read from the object file generated during the first pass and incorporated into the setup code to construct the specialized stitcher.

Break the units apart and put the blocks of each unit in the order in which they will be processed by the dynamic compiler.
Eliminate template instructions from the dynamic region.
Fix (insert some, delete some) the emitted branches based on the order in which the dynamic compiler will generate code, and insert code for patching branch destinations.
Insert cache checks and context switches.
Link the units together to establish the control flow for the dynamic compiler and indicate data dependencies to the back end.
Read the directives from the object file and use their information to read template instructions and insert them into lda instructions in the appropriate setup basic blocks. Reconstruct static labels.
Insert code to patch holes in template instructions, and to perform other optimizations (load-elimination, peephole, register-allocation, etc.).
Correct instruction alignment in the stitched code; the correct alignment within each unit should be statically computable (provided optimizations that insert or delete instructions maintain the current alignment) and we can pad the end of each unit with a noop, if necessary, to ensure each unit is properly aligned. Units should be aligned on cache-line boundaries as well.

Isolating the Templates

Specialization Units

What Are They?

Every program point is specialized on the set of computed run-time constants, called the RTconst set. However, we only need to generate different specialized versions of the code when variables must be kept constant whose values cannot be computed from values of other run-time constants. This can occur at isConst or keepConst annotations or when a variable to be kept constant is assigned a dynamic value. So, we can use the RTconst set and the set of variables to be kept constant, called the Kconst set, to decide when we may need to generate a new specialized version.

To manage specialization of different portions of the dynamic region on different variables, we partition the graph into specialization units. Each unit is specialized on a particular set of variables. A unit is marked eager if the values on which it specializes can be computed based of the values of run-time constants from preceding units, and lazy if it must specialize on potentially dynamic values.

Preparing the CFG for Unit Partitioning

The unit-partitioning algorithm operates on basic blocks of the CFG for the dynamic region. So, we must partition a basic block if something important happens in the middle. We must partition a block in the following circumstances:

When variables are added to the Iconst set by isConst,
When variables are added to the Kconst set by keepConst,
When variables are removed from the Iconst set at a dynamic assignment, a discordant merge, or the end of a dynamicRegion, or
When variables are removed from the Kconst set at an unkeepConst or the end of a dynamicRegion, or
When variables in the Kconst set are killed from the RTconst set (e.g., due to dynamic assignments).

Partitioning the CFG into Units

Once we're sure that the only changes to the RTconst, Iconst, and Kconst sets we care about happen between basic blocks, we can partition the graph. We can perform the partitioning by traversing the nodes of the graph in topological order and by marking nodes with the head of their unit.

We must begin a new lazy unit at node N, with N as the unit's head, in the following situations:

If a parent of N is a dynamic branch marked lazy,
If live* variables are added to the Iconst set by isConst that are not in the RTconst set,
If live* variables are added to the Kconst set by keepConst that are not in the RTconst set, or
If variables in the Kconst set are killed from the RTconst set (e.g., due to dynamic assignments).

We must begin a new eager unit at node N, with N as the unit's head, in the following situations:

If variables are removed from the Iconst set at a dynamic assignment, a discordant merge, or the end of a dynamicRegion,
If variables are removed from the Kconst set at an unkeepConst or the end of a dynamicRegion,
If N is a discordant merge (a variable in the Kconst set has multiple reaching definitions), or
If the immediate predecessors of N are from different units.

Otherwise, the node is assigned to the same unit as its immediate predecessors. At this stage, all immediate predecessors must have the same Kconst sets because merges with different Kconst sets are split in the BTA. That means we don't have to mark any units created by the last rule (multiple preceding units) as lazy.

*: Actually, we want to delay the creation of a new unit until the variable first becomes live, and then only create a new unit if the value of the variable cannot be derived from constants and run-time constants. We may also need to examine where run-time contants die, or at least where they permanently die (and will not be resurrected). Perhaps a liveness analysis should follow the BTA and refine its output.

Inserting Cache Checks

During this process, we can also insert the cache checks. Cache checks must be inserted in both the stitcher and in the stitched code. In the stitched code, we need a cache check when the code that follows must be specialized on a variable with a dynamic value. These checks are inserted into blocks with branches to lazy units. If the cacheing policy of the lazy unit is cache or cache1, then the emitted branches to the unit are replaced with an emitted cache check and either a branch to the stitched code or a context switch to the stitcher at the lazy unit. If the cacheing policy is replicate, the emitted branches are replaced by just a context switch to the stitcher. If the policy is unchecked, the emitted branches are replaced with an indirect jump that initially goes to code that performs a context switch to the stitcher, but is switched to point to the stitched code after the stitcher executes (as in the prototype). The same can be done for lazy units created voluntarily through the annotations, when the context is known to be the same as the unit's predecessors'.

In the stitcher, lazy units need no cache checks because the checks are performed in the stitched code. However, we may need to perform other actions. If the cache policy is cache1, the old stitched version of the lazy unit must be removed from the cache. If the policy is unchecked, the stitcher must change the indirect jumps to the unit to jump to the stitched code. In the head of the lazy unit, we also need to set a variable with the address of the code that is about to be stitched.

Cache checks are needed in the stitcher at the tops of eager units whose cache policies are cache or cache1 and which are preceded by either multiple units or a single unit with a different Context set (includes variables from both the Iconst and Kconst sets). If we reach an eager unit that is in the cache, we know we can stop because all eager units following the unit must also be in the cache. So, if the eager unit is in the cache, then patch branches from the code just stitched into the units in the cache and switch to the code just stitched. If the cache policy is unchecked, then the check is still required, but only that there is some version in the cache (the whole context is not checked). A check can be eliminated completely between an unchecked unit and one with the same Context set as its predecessors because we know the unchecked unit will be stitched only once for each version of its predecessors.

Cache Structure

To optimize cache checks, we must establish a hierarchy of units specialized on the same variables. The cache can be structured as a lattice based on the context of each unit. Units with unchecked cache policies are inserted in the cache as if they belong in their parents' context.

For every unit for which a cache check is required, we must enter that unit into the cache. We will maintain a cache for each context type (the particular Context set). Each such cache will contain a hash table of pointers to per-context caches. Each per-context cache will have pointers to caches for context types that are strict supersets of the current context type and to the units specialized for the context.

When moving from one context to another, the cache pointer must be changed to point to the current cache. Only the additional context need be checked when moving to a new context that is a superset of the old one. When moving to a subset context, merely look for the correct unit (no context needs to be checked).

The following code example creates seven units and five context types.

dynamicRegion {
	Unit0: /* Context={} */ ...
	keepConst(x);
	Unit1: /* Context={x} */ ...
	keepConst(y);
	Unit2: /* Context={x,y} */ ...
	keepConst(z);
	Unit3: /* Context={x,y,z} */ ...
	Unit4: z = dynamic_foo(z); ...
	unkeepConst(y);
	Unit5: /* Context={x,z} */ ...
	unkeepConst(Z);
	Unit6: /* Context={x} */ ...
}

Here is the cache structure created after one execution of the region (x=1, y=7, z=3 initially and z=9 after the call to dynamic_foo):

	RootCache = Cache0
	Cache0
		ParentContext=		NIL
		ContextType=		()
		ContextHashTable=	{Context0}
	Context0
		ParentCaches=		{Cache0}
		Context=		()
		SubCacheHashTable=	{Cache1}
		UnitHashTable=		{Unit0}
	Cache1
		ParentContext=		Context0
		ContextType=		(x)
		ContextHashTable=	{Context1}
	Context1
		ParentCaches=		{Cache1}
		Context=		(1)
		SubCacheHashTable=	{Cache2, Cache4}
		UnitHashTable=		{Unit1, Unit6}
	Cache2
		ParentContext=		Context1
		ContextType=		(x,y)
		ContextHashTable=	{Context2}
	Context2
		ParentCaches=		{Cache2}
		Context=		(1,7)
		SubCacheHashTable=	{Cache3}
		UnitHashTable=		{Unit2}
	Cache3
		ParentContext=		Context2
		ContextType=		(x,y,z)
		ContextHashTable=	{Context3, Context4}
	Context3
		ParentCaches=		{Cache3}
		Context=		(1,7,3)
		SubCacheHashTable=	{}
		UnitHashTable=		{Unit3}
	Context4
		ParentCaches=		{Cache3, Cache5}
		Context=		(1,7,9)
		SubCacheHashTable=	{}
		UnitHashTable=		{Unit4}
	Cache4
		ParentContext=		Context1
		ContextType=		(x,z)
		ContextHashTable=	{Context5}
	Context5
		ParentCaches=		{Cache4}
		Context=		(1,9)
		SubCacheHashTable=	{Cache5}
		UnitHashTable=		{Unit5}
	Cache5
		ParentContext=		Context5
		ContextType=		(x,y,z)
		ContextHashTable=	{Context4}

Note at the cache check for Unit5, we first check the parent caches of Context4 for the correct parent context. If the correct context is not found, as when the check is done for the first time, then we must check the subcaches of Context1.

Statically Reordering the Units

Why Statically Reorder the Blocks in Each Unit?

Minimize stitch-time overhead by:

Solving our live-variables problem with Multiflow, allowing setup computations to be performed using temps,
Eliminating checks of whether blocks need to be stitched or not, and
Reducing the the use of stacks and/or work lists to keep track of setup computations to perform and templates that need to be stitched.

Why Put the Blocks in Topological Order?

The depth-first topological traversal is the traversal algorithm most commonly applied to CFGs. In a topological traversal (TT), no successor is traversed before a predecessor, which simplifies analyses and makes it easier to maintain invariants on the graph.

Here is a list of reason why we should put the blocks in topological order, roughly ordered from most important to least important.

Topological sort (TS) preserves data-flow dependencies, which is necessary for correctly executing setup code after linearization,
Lengths of live ranges are kept relatively short,
Spacial locality is largely maintained (there are some weaknesses in this respect with loops with multiple entries),
TS doesn't overly break up the control flow, which increases the number of branches that must be inserted, and
For structured programs, the code generated by TS is likely to have the same layout as the source program.

Why Is This So Difficult?

Unfortunately, we don't want to stitch all blocks in a unit. Essentially, static branches select among different sets of blocks that should be stitched. That doesn't sound so bad. However, the situation becomes complicated in the presence of:

Unstructured code,
Combinations of static and dynamic branches, and
Loops / back edges.

We have the most troubles when all three of these situations occur simultaneously. In particular, chains of overlapping back edges directed into the bodies of static ifs are present the biggest challenge. If a loop head is reached from one path, and its tail from another, then we must ensure the head will be stitched if the tail is reached. However, by the time the tail is reached, it is usually too late to then stitch the head, if we want to maintain topological order.

Another issue also arose in the discussion of reordering blocks in a unit. There was a vague feeling that maybe we should try to preserve "structured subgraphs". However, because of the difficulty in identifying such subgraphs and because of the difficulties created by violating the topological order, this idea was abandoned.

Finally, I was originally attempting to be more aggressive in eliminating the use of stacks and worklists to traverse the CFG, reserving their use only for implementing an eager solution to replicating specialization. Relaxing this goal, breaking the graph into smaller specialization units, and traversing the units using a stack simplified the reordering algorithms.

Competing Solutions to Statically Order Unit Blocks

Over the past three months, I have been working towards a solution to this problem. I have proposed several algorithms for linearizing the blocks of the CFG. I will summarize them here and then explain the chosen algorithm in detail.

Edge Redirection

In this solution, the CFG is traversed with a DFTT and a duplicate graph is constructed with the same nodes but different edges from the original. The paradigm for the algorithm is edge redirection. That is, an edge in the duplicate graph either goes to its original destination or is redirected (or is omitted from the duplicate graph). The result was essentially a topological sort of the graph, but with static branches left intact to skip blocks that did not need to be stitched.

Unfortunately, this solution never fully worked. It's remaining problems were:

Although a block might originally have several incoming edges, each block was linked into the duplicate graph only by the last incoming edge. Unfortunately, the last incoming edge might not be traversed if one or more static branches lead to that edge. I tried to check when a block might be passed by that needed to be stitched and to check whether it had been stitched later, but my solution was flawed.
1. For one, too many checks were inserted.
2. More importantly, simply jumping back to missed blocks doesn't work. Either a gosub mechanism or code duplication (duplication of the missed blocks) would be required to continue stitching the correct part of the graph after the missed blocks have been stitched.
Overlapping loops were not handled correctly.

Equivalence Classes on Reachability Conditions

This strategy solved many of the problems of the edge-redirection approach. The algorithm identifies blocks that can be strung together and those that are reached by static branches. It does this by computing nested equivalence classes of blocks on the reachability conditions. New equivalence classes are spawned by static branches and disjunctive merges, merges that introduce disjunctions into the reachability conditions. Blocks in the same equivalence class can be linearized. Blocks in a nested equivalence class are linearized before the parent class is linearized.

Blocks that are reachable by different static branches are those in equivalence classes spawned by disjunctive merges. The decision to stitch such blocks is postponed until it has been definitively determined whether or not they need to be stitched. Loops are not a problem because the loop head is either in the same equivalence class as the tail, in an ancestor class, or in a disjunctive-merge class nested in an ancestor class.

A big advantage over the edge-redirection approach is that no jumping back is necessary. The checks on the missed blocks are performed in one place. Another is that the algorithm is easier to understand.

One disadvantage is that some checks may be performed unnecessarily. Another is that because the disjunctive-merge classes are appended to the end of their parent class, live ranges of the run-time constants they reference can be extended.

Splitting and Merging

The splitting-and-merging approach eliminates the need for checks and maintains strict topological order on the blocks within a unit. The concept for this algorithm is based on splitting at merges and stitching a subset of the graph in topological order, the subset being determined by which static branches are taken at stitch-time.

We maintain a list in topological order of blocks that need to be stitched and a list of nodes we need to visit. We first add the unit head to the list of blocks that we need to visit. Then, we repeatedly remove the block with the least topological number from that list, add it to the list of blocks that need to be stitched, and add the children of the block to the list of nodes we need to visit. We duplicate the two lists for every path from every static branch, creating a number of lists that is exponential in the number of static branches.

The merging algorithm uses the lists of nodes that must be stitched along the different paths to create the final graph.

The Splitting and Merging Algorithms

Pseudocode Notation

RC(Node) returns the reachability conditions for Node
UnitHead(Node) returns the head of the unit to which Node belongs
IsLazyUnit(Node) returns true if the unit to which Node belongs must be stitched lazily, false if the unit can be stitched eagerly
Zero(Node) returns 0
Topo#(Node) returns the topological number of Node
IsPredecessor(Node1, Node2) is true if Node1 is a predecessor of Node2 or Node1 == Node2, false if Node1 succeeds Node2 or Node1 neither preceeds nor succeeds Node2
NewNode = CopyNode(OldNode) creates a copy NewNode of node OldNode (not including connecting edges)
Edges are ordered by the topological numbers of their destinations; Node->Edge[i] is the ith edge of Node; Node->Edge[0] is the first edge of Node
Edge->Destination is the destination node of Edge
Edge->Condition is the condition (NoCondition, True, False, 1, 100, etc.) of Edge
(F₀, F₁, ..., F_N) is an (N+1)-tuple of objects F₀, F₁, ..., F_N; F₀ is in field 0, F₁ is in field 1, etc.
Field(Tuple, N) returns field N of Tuple
Stacks: stacks contain fixed-length tuples of pointers or scalars

Stack = CreateEmptyStack(Size)
Creates the empty stack Stack of tuples of size Size
DestroyStack(Stack)
Frees Stack
EmptyStack(Stack)
Returns true if Stack is empty, false otherwise
PushOntoStack(Stack, Tuple)
Pushes Tuple onto Stack
Tuple = PopOffStack(Stack)
Pops and returns the top of Stack
Tuple = PeekAtStack(Stack)
Returns the top of Stack without popping
Lists: sorted lists contain fixed-length tuples of pointers or scalars and no duplicate tuples

List = CreateEmptyList(Size, SortField, ValueFunction)
Creates a list of tuples of size Size that is sorted in ascending order by the value returned by invoking ValueFunction on field SortField of each tuple
DestroyList(List)
Frees List
EmptyList(List)
Returns true if List is empty, false otherwise
InsertIntoList(List, Tuple)
Inserts Tuple into List after all other tuples of lesser or equal rank
RemoveFromList(List, Tuple)
Removes Tuple from List; a wildcard (*) may be used in any tuple field
IsInList(List, Tuple)
Returns true if Tuple is in List, false otherwise
Tuple = HeadOfList(List)
Returns the head of List (least tuple by the ValueFunction)
Tuple = TailOfList(List)
Returns the tail of List (greatest tuple by the ValueFunction)
NewList = CopyList(OldList)
Creates a copy NewList of list OldList

The Sorting & Splitting Algorithm

Description

This algorithm creates a list of lists of blocks for each unit. These lists are then passed to the merging algorithm for merging. The blocks in each list are in topological order. Each list corresponds to one combination of true and false paths from static branches. So, the first list contains the blocks reached when all static branches in the unit evaluate to true, and the last list contains the blocks reached when they all evaluate to false.

As presented, this algorithm has a problem with back edges from one side of a static if into another static if body. This situation can create a difference between lists (i.e., blocks in one list but not the other) before the static branch causing the difference appears in the lists. The two possible solutions are to include the blocks in question in both lists, or to move the blocks below the static branch. The first solution has the disadvantage that some blocks will be unnecessarily stitched on some paths. The second solution inhibits merging. I plan to go with the second solution because its run-time efficiency is higher (no blocks unnecessarily stitched) and because it is easier to implement.

Note that this algorithm assumes that units cannot contain disconnected blocks.

The merging algorithm requires a tree rather than a list of lists. That tree can be reconstructed from the lists generated by this algorithm, or this algorithm can be modified to generate the tree directly.

Variables

UnitRootList is a list of nodes sorted in ascending order by topological number; it is a list of nodes that are the roots of specialization units
PathStack is a stack of 2-tuples: (List, List); the tuples contain information about paths that need to be completed
NextNodeList is a list of nodes of the graph sorted in ascending order by topological number; it is a list of nodes that need to be stitched, but haven't been copied to the duplicate graph yet
VisitedNodeList is a list of 2-tuples: (Node, EdgeCondition); where Node is a node of the graph and EdgeCondition indictates which path was taken from a static branch (NoCondition for nodes that aren't static branches); the list is sorted in ascending order by topological number of the nodes; it is a list of nodes along one path down a unit
UnitPathList is a list of all VisitedNodeLists, and is what will be handed to the tail-merging algorithm

Pseudocode

procedure SortAndSplitUnits(UnitRootList) {

    UnitPathList = CreateEmptyList(1, 0, Zero)
    
    until EmptyList(UnitRootList) {
    
        /* Get the root of this specialization unit */
        (CurrentUnit) = RemoveHeadOfList(UnitRootList)
    
        PathStack = CreateEmptyStack(2)
    
        VisitedNodeList = CreateEmptyList(2, 0, Topo#)
        InsertIntoList(UnitPathList, (VisitedNodeList))
    
        NextNodeList = CreateEmptyList(1, 0, Topo#)
        InsertIntoList(NextNodeList, (CurrentUnit))
    
        PushOntoStack(PathStack, (VisitedNodeList,
          NextNodeList))
    
        until EmptyStack(PathStack) {
            /* Start on a new path through the unit */
            (VisitedNodeList, NextNodeList) = PopOffStack(PathStack)
    
            /* Add to the path until it is completed */
            until EmptyList(NextNodeList) {
                (NextNode) = RemoveHeadOfList(NextNodeList)
    
                if EndsInStaticBranch(NextNode) {
                    /* Static branches spawn new paths through the
                       unit.  The current path follows the first edge.
                     */

                    for each outgoing edge NextNode->Edge[i]
                      except the first (in reverse topo. order) {
                        /* Create a new path */
                        NewVisitedList = CopyList(VisitedNodeList)
                        InsertIntoList(UnitPathList, (NewVisitedList))

                        /* Keep track of which path we took */
                        InsertIntoList(NewVisitedList,
                          (NextNode, NextNode->Edge[i]->Condition))
    
                        /* Work on each new path begins at a different
                           destination.  Stay within the unit and
                           don't revisit visited nodes (must check
			   because we follow back edges).
                         */
                        NewNextList = CopyList(NextNodeList)
                        Destination = NextNode->Edge[i]->Destination
                        if ((UnitHead(Destination) == CurrentUnit) &&
                           (! IsInList(NewVisitedList, (Destination,
                           *)))) {
                            InsertIntoList(NewNextList, (Destination))
                        }
    
                        PushOntoStack(PathStack,
                          (NewVisitedList, NewNextList))
                    }
    
                    InsertIntoList(VisitedNodeList,
                      (NextNode, NextNode->Edge[0]->Condition))
    
                    Destination = NextNode->Edge[0]->Destination
                    if ((UnitHead(Destination) == CurrentUnit) &&
                      (! IsInList(VisitedNodeList, (Destination, *)))) {
                        InsertIntoList(NextNodeList, (Destination))
                    }
                }
                else {
                    InsertIntoList(VisitedNodeList,
                      (NextNode, NoCondition))
    
                    /* Visit all children of this node that are in
                       this unit and that have not already been
                       visited.
                     */
                    for each outgoing edge NextNode->Edge[i] {
                        Destination = NextNode->Edge[i]->Destination
                        if ((UnitHead(Destination) == CurrentUnit) &&
                           (! IsInList(VisitedNodeList, (Destination,
                           *)))) {
                            InsertIntoList(NextNodeList, (Destination))
                        }
                    }
                }
            }
            DestroyList(NextNodeList)
        }
        DestroyStack(PathStack)
    }
    DestroyList(UnitRootList)
    
    return UnitPathList
}

The Merging Algorithm

First, we merge the lists from the front to create a tree with forks at static branches. As mentioned above, we could alternatively create the tree in the splitting algorithm. Given the tree, we then merge its branches, starting with the leaves. The result is a DAG.

For each unit:

Create a dummy node and link it to all of the heads of the unit's lists in UnitPathList, creating a tree.
Add the root of the tree (the dummy node) to a work list.
Until the work list is empty, remove the head of the list. Compare the destinations of all outgoing edges from the node. If two destination nodes represent the same block and they have the same predecessors in this tree, then merge the two nodes and add the merged node to the work list if it isn't there already.
Create a dummy node and link it to all of the leaves of the unit's tree, creating a graph with a source and a sink.
Add the sink to a work list.
Until the work list is empty, remove the head of the list. Compare the sources of all incoming edges into the node. If two source nodes represent the same block and they have the same successors in this graph, then merge the two nodes and add the merged node to the work list if it isn't there already.

This algorithm does not perform some of the reordering optimizations we discussed previously. For example, if all successors of a static branch represented the same block, then we could merge the successors and move the merged node above the branch. Or, if one predecessor of a merge appears in more lists than another, and they both appear in some lists, the more frequently occuring predecessor should be listed last (changing the topological order) so that more merges could be performed.

However, I feel that those cases would occur infrequently and the merging algorithm would be unnecessarily complicated by the addition of code to handle them. If we later feel it is necessary, we can create another pass to reorder blocks to create more opportunities for the merging algorithm.

Linking the Units

Where are there emitted branches to eager units, also push the eager units to a unit stack. At the leaves of a unit with no eager successors, pop from the unit stack. If the stack is empty, switch to the code just stitched.

To ensure Multiflow allocates registers appropriately and that it inserts needed spills and loads, on each leaf of a unit with:

One eager successor,: Simply create an edge to the successor.
Multiple eager successors,: Build a bogus switch (that will be removed after register allocation) with edges to all eager successors.
No eager successors,: Build a bogus switch (to be removed later) with back edges to preceding bogus switches in the unit graph (to which we could backtrack when popping off the unit stack at stitch time).

I'm not entirely sure this solution will work. I need to find out what is possible in Multiflow before this problem will be solved conclusively.

We also need to ensure that all run-time constants live across edges to lazy units are saved so that they can be restored when returning to the lazy units. Furthermore, we need to pass new run-time-constant values from the stitched code to the stitcher when stitching lazy units.

Lazy Stitching

Most of the lazy-stitching discussion was focused on minimizing context-switching overhead. We can generate specialized fix-up code to save and restore context and to pass new run-time-constant values to the stitcher in registers. Should we use the JSR_COROUTINE instruction? How exactly we will do this remains an open question that likely won't be solved until we try to do it.

CFG Test Suite

Revision History

7/25/96:: Document started. First description of specialization fragments set down.
7/26/96:: Idea of nested fragments discarded. Fragments renamed units.
7/29/96:: Fleshed out several sections. Added introduction.
7/30/96:: Added list of steps for building a specialized dynamic compiler (SDC). Added several graphs to the CFG test suite.
8/1/96:: Summarized three competing reordering algorithms. Updated splitting algorithm to work with units.
8/2/96:: Added criteria for breaking basic blocks. Updated list for breaking units. Updated the "Steps for Building a Specialized Dynamic Compiler" (SBSDC) list. More editting.
8/5/96:: Added discussion of cache checks and brief summaries of the merging algorithm and lazy stitching.
8/6/96:: Minor restructuring to make better use of subsections and to move side discussions to separate pages. Fleshed out cache structure and added an example.
8/7/96:: Enhanced the description of the merging algorithm. Added the option of generating a tree in the splitting algorithm. Found bug in the splitting algorithm regarding loops and described a fix. Fleshed out the "Linking the Units" section a little.
9/4/96:: Expanded explanation of multi-way-loop-unrolling example. Eliminated "Generate code for templates". Added "Separating Setup from Templates" section. Improved wording a few places.
9/27/96:: Added list of new functionality provided by the annotations. Updated SBSDC to describe the two-pass idea.
10/1/96:: Added comment about liveness analysis to discussion of partitioning the graph into units. Added discussion of all-lazy vs. mixed-eager/lazy approaches to stitching.
10/10/96:: Added a couple paragraphs on the new lazy annotation. Reordered SBSDC. Added to the liveness discussion.
3/18/97:: Added preface.