********************************************************************************
********************************************************************************
********************************************************************************

To: eggers@thistle.cs.washington.edu
cc: grant@thistle.cs.washington.edu, spin-comp@thistle.cs.washington.edu
Subject: eager stitching
Date: Fri, 07 Jun 1996 14:01:37 PDT
From: Brian Grant <grant@thistle.cs.washington.edu>

         |   |
         S   Z
        / \ /
       A   B
        \ /
         M
         |

Assume S is a static branch, and Z is "the sky" (an edge from some
unspecified place in the CFG).

In our previous discussions, you wanted to leave edges AM and BM intact
and place a test (does B need to be stitched?) on edge ZB.  I failed to
make an argument about why that couldn't be done in general.  I think now
I have one.  The main impediment to leaving those edges intact is
identifying that there is something special about the (nonconstant) merge M.  

If ZB is a cross edge (B is neither a successor or predecessor of Z in
the depth-first-search tree created by traversing the CFG), it can be
identified and (possibly) removed.  If we recompute reachability conditions
with cross edges removed, we eliminate the "most unstructured" instances of
this example and we can conservatively just place tests on the cross edges.

Unfortunately, most common cases of unstructure introduce forward edges,
which cannot be eliminated from the graph.  Forward edges are everywhere.
A forward edge is an edge to a successor in the CFG.  For example, edge BM
in the picture above is a forward edge.  If ZB is a forward edge, edge
AM must be changed.  Here's a more concrete example:

if (D && S)
	{A}
else
	{B}
{M}

This code translates to the following CFG:

         |
         D
        / \
       S  |             B and M are nonconstant merges
      / \ /
     A   B
      \ /
       M
       |

Edges DB and BM are forward edges.  We always want to eagerly stitch B, but
only want to stitch A if S is true.  So, we want to create the following
CFG:
         |
         D
         |
         S
        /|
       A |
        \|
         B
         |
         M
         |

What we have done is redirected edge AM to point to the destination of
edge DB (the next edge from a dynamic branch that we backtrack to in the
traversal), creating edge AB.  So, we don't touch edges from static branches,
but I don't see a way to avoid potentially changing edges at a 
nonconstant merge below a static branch.  

Here are some similar examples (S denotes static branch, D dynamic):

         |                              |
         S                              S
        / \                            / \
       D  |                           D  |
      / \ /     transformed ->        |  /       B is a constant merge
     A   B                            A /        M is a nonconstant merge
      \ /                             |/
       M                              B
       |                              |
                                      M
         |                            |
         D1
        / \
       D2 |
      / \ /     transformed -> -D1-D2-A-B-M-
     A   B                                       B and M are nonconstant
      \ /                                        merges with identical
       M                                         reachability conditions
       |                                         on all incoming edges

         |
         S1
        / \
       S2 |
      / \ /     transformed -> unchanged
     A   B                                       B and M are constant merges
      \ /
       M
       |

Craig suggested a new way of looking at the problem by grouping blocks
into nested equivalence classes.  Blocks with the same reachability
conditions would be put in the same equivalence class.  New nested
equivalence classes would only be created at static branches and weird
merges (merges with edges from the sky).  Subgraphs within equivalence
classes would only contain dynamic branches and could be linearized by
topological sort.  So, by starting at the innermost nested equivalence
classes and working out, the whole graph could be linearized.  Static
branches would choose between linearized subgraphs at setup/stitch time.
The main trickiness is in dealing with equivalence classes created at weird
merges.  A check would have to be inserted somewhere to make sure the
subgraph would be stitched if it needed to be.  This is equivalent to my
original problem of figuring out where to put in checks to go back and
stitch some blocks.  It is just a new way of looking at it.

What I need to think about:

1) What happens if I remove cross edges, recompute reachability conditions,
then reinsert the edges with checks on them (does this block need to be
stitched?)?  Does it mess anything up?  Does it simplify the rest of the
algorithm or improve the quality of our code layout?

2) Does this notion of equivalence classes help?  Where do we check to go
back and stitch subgraphs started at weird merges?  Will there be too many
of these weird merges with small amounts of unstructure, creating a lot
checks?  How does this relate to control-dependence trees?

3) What exactly is the algorithm for doing the topological sort?  I
have a start.  At a static branch, do nothing special.  At a dynamic
branch, push all edges but the first on a stack and remove them from the
graph.  At a constant merge, do nothing.  On any edge but the last at a
nonconstant merge, redirect the edge to point to the destination of the
first edge popped off the stack that doesn't lead to a nonconstant merge
(or is the last edge into a nonconstant merge).  I think there may also be
a special case when all edges into a merge have the same reachability
conditions.

4) What happens if the CFG has backedges (unrolled loops, esp. multiway)?
Backedges for loops that are not unrolled can be removed if other edges
into the loop head have the same (or less restrictive) reachability
conditions as the backedge.  That is, we're sure the destination of the
backedge will be stitched.  Otherwise, we have to insert a check on the
backedge.

--Brian


********************************************************************************
********************************************************************************
********************************************************************************

To: spin-comp@thistle.cs.washington.edu
Subject: algorithm for eager stitching
Date: Mon, 01 Jul 1996 15:07:49 PDT
From: Brian Grant <grant@thistle.cs.washington.edu>

I think I have the algorithm for eager stitching mostly done, except for two
things (one thing?):

	1) How to combine eager stitching with lazy stitching, if we're
	   going to implement lazy.
	2) How to handle multi-way unrolling if we're not.

I realize it's too late for people to read this before the meeting.  It's
also too late for me to prepare a good presentation of it, so I suggest we
postpone the full discussion until next week.

--Brian

Eager Stitching in DyC
Brian K. Grant


********************************************************************************
The Problem
********************************************************************************

We would like to statically transform the CFG for the combined setup
and stitching of the dynamic region (I'll call this the SCFG, for
static CFG) such that basic blocks corresponding to specialized basic
blocks that are to be eagerly stitched (I'll call this the DCFG, for
dynamic CFG) will be reached, without maintaining a stack or worklist
at run-time (except in the case of multi-way loop unrolling).  The
problem has two levels:

1) The order that the setup/stitcher blocks (of the SCFG) are ordered
statically, and
2) The order in which specialized blocks (of the DCFG) will be
stitched at run-time.

We would like both to correspond roughly to a topological traversal of
the basic blocks, for several reasons (first three are pretty standard):

1) Shorter live ranges, solving our problem with Multiflow and
hopefully producing fewer spills,
2) Minimization of the number of branches,
3) Spacial locality, and (at run-time)
4) Minimization of checks for whether blocks of the DCFG have already
been stitched or not.

What makes this problem difficult is that in a topological traversal,
a merge is not visited except until it is reached by its last incoming
edge.  However, at setup/stitch time we might not traverse the last
incoming edge due to the presence of static branches on the path to
that edge.  To preserve the topological ordering of blocks we must
wait until that edge is either traversed or passed by before we visit
its destination.  So, the tricky part is figuring out where to check
whether that edge was traversed or not and to go back and stitch the
unstitched blocks.

An alternative to this checking is splitting the CFG at the merges,
duplicating the code.  Optimizing the resulting graph to eliminate
unnecessary duplicates (tail merging) looks as difficult as inserting
checks, so I'm going with inserting checks.

At the moment, this discussion assumes that the SCFG initially looks
like the original CFG, back edges and all, and that there is more or
less a one-to-one correspondence between blocks in the SCFG and blocks
in the DCFG.  However, this may not be the case.  This brings up two
questions:

1) How do blocks in the DCFG correspond to blocks in the SCFG?  We
need to determine whether global optimizations on the templates moving
instructions across basic-block boundaries is a problem or not.  This
problem also exists with lazy stitching.

2) Should we perform the CFG transformation before or after merging
the templates with the setup code?  If we do the transformation
before, we have to make sure that no new basic blocks are added to the
SCFG during the merge.  If we do it after, we have to be sure that we
don't modify the DCFG, which is embedded in the SCFG that we are
transforming.


********************************************************************************
Structured vs. unstructured code
********************************************************************************

One possible definition of structured code is code containing only
looping and decision constructs that are strictly nested, with no
short cuts or gotos.  However, most C code contains simple forms of
unstructure:

	-- short-circuiting Boolean expressions
	-- break
	-- continue
	-- fall-through cases
	-- goto bottom of function

Ideally, we should be able to handle these simple forms of unstructure
as efficiently as perfectly structured code.  Can we come up with more
lenient definitions of structure that include these cases?  

Can we identify the underlying structure in a CFG with a few stray
"unstructured edges"?  If we could determine which edges were
unstructured and which ones were structured, we could possibly
eliminate the unstructured ones when transforming the CFG in order to
preserve its structure.

	-- OR in reachability conditions: introduced at merges that
	   are the destinations of unstructured control flow from
	   static branches
		Doesn't capture unstructuredness when dynamic branches
		are involved.  Also includes the simple forms of
		unstructure.

	-- What about cross edges?  What about forward edges?
		See below.


********************************************************************************
Classification of edges
********************************************************************************

I made some mistakes in the labeling of cross edges in my previous
writeup.  I'll try to clarify that here.

The definitions of tree, forward, and cross edges of the CFG come from
the DFS tree built when traversing the CFG.  In the DFS traversal, a
merge node is traversed the first time it is reached and the algorithm
must backtrack on all subsequent visits.  This is the opposite of the
depth-first topological traversal, in which a merge node is only
traversed once it is reached from the last parent to be traversed.

Nodes are added to the DFS tree as the CFG is traversed.  A tree edge
is an edge to a node not yet in the DFS tree (the first edge to a
block).  A forward edge is an edge to a successor in the DFS tree (an
nth edge into a merge).  A cross edge is an edge to a node that is
neither a successor or predecessor in the DFS tree.

Any right-hand edge entering a merge is a cross edge, even the edge
from the else clause of a simple if statement.  Cross edges that don't
fall into this category generally point backwards in the program text,
but are not loop edges.  One example is a goto from an else clause to
a then clause of the same if statement.

Equivalent unstructured constructs in the other direction produce
forward edges.  In fact, an unstructured edge can be a tree edge of
the DFS tree, making the structured edge a forward edge.  This
happens with fall-through cases in switch statements.  So, it is
impossible to classify an edge as structured or unstructured based
purely on its DFS classification.


********************************************************************************
Should we remove any edges?
********************************************************************************

If we remove any (e.g., forward, cross, back) edges, we may violate
the topological ordering of blocks.  For every such violation, we have
to insert checks to stop stitching.  This introduces two problems:

1) Where do we insert the checks?  

2) How do we know where we should continue stitching if we decide to
stop stitching along one path?  A function-call-like mechanism would
necessitate saving and restoring state at different points in the CFG,
similar to what is required for lazy stitching.  A long list of
explicit checks (if we came from X, goto to Y) might be just as
inefficient.

By removing edges, we may also create dead-ends in the CFG.

Back edges are a special case.  They can be removed without creating
dead-ends.  They generally have to be removed (or at least ignored)
when performing a topological traversal.  We still may get in trouble
if the last non-back edge into the destination of a back edge has more
restrictive or different reachability conditions than the back edge.

********************************************************************************
Why can't all subgraphs corresponding to static conditionals be left
unchanged?
********************************************************************************

Study the following diagram of a CFG.  Assume S is a static branch,
and Z is "the sky" (an edge from some unspecified place in the CFG).

         |   |
         S   Z
        / \ /
       A   B
        \ /
         M
         |

Why can't we leave edges SB, AM, and BM intact and place a test (does
B need to be stitched?) on edge ZB?  The main impediment to leaving
those edges intact is identifying that there is something special
about the (nonconstant) merges B and M.

Edges SA, AM, and SB in the picture above are tree edges, BM is a
cross edge, and ZB could be a forward or cross edge (assume for now
that it is not a back edge).  If we could identify that there were
something special about the edge ZB, we could eliminate it from the
graph and recompute the reachability conditions.  That would eliminate
the merge at B and make M a constant merge.  Unfortunately, I don't
know how to identify that edge.

Here is a more concrete example (D dynamic, S static):

if (D && S)
        {A}
else
        {B}
{M}

This code translates to the following CFG:

         |
         D
        / \
       S  |             B and M are nonconstant merges
      / \ /
     A   B
      \ /
       M
       |

Edges DB and BM are forward edges.  We always want to eagerly stitch B, but
only want to stitch A if S is true.  So, we want to create the following
CFG:
         |
         D
         |
         S
        /|
       A |
        \|
         B
         |
         M
         |

What we have done is redirected edge AM to point to the destination of
edge DB (the next edge from a dynamic branch that we backtrack to in
the traversal), creating edge AB.  Edge SB is not really left
untouched.  It is also redirected to the destination of edge DB, which
happens to be to the same block, B.

Here are some similar examples (S denotes static branch, D dynamic):

         |                              |
         S                              S
        / \                            / \
       D  |                           D  |
      / \ /     transformed ->        |  /       B is a constant merge
     A   B                            A /        M is a nonconstant merge
      \ /                             |/
       M                              B
       |                              |
                                      M
         |                            |
         D1
        / \
       D2 |
      / \ /     transformed -> -D1-D2-A-B-M-
     A   B                                       B and M are nonconstant
      \ /                                        merges with identical
       M                                         reachability conditions
       |                                         on all incoming edges

         |
         S1
        / \
       S2 |
      / \ /     transformed -> unchanged
     A   B                                       B and M are constant merges
      \ /
       M
       |


********************************************************************************
Equivalence classes and control-dependence trees
********************************************************************************

One way of tackling the transformation problem is to identify
subgraphs of the SCFG that are to be linearized, linearize those
subgraphs, and connect them so that the appropriate subgraphs are
stitched.

To do this, we group blocks in nested equivalence classes.  Blocks
with the same reachability conditions are put in the same equivalence
class.  New nested equivalence classes are created at static branches
and weird merges.  A weird merge is one producing an OR in the
reachability conditions.  Subgraphs within equivalence classes by
definition only contain dynamic branches and (possibly) nested
equivalence classes.  The blocks in the innermost equivalence classes
contain only dynamic branches and can be linearized by topological
sort.  Once that is completed, the surrounding equivalence class can
be linearized.  Static branches would choose between linearized
subgraphs at setup/stitch time.  Subgraphs corresponding to
equivalence classes created at weird merges would just be concatenated
and appended to the subgraph of the parent equivalence class.  So, by
starting at the innermost nested equivalence classes and working out,
the whole graph could be linearized.  I'm not yet sure how back edges
should be handled.

The following diagram shows a CFG with two weird merges.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	       \   \ /   /
                \   8   /
                 \ /   /
                  9   /
                   \ /
                   10
		    |
		  EXIT

The following tree shows the nested equivalence classes for this CFG.
Blocks within parentheses are in the same equivalence class, and
children in the tree are in nested equivalence classes.

             (ENTRY,1,10,EXIT)
                   | \_ \_
		   S1  \_ \_
                 t/ \f   \w \w
               (2)   (3) (8)(9)
                |     |
                S2    S3
              t/ \f  t/ \f
	     (4) (5) (6) (7)

From this equivalence-class tree, we construct the following CFG.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	      |    | |    |
             9NS 8NS 8NS  |	
	       \  /   \  /
		\/     \/
                 \     /
                  \   /
                   \ /
                   10
		    |
		   8NS?
		   /|
                  8 |
                   \|
                   9NS?
		   /|
		  9 |
		  \ |
		   \|
		  EXIT

So, all blocks are stitched that need to be.  Since we don't want
blocks 8 and 9 to be stitched unnecessarily, we set
need-to-be-stitched flags for the blocks on the edges from blocks 4,
5, and 6 and check those flags before the blocks are stitched.

How does this relate to control-dependence trees?  Reachability
conditions are a type of control-dependence information, where we are
only concerned about dependence upon static branches.  Building the
nested equivalence classes based on reachability conditions is very
similar to insertion of region nodes into the control-dependence
subgraph.  The region nodes become the parents of nodes with the same
control dependencies and children of nodes with fewer dependencies.
The main difference in the two approaches is in the handling of weird
merges and back edges.

Does this notion of equivalence classes make eager stitching:

	-- Easier to implement?
	-- Easier to understand?
	-- More or less efficient?

I think the equivalence-class method is easy to understand, but
building up the equivalence classes would be more work than it is
worth.  I also think that weird merges occur quite frequently and just
appending the blocks and their tests in a linear fashion is less
efficient than the algorithm I propose.  It should be possible to
insert subgraphs for equivalence classes for weird merges back into
the CFG more intelligently than appending them to the end, but I
haven't yet thought of one.  However, I haven't yet worked on this
approach for as long as I worked on the algorithm that follows, which
I had nearly worked out before Craig came up with this idea.


********************************************************************************
Identifying predecessors in the CFG
********************************************************************************

Remove all back edges from the CFG.  Perform a depth-first topological
traversal and count the nodes as they are visited.  These counts are
the begin times.  Then, reverse all edges (with back edges removed
from the original graph), and do the same again.  These counts are the
end times.  If the (begin, end) range of one node is completely
contained within the range of another node, the first node is a
successor of the second in the original CFG.


********************************************************************************
The Algorithm
********************************************************************************

Since we're traversing the SCFG and making modifications as we go
along, we maintain two graphs for sanity (traverse the original and
build a duplicate).

By passing another parameter to transform(), we could eliminate the
BACK_STACK, but it would complicate the code.

This algorithm is quite short, but complicated, and therefore
difficult to understand.  I am nearly convinced that is correct, but I
have been thinking about it for more than a month.  Understandably, it
may be more difficult to convince you that it is correct, and it would
be nice to have a proof.  I have included several comments to try to
describe the rationale for each piece of the algorithm.

Notation
--------

RC is reachability conditions (modified)

ME is mutually exclusive

is_predecessor(A, B) is true if A is a predecessor of B

SCFG2 is a duplicate CFG, initially containing duplicates of all the
nodes, but no edges.

The Algorithm
-------------

1) Peel off the first iteration of every one-way unrolled loop.  This
makes stitching of unrolled loops with multiple entries more efficient
and any unnecessarily generated code will be eliminated.

2) Find all loops and annotate each loop-entry edge with the head of
the outermost loop it enters.

3) Remove back edges from SCFG.

4) Redirect each loop-entry edge to point to the head of the outermost
loop entered.

5) Insert empty nodes into the SCFG on all non-last edges into
nonconstant merges.  This prevents redirection to edges that will
later themselves be redirected (edges from dynamic branches into
nonconstant merges).

6) Recompute reachability conditions.

7) Compute SCFG predecessors.

8) Copy the nodes of SCFG to SCFG2.

9) Start the recursive transformation and topological traversal by
calling transform(SCFG->entrynode).

10) Add back edges corresponding to loops to be unrolled to SCFG2.


transform(node) {
	/* if we're done, quit */
	if node == SCFG->exitnode {
		return
	}

	if ends_in_dynamic_branch(node) {
		/* Push the destinations of dynamic edges onto the
		   stack so that we can redirect edges into
		   nonconstant merges to them.  We want the edges to
		   be popped from BACK_STACK in the same order in
		   which they are traversed.  We don't redirect to
		   edges emanating from static branches because:
			-- If we're redirecting, the RC of the edge
			   are not ME with those of the last edge into
			   a nonconstant merge.
			-- So, there must be an edge from a dynamic
			   branch that leads to the path from which
			   the last edge comes.
			-- The RC of the next edge from a static
			   branch to be traversed must be ME with
			   those of the redirected edge, and only one
			   of the two paths should be stitched.
		 */ 
		for each outgoing edge node->e[i] (in reverse order) {
			push(BACK_STACK, node->e[i]->dest)
		}

		/* Add only the first edge to the new graph.  The
		   other branch destinations will be strung together.
		 */
		Add edge (node -> peek(BACK_STACK)) to SCFG2
	}

	/* In the case of a static branch or single edge, copy all
	   edges to SCFG2.
	 */
	else {
		for each outgoing edge node->e[i] {
			Add edge node->e[i] to SCFG2
		}
	}

	/* Recursively invoke transform() on graph successors,
	   backtracking at non-last edges into merges.
	 */
	for each outgoing edge node->e[i] {
		/* Remove the current edge from the stack so the top
		   of the stack is the next dynamic edge we're going
		   to handle.
		 */
		if ends_in_dynamic_branch(node) {
			assert(node->e[i]->dest == peek(BACK_STACK))
			pop(BACK_STACK)
		}

	/* Case 1: edge is only incoming edge (not a merge)

	   Simply move down the graph.
	 */

		/* If the edge destination is not a merge, then go to
		   that node.
		 */
		if (not is_merge(node->e[i]->dest) {
			transform(node->e[i]->dest)
		}

	/* Case 2: last edge into a merge, an edge into a constant
	   merge, or an edge into a nonconstant merge where the RC of
	   the current edge are ME with those of the last edge

	   Insert checks to go back if necessary and move down the
	   graph if it's the last edge.
	 */

		/* we know the edge destination is a merge */

		/* If the edge is the last edge or its reachability
		   conditions are mutually exclusive with those of the
		   last edge, we're going to stitch the merge next.
		   So, if the merge is a nonconstant merge, check if
		   we need to insert checks to go back and stitch any
		   blocks to preserve topological order.  At a
		   constant merge (this case catches all constant
		   merges), there won't be any blocks to go back and
		   stitch. 
		 */
		else if (node->e[i] == node->e[i]->dest->last_in_edge)
		  or ( RC(node->e[i]) ME RC(node->e[i]->dest->last_in_edge) ) {
			if is_nonconstant_merge(node->e[i]->dest) {
				/* Merges in the CHECK_SET are tested
				   in ascending order by their begin
				   times.  This ensures that no merge
				   will be checked before a
				   predecessor.  That is necessary to
				   ensure that we don't stitch the same
				   part of the graph twice.
				*/
				for every node n in CHECK_SET {
					/* We need to insert a check
					   to go back to a merge if
					   the destination of the
					   current edge is a successor
					   of a merge in CHECK_SET,
					   but the source of the edge
					   is not.  If the source is
					   also a successor, we are
					   guaranteed the merge to
					   check has already been
					   stitched due to the strict
					   topological ordering,
					   making the check unnecessary.
					 */
					if is_predecessor(n,
					  node->e[i]->dest) && (not
					  is_predecessor(n, node)) {
						Insert "is NTBS flag
						  set for n?" on edge
						  node->e[i]
					}
				}
			}

			/* If the edge is the last edge into a merge,
			   then go to that node.  If the edge is not
			   the last edge into the merge, then we need
			   to backtrack. 
			 */
			if (node->e[i] == node->e[i]->dest->last_in_edge) {
				transform(node->e[i]->dest)
			}
		}

	/* Case 3: non-last edge into a nonconstant merge

	   Redirect the edge and determine whether we might need to
	   come back to the merge or not.
	 */

		/* we know the edge destination is a nonconstant merge */
		/* we know the edge is not the last edge into the
		   merge, so we're going to backtrack */
		/* we know the reachability conditions of the edge
		   are not mutually exclusive with those of the last
		   edge, so we're going to redirect */
	
		/* If the reachability conditions of the current edge
		   are not mutually exclusive with those of the last
		   edge (includes most edges at nonconstant merges),
		   then redirect the edge and check whether the last
		   edge might not be traversed.  
		 */
		else {
			/* If reachability conditions of the current
			   edge imply those of all remaining incoming
			   edges (the last edge and all edges whose
			   reachability conditions are mutually
			   exclusive with those of the last edge),
			   then we know we will reach the merge again
			   and no check is needed.  Otherwise, insert
			   the check.
			 */
			rc_remaining_edges = RC(node->e[i]->dest->last_in_edge)
			for each incoming edge node->e[i]->dest->ie[j] {
				if ( RC(node->e[i]->dest->ie[j]) ME
				     RC(node->e[i]->dest->last_in_edge) ) {
					rc_remaining_edges =
					  rc_remaining_edges or 
					  RC(node->e[i]->dest->ie[j])
				}
			}

			if ( rc_remaining_edges && RC(node->e[i]) ) !=
			  RC(node->e[i]) { 
				/* The remaining edges might not be
				   traversed, so insert a need-to-
				   be-stitched flag and add the merge
				   to the set of nodes to go back and
				   check.
				 */
				assert(is_a_cross_edge(
				  node->e[i]->dest->last_in_edge))
				Insert "set NTBS flag for 
				  node->e[i]->dest" on edge node->e[i]
				Add node->e[i]->dest to CHECK_SET
				  (begin times in ascending order)
			}

			Redirect edge node->e[i] in SCFG2 to
			  peek(BACK_STACK), or add the edge if it is
			  not present
		}
	}

	return
}

This algorithm correctly transforms the following CFG.

		    |
		    D1
		   /  \
		 D2    E
		/  \    \
	       A    S    |
	       |   / \   |
	        \ B   C  |
	         \ \ /  /
		  \ |  /
		   M1 /
		    |/
		    M2
		    |

To:
		    |
		    D1
		    |
		    D2
		    |
		    A
		    |
		    S
		   / \
		  B   C
		   \ /
		    M1
		    |
		    E
		    |
		    M2
		    |


********************************************************************************
Does this algorithm correctly handle back edges?
********************************************************************************

What happens if the SCFG has back edges?

Backedges for loops that are not unrolled can be ignored (not copied
to the duplicate graph).  Loops with multiple entries are handled
correctly because all entering edges are redirected to the loop head,
ensuring that the whole loop will be stitched and that no successor
will be stitched before a predecessor.  The following CFG is a good
example of this situation.

		     |
		     S   __
		    /|\ / |
		   A | H  |
		   |  \|  |
		   |   Q  |
		    \ /   |
		     D    |
		    / \   |
		   B   T  |
		   |   |__|

With one-way unrolled loops, we should be able to just add the back
edge to the duplicate graph after the transformation algorithm runs.

Multi-way unrolled loops are a problem.  If done eagerly, they require
a stitch-time stack.  I've only just begun thinking about how they
affect this algorithm.


********************************************************************************
Integration with lazy stitching
********************************************************************************

How do we integrate this algorithm with the lazy stitching
algorithm?  Subgraphs of the SCFG to be stitched lazily should not be
modified by this transformation, and movement between eager and lazy
portions of the CFGs should be well-defined.

Lazily stitched branches are similar, but not identical, to static
branches.  For example, we need to save some state before a lazy
branch.  Also, how do we identify merges below lazy branches so that
we don't redirect edges?  Do we need to identify lazy subgraphs in
advance?


********************************************************************************
********************************************************************************
********************************************************************************

To: spin-comp@thistle.cs.washington.edu
Subject: looks like we need lazy stitching
Date: Mon, 15 Jul 1996 12:30:13 PDT
From: Brian Grant <grant@thistle.cs.washington.edu>

I've discovered two problems with stitching only eagerly:

	1) I don't think shareConst, cacheOne, and replicate make sense
	   if we only stitch eagerly.

	2) We never know the values of rtconsts at at start of a dynamic
	   region until it is reached, and there are situations where
	   we can encounter this problem inside the region as well.  If we
	   don't know the values, we can't stitch.  See the following
	   examples.

DynRegion {

	...

	x = foo(x, y, z);

	keepConst(x);

	if (d)
		x += 1;
	else
		x += 2;

	...
}

Since we don't know the initial value of x, we can't compute its value on the
two paths created at the merge of the if statement.  Alternatively, we
might know the initial value, but not the value on all paths.  Here is
an example of multi-way unrolling of a byte-code interpreter for a language
that allows indirect jumps on run-time-computed values (e.g., a hardware
simulator):

	keepConst(pc);
	pc = 0;
	while (opcode(pc) != RETURN) {
		...
		switch opcode(pc) {
		case BR:
			pc += secondOperand(pc); break;
		case IJMP:
			pc = registerFile[firstOperand(pc)]; break;
			/* no way to compute this
			   at stitch time! */
		...
		default:
			pc += 4; break;
		}
	}

So, the question is: how do we implement lazy?  If we only use lazy for
specialization, we can't execute at the same time as stitching because
the control flow of the stitcher is different from that of the original
code.  So, we need to alternate stitching and executing, but at what
granularity?  My first thought is to stitch until we hit a place where we
need to specialize, then execute.  Once the end of the stitched code is
reached, go back and stitch some more (potentially skipping by the
path that requires specialization).  If we do lazy, we also need to relax
the topological-ordering constraint because some blocks will not be stitched
eagerly.  So, the eager-stitching algorithm will have to be tweaked.

As discussed before, the stitcher will need to run with a different stack
than the stitched code, and we need to save and restore state when we
switch between the stitcher and the stitched code.  We'll need space to
store state for every specialized path through the CFG that we don't
complete.

--Brian

********************************************************************************
********************************************************************************
********************************************************************************

To: mock
Subject: new algorithm w/ splitting and specialization
Date: Mon, 12 Jul 1996
From: Brian Grant <grant@thistle.cs.washington.edu>


Eager Stitching in DyC
Brian K. Grant


********************************************************************************
Summary
********************************************************************************

The Problem

	-- Statically transform the CFG for eager setup & stitching
	-- Ensure all branch destinations are stitched (tricky: loops,
	   cross edges)
	-- Stitch each block only once
	-- Minimize stitch-time overhead (statcks, work lists, checks, etc.)

Assumptions (for now)
	-- No specialization other than simple unrolled loops
	-- Overlapping loops are disallowed (to ensure a well-defined
	   outermost loop and unique loop head)

The Solution

	-- Linearize the graph with a modified topological sort
	-- Don't serialize mutually exclusive paths
	-- Redirect loop entries to loop heads
	-- If a node might not be stitched, go back to it where the
	   predecessor before successor rules would otherwise be
	   violated
	-- Include back edges for unrolled loops

The Algorithm

	1) Peel 1st iteration off unrolled loops
	2) Compute for each loop-entry edge the head of the outermost
	   loop entered
	3) Remove back edges from the graph
	4) Insert empty nodes on all non-last edges into nonconstant
	   merges
	5) Redirect each loop-entry to the head of the outermost loop
	   entered
	6) Recompute reachability conditions
	7) Compute graph predecessors
	8) Copy the nodes to a duplicate graph
	9) Invoke the recursive transformation routine on the entry
	   node of the graph
	10) Add the back edges corresponding to unrolled loops to the
	   duplicate graph

See the bottom of this writeup for the recursive transformation routine.


********************************************************************************
The Problem
********************************************************************************

We would like to statically transform the CFG for the combined setup
and stitching of the dynamic region (I'll call this the SCFG, for
static CFG) such that basic blocks corresponding to specialized basic
blocks that are to be eagerly stitched (I'll call this the DCFG, for
dynamic CFG) will be reached, without maintaining a stack or worklist
at run-time (except in the case of multi-way loop unrolling).  The
problem has two levels:

1) The order that the setup/stitcher blocks (of the SCFG) are ordered
statically, and
2) The order in which specialized blocks (of the DCFG) will be
stitched at run-time.

We would like both to correspond roughly to a topological traversal of
the basic blocks, for several reasons (first three are pretty standard):

1) Shorter live ranges, solving our problem with Multiflow and
hopefully producing fewer spills,
2) Minimization of the number of branches,
3) Spacial locality, and (at run-time)
4) Minimization of checks for whether blocks of the DCFG have already
been stitched or not.

What makes this problem difficult is that in a topological traversal,
a merge is not visited except until it is reached by its last incoming
edge.  However, at setup/stitch time we might not traverse the last
incoming edge due to the presence of static branches on the path to
that edge.  To preserve the topological ordering of blocks we must
wait until that edge is either traversed or passed by before we visit
its destination.  So, the tricky part is figuring out where to check
whether that edge was traversed or not and to go back and stitch the
unstitched blocks.

At the moment, this discussion assumes that the SCFG initially looks
like the original CFG, back edges and all, and that there is more or
less a one-to-one correspondence between blocks in the SCFG and blocks
in the DCFG.  However, this may not be the case.  This brings up two
questions:

1) How do blocks in the DCFG correspond to blocks in the SCFG?  We
need to determine whether global optimizations on the templates moving
instructions across basic-block boundaries is a problem or not.  This
problem also exists with lazy stitching.

2) Should we perform the CFG transformation before or after merging
the templates with the setup code?  If we do the transformation
before, we have to make sure that no new basic blocks are added to the
SCFG during the merge.  If we do it after, we have to be sure that we
don't modify the DCFG, which is embedded in the SCFG that we are
transforming.


********************************************************************************
Structured vs. unstructured code
********************************************************************************

One possible definition of structured code is code containing only
looping and decision constructs that are strictly nested, with no
short cuts or gotos.  However, most C code contains simple forms of
unstructure:

	-- short-circuiting Boolean expressions
	-- break
	-- continue
	-- fall-through cases
	-- early return
	-- goto bottom of function

Ideally, we should be able to handle these simple forms of unstructure
as efficiently as perfectly structured code.  Can we come up with more
lenient definitions of structure that include these cases?  

Can we identify the underlying structure in a CFG with a few stray
"unstructured edges"?  If we could determine which edges were
unstructured and which ones were structured, we could possibly
eliminate the unstructured ones when transforming the CFG in order to
preserve its structure.

	-- OR in reachability conditions: introduced at merges that
	   are the destinations of unstructured control flow from
	   static branches
		Doesn't capture unstructuredness when dynamic branches
		are involved.  Also includes the simple forms of
		unstructure.

	-- What about cross edges?  What about forward edges?
		See below.


********************************************************************************
Classification of edges
********************************************************************************

The definitions of tree, forward, and cross edges of the CFG come from
the DFS tree built when traversing the CFG.  In the DFS traversal, a
merge node is traversed the first time it is reached and the algorithm
must backtrack on all subsequent visits.  This is the opposite of the
depth-first topological traversal, in which a merge node is only
traversed once it is reached from the last parent to be traversed.

Nodes are added to the DFS tree as the CFG is traversed.  A tree edge
is an edge to a node not yet in the DFS tree (the first edge to a
block).  A forward edge is an edge to a successor in the DFS tree (an
nth edge into a merge).  A cross edge is an edge to a node that is
neither a successor or predecessor in the DFS tree.

Any right-hand edge entering a merge is a cross edge, even the edge
from the else clause of a simple if statement.  Cross edges that don't
fall into this category generally point backwards in the program text,
but are not loop edges.  One example is a goto from an else clause to
a then clause of the same if statement.

Equivalent unstructured constructs in the other direction produce
forward edges.  In fact, an unstructured edge can be a tree edge of
the DFS tree, making the structured edge a forward edge.  This
happens with fall-through cases in switch statements.  So, it is
impossible to classify an edge as structured or unstructured based
purely on its DFS classification.


********************************************************************************
Should we remove any edges?
********************************************************************************

If we remove any (e.g., forward, cross, back) edges, we may violate
the topological ordering of blocks.  For every such violation, we have
to insert checks to stop stitching.  This introduces two problems:

1) Where do we insert the checks?  

2) How do we know where we should continue stitching if we decide to
stop stitching along one path?  A function-call-like mechanism would
necessitate saving and restoring state at different points in the CFG,
similar to what is required for lazy stitching.  A long list of
explicit checks (if we came from X, goto to Y) might be just as
inefficient.

By removing edges, we may also create dead-ends in the CFG or nodes
with no predecessors.

Back edges are a special case.  They can be removed without orphaning
nodes.  They generally have to be removed (or at least ignored)
when performing a topological traversal.  We still may get in trouble
if the loop has multiple entries.  The solution is to redirect all
entries to the loop head, ensuring that the entire loop is stitched.

********************************************************************************
Why can't all subgraphs corresponding to static conditionals be left
unchanged?
********************************************************************************

Study the following diagram of a CFG.  Assume S is a static branch,
and Z is "the sky" (an edge from some unspecified place in the CFG).

         |   |
         S   Z
        / \ /
       A   B
        \ /
         M
         |

Why can't we leave edges SB, AM, and BM intact and place a test (does
B need to be stitched?) on edge ZB?  The main impediment to leaving
those edges intact is identifying that there is something special
about the (nonconstant) merges B and M.

Edges SA, AM, and SB in the picture above are tree edges, BM is a
cross edge, and ZB could be a forward or cross edge (assume for now
that it is not a back edge).  If we could identify that there were
something special about the edge ZB, we could eliminate it from the
graph and recompute the reachability conditions.  That would eliminate
the merge at B and make M a constant merge.  Unfortunately, I don't
know how to identify that edge.

Here is a more concrete example (D dynamic, S static):

if (D && S)
        {A}
else
        {B}
{M}

This code translates to the following CFG:

         |
         D
        / \
       S  |             B and M are nonconstant merges
      / \ /
     A   B
      \ /
       M
       |

Edges DB and BM are forward edges.  We always want to eagerly stitch B, but
only want to stitch A if S is true.  So, we want to create the following
CFG:
         |
         D
         |
         S
        /|
       A |
        \|
         B
         |
         M
         |

What we have done is redirected edge AM to point to the destination of
edge DB (the next edge from a dynamic branch that we backtrack to in
the traversal), creating edge AB.  Edge SB is not really left
untouched.  It is also redirected to the destination of edge DB, which
happens to be to the same block, B.

Here are some similar examples (S denotes static branch, D dynamic):

         |                              |
         S                              S
        / \                            / \
       D  |                           D  |
      / \ /     transformed ->        |  /       B is a constant merge
     A   B                            A /        M is a nonconstant merge
      \ /                             |/
       M                              B
       |                              |
                                      M
         |                            |
         D1
        / \
       D2 |
      / \ /     transformed -> -D1-D2-A-B-M-
     A   B                                       B and M are nonconstant
      \ /                                        merges with identical
       M                                         reachability conditions
       |                                         on all incoming edges

         |
         S1
        / \
       S2 |
      / \ /     transformed -> unchanged
     A   B                                       B and M are constant merges
      \ /
       M
       |


********************************************************************************
Identifying predecessors in the CFG
********************************************************************************

Remove all back edges from the CFG.  Perform a depth-first topological
traversal and count the nodes as they are visited.  These counts are
the begin times.  Then, reverse all edges (with back edges removed
from the original graph), and do the same again.  These counts are the
end times.  If the (begin, end) range of one node is completely
contained within the range of another node, the first node is a
successor of the second in the original CFG.


********************************************************************************
Equivalence classes and control-dependence trees
********************************************************************************

One way of tackling the transformation problem is to identify
subgraphs of the SCFG that are to be linearized, linearize those
subgraphs, and connect them so that the appropriate subgraphs are
stitched.

To do this, we group blocks in nested equivalence classes.  Blocks
with the same reachability conditions are put in the same equivalence
class.  New nested equivalence classes are created at static branches
and weird merges.  A weird merge is one producing an OR in the
reachability conditions.  Subgraphs within equivalence classes by
definition only contain dynamic branches and (possibly) nested
equivalence classes.  The blocks in the innermost equivalence classes
contain only dynamic branches and can be linearized by topological
sort.  Once that is completed, the surrounding equivalence class can
be linearized.  Static branches would choose between linearized
subgraphs at setup/stitch time.  Subgraphs corresponding to
equivalence classes created at weird merges would just be concatenated
and appended to the subgraph of the parent equivalence class.  So, by
starting at the innermost nested equivalence classes and working out,
the whole graph could be linearized.  I'm not yet sure how back edges
should be handled.

The following diagram shows a CFG with two weird merges.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	       \   \ /   /
                \   8   /
                 \ /   /
                  9   /
                   \ /
                   10
		    |
		  EXIT

The following tree shows the nested equivalence classes for this CFG.
Blocks within parentheses are in the same equivalence class, and
children in the tree are in nested equivalence classes.

             (ENTRY,1,10,EXIT)
                   | \_ \_
		   S1  \_ \_
                 t/ \f   \w \w
               (2)   (3) (8)(9)
                |     |
                S2    S3
              t/ \f  t/ \f
	     (4) (5) (6) (7)

From this equivalence-class tree, we construct the following CFG.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	      |    | |    |
             9NS 8NS 8NS  |	
	       \  /   \  /
		\/     \/
                 \     /
                  \   /
                   \ /
                   10
		    |
		   8NS?
		   /|
                  8 |
                   \|
                   9NS?
		   /|
		  9 |
		  \ |
		   \|
		  EXIT

So, all blocks are stitched that need to be.  Since we don't want
blocks 8 and 9 to be stitched unnecessarily, we set
need-to-be-stitched flags for the blocks on the edges from blocks 4,
5, and 6 and check those flags before the blocks are stitched.

How does this relate to control-dependence trees?  Reachability
conditions are a type of control-dependence information, where we are
only concerned about dependence upon static branches.  Building the
nested equivalence classes based on reachability conditions is very
similar to insertion of region nodes into the control-dependence
subgraph.  The region nodes become the parents of nodes with the same
control dependencies and children of nodes with fewer dependencies.
The main difference in the two approaches is in the handling of weird
merges and back edges.

Does this notion of equivalence classes make eager stitching:

	-- Easier to implement?
	-- Easier to understand?
	-- More or less efficient?

I think the equivalence-class method is easy to understand, but
building up the equivalence classes would be more work than it is
worth.  I also think that weird merges occur quite frequently and just
appending the blocks and their tests in a linear fashion is less
efficient than the algorithm I propose (which doesn't change the graph
in the example above).  It should be possible to insert subgraphs for
equivalence classes for weird merges back into the CFG more
intelligently than appending them to the end, but I haven't yet
thought of one.  However, I haven't yet worked on this approach for as
long as I worked on the algorithm that follows, which I had nearly
worked out before Craig came up with this idea.


********************************************************************************
Node splitting and tail merging
********************************************************************************

The concept for this algorithm is based on splitting at merges and
stitching a subset of the graph in topological order, the subset being
determined by which static branches are taken at stitch-time.

NB: I'm not thinking about specialization yet.

The duplicate CFG built by the following algorithm is a tree, with new
branches of the tree created at static branches.  After the algorithm
runs, another pass would attempt to merge the tails (quadratic in
number of tails?  exponential in number of static branches?).  Without
specialization (including loop unrolling), no stitch-time checks or
data structures (stacks, queues, etc.) are necessary.

Before we include each block in the tree, we check whether we might
reach a loop tail (block ending in a back edge) but not the loop head
(destination of the back edge).  If this is possible, we insert the
loop head in the sorted list of nodes to stitch.  The test I use to
check this is whether the reachability conditions of the next node in
the sorted list imply the intersection of the reachability conditions
of each tail and its corresponding head.

Assumptions and notation
------------------------

-- Edges are ordered by the topological numbers of their destinations

-- Edge->Dest is the destination node of Edge

-- Node->Edge[0] is the first edge of Node

-- SCFG->Start has a single successor

-- BranchStack is a stack of branches of the tree (SCFG2) that need to
   be completed

-- NextNodeList is a list of nodes of SCFG sorted in ascending order by
   topological number; it is a list of nodes that need to be stitched,
   but haven't been copied to the duplicate graph yet; no node can be
   inserted into the list more than once (the list contains no duplicates)

-- LoopList is a list of node pairs sorted in ascending order by the
   topological number of the first node in the pair;  it is a list of
   loop heads paired with their tails; a loop head may appear more
   than once if it has multiple tails; each tail appears only once in
   the list, since a back edge can only point to one head

-- CopyList(L2, L1) creates a copy L2 of list L1

-- CopyNode(N2, N1) creates a copy N2 of node N1 (not including
   connecting edges)

-- RC is reachability conditions

-- is_predecessor(N1, N2) is true if N1 is a predecessor of N2 or
     N1 == N2, false if N1 succeeds N2 or N1 neither preceeds nor
     succeeds N2


Algorithm
---------

CreateEmptyList(LoopList)
FindLoopHeadsAndTails(LoopList, SCFG->Start)

RemoveBackEdges(SCFG->Start)

ComputeTopologicalNumbers(SCFG->Start)

CopyNode(SCFG2->Start, SCFG->Start)

CreateEmptyList(NextNodeList)
InsertIntoList(NextNodeList, SCFG->Start->Edge[0]->Dest)

CreateEmptyStack(BranchStack)
PushOntoStack(BranchStack, (SCFG2->Start, NoCondition, NextNodeList,
  LoopList))

until EmptyStack(BranchStack) {
	/* Start on a branch of the tree */
	(PreviousDupNode, EdgeCondition, NextNodeList, LoopList) =
	  PopOffStack(BranchStack)

	/* Add to the branch until it is completed */
	until EmptyList(NextNodeList) {

		/* Check if there is a loop that we need to stitch
		   before the next node scheduled to be stitched.
		 */
		NextNode = HeadOfList(NextNodeList)
		for each node pair (Head, Tail) in LoopList {
			/* If the next node to be stitched is the loop
			   head or if we already passed by the tail,
			   we don't need to worry about this loop, so
			   remove it from the list.
			 */
			if (NextNode == Head) or
			  (Topo#(Tail) < Topo#(NextNode)) {
				RemoveFromList(LoopList, (Head, Tail))
			}
			else  {
				/* If the reachability conditions of
				   the next node to be stitched imply
				   the intersection of the
				   reachability conditions of the head
				   and the tail, then we might reach
				   the tail from NextNode but not the
				   head, so we had better ensure that
				   we stitch the head.
				 */
				if RC(NextNode) ==> (RC(Head) &&
				  RC(Tail)) {
					InsertIntoList(NextNodeList,
					  Head)
					RemoveFromList(LoopList,
					  (Head, *))
				}
			}
		}

		/* Copy the next node in the topological order that
		   needs to be stitched.
		 */
		ThisOrigNode = RemoveHeadOfList(NextNodeList)
		CopyNode(ThisDupNode, ThisOrigNode)
		CreateEdge(PreviousDupNode, ThisDupNode, EdgeCondition)

		/* Static branches spawn new branches in the SCFG2
		   tree.  The current branch follows the first edge.
		 */
		if EndsInStaticBranch(ThisOrigNode) {
			for each outgoing edge ThisOrigNode->Edge[i] except
			  the first (in reverse topo. order) {
				CopyList(NewNodeList, NextNodeList)
				CopyList(NewLoopList, LoopList)
				InsertIntoList(NewNodeList,
				  ThisOrigNode->Edge[i]->Dest)
				PushOntoStack(BranchStack,
				  (ThisDupNode,
				  Condition(ThisOrigNode->Edge[i]),
				  NewNodeList, NewLoopList))
			}

			InsertIntoList(NextNodeList,
			  ThisOrigNode->Edge[0]->Dest)
			EdgeCondition = Condition(ThisOrigNode->Edge[0])
		}
		else {
			/* Ensure all the children of this dynamic
			   branch or singleton edge will be stitched
			   in the proper order by inserting them into
			   the sorted list.
			 */
			for each outgoing edge ThisOrigNode->Edge[i] {
				InsertIntoList(NextNodeList,
				  ThisOrigNode->Edge[i]->Dest) 
			}

			EdgeCondition = NoCondition
		}

		PreviousDupNode = ThisDupNode
	}
}


********************************************************************************
The Algorithm
********************************************************************************

Since we're traversing the SCFG and making modifications as we go
along, we maintain two graphs for sanity (traverse the original and
build a duplicate).

By passing another parameter to transform(), we could eliminate the
BACK_STACK, but it would complicate the code.

The topological traversal defines a unique order on the edges (by the
topological numbers of the edge destinations).

Notation
--------

NTBS_LIST is a sorted list of nodes

NTBS is need to be stitched

HBS is has been stitched

RC is reachability conditions (modified)

ME is mutually exclusive

is_predecessor(A, B) is true if A is a predecessor of B or A == B

SCFG2 is a duplicate CFG, initially containing duplicates of all the
nodes, but no edges.

The Algorithm
-------------

1) Peel off the first iteration of every one-way unrolled loop.  This
makes stitching of unrolled loops with multiple entries more efficient
and any unnecessarily generated code will be eliminated.

2) Find all loops and annotate each loop-entry edge with the head of
the outermost loop it enters.

3) Remove back edges from SCFG.

4) Redirect each loop-entry edge to point to the head of the outermost
loop entered.

5) Insert empty nodes into the SCFG on all edges from dynamic
branches.  This prevents redirection of the first edge from a dynamic
branch, and prevents redirection to edges that will later themselves
be redirected (edges from dynamic branches into nonconstant merges).
It also ensures that the destination of each edge from a dynamic
branch has only one incoming edge.

6) Recompute reachability conditions.

7) Compute SCFG predecessors.

8) Copy the nodes of SCFG to SCFG2.

9) Start the recursive transformation and topological traversal by
calling transform(SCFG->entrynode).

10) Add back edges corresponding to loops to be unrolled to SCFG2.


Invariants on the SCFG when transform is called:
	-- No back edges
	-- Edges from dynamic branches don't go into merges


transform(node) {
	/* if we're done, quit */
	if node == SCFG->exitnode {
		return
	}

	if ends_in_dynamic_branch(node) {
		/* Push the destinations of dynamic edges onto the
		   stack so that we can redirect edges into
		   nonconstant merges to them.  We want the edges to
		   be popped from BACK_STACK in the same order in
		   which they are traversed.  We don't redirect to
		   edges emanating from static branches because:
			-- If we're redirecting, the RC of the edge
			   are not ME with those of the last edge into
			   a nonconstant merge.
			-- So, there must be an edge from a dynamic
			   branch that leads to the path from which
			   the last edge comes.
			-- The RC of the next edge from a static
			   branch to be traversed must be ME with
			   those of the redirected edge, and only one
			   of the two paths should be stitched.
		 */ 
		for each outgoing edge node->e[i] (in reverse order) {
			push(BACK_STACK, node->e[i]->dest)
		}
	}

	/* Recursively invoke transform() on graph successors,
	   backtracking at non-last edges into merges.
	 */
	for each outgoing edge node->e[i] {
		/* Remove the current edge from the stack so the top
		   of the stack is the next dynamic edge we're going
		   to handle.
		 */
		if ends_in_dynamic_branch(node) {
			assert(node->e[i]->dest == peek(BACK_STACK))
			pop(BACK_STACK)
		}

	/* Case 1: edge is only incoming edge (not a merge)

	   Simply move down the graph.
	 */

		/* If the edge destination is not a merge, then go to
		   that node.
		 */
		if (not is_merge(node->e[i]->dest) {
			/* For dynamic branches, add only the first
			   edge to the new graph.  The other branch
			   destinations will be strung together.  All
			   singleton edges and edges from static
			   branches may be added.
			 */
			if (not ends_in_dynamic_branch(node)) or
			  (ends_in_dynamic_branch(node) and
			  node->e[i] is the first edge from node) {
				Add edge node->e[i] to SCFG2
			}

			transform(node->e[i]->dest)
		}

	/* Case 2: last edge into a merge, an edge into a constant
	   merge, or an edge into a nonconstant merge where the RC of
	   the current edge are ME with those of the last edge

	   Insert checks to go back if necessary and move down the
	   graph if it's the last edge.
	 */

		/* we know the edge destination is a merge */

		/* If the edge is the last edge or its reachability
		   conditions are mutually exclusive with those of the
		   last edge, we're going to stitch the merge next.
		   So, if the merge is a nonconstant merge, check if
		   we need to insert checks to go back and stitch any
		   blocks to preserve topological order.  At a
		   constant merge (this case catches all constant
		   merges), there won't be any blocks to go back and
		   stitch. 
		 */
		else if (node->e[i] == node->e[i]->dest->last_in_edge)
		  or ( RC(node->e[i]) ME RC(node->e[i]->dest->last_in_edge) ) {
			if is_nonconstant_merge(node->e[i]->dest) {
				/* Merges in the NTBS_LIST will be tested
				   in ascending order by their begin
				   times.  This ensures that no merge
				   will be checked before a
				   predecessor.  That is necessary to
				   ensure that we don't stitch the same
				   part of the graph twice.  The
				   NTBS_LIST is actually in
				   descending order and we insert the
				   checks backwards.

				   We may also be able to use the
				   ordering to optimize this loop.
				*/
				v = node->e[i]->dest
				for every node n in NTBS_LIST {
					/* We need to insert a check
					   to go back to a merge if
					   the destination of the
					   current edge is a successor
					   of a merge in NTBS_LIST,
					   but the source of the edge
					   is not.  If the source is
					   also a successor, we are
					   guaranteed the merge to
					   check has already been
					   stitched due to the strict
					   topological ordering,
					   making the check unnecessary.
					 */
					if is_predecessor(n,
					  node->e[i]->dest) && (not
					  is_predecessor(n, node)) {
						Add subgraph u
						 "if (n NTBS && !(n HBS)),
						  then goto n, else goto v"
						  to SCFG2
						v = u
					}
				}

				/* Connect node to the string of
				   checks anchored to
				   node->e[i]->dest.
				 */
				Add edge (node, v) to SCFG2
			}
			else {
				/* No checks are necessary, so just
				   add the edge.
				 */
				Add edge node->e[i] to SCFG2
			}

			/* If the edge is the last edge into a merge,
			   then go to that node.  If the edge is not
			   the last edge into the merge, then we need
			   to backtrack. 
			 */
			if (node->e[i] == node->e[i]->dest->last_in_edge) {
				transform(node->e[i]->dest)
			}
		}

	/* Case 3: non-last edge into a nonconstant merge

	   Redirect the edge and determine whether we might need to
	   come back to the merge or not.
	 */

		/* we know the edge destination is a nonconstant merge */
		/* we know the edge is not the last edge into the
		   merge, so we're going to backtrack */
		/* we know the reachability conditions of the edge
		   are not mutually exclusive with those of the last
		   edge, so we're going to redirect */
		/* we know the edge is a singleton edge or an edge
		   from a static branch */
	
		/* If the reachability conditions of the current edge
		   are not mutually exclusive with those of the last
		   edge (includes most edges at nonconstant merges),
		   then redirect the edge and check whether the last
		   edge might not be traversed.  
		 */
		else {
			/* If reachability conditions of the current
			   edge imply those of all remaining incoming
			   edges (the last edge and all edges whose
			   reachability conditions are mutually
			   exclusive with those of the last edge),
			   then we know we will reach the merge again
			   and no check is needed.  Otherwise, insert
			   the check.
			 */
			rc_remaining_edges = RC(node->e[i]->dest->last_in_edge)
			for each incoming edge node->e[i]->dest->ie[j] {
				if ( RC(node->e[i]->dest->ie[j]) ME
				     RC(node->e[i]->dest->last_in_edge) ) {
					rc_remaining_edges =
					  rc_remaining_edges or 
					  RC(node->e[i]->dest->ie[j])
				}
			}

			if ( rc_remaining_edges && RC(node->e[i]) ) !=
			  RC(node->e[i]) { 
				/* The remaining edges might not be
				   traversed, so insert a need-to-
				   be-stitched flag and add the merge
				   to the set of nodes to go back and
				   check.
				 */
				assert(is_a_cross_edge(
				  node->e[i]->dest->last_in_edge))
				Add node n 
				  "set NTBS flag for node->e[i]->dest"
				  to SCFG2
				Add edge (node, n) to SCFG2
				v = n
				Insert "set HBS flag for node
				  node->e[i]->dest" in node
				  node->e[i]->dest, if it is not there
				  already
				Insert node->e[i]->dest into NTBS_LIST
				  (begin times in descending order)
			}
			else {
				v = node
			}

			/* If we could have backtracked from the
			   merge, we're going to want to skip back to
			   it.
			 */
			backtrack = false
			for every node n in NTBS_LIST {
				if is_predecessor(n, node) && (not 
				  is_predecessor(n,
				  node->e[i]->dest->last_in_edge)) {
					backtrack = true
					break
				}
			}

			/* If we don't want to go straight back to the
			   merge, go to the redirected location.
			 */
			if backtrack {
				Add subgraph u
				  "if peek(BACK_STACK) HBS,
				  then goto node->e[i]->dest,
				  else goto peek(BACK_STACK)"
				  to SCFG2
				Add edge (v, u) to SCFG2
				Insert "set HBS flag for node
				  peek(BACK_STACK)" in node
				  peek(BACK_STACK), if it is not there
				  already
			}
			else {
				Add edge (v, peek(BACK_STACK)) to SCFG2
			}
		}
	}

	return
}

This algorithm correctly transforms the following CFG.

		    |
		    D1
		   /  \
		 D2    E
		/  \    \
	       A    S    |
	       |   / \   |
	        \ B   C  |
	         \ \ /  /
		  \ |  /
		   M1 /
		    |/
		    M2
		    |

To:
		    |
		    D1
		    |
		    D2
		    |
		    A
		    |
		    S
		   / \
		  B   C
		   \ /
		    M1
		    |
		    E
		    |
		    M2
		    |


********************************************************************************
Does this algorithm correctly handle back edges?
********************************************************************************

What happens if the SCFG has back edges?

Backedges for loops that are not unrolled can be ignored (not copied
to the duplicate graph).  Loops with multiple entries are handled
correctly because all entering edges are redirected to the loop head,
ensuring that the whole loop will be stitched and that no successor
will be stitched before a predecessor.  The following CFG is a good
example of this situation.

		     |
		     S   __
		    /|\ / |
		   A | H  |
		   |  \|  |
		   |   Q  |
		    \ /   |
		     D    |
		    / \   |
		   B   T  |
		   |   |__|

With one-way unrolled loops, we should be able to just add the back
edge to the duplicate graph after the transformation algorithm runs.
We also need to ensure that NTBS and HBS flags for nodes within the
unrolled loop are duplicated for each copy of the loop body.

Multi-way unrolled loops are a problem.  If done eagerly, they require
a stitch-time stack.  I've only just begun thinking about how they
affect this algorithm.


********************************************************************************
Integration with lazy stitching, multi-way unrolling, and other specialization
********************************************************************************

How do we integrate this algorithm with the lazy stitching
algorithm?  Subgraphs of the SCFG to be stitched lazily should not be
modified by this transformation, and movement between eager and lazy
portions of the CFGs should be well-defined.

Lazily stitched branches are similar, but not identical, to static
branches.  For example, we need to save some state before a lazy
branch.  Also, how do we identify merges below lazy branches so that
we don't redirect edges?  Do we need to identify lazy subgraphs in
advance?

Even without lazy, how do we handle multi-way unrolling and other
specialization?


********************************************************************************
********************************************************************************
********************************************************************************

To: mock
Subject: latest stitching writeup w/ splitting and specialization
Date: Mon, 20 Jul 1996
From: Brian Grant <grant@thistle.cs.washington.edu>


Stitching in DyC
Brian K. Grant


********************************************************************************
Summary
********************************************************************************

I will generally refer to the combined task of setup and stitching
simply as stitching.

The Problem
-----------

	-- Statically transform the CFG for combo eager & lazy setup &
	   stitching by using a modified topological sort; this
	   produces code the stitcher simply follows, eliminating the
	   need for the stitcher to manipulate or traverse the templates

Why Perform A Static Transformation?
------------------------------------

	-- Minimize stitch-time overhead by
		-- Solving our problem with Multiflow, allowing setup
		   computations to be performed using temps
		-- Eliminating the need for stacks and work lists to
		   keep track of setup computations to perform and
		   templates that need to be stitched
		-- Minimizing checks of whether blocks need to be
		   stitched or not
		-- Reducing context-switching overhead (between
		   stitched code and the stitcher) by eagerly
		   stitching as much as possible and by generating
		   optimized fixup code
	-- Fits with the idea of specializing the stitcher to the
	   dynamic region

Why Use A Topological Sort?
---------------------------

	-- It is _the_ traversal of the CFG
		-- Both depth-first and bread-first topological
		   traversals are possible, but DF is easier to
		   implement
		-- No successor is traversed before a predecessor, so
		   graph invariants are easier to maintain (especially
		   if information is propagated along edges)
		-- Preserves data-flow dependencies
		-- For structured programs, the code generated is in
		   has the same layout as the source program 
		-- It is easy to understand

	-- Why not DFS?
		-- Graph invariants are difficult or impossible to
		   maintain because merges are traversed before all
		   predecessors have been traversed

	-- Why not something else (like BFS)?
		-- Length of live ranges (short with TS, DFS)
		-- Spacial locality (good with TS, DFS)
		-- Number of branches (low with TS, DFS)
		-- Doesn't preserve data dependencies

What's So Difficult About That?
-------------------------------

	-- We don't want to stitch all blocks, just all blocks we need
	   to stitch, and we don't know which blocks those are statically
	-- We must ensure all branch destinations are stitched
		This is tricky because of
		-- Unstructured code
		-- Combinations of static and dynamic branches
		-- Replicating specialization (keepConst)
		-- Loops, especially overlapping ones, are the biggest
		   problem

Competing Strategies
--------------------

1) Edge Redirection
	-- Perform a topological traversal of the graph
	-- Include only the first edge from a dynamic branch, all
	   edges from static branches
	-- Redirect certain edges at nonconstant merges
		-- Non-last edges whose reachability conditions are
		   not mutually exclusive with those of the last edge
		-- Redirect to the next edge from a dynamic branch
		   that we will traverse
	-- Blocks that need to be stitched could be bypassed because
	   only the last edge into a merge is not redirected, so if
	   we could miss some blocks, insert them with NTBS checks

	The remaining questions are:
	-- Elimination of redundant NTBS checks?
	-- Loops?
	-- Specialization?

2) Equivalence Classes on Reachability Conditions and Specialized Variables
	-- Compute nested equivalence classes of nodes with the same
	   reachability conditions
	-- New nested equivalence classes begin at the destinations of
	   static branches and at "weird merges", merges that
	   introduce OR's into the reachability conditions (note that
	   this includes some constant merges)
	-- Innermost equivalence classes are topologically sorted,
	   then connected to their parent class, which can then
	   sorted, and so on
	-- Equivalence classes started at the children of static
	   branches are simply connected to the static branches;
	   constant weird merges can also be connected by their
	   original incoming edges; nonconstant weird merges must be
	   appended to the end of the parent graph

	The remaining questions are:
	-- Is there a more efficient way to connect equivalence
	   classes started by nonconstant weird merges?
	-- How do loops affect this algorithm?
		-- We must be careful that the NTBS flags for the heads
		   of loops entered are set
		-- We must also ensure that the flags are set before
		   they are tested (can't just set them on back edges)
	-- How does specialization affect this algorithm?
		-- We need to intersect RC equiv. classes with Kconst
		   equiv. classes
		-- A nested Kconst equiv. class should be stitched
		   eagerly if the new variables kept constant were
		   already rtconsts
		-- We need a stack for eagerly stitching multiple
		   versions

3) Splitting and Merging Plus Topological Sort Plus Equivalence
   Classes on Specialized Variables


********************************************************************************
Classification of edges
********************************************************************************

I will sometimes use DFS edge-classification terminology when
describing certain types of edges.  The edge classifications tree,
forward, and cross have no particular significance in a CFG.  Back
edges, however, do, and they must be eliminated before performing a
topological traversal.

The definitions of tree, forward, and cross edges of the CFG come from
the DFS tree built when traversing the CFG.  In the DFS traversal, a
merge node is traversed the first time it is reached and the algorithm
must backtrack on all subsequent visits.  This is, in a sense, the
opposite of the depth-first topological traversal, in which a merge
node is only traversed once it is reached from the last parent to be
traversed.

Nodes are added to the DFS tree as the CFG is traversed.  A tree edge
is an edge to a node not yet in the DFS tree (the first edge to a
block).  A forward edge is an edge to a successor in the DFS tree.  A
cross edge is an edge to a node that is neither a successor or
predecessor in the DFS tree.


********************************************************************************
Identifying predecessors in the CFG
********************************************************************************

Predecessor information is used in the algorithms.

1) Identify back edges using DFS.
2) Remove all back edges from the CFG.
3) Perform a depth-first topological traversal and count the nodes as
they are visited.  These counts are the topological numbers and will
also serve as begin times.
4) Reverse all edges and continue counting from the bottom of the CFG
to the top.  These counts are the end times.  The bottom of the CFG
will have end_time = begin_time + 1.

If the (begin, end) range of one node is completely contained within
the range of another node, the first node is a successor of the second
in the original CFG.


********************************************************************************
Can we preserve structured subgraphs?  Can we preserve subgraphs beginning at
static branches?
********************************************************************************

Suppose we want to preserve "structured subgraphs" during the graph
transformation.  Can we identify the underlying structure in a CFG
with a few stray "unstructured edges"?  To do that, we would need
definitions of "structured" and "unstructured".

One possible definition of structured code is code containing only
looping and decision constructs that are strictly nested, with no
short cuts or gotos.  However, most C code contains simple forms of
unstructure:

	-- short-circuiting Boolean expressions
	-- break
	-- continue
	-- fall-through cases
	-- early return
	-- goto bottom of function

Ideally, we should be able to handle these common forms of unstructure
as efficiently as perfectly structured code.  Can we come up with more
lenient definitions of "structured subgraphs" that include these cases?

Unfortunately, isolating the "unstructuredness" of a CFG is very
difficult.  One characteristic of a node in the graph we could use to
locate "unstructuredness" is whether a new OR is introduced into the
reachability conditions at a merge.  We'll call merges with new ORs
introduced into the reachability conditions weird merges.  It turns
out we'll use the concept of weird merges in one of the algorithms.

Let's look at how we would want to transform some simple graphs with a
little unstructuredness.

Study the following diagram of a CFG.  Assume S is a static branch,
and Z is "the sky" (an edge from some unspecified place in the CFG).

         |   |
         S   Z
        / \ /
       A   B
        \ /
         M
         |

Why can't we leave edges SB, AM, and BM intact and place a test (does
B need to be stitched?) on edge ZB?  The main impediment to leaving
those edges intact is identifying that there is something special
about the nonconstant merges B and M, that treatment of the edge ZB
should be delayed.  The merge at B may not even be a weird merge, such
as in the following example.

Here is a more concrete example (D dynamic, S static):

if (D && S)
        {A}
else
        {B}
{M}

This code translates to the following CFG:

         |
         D
        / \
       S  |             B and M are nonconstant merges, but not weird
      / \ /
     A   B
      \ /
       M
       |

The reachability conditions at B and M are the same as those at D, so
we always want to eagerly stitch B, but only want to stitch A if S is
true.  So, we want to create the following CFG:

         |
         D
         |
         S
        /|
       A |
        \|
         B
         |
         M
         |

If we look at this from the edge-redirection perspective, then we have
redirected edge AM to point to the destination of edge DB (the next
edge from a dynamic branch that we backtrack to in the traversal),
creating edge AB.  Edge SB is not really left untouched.  It is also
redirected to the destination of edge DB, which happens to be to the
same block, B.

From the RC-equivalence-class perspective, blocks D, S, B, and M are
in the same equivalence class, with A being in a nested class.  A is
made conditional on the test at S being true and the other blocks are
topologically sorted.

From the splitting and merging perspective, we connect D to S, then
split at S, creating two tails ABM and BM.  We obtain the above graph
by merging the BM suffixes.

Here are some similar examples (S denotes static branch, D dynamic):

         |                              |
         S                              S
        / \                            / \
       D  |                           D  |
      / \ /     transformed ->        |  /       B is a constant merge
     A   B                            A /        M is a nonconstant merge
      \ /                             |/
       M                              B
       |                              |
                                      M
         |                            |
         D1
        / \
       D2 |
      / \ /     transformed -> -D1-D2-A-B-M-
     A   B                                       B and M are nonconstant
      \ /                                        merges with identical
       M                                         reachability conditions
       |                                         on all incoming edges

         |
         S1
        / \
       S2 |
      / \ /     transformed -> unchanged
     A   B                                       B and M are constant merges
      \ /					 B is a weird merge
       M
       |


********************************************************************************
Equivalence Classes based on Reachability Conditions
********************************************************************************

One way of tackling the transformation problem is to identify
subgraphs of the SCFG that are to be linearized, linearize those
subgraphs, and connect them so that the appropriate subgraphs are
stitched.

To do this, we group blocks in nested equivalence classes.  Blocks
with the same reachability conditions are put in the same equivalence
class.  Thus, new nested equivalence classes are created at static
branches and weird merges.  A weird merge is one introducing an OR
into the reachability conditions.

The blocks in the innermost equivalence classes contain no static
branches and so can be linearized by topological sort.  Once that is
completed, the surrounding equivalence class can be linearized.  So,
by starting at the innermost nested equivalence classes and working
out, the whole graph can be linearized.

A subgraph of a nested equivalence class spawned by a static branch is
connected to the subgraph of its parent by the original static branch.
Static branches, then, choose between linearized subgraphs at
stitch-time.

Subgraphs of weird-merge equivalence classes are connected to the
parent graph in

 different ways, depending on whether the merge is
constant or nonconstant.

can be connected to their parents in the original CFG.

set flag on edge from original parent
 concatenated and appended to the subgraph of the parent equivalence class.  


The following diagram shows a CFG with two constant weird merges.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	       \   \ /   /
                \   8   /
                 \ /   /
                  9   /
                   \ /
                   10
		    |
		  EXIT

The following tree shows the nested equivalence classes for this CFG.
Blocks within parentheses are in the same equivalence class, and
children in the tree are in nested equivalence classes.

             (ENTRY,1,10,EXIT)
                   | \_ \_
		   S1  \_ \_
                 t/ \f   \w \w
               (2)   (3) (8)(9)
                |     |
                S2    S3
              t/ \f  t/ \f
	     (4) (5) (6) (7)

From this equivalence-class tree, we construct the following CFG.

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	      |    | |    |
             9NS 8NS 8NS  |	
	       \  /   \  /
		\/     \/
                 \     /
                  \   /
                   \ /
                   10
		    |
		   8NS?
		   /|
                  8 |
		  | |
		9NS |
                   \|
                   9NS?
		   /|
		  9 |
		  \ |
		   \|
		  EXIT

So, all blocks are stitched that need to be.  Since we don't want
blocks 8 and 9 to be stitched unnecessarily, we set
need-to-be-stitched flags for the blocks on the edges from blocks 4,
5, and 6 and check those flags before the blocks are stitched.

How does this relate to control-dependence trees?  Reachability
conditions are a type of control-dependence information, where we are
only concerned about dependence upon static branches.  Building the
nested equivalence classes based on reachability conditions is very
similar to insertion of region nodes into the control-dependence
subgraph.  The region nodes become the parents of nodes with the same
control dependencies and children of nodes with fewer dependencies.
The main difference in the two approaches is in the handling of weird
merges and back edges.


********************************************************************************
Splitting and Merging + Topological Sort + Equivalence Classes on Kconst
********************************************************************************

The concept for this algorithm is based on splitting at merges and
stitching a subset of the graph in topological order, the subset being
determined by which static branches are taken at stitch-time.

The duplicate CFG built by the following algorithm is a forest, with
new roots created at special merges and new tree branches created at
static branches.  After the algorithm runs, another pass will attempt
to merge the tails (quadratic in number of tails?  exponential in
number of static branches?).

Before we include each block in the tree, we check whether we might
reach a loop tail (block ending in a back edge) but not the loop head
(destination of the back edge).  If this is possible, we insert the
loop head in the sorted list of nodes to stitch.  The test I use to
check this is whether the reachability conditions of the next node in
the sorted list imply the intersection of the reachability conditions
of each tail and its corresponding head.

Each tree created corresponds to a group of blocks that will be
stitched eagerly.  The tree rooted at the start of the original CFG is
stitched first, then the stitcher will jump to the stitched code.
Other trees will be stitched lazily.

In this version of the writeup, I haven't included the details about
how to link the trees together.  We need to ensure that the values of
live rtconsts are carried to the roots of the trees.  We also need to
ensure that the stitcher returns to the right root when more code
needs to be stitched lazily.  This is not as straightforward as it
seems because there can be multiple entries into unrolled loops.

I also haven't worked out all of the details of the stitcher, such as
branch patching.  Because many copies and versions of the same block
can be stitched, patching correctly will be nontrivial.


Assumptions and notation
------------------------

-- The BTA runs before this algorithm, computing the reachability
   conditions and marking merges as special where specialization must
   begin

-- RC is reachability conditions

-- IsSpecial(N) is true if N is a merge where we need to begin
     specializing; that is, N is a merge with multiple reaching
     definitions for some variable in the set Kconst(N)

-- MakeUnspecial(N) causes IsSpecial(N) to return false

-- Zero(N) returns 0

-- Topo#(N) returns the topological number of node N

-- IsPredecessor(N1, N2) is true if N1 is a predecessor of N2 or
     N1 == N2, false if N1 succeeds N2 or N1 neither preceeds nor
     succeeds N2

-- CopyNode(N2, N1) creates a copy N2 of node N1 (not including
   connecting edges)

-- Edges are ordered by the topological numbers of their destinations

-- Edge->Dest is the destination node of Edge

-- Node->Edge[0] is the first edge of Node

-- SCFG->Start has a single successor

-- (F0, F1, ..., FN) is an (N+1)-tuple of objects F0, F1, F2, ..., FN;
   F0 is in field 0, F1 is in field 1, etc.

-- Field(Tuple, N) returns field N of Tuple

-- Stacks: stacks contain fixed-length tuples of pointers or scalars
   CreateEmptyStack(Stack, Size) creates the empty stack Stack of
     tuples of size Size
   DestroyStack(Stack) frees Stack
   EmptyStack(Stack) returns true if Stack is empty, false otherwise
   PushOntoStack(Stack, Tuple) pushes Tuple onto Stack
   Tuple = PopOffStack(Stack) pops and returns the top of Stack
   Tuple = PeekAtStack(Stack) returns the top of Stack without popping

-- BranchStack is a stack of 5-tuples: (Node, EdgeCondition, List,
   List, List); the tuples contain information about branches of the
   tree (SCFG2) that need to be completed

-- Lists: sorted lists contain fixed-length tuples of pointers or
     scalars and no duplicate tuples
   CreateEmptyList(List, Size, SortField, ValueFunction) creates a
     list of tuples of size Size that is sorted in ascending order by
     the value returned by invoking ValueFunction on field SortField
     of each tuple
   DestroyList(List) frees List
   EmptyList(List) returns true if List is empty, false otherwise
   InsertIntoList(List, Tuple) inserts Tuple into List in the proper
     position
   RemoveFromList(List, Tuple) removes Tuple from List; a wildcard (*)
     may be used in any tuple field
   IsInList(List, Tuple) returns true if Tuple is in List, false otherwise
   Tuple = HeadOfList(List) returns the head of the list (least tuple
     by the ValueFunction)
   Tuple = TailOfList(List) returns the tail of the list (greatest
     tuple by the ValueFunction)
   CopyList(NewList, OldList) creates a copy NewList of list OldList

-- SpecializationRootList is a list of nodes in the SCFG sorted in
   ascending order by topological number; it is a list of roots of
   specialized subgraphs that must be created 

-- NextNodeList is a list of nodes of SCFG sorted in ascending order by
   topological number; it is a list of nodes that need to be stitched,
   but haven't been copied to the duplicate graph yet 

-- LoopHeadList is a list of (LoopHead, LoopTail) node pairs sorted in
   ascending order by the topological number of the first node in the
   pair; a loop head may appear more than once if it has multiple
   tails; each tail appears only once in the list, since each back
   edge can only point to one head

-- LoopTailList contains the same node pairs as LoopHeadList, but is
   sorted by the second node in the pair

Algorithm
---------

CreateEmptyList(OriginalHeadList, 2, 0, Topo#)
CreateEmptyList(OriginalTailList, 2, 1, Topo#)
FindLoopHeadsAndTails(OriginalHeadList, OriginalTailList,
  SCFG->Start)

RemoveBackEdges(SCFG->Start)

ComputeTopologicalNumbers(SCFG->Start)

CreateEmptyList(SpecializationRootList, 1, 0, Topo#)
InsertIntoList(SpecializationRootList, (SCFG->Start))

CreateEmptyList(RootsDone, 1, 0, Topo#)

/* This list doesn't need to be sorted */
CreateEmptyList(GraphsDone, 1, 0, Zero)

until EmptyList(SpecializationRootList) {

	/* Create a dummy node from which to hang this subgraph */
	CreateEmptyNode(DummyNode)
	InsertIntoList(GraphsDone, (DummyNode))

	/* Get the root of this subgraph */
	(ThisRoot) = RemoveHeadOfList(SpecializationRootList)
	InsertIntoList(RootsDone, (ThisRoot))
	CreateEmptyList(NextNodeList, 1, 0, Topo#)
	InsertIntoList(NextNodeList, (ThisRoot))

	/* We don't want to stop at the root again */
	MakeUnspecial(ThisRoot)

	/* Initialize the loop lists and prune them by eliminating tails
	   that are not reachable.  We could compute the transitive
	   closure instead.
	 */
	CopyList(TmpTailList, OriginalTailList)
	until EmptyList(TmpTailList) {
		(Head, Tail) = TailOfList(TmpTailList)
		/* If the root or another loop head precedes the tail,
		   then it is reachable.
		 */
		if (IsPredecessor(ThisRoot, Tail)) {
			InsertIntoList(LoopHeadList, (Head, Tail))
			InsertIntoList(LoopTailList, (Head, Tail))
		}
		else if (not EmptyList(LoopHeadList)) {
			CopyList(TmpHeadList, LoopHeadList)
			until EmptyList(TmpHeadList) {
				(ReachedHead, ReachedTail) =
				  HeadOfList(TmpHeadList)
				if IsPredecessor(ReachedHead, Tail) {
					InsertIntoList(LoopHeadList,
					  (Head, Tail))
					InsertIntoList(LoopTailList,
					  (Head, Tail))
					break
				}
				RemoveFromList(TmpHeadList,
				  (ReachedHead, ReachedTail))
			}
			DestroyList(TmpHeadList)
		}
		RemoveFromList(TmpTailList, (Head, Tail))
	}

	/* Add all heads preceding the root to the NextNodeList.
	 */
	while ((not EmptyList(LoopHeadList)) and
	  (Topo#(Field(HeadOfList(LoopHeadList), 0)) <
	  Topo#(ThisRoot)) {
		(Head, Tail) = HeadOfList(LoopHeadList)
		InsertIntoList(NextNodeList, (Head))
		RemoveFromList(LoopHeadList, (Head, *))
		RemoveFromList(LoopTailList, (Head, *))
	}

	CreateEmptyStack(BranchStack, 5)
	PushOntoStack(BranchStack, (DummyNode, NoCondition,
	  NextNodeList, LoopHeadList, LoopTailList))

	until EmptyStack(BranchStack) {
		/* Start on a branch of the tree */
		(PreviousDupNode, EdgeCondition, NextNodeList,
		  LoopHeadList, LoopTailList) = PopOffStack(BranchStack)

		/* Add to the branch until it is completed */
		until EmptyList(NextNodeList) {

			(NextNode) = HeadOfList(NextNodeList)

			/* Prune the loop lists */
			if (not EmptyList(LoopTailList)) {
				/* Move all (head, tail) pairs with
				   tails with topological numbers less
				   than that of NextNode to a new
				   list.
				 */
				(Head, Tail) = HeadOfList(LoopTailList)
				while (Topo#(Tail) < Topo#(NextNode)) {
					RemoveFromList(LoopHeadList,
					  (Head, Tail))
					RemoveFromList(LoopTailList,
					  (Head, Tail))
					InsertIntoList(TmpTailList,
					  (Head, Tail))
					if (not EmptyList(LoopTailList)) {
						(Head, Tail) =
						  HeadOfList(LoopTailList)
					}
					else {
						break
					}
				}

				/* We may reach the tails remaining in
				   the loop lists.  If there is a
				   corresponding head whose
				   topological number is less than or
				   equal to the tail we're checking,
				   then we must add that (head, tail)
				   pair back to the loop lists.
				 */
				until EmptyList(TmpTailList) {
					(Head, Tail) =
					  TailOfList(TmpTailList)
					if ((not EmptyList(LoopHeadList)) and
					  ((Topo#(Field(HeadOfList(LoopHeadList),
					  0)) <= Topo#(Tail)))) {
						InsertIntoList(LoopHeadList,
						  (Head, Tail))
						InsertIntoList(LoopTailList,
						  (Head, Tail))
					}
					RemoveFromList(TmpTailList,
					  (Head, Tail))
				}
			}

			Assert(Topo#(Field(HeadOfList(LoopHeadList),
			  0)) > Topo#(NextNode))

			/* Check if there is a loop that we need to
			   stitch before the next node scheduled to be
			   stitched.
			 */
			CopyList(TmpHeadList, LoopHeadList)
			until EmptyList(TmpHeadList) {
				(Head, Tail) = HeadOfList(TmpHeadList)

				/* If the next node to be stitched is
				   the loop head, we don't need
				   to worry about the loop, so remove
				   it from the loop lists.
				 */
				if (NextNode == Head) {
					RemoveFromList(LoopHeadList,
					  (Head, *))
					RemoveFromList(LoopTailList,
					  (Head, *))
					RemoveFromList(TmpHeadList,
					  (Head, *)) 
				}

				/* If the reachability
				   conditions of the next node
				   to be stitched imply the
				   intersection of the
				   reachability conditions of
				   the head and the tail, then
				   we might reach the tail
				   from NextNode but not the
				   head, so we had better
				   ensure that we stitch the
				   head.
				 */
				else if RC(NextNode) ==> (RC(Head) &&
				  RC(Tail)) {
					InsertIntoList(NextNodeList,
					  (Head))
					RemoveFromList(LoopHeadList,
					  (Head, *))
					RemoveFromList(LoopTailList,
					  (Head, *))
					RemoveFromList(TmpHeadList,
					  (Head, *))
				}

				else {
					RemoveFromList(TmpHeadList,
					  (Head, Tail))
				}

			}
			DestroyList(TmpHeadList)

			/* If the next node is a merge where we need to
			   begin specializing, then it becomes a root
			   of its own subgraph and we will come back
			   to it.
			 */
			while IsSpecial(Field(HeadOfList(NextNodeList), 0)) {
				(NewRoot) = RemoveHeadOfList(NextNodeList)
				if (not IsInList(RootsDone, (NewRoot))) {
					InsertIntoList(SpecializationRootList,
					  (NewRoot))
				}
			}

			/* Copy the next node in the topological order
			   that needs to be stitched.
			 */
			(ThisOrigNode) = RemoveHeadOfList(NextNodeList)
			CopyNode(ThisDupNode, ThisOrigNode)
			CreateEdge(PreviousDupNode, ThisDupNode,
			  EdgeCondition)

			/* Static branches spawn new branches in the
			   SCFG2 tree.  The current branch follows the
			   first edge.
			 */
			if EndsInStaticBranch(ThisOrigNode) {
				for each outgoing edge
				  ThisOrigNode->Edge[i] except the
				  first (in reverse topo. order) {
					CopyList(NewNodeList, NextNodeList)
					CopyList(NewLoopList, LoopList)
					InsertIntoList(NewNodeList,
					  (ThisOrigNode->Edge[i]->Dest))
					PushOntoStack(BranchStack,
					  (ThisDupNode,
					  Condition(ThisOrigNode->Edge[i]),
					  NewNodeList, NewLoopList))
				}

				InsertIntoList(NextNodeList,
				  (ThisOrigNode->Edge[0]->Dest))
				EdgeCondition =
				  Condition(ThisOrigNode->Edge[0])
			}
			else {
				/* Ensure all the children of this
				   dynamic branch or singleton edge
				   will be stitched in the proper
				   order by inserting them into the
				   sorted list.
				 */
				for each outgoing edge
				  ThisOrigNode->Edge[i] {
					InsertIntoList(NextNodeList,
					  (ThisOrigNode->Edge[i]->Dest)) 
				}

				EdgeCondition = NoCondition
			}

			PreviousDupNode = ThisDupNode
		}
		DestroyList(NextNodeList)
		DestroyList(LoopHeadList)
		DestroyList(LoopTailList)
	}
	DestroyStack(BranchStack)
	DestroyList(TmpTailList)
}
DestroyList(TmpTailList)
DestroyList(OriginalHeadList)
DestroyList(OriginalTailList)
DestroyList(SpecializationRootList)
DestroyList(GraphsToBuild)

MergeTails(GraphsDone)


********************************************************************************
Specialization and lazy stitching
********************************************************************************

Specialization will require insertion of cache checks at merges where
variables to be kept constant would have different incoming values.
We'll need to build a new tree containing blocks with uses of rtconsts
derived from the rtconsts that have changed and the intervening
blocks.  The new tree extends from the cache check to the unkeepConst
of the variables (or bottom of the nested dynamic region, such as for
unrolled loops).  The original tree (well, the graph, after tail
merging) will be specialized and stitched first.  The new tree (again,
graph after merging) will be stitched each time we respecialize for
different values.

It looks like specialization will have to be handled lazily.  I've
discovered two problems with stitching only eagerly:

	1) I don't think shareConst, cacheOne, and replicate make sense
	   if we only stitch eagerly.

	2) We never know the values of rtconsts at at start of a dynamic
	   region until it is reached, and there are situations where
	   we can encounter this problem inside the region as well.  If we
	   don't know the values, we can't stitch.  See the following
	   examples.

DynRegion {

	...

	x = foo(x, y, z);

	keepConst(x);

	if (d)
		x += 1;
	else
		x += 2;

	...
}

Since we don't know the initial value of x, we can't compute its value
on the two paths created at the merge of the if statement.  So, if any
of the arguments of a keepConst or isConst is not already an rtconst,
it creates a region that cannot be stitched until it is reached and so
it must be stitched lazily (e.g., the outermost dynamic region is not
stitched until it is reached).

So, the question is: how do we implement lazy?  If we only use lazy for
specialization, we can't execute at the same time as stitching because
the control flow of the stitcher is different from that of the original
code.  So, we need to alternate stitching and executing, but at what
granularity?  My first thought is to stitch until we hit a place where we
need to specialize, then execute.  Once the end of the stitched code is
reached, go back and stitch some more (potentially skipping by the
path that requires specialization).  If we do lazy, we also need to relax
the topological-ordering constraint because some blocks will not be stitched
eagerly.  So, the eager-stitching algorithm will have to be tweaked.

As discussed before, the stitcher will need to run with a different stack
than the stitched code, and we need to save and restore state when we
switch between the stitcher and the stitched code.  We'll need space to
store state for every specialized path through the CFG that we don't
complete.


********************************************************************************
When replicating specialization stops
********************************************************************************

1) At unkeepConst or end of enclosing dynamic region
2) When the BTA can no longer determine the variable's value based on
other rtconsts.

Here is an example of multi-way unrolling of a byte-code interpreter
for a language that allows indirect jumps on run-time-computed values
(e.g., a hardware simulator):

	keepConst(pc);
	pc = 0;
	while (opcode(pc) != RETURN) {
		...
		switch opcode(pc) {
		case BR:
			pc += secondOperand(pc); break;
		case IJMP:
			pc = registerFile[firstOperand(pc)]; break;
			/* no way to compute this
			   at stitch time! */
		...
		default:
			pc += 4; break;
		}
	}

In this case, the pc can no longer be kept constant after the merge
because it takes on a totally dynamic value, not one based on other
rtconsts.


********************************************************************************
CFG Test Suite
********************************************************************************

Simple case #1:

		   |
		   D
		  / \
		 S   C
		/ \  |
	       A   B |
		\ /  |
		 M   |
		  \ /
		   Z
		   |

Edge ME with last edge:

		   |
		   D
		  / \
		 A   S
		 |  / \
		 | B   C
		 \ |   /
		  \ \ /
		   \|/
		    Z
		    |

Several constant weird merges:

		 ENTRY
		   |
		   1
		   |
		   S1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	       \   \ /   /
                \   8   /
                 \ /   /
                  9   /
                   \ /
                   10
		    |
		  EXIT


Some nonconstant weird merges:

		 ENTRY
		   |
		   1
		   |
		   D1
                  /  \
		2      3
		|      |
	        S2     S3
               /  \   /  \
	      4    5 6    7
	       \   \ /   /
                \   8   /
                 \ /   /
                  9   /
                   \ /
                   10
		    |
		  EXIT

A graph with edges that cut across both left and right:

		   |
		  D1
		 /  \
	       S1    D2
	      / \    /|\
	     A   | S3 E H
	     |   |/  \| |
	     |   B    F |
	     |   |    | |
	     |  S2   /  |
	      \ /\___   |
	       C   / \ /
	        \ /   I
		 G   /
		  \ /
		   M
		   |

Lotsa weird merges:

		   |
		   D1
		  /  \
		 /    \
	       D2      S1
	      /  \    /  \
	    S2    A  D3   B
	   /||\   | /\   /
	  / || |  _/  \ /
	D4 / | | /     I
	/\/  | |/ |    |
       C  E  F G  |    |
       \  | / /   |    |
	\ || /    |   /
	 \||/    /   /
	  J     /   /
	   \   /   /
	    \ /   /
	     K   /
	      \ /
	       Z
	       |

The following loop is an example of one with multiple entries:

		     |
		     S   __
		    /|\ / |
		   A | H  |
		   |  \|  |
		   |   Q  |
		    \ /   |
		     D    |
		    / \   |
		   B   T  |
		   |   |__|

Ensure loop is stitched iff it needs to be:

		     |
		    S1
		   /  \
		  /   S2   __
		 .   /  \ /  \
		 .  |    H    |
		 .  |    |    |
		    |    S3   |
		    |   /  \  |
		    |  A    B |
		    |   \  /  |
		    |    M1   |
		    \    |    |
		     \  /     |
		      \/      |
		      X1      |
		     /  \     |
		    /   S4    |
		   /   /  \   |
		  .   .    T  |
		  .   .    |__|
		  .   .

Simple, straight-line, overlapping loops:

		     | / \ 
		     H1 _|__
		     | / | |
		     H2  | |
		     |   | |
		     T1  | |
		    /|\__| |
		   / T2    |
		  / /|_____|
		 . .
		 . .
		 . .

Loops can be much more convoluted, with multiple entries, multiple
tails, and overlapping loops with heads and tails on different paths:

		     |
		 _   S1  _
		/ \ / \ / \
	       |   H1  H2  |
	       |   |   |   |
	       |    \ /    |
	       |     M     |
	       |     |     |
	       |     S2    |
	       |    / \    |
	       |   T1  T2  |
	        \_/|   |\_/
		   |   |
		    \ /
		     Z
		     |

Another one:

		____ |
	       /    \|
	       |     H1 ______
	       |     | /      \
	       |     H2        |
	       |     |	       |
	       |     S1	       |
	       |    /  \       |
	       |  D1    S2     |
	       |_/  \  /  \    |
		     D2    D3  |
		    /  \  /  \_|
		   |    \/
		   |    M1
		    \   /
		     \ /
		      M2
		      |