Section 6: Title (GEgen)?
======================================================================

Steps we perform
----------------

1) Generate new functions for different divisions of specialized
   functions (and create Function structs) and split merges with
   multiple divisions, yielding monovariant divisions
2) Unit identification
3) Separate setup and template code
4) Insert calls to the specializer at dynamic-to-static promotion
   points on region entries from static code
5) Produce optimized generating extensions, including merging
   setup and template in the back end, inserting emits and
   instructions to patch intra-unit branches


Unit Identification
----------------------------------------------------------------------

What are specialization units?
------------------------------

-- Specialization units are the largest subgraphs that the
   specialization algorithm can manipulate.

-- The specializer makes decisions about which code to specialize at
   the granularity of units, and consequently cache insertions,
   deletions, and lookups also operate on units.

-- Cache lookups ensure that specialized code is reused/shared if it
   exists (important at promotions, demotions, and discordant merges)

What do specialization units look like?
---------------------------------------

-- Roughly, a unit begins at a specialization point (this is a term
   used in the literature, e.g., by Bulyonkov and Kochetov `96; they
   cite the JGS93 book), which is a point of which the specializer
   should be aware (to lookup and/or begin specialization)

-- Units differ from basic blocks in that they can contain multiple
   calls and exits.

-- A unit must have a single entry

	-- If a predecessor to a node has a different context (and,
	   hence, must be from a different unit), a cache lookup will
	   be required either at run time (if there are promotions) or
	   at specialization time (if there are only demotions), and
	   so the node must start a new unit even if it has other
	   predecessors with the same context

	-- Even if all predecessors to a node have the same context as
	   the node, but they don't all belong to the same unit (e.g.,
	   due to "voluntary" laziness), then the node must start a
	   new unit because it must be possible to generate code for
	   that node if it is reached by any of its predecessors
	   (remember that the specializer only generates code for
	   whole units)

	-- Even if you know that a particular predecessor will be
	   specialized before the node is reached, when the other
	   predecessors are specialized, you need to be able to find
	   the node in order to link (backpatch) the branches to it.
	   Thus, some kind of context-dependent lookup is required,
	   which is essentially a cache lookup.

	   For example:
	   If some node d in the unit of one of the predecessors of
	   node n dominates all of the predecessors and there exists
	   downward paths (not traversing back edges) from d to
	   each of the predecessors that do not contain lazy edges,
	   then n _could_ be included in the same unit as d

	   A concrete instance of this:
		make_static(x);
	L1:	...x...
		if (d1) {
		    ...x...
		} else {
		    if (d2) {
			make_static(y);
	L2:		...y...
			make_dynamic(y);
		    }
	L3:	    ...x...
		}
	L4:	...x...
		make_dynamic(x);
	   The node marked L4 could be included in the same unit as
	   the node marked L1.  But, without a cache lookup, how would
	   the specializer link the bottom of L3 to L4?
	   
Single-Run-Time-Predecessor Optimizations
-----------------------------------------

When I say a unit has the all_unchecked policy, I mean some static
variable has that policy as its current policy, causing the unit to
always be specialized.

We can relax the compile-time single-entry requirement for units if we
know that a node will have only a single predecessor _at_run_time_.

We can eliminate unit boundaries for eager successors with single
run-time predecessors, and for those with multiple predecessors but
the replicate (all_unchecked) policy.  For example, a static merge has
only one run-time predecessor, so we don't need a unit boundary at a
static multi-unit merge where all predecessors have the same context.
Actually, we only require that predecessors from different units be on
statically mutually exclusive paths.  Multiple non-mutually exclusive
predecessors from the same unit are acceptable because they will be
converted to a single predecessor on that static path during DAG
conversion.

Thus, it is possible for multiple units to share the same code, and
the graph is no longer strictly partitioned.  Note that the shared
code may actually be copied during DAG conversion if it is reached in
different orders on statically mutually exclusive paths.

Similarly, we can eliminate cache lookups (though not the unit
boundary) for lazy successors with no dynamic-to-static promotions and
single run-time predecessors (e.g., at lazy successors to dynamic
branches) or multiple predecessors and the replicate (all_unchecked)
policy.

I only mentioned variables with the all_unchecked policy.  The
optimizations apply equally to all other combinations of cache
policies, including one_unchecked.

NB: we do have to look at back edges when performing these
optimizations.  This means unit identification is iterative!  This
could provide justification to include "simple" unit identification in
the BTA.  If we require back edges to be mutually exclusive with all
other incoming edges, then "simple" unit identification is
non-iterative because it won't matter from which unit they emanate.

Simple unit identification
--------------------------

-- A node begins a new unit if any of the following is true:
	-- At least one incoming edge has dynamic-to-static promotions
	   (explicit in BTA output) 
	-- At least one incoming edge has static-to-dynamic
	   "demotions" (difference of context sets) 
	-- At least one incoming edge is lazy (explicit in BTA output)
	-- It is a discordant merge (explicit in BTA output)
	-- It has immediate predecessors (predecessors found on back
	   edges are ignored) from different units and it is a
	   non-constant merge and the cache policy is not replicate
	   (for single-entry, called a multi-unit merge)

-- Unchecked eager promotions for static dereferences of static
   pointers do not create new unit boundaries

-- Build unit structs (input to specializer), including id, context
   variables, other live static variables, all known static variables,
   cache policy, successor edge information (dest. unit, successor number,
   eager/lazy, promotions), and call information (in the
   interprocedural version), and make demotions explicit as well for
   clustering; only CFG information need not be contructed at this point

Clustering
----------

Matthai is writing up the algorithm.  Here is the basic idea.
Clustering attempts to reduce the number of units by moving, merging,
and/or eliminating unit boundaries.  It does so by moving the
promotions and demotions within the CFG.  If promotions and demotions
are made to coincide on the same edges with each other or with lazy
edges, or if promotions and demotions are moved to edges entering
discordant merges or multi-unit merges, then the number of unit
boundaries is reduced.  Discordant merges and multi-unit merges can be
eliminated as well by pushing promotions or demotions past such
merges.

Clustering must update the information in the unit structs.


Produce optimized generating extensions
----------------------------------------------------------------------

In order to make run-time generation of dynamic code as efficient as
possible, the Specialize function is manually specialized for each
specialization unit, creating an optimized generating extension for
each unit.  All of the information in the unit structs is known at
static compile time, and so can be used in specializing the
specializer.

All of the run-time primitives called by the specializer (see section
2) can be inlined into the body of the generating extension.  The bulk
of the generating extension is, of course, embodied in the specialized
ReduceAndResidualize primitive.  The other primitives, when inlined
and specialized, incur little additional overhead.

Other optimizations
-------------------

Recursive calls to the specializer are implemented with a manually
manipulated stack of units to specialize and jumps.  In a few cases,
this stack manipulation may be avoided.  For example, when a dynamic
branch to another unit appears on all static paths through a unit, the
specializer may proceed directly to that successor unit after
specializing the first one.

We do explicitly pass around lists of all live static variables, but
we are exploring ways of reducing redundancy in constructing the list.
We could also pass sets of indices into a table of the values, which
would reduce long-term memory usage, but incur an additional cost for
the indirection.

We don't include all live static variables in the context used for
cache lookups, but rather only those that trigger polyvariant
specialization (promoted and discordant variables).  Because derived
static variables are monovariantly specialized and not promoted, they
never appear in the context.

A hybrid hierarchical caching scheme is possible, where a level is
added to the hierarchy when adding to the context (promotions), but
the level is reset to the top of the hierarchy when removing from the
context (demotions).  The hierarchy allows checking only on the new
variables in the context.  Resetting the level to the top at demotions
avoids the cost of traversing back up the tree and possibly down
another branch (e.g., at a merge of contexts {x,y} and {x,z}).

DAG conversion
--------------

At specialize time, we must reduce and residualize all blocks of a
unit that could potentially be reached at run time, so we must pursue
all paths under dynamic control.  So that ReduceAndResidualize does
not have to traverse the unit subgraph using a work list and visited
flags, the unit subgraph is converted to a DAG that can be simply
traversed top to bottom.  This process goes beyond simple unrolling of
the traversal loop.  Because converting the unit subgraph to a DAG
also allows us to know order of traversal of the blocks at compile
time, we can also better optimize the generating extension using
standard compiler optimizations.

***
Following DAG conversion of the static code, we must merge with the
dynamic code, inserting emits.
***

Building a DAG of the basic blocks reached is difficult in the
presence of unstructured code, combinations of static and dynamic
branches, and loops or other back edges.  Our solution is to pursue
all static paths (static paths are determined by which static branches
are taken), copying blocks when they are reached in different orders
on different static paths.  Thus, the order of traversal of the basic
blocks will be known for each static path.

Our algorithm first splits all static paths, creating a tree of basic
blocks.  Nodes under dynamic control are linearized in a topological
order.  Nodes in the tree with fanout greater than one correspond to
static branches.  Each node can appear in a given static path at most
one time.  Then, all leaves of the tree are linked to a sink
node, which is added to a work list.  Until the work list is empty,
remove a node from the work list and unify all its identical
predecessors with the same successors, adding unified nodes to the
work list.  The merging algorithm can also be used to merge common
tails of different units.


The Sorting & Splitting Algorithm


Pseudocode Notation

     RC(Node) returns the reachability conditions for Node 
     UnitHead(Node) returns the head of the unit to which Node belongs 
     IsLazyUnit(Node) returns true if the unit to which Node belongs must be stitched lazily,
     false if the unit can be stitched eagerly 
     Zero(Node) returns 0 
     Topo#(Node) returns the topological number of Node 
     IsPredecessor(Node1, Node2) is true if Node1 is a predecessor of Node2 or Node1 ==
     Node2, false if Node1 succeeds Node2 or Node1 neither preceeds nor succeeds Node2 
     NewNode = CopyNode(OldNode) creates a copy NewNode of node OldNode (not including
     connecting edges) 
     Edges are ordered by the topological numbers of their destinations; Node->Edge[i] is the
     ith edge of Node; Node->Edge[0] is the first edge of Node 
     Edge->Destination is the destination node of Edge 
     Edge->Condition is the condition (NoCondition, True, False, 1, 100, etc.) of Edge 
     (F0, F1, ..., FN) is an (N+1)-tuple of objects F0, F1, ..., FN; F0 is in field 0, F1 is in field
     1, etc. 
     Field(Tuple, N) returns field N of Tuple 
     Stacks: stacks contain fixed-length tuples of pointers or scalars 
     Stack = CreateEmptyStack(Size) 
         Creates the empty stack Stack of tuples of size Size 
     DestroyStack(Stack) 
         Frees Stack 
     EmptyStack(Stack) 
         Returns true if Stack is empty, false otherwise 
     PushOntoStack(Stack, Tuple) 
         Pushes Tuple onto Stack 
     Tuple = PopOffStack(Stack) 
         Pops and returns the top of Stack 
     Tuple = PeekAtStack(Stack) 
         Returns the top of Stack without popping 
     Lists: sorted lists contain fixed-length tuples of pointers or scalars and no duplicate tuples
     List = CreateEmptyList(Size, SortField, ValueFunction) 
         Creates a list of tuples of size Size that is sorted in ascending order by the value
         returned by invoking ValueFunction on field SortField of each tuple 
     DestroyList(List) 
         Frees List 
     EmptyList(List) 
         Returns true if List is empty, false otherwise 
     InsertIntoList(List, Tuple) 
         Inserts Tuple into List after all other tuples of lesser or equal rank 
     RemoveFromList(List, Tuple) 
         Removes Tuple from List; a wildcard (*) may be used in any tuple field 
     IsInList(List, Tuple) 
         Returns true if Tuple is in List, false otherwise 
     Tuple = HeadOfList(List) 
         Returns the head of List (least tuple by the ValueFunction) 
     Tuple = TailOfList(List) 
         Returns the tail of List (greatest tuple by the ValueFunction) 
     NewList = CopyList(OldList) 
         Creates a copy NewList of list OldList 

Description

This algorithm creates a list of lists of blocks for each unit. These
lists are then passed to the merging algorithm for merging. The blocks
in each list are in topological order. Each list corresponds to one
combination of true and false paths from static branches. So, the
first list contains the blocks reached when all static branches in the
unit evaluate to true, and the last list contains the blocks reached
when they all evaluate to false.

As presented, this algorithm has a problem with back edges from one
side of a static if into another static if body. This situation can
create a difference between lists (i.e., blocks in one list but not
the other) before the static branch causing the difference appears in
the lists. The two possible solutions are to include the blocks in
question in both lists, or to move the blocks below the static
branch. The first solution has the disadvantage that some blocks will
be unnecessarily stitched on some paths. The second solution inhibits
merging. I plan to go with the second solution because its run-time
efficiency is higher (no blocks unnecessarily stitched) and because it
is easier to implement.

Note that this algorithm assumes that units cannot contain
disconnected blocks.

The merging algorithm requires a tree rather than a list of
lists. That tree can be reconstructed from the lists generated by this
algorithm, or this algorithm can be modified to generate the tree
directly.

Variables

     UnitRootList is a list of nodes sorted in ascending order by
     topological number; it is a list of nodes that are the roots of
     specialization units  
     PathStack is a stack of 2-tuples: (List, List); the tuples
     contain information about paths that need to be completed 
     NextNodeList is a list of nodes of the graph sorted in ascending
     order by topological number; it is a list of nodes that need to
     be stitched, but haven't been copied to the duplicate graph yet 
     VisitedNodeList is a list of 2-tuples: (Node, EdgeCondition);
     where Node is a node of the graph and EdgeCondition indictates
     which path was taken from a static branch 
     (NoCondition for nodes that aren't static branches); the list is
     sorted in ascending order by topological number of the nodes; it
     is a list of nodes along one path down a unit  
     UnitPathList is a list of all VisitedNodeLists, and is what will
     be handed to the tail-merging algorithm 

Pseudocode

procedure SortAndSplitUnits(UnitRootList) {

    UnitPathList = CreateEmptyList(1, 0, Zero)
    
    until EmptyList(UnitRootList) {
    
        /* Get the root of this specialization unit */
        (CurrentUnit) = RemoveHeadOfList(UnitRootList)
    
        PathStack = CreateEmptyStack(2)
    
        VisitedNodeList = CreateEmptyList(2, 0, Topo#)
        InsertIntoList(UnitPathList, (VisitedNodeList))
    
        NextNodeList = CreateEmptyList(1, 0, Topo#)
        InsertIntoList(NextNodeList, (CurrentUnit))
    
        PushOntoStack(PathStack, (VisitedNodeList,
          NextNodeList))
    
        until EmptyStack(PathStack) {
            /* Start on a new path through the unit */
            (VisitedNodeList, NextNodeList) = PopOffStack(PathStack)
    
            /* Add to the path until it is completed */
            until EmptyList(NextNodeList) {
                (NextNode) = RemoveHeadOfList(NextNodeList)
    
                if EndsInStaticBranch(NextNode) {
                    /* Static branches spawn new paths through the
                       unit.  The current path follows the first edge.
                     */

                    for each outgoing edge NextNode->Edge[i]
                      except the first (in reverse topo. order) {
                        /* Create a new path */
                        NewVisitedList = CopyList(VisitedNodeList)
                        InsertIntoList(UnitPathList, (NewVisitedList))

                        /* Keep track of which path we took */
                        InsertIntoList(NewVisitedList,
                          (NextNode, NextNode->Edge[i]->Condition))
    
                        /* Work on each new path begins at a different
                           destination.  Stay within the unit and
                           don't revisit visited nodes (must check
                           because we follow back edges).
                         */
                        NewNextList = CopyList(NextNodeList)
                        Destination = NextNode->Edge[i]->Destination
                        if ((UnitHead(Destination) == CurrentUnit) &&
                           (! IsInList(NewVisitedList, (Destination,
                           *)))) {
                            InsertIntoList(NewNextList, (Destination))
                        }
    
                        PushOntoStack(PathStack,
                          (NewVisitedList, NewNextList))
                    }
    
                    InsertIntoList(VisitedNodeList,
                      (NextNode, NextNode->Edge[0]->Condition))
    
                    Destination = NextNode->Edge[0]->Destination
                    if ((UnitHead(Destination) == CurrentUnit) &&
                      (! IsInList(VisitedNodeList, (Destination, *)))) {
                        InsertIntoList(NextNodeList, (Destination))
                    }
                }
                else {
                    InsertIntoList(VisitedNodeList,
                      (NextNode, NoCondition))
    
                    /* Visit all children of this node that are in
                       this unit and that have not already been
                       visited.
                     */
                    for each outgoing edge NextNode->Edge[i] {
                        Destination = NextNode->Edge[i]->Destination
                        if ((UnitHead(Destination) == CurrentUnit) &&
                           (! IsInList(VisitedNodeList, (Destination,
                           *)))) {
                            InsertIntoList(NextNodeList, (Destination))
                        }
                    }
                }
            }
            DestroyList(NextNodeList)
        }
        DestroyStack(PathStack)
    }
    DestroyList(UnitRootList)
    
    return UnitPathList
}

The Actual Code

STATIC VOID
dyc_SortAndSplitUnits(
    DYC_UNIT_INFO		UnitList
)
{
    DYC_UNIT_INFO		CurrentUnit;
    BBLOCK			CurrentUnitHead;
    DYC_STACK			PathStack;
    DYC_LIST			VisitedNodeList, NewVisitedList;
    DYC_LIST			NextNodeList, NewNextList;
    BBLOCK			NextNode, NNSuccessor;
    BBLOCK_LIST			SuccessorList, ReverseSuccessorList;
    DYC_GRAPH			Graph;
    DYC_GRAPH_NODE		ParentGraphNode, GraphNode;

    CurrentUnit = UnitList;
    WHILE (CurrentUnit != NULL) {
	/* Skip units outside of dynamic regions */
	IF (! dyc_UNIT_insideDynamicRegion(CurrentUnit)) {
	    DYC_UNIT_INFO_UnitGraph(CurrentUnit) = (DYC_GRAPH) NULL;
	    CurrentUnit = DYC_UNIT_INFO_NextUnit(CurrentUnit);
	    CONTINUE;
	}

        /* Get the root of this specialization unit */
        CurrentUnitHead = DYC_UNIT_INFO_UnitHead(CurrentUnit);
    
	/* Create the graph we build to restructure the unit.  The graph node
	 * for each basic block is created when the node is inserted into
	 * the NextNodeList.
	 */
	DYC_UNIT_INFO_UnitGraph(CurrentUnit) = Graph =
	    DYC_GRAPH_Create(1, dyc_Allocate, dyc_Deallocate);

	/* Create the list of nodes to visit.  dyc_TopologicallyPreceeds should
	 * work even though it only expects pointers to scalars instead of
	 * pointers to arrays.  It just uses the first element of each array.
	 */
        NextNodeList =
	    DYC_LIST_Create(1, dyc_Allocate, dyc_Deallocate,
			    dyc_TopologicallyPreceeds);
        DYC_LIST_Insert(NextNodeList, CurrentUnitHead);
    
	/* Create the stack used to keep track of split paths */
        PathStack = DYC_STACK_Create(3, dyc_Allocate, dyc_Deallocate);
        DYC_STACK_Push(PathStack, VisitedNodeList, NextNodeList, (void *) NULL);
    
        WHILE (! DYC_STACK_EmptyP(PathStack)) {
            /* Start on a new path through the unit */
	    DYC_STACK_Pop(PathStack, &VisitedNodeList, &NextNodeList,
			  &ParentGraphNode);
    
            /* Add to the path until it is completed */
            WHILE (! DYC_LIST_EmptyP(NextNodeList)) {
		/* Get the next node */
                DYC_LIST_RemoveHead(NextNodeList, &NextNode);

		/* Mark the node as visited on this path */
		DYC_LIST_Insert(VisitedNodeList, NextNode);

		/* Add the node to the graph.  Does the edge-order matter? */
		GraphNode = DYC_GRAPH_CreateNode(Graph, NextNode);
		IF (ParentGraphNode == (DYC_GRAPH_NODE) NULL)
		    DYC_GRAPH_AddRoot(Graph, GraphNode);
		ELSE
		    DYC_GRAPH_AddEdge(Graph, ParentGraphNode, GraphNode);

                IF (dyc_StatIsAStaticBranch(BBLOCK_last_stat(NextNode))) {
                    /* Static branches spawn new paths through the
                     * unit.  The current path follows the first edge.
                     */

		    /* Create a list of successors in the reverse order,
		     * skipping the first successor.
		     */
		    SuccessorList = BBLOCK_succ_list(NextNode);
		    assert(SuccessorList);
		    ReverseSuccessorList =
			BBLOCK_List_Reverse_List(BBLOCK_list_tail(SuccessorList));

		    /* NB: If the branch is an IGOTO, the successors are tag
		     * anchors, which must always be the immediate successors
		     * of the block.  How can we guarantee this?
		     */

		    /* Push the successors onto the PathStack in reverse order */
		    LOOP {
			FOR_EACH_BBLOCK_LIST_BBLOCK(ReverseSuccessorList,NNSuccessor) DO {

			    /* Create a new path */
			    NewVisitedList = DYC_LIST_Copy(VisitedNodeList);
			    NewNextList = DYC_LIST_Copy(NextNodeList);
			    dyc_AddBBlockToNextNodeList(NNSuccessor, NewNextList,
							NewVisitedList, CurrentUnit);
			    DYC_STACK_Push(PathStack, NewVisitedList, NewNextList,
					   GraphNode);
			}
		    }
    
		    /* The first successor is on the current path */
                    NNSuccessor = BBLOCK_list_bblock(SuccessorList);
		    dyc_AddBBlockToNextNodeList(NNSuccessor, NextNodeList,
						VisitedNodeList, CurrentUnit);

		    BBLOCK_List_Free_List(ReverseSuccessorList);
                }
                ELSE {
                    /* Visit all children of this node that are in
                     * this unit and that have not already been visited.
                     */
		    LOOP {
			FOR_EACH_BBLOCK_SUCC_BBLOCK(NextNode,NNSuccessor) DO {
			    dyc_AddBBlockToNextNodeList(NNSuccessor,
							NextNodeList,
							VisitedNodeList,
							CurrentUnit);
                        }
                    }
                }

		ParentGraphNode = GraphNode;
            }

            DYC_LIST_Destroy(NextNodeList);
            DYC_LIST_Destroy(VisitedNodeList);
        }

        DYC_STACK_Destroy(PathStack);
	CurrentUnit = DYC_UNIT_INFO_NextUnit(CurrentUnit);
    }
}