Section 6: Title (GEgen)? ====================================================================== Steps we perform ---------------- 1) Generate new functions for different divisions of specialized functions (and create Function structs) and split merges with multiple divisions, yielding monovariant divisions 2) Unit identification 3) Separate setup and template code 4) Insert calls to the specializer at dynamic-to-static promotion points on region entries from static code 5) Produce optimized generating extensions, including merging setup and template in the back end, inserting emits and instructions to patch intra-unit branches Unit Identification ---------------------------------------------------------------------- What are specialization units? ------------------------------ -- Specialization units are the largest subgraphs that the specialization algorithm can manipulate. -- The specializer makes decisions about which code to specialize at the granularity of units, and consequently cache insertions, deletions, and lookups also operate on units. -- Cache lookups ensure that specialized code is reused/shared if it exists (important at promotions, demotions, and discordant merges) What do specialization units look like? --------------------------------------- -- Roughly, a unit begins at a specialization point (this is a term used in the literature, e.g., by Bulyonkov and Kochetov `96; they cite the JGS93 book), which is a point of which the specializer should be aware (to lookup and/or begin specialization) -- Units differ from basic blocks in that they can contain multiple calls and exits. -- A unit must have a single entry -- If a predecessor to a node has a different context (and, hence, must be from a different unit), a cache lookup will be required either at run time (if there are promotions) or at specialization time (if there are only demotions), and so the node must start a new unit even if it has other predecessors with the same context -- Even if all predecessors to a node have the same context as the node, but they don't all belong to the same unit (e.g., due to "voluntary" laziness), then the node must start a new unit because it must be possible to generate code for that node if it is reached by any of its predecessors (remember that the specializer only generates code for whole units) -- Even if you know that a particular predecessor will be specialized before the node is reached, when the other predecessors are specialized, you need to be able to find the node in order to link (backpatch) the branches to it. Thus, some kind of context-dependent lookup is required, which is essentially a cache lookup. For example: If some node d in the unit of one of the predecessors of node n dominates all of the predecessors and there exists downward paths (not traversing back edges) from d to each of the predecessors that do not contain lazy edges, then n _could_ be included in the same unit as d A concrete instance of this: make_static(x); L1: ...x... if (d1) { ...x... } else { if (d2) { make_static(y); L2: ...y... make_dynamic(y); } L3: ...x... } L4: ...x... make_dynamic(x); The node marked L4 could be included in the same unit as the node marked L1. But, without a cache lookup, how would the specializer link the bottom of L3 to L4? Single-Run-Time-Predecessor Optimizations ----------------------------------------- When I say a unit has the all_unchecked policy, I mean some static variable has that policy as its current policy, causing the unit to always be specialized. We can relax the compile-time single-entry requirement for units if we know that a node will have only a single predecessor _at_run_time_. We can eliminate unit boundaries for eager successors with single run-time predecessors, and for those with multiple predecessors but the replicate (all_unchecked) policy. For example, a static merge has only one run-time predecessor, so we don't need a unit boundary at a static multi-unit merge where all predecessors have the same context. Actually, we only require that predecessors from different units be on statically mutually exclusive paths. Multiple non-mutually exclusive predecessors from the same unit are acceptable because they will be converted to a single predecessor on that static path during DAG conversion. Thus, it is possible for multiple units to share the same code, and the graph is no longer strictly partitioned. Note that the shared code may actually be copied during DAG conversion if it is reached in different orders on statically mutually exclusive paths. Similarly, we can eliminate cache lookups (though not the unit boundary) for lazy successors with no dynamic-to-static promotions and single run-time predecessors (e.g., at lazy successors to dynamic branches) or multiple predecessors and the replicate (all_unchecked) policy. I only mentioned variables with the all_unchecked policy. The optimizations apply equally to all other combinations of cache policies, including one_unchecked. NB: we do have to look at back edges when performing these optimizations. This means unit identification is iterative! This could provide justification to include "simple" unit identification in the BTA. If we require back edges to be mutually exclusive with all other incoming edges, then "simple" unit identification is non-iterative because it won't matter from which unit they emanate. Simple unit identification -------------------------- -- A node begins a new unit if any of the following is true: -- At least one incoming edge has dynamic-to-static promotions (explicit in BTA output) -- At least one incoming edge has static-to-dynamic "demotions" (difference of context sets) -- At least one incoming edge is lazy (explicit in BTA output) -- It is a discordant merge (explicit in BTA output) -- It has immediate predecessors (predecessors found on back edges are ignored) from different units and it is a non-constant merge and the cache policy is not replicate (for single-entry, called a multi-unit merge) -- Unchecked eager promotions for static dereferences of static pointers do not create new unit boundaries -- Build unit structs (input to specializer), including id, context variables, other live static variables, all known static variables, cache policy, successor edge information (dest. unit, successor number, eager/lazy, promotions), and call information (in the interprocedural version), and make demotions explicit as well for clustering; only CFG information need not be contructed at this point Clustering ---------- Matthai is writing up the algorithm. Here is the basic idea. Clustering attempts to reduce the number of units by moving, merging, and/or eliminating unit boundaries. It does so by moving the promotions and demotions within the CFG. If promotions and demotions are made to coincide on the same edges with each other or with lazy edges, or if promotions and demotions are moved to edges entering discordant merges or multi-unit merges, then the number of unit boundaries is reduced. Discordant merges and multi-unit merges can be eliminated as well by pushing promotions or demotions past such merges. Clustering must update the information in the unit structs. Produce optimized generating extensions ---------------------------------------------------------------------- In order to make run-time generation of dynamic code as efficient as possible, the Specialize function is manually specialized for each specialization unit, creating an optimized generating extension for each unit. All of the information in the unit structs is known at static compile time, and so can be used in specializing the specializer. All of the run-time primitives called by the specializer (see section 2) can be inlined into the body of the generating extension. The bulk of the generating extension is, of course, embodied in the specialized ReduceAndResidualize primitive. The other primitives, when inlined and specialized, incur little additional overhead. Other optimizations ------------------- Recursive calls to the specializer are implemented with a manually manipulated stack of units to specialize and jumps. In a few cases, this stack manipulation may be avoided. For example, when a dynamic branch to another unit appears on all static paths through a unit, the specializer may proceed directly to that successor unit after specializing the first one. We do explicitly pass around lists of all live static variables, but we are exploring ways of reducing redundancy in constructing the list. We could also pass sets of indices into a table of the values, which would reduce long-term memory usage, but incur an additional cost for the indirection. We don't include all live static variables in the context used for cache lookups, but rather only those that trigger polyvariant specialization (promoted and discordant variables). Because derived static variables are monovariantly specialized and not promoted, they never appear in the context. A hybrid hierarchical caching scheme is possible, where a level is added to the hierarchy when adding to the context (promotions), but the level is reset to the top of the hierarchy when removing from the context (demotions). The hierarchy allows checking only on the new variables in the context. Resetting the level to the top at demotions avoids the cost of traversing back up the tree and possibly down another branch (e.g., at a merge of contexts {x,y} and {x,z}). DAG conversion -------------- At specialize time, we must reduce and residualize all blocks of a unit that could potentially be reached at run time, so we must pursue all paths under dynamic control. So that ReduceAndResidualize does not have to traverse the unit subgraph using a work list and visited flags, the unit subgraph is converted to a DAG that can be simply traversed top to bottom. This process goes beyond simple unrolling of the traversal loop. Because converting the unit subgraph to a DAG also allows us to know order of traversal of the blocks at compile time, we can also better optimize the generating extension using standard compiler optimizations. *** Following DAG conversion of the static code, we must merge with the dynamic code, inserting emits. *** Building a DAG of the basic blocks reached is difficult in the presence of unstructured code, combinations of static and dynamic branches, and loops or other back edges. Our solution is to pursue all static paths (static paths are determined by which static branches are taken), copying blocks when they are reached in different orders on different static paths. Thus, the order of traversal of the basic blocks will be known for each static path. Our algorithm first splits all static paths, creating a tree of basic blocks. Nodes under dynamic control are linearized in a topological order. Nodes in the tree with fanout greater than one correspond to static branches. Each node can appear in a given static path at most one time. Then, all leaves of the tree are linked to a sink node, which is added to a work list. Until the work list is empty, remove a node from the work list and unify all its identical predecessors with the same successors, adding unified nodes to the work list. The merging algorithm can also be used to merge common tails of different units. The Sorting & Splitting Algorithm Pseudocode Notation RC(Node) returns the reachability conditions for Node UnitHead(Node) returns the head of the unit to which Node belongs IsLazyUnit(Node) returns true if the unit to which Node belongs must be stitched lazily, false if the unit can be stitched eagerly Zero(Node) returns 0 Topo#(Node) returns the topological number of Node IsPredecessor(Node1, Node2) is true if Node1 is a predecessor of Node2 or Node1 == Node2, false if Node1 succeeds Node2 or Node1 neither preceeds nor succeeds Node2 NewNode = CopyNode(OldNode) creates a copy NewNode of node OldNode (not including connecting edges) Edges are ordered by the topological numbers of their destinations; Node->Edge[i] is the ith edge of Node; Node->Edge[0] is the first edge of Node Edge->Destination is the destination node of Edge Edge->Condition is the condition (NoCondition, True, False, 1, 100, etc.) of Edge (F0, F1, ..., FN) is an (N+1)-tuple of objects F0, F1, ..., FN; F0 is in field 0, F1 is in field 1, etc. Field(Tuple, N) returns field N of Tuple Stacks: stacks contain fixed-length tuples of pointers or scalars Stack = CreateEmptyStack(Size) Creates the empty stack Stack of tuples of size Size DestroyStack(Stack) Frees Stack EmptyStack(Stack) Returns true if Stack is empty, false otherwise PushOntoStack(Stack, Tuple) Pushes Tuple onto Stack Tuple = PopOffStack(Stack) Pops and returns the top of Stack Tuple = PeekAtStack(Stack) Returns the top of Stack without popping Lists: sorted lists contain fixed-length tuples of pointers or scalars and no duplicate tuples List = CreateEmptyList(Size, SortField, ValueFunction) Creates a list of tuples of size Size that is sorted in ascending order by the value returned by invoking ValueFunction on field SortField of each tuple DestroyList(List) Frees List EmptyList(List) Returns true if List is empty, false otherwise InsertIntoList(List, Tuple) Inserts Tuple into List after all other tuples of lesser or equal rank RemoveFromList(List, Tuple) Removes Tuple from List; a wildcard (*) may be used in any tuple field IsInList(List, Tuple) Returns true if Tuple is in List, false otherwise Tuple = HeadOfList(List) Returns the head of List (least tuple by the ValueFunction) Tuple = TailOfList(List) Returns the tail of List (greatest tuple by the ValueFunction) NewList = CopyList(OldList) Creates a copy NewList of list OldList Description This algorithm creates a list of lists of blocks for each unit. These lists are then passed to the merging algorithm for merging. The blocks in each list are in topological order. Each list corresponds to one combination of true and false paths from static branches. So, the first list contains the blocks reached when all static branches in the unit evaluate to true, and the last list contains the blocks reached when they all evaluate to false. As presented, this algorithm has a problem with back edges from one side of a static if into another static if body. This situation can create a difference between lists (i.e., blocks in one list but not the other) before the static branch causing the difference appears in the lists. The two possible solutions are to include the blocks in question in both lists, or to move the blocks below the static branch. The first solution has the disadvantage that some blocks will be unnecessarily stitched on some paths. The second solution inhibits merging. I plan to go with the second solution because its run-time efficiency is higher (no blocks unnecessarily stitched) and because it is easier to implement. Note that this algorithm assumes that units cannot contain disconnected blocks. The merging algorithm requires a tree rather than a list of lists. That tree can be reconstructed from the lists generated by this algorithm, or this algorithm can be modified to generate the tree directly. Variables UnitRootList is a list of nodes sorted in ascending order by topological number; it is a list of nodes that are the roots of specialization units PathStack is a stack of 2-tuples: (List, List); the tuples contain information about paths that need to be completed NextNodeList is a list of nodes of the graph sorted in ascending order by topological number; it is a list of nodes that need to be stitched, but haven't been copied to the duplicate graph yet VisitedNodeList is a list of 2-tuples: (Node, EdgeCondition); where Node is a node of the graph and EdgeCondition indictates which path was taken from a static branch (NoCondition for nodes that aren't static branches); the list is sorted in ascending order by topological number of the nodes; it is a list of nodes along one path down a unit UnitPathList is a list of all VisitedNodeLists, and is what will be handed to the tail-merging algorithm Pseudocode procedure SortAndSplitUnits(UnitRootList) { UnitPathList = CreateEmptyList(1, 0, Zero) until EmptyList(UnitRootList) { /* Get the root of this specialization unit */ (CurrentUnit) = RemoveHeadOfList(UnitRootList) PathStack = CreateEmptyStack(2) VisitedNodeList = CreateEmptyList(2, 0, Topo#) InsertIntoList(UnitPathList, (VisitedNodeList)) NextNodeList = CreateEmptyList(1, 0, Topo#) InsertIntoList(NextNodeList, (CurrentUnit)) PushOntoStack(PathStack, (VisitedNodeList, NextNodeList)) until EmptyStack(PathStack) { /* Start on a new path through the unit */ (VisitedNodeList, NextNodeList) = PopOffStack(PathStack) /* Add to the path until it is completed */ until EmptyList(NextNodeList) { (NextNode) = RemoveHeadOfList(NextNodeList) if EndsInStaticBranch(NextNode) { /* Static branches spawn new paths through the unit. The current path follows the first edge. */ for each outgoing edge NextNode->Edge[i] except the first (in reverse topo. order) { /* Create a new path */ NewVisitedList = CopyList(VisitedNodeList) InsertIntoList(UnitPathList, (NewVisitedList)) /* Keep track of which path we took */ InsertIntoList(NewVisitedList, (NextNode, NextNode->Edge[i]->Condition)) /* Work on each new path begins at a different destination. Stay within the unit and don't revisit visited nodes (must check because we follow back edges). */ NewNextList = CopyList(NextNodeList) Destination = NextNode->Edge[i]->Destination if ((UnitHead(Destination) == CurrentUnit) && (! IsInList(NewVisitedList, (Destination, *)))) { InsertIntoList(NewNextList, (Destination)) } PushOntoStack(PathStack, (NewVisitedList, NewNextList)) } InsertIntoList(VisitedNodeList, (NextNode, NextNode->Edge[0]->Condition)) Destination = NextNode->Edge[0]->Destination if ((UnitHead(Destination) == CurrentUnit) && (! IsInList(VisitedNodeList, (Destination, *)))) { InsertIntoList(NextNodeList, (Destination)) } } else { InsertIntoList(VisitedNodeList, (NextNode, NoCondition)) /* Visit all children of this node that are in this unit and that have not already been visited. */ for each outgoing edge NextNode->Edge[i] { Destination = NextNode->Edge[i]->Destination if ((UnitHead(Destination) == CurrentUnit) && (! IsInList(VisitedNodeList, (Destination, *)))) { InsertIntoList(NextNodeList, (Destination)) } } } } DestroyList(NextNodeList) } DestroyStack(PathStack) } DestroyList(UnitRootList) return UnitPathList } The Actual Code STATIC VOID dyc_SortAndSplitUnits( DYC_UNIT_INFO UnitList ) { DYC_UNIT_INFO CurrentUnit; BBLOCK CurrentUnitHead; DYC_STACK PathStack; DYC_LIST VisitedNodeList, NewVisitedList; DYC_LIST NextNodeList, NewNextList; BBLOCK NextNode, NNSuccessor; BBLOCK_LIST SuccessorList, ReverseSuccessorList; DYC_GRAPH Graph; DYC_GRAPH_NODE ParentGraphNode, GraphNode; CurrentUnit = UnitList; WHILE (CurrentUnit != NULL) { /* Skip units outside of dynamic regions */ IF (! dyc_UNIT_insideDynamicRegion(CurrentUnit)) { DYC_UNIT_INFO_UnitGraph(CurrentUnit) = (DYC_GRAPH) NULL; CurrentUnit = DYC_UNIT_INFO_NextUnit(CurrentUnit); CONTINUE; } /* Get the root of this specialization unit */ CurrentUnitHead = DYC_UNIT_INFO_UnitHead(CurrentUnit); /* Create the graph we build to restructure the unit. The graph node * for each basic block is created when the node is inserted into * the NextNodeList. */ DYC_UNIT_INFO_UnitGraph(CurrentUnit) = Graph = DYC_GRAPH_Create(1, dyc_Allocate, dyc_Deallocate); /* Create the list of nodes to visit. dyc_TopologicallyPreceeds should * work even though it only expects pointers to scalars instead of * pointers to arrays. It just uses the first element of each array. */ NextNodeList = DYC_LIST_Create(1, dyc_Allocate, dyc_Deallocate, dyc_TopologicallyPreceeds); DYC_LIST_Insert(NextNodeList, CurrentUnitHead); /* Create the stack used to keep track of split paths */ PathStack = DYC_STACK_Create(3, dyc_Allocate, dyc_Deallocate); DYC_STACK_Push(PathStack, VisitedNodeList, NextNodeList, (void *) NULL); WHILE (! DYC_STACK_EmptyP(PathStack)) { /* Start on a new path through the unit */ DYC_STACK_Pop(PathStack, &VisitedNodeList, &NextNodeList, &ParentGraphNode); /* Add to the path until it is completed */ WHILE (! DYC_LIST_EmptyP(NextNodeList)) { /* Get the next node */ DYC_LIST_RemoveHead(NextNodeList, &NextNode); /* Mark the node as visited on this path */ DYC_LIST_Insert(VisitedNodeList, NextNode); /* Add the node to the graph. Does the edge-order matter? */ GraphNode = DYC_GRAPH_CreateNode(Graph, NextNode); IF (ParentGraphNode == (DYC_GRAPH_NODE) NULL) DYC_GRAPH_AddRoot(Graph, GraphNode); ELSE DYC_GRAPH_AddEdge(Graph, ParentGraphNode, GraphNode); IF (dyc_StatIsAStaticBranch(BBLOCK_last_stat(NextNode))) { /* Static branches spawn new paths through the * unit. The current path follows the first edge. */ /* Create a list of successors in the reverse order, * skipping the first successor. */ SuccessorList = BBLOCK_succ_list(NextNode); assert(SuccessorList); ReverseSuccessorList = BBLOCK_List_Reverse_List(BBLOCK_list_tail(SuccessorList)); /* NB: If the branch is an IGOTO, the successors are tag * anchors, which must always be the immediate successors * of the block. How can we guarantee this? */ /* Push the successors onto the PathStack in reverse order */ LOOP { FOR_EACH_BBLOCK_LIST_BBLOCK(ReverseSuccessorList,NNSuccessor) DO { /* Create a new path */ NewVisitedList = DYC_LIST_Copy(VisitedNodeList); NewNextList = DYC_LIST_Copy(NextNodeList); dyc_AddBBlockToNextNodeList(NNSuccessor, NewNextList, NewVisitedList, CurrentUnit); DYC_STACK_Push(PathStack, NewVisitedList, NewNextList, GraphNode); } } /* The first successor is on the current path */ NNSuccessor = BBLOCK_list_bblock(SuccessorList); dyc_AddBBlockToNextNodeList(NNSuccessor, NextNodeList, VisitedNodeList, CurrentUnit); BBLOCK_List_Free_List(ReverseSuccessorList); } ELSE { /* Visit all children of this node that are in * this unit and that have not already been visited. */ LOOP { FOR_EACH_BBLOCK_SUCC_BBLOCK(NextNode,NNSuccessor) DO { dyc_AddBBlockToNextNodeList(NNSuccessor, NextNodeList, VisitedNodeList, CurrentUnit); } } } ParentGraphNode = GraphNode; } DYC_LIST_Destroy(NextNodeList); DYC_LIST_Destroy(VisitedNodeList); } DYC_STACK_Destroy(PathStack); CurrentUnit = DYC_UNIT_INFO_NextUnit(CurrentUnit); } }