Outstanding Issues ---------------------------------------------------------------------- Name ---- We need a name for our system. There appears to be little enthusiasm for DyC. My best suggestion is Duchamp. Why? Marcel Duchamp, b. July 28, 1887, d. Oct. 2, 1968, was a French painter and theorist and one of the most influential figures of avant-garde 20th-century art. Duchamp developed a type of symbolic painting, a dynamic version of facet CUBISM (similar to FUTURISM), in which the image depicted successive movements of a single body....Following his maxim never to repeat himself... Note _dynamic_, "depicted successive movements" (sounds like specialized versions), and "never to repeat" (CacheAllUnchecked!). Note also that Duchamp was French and that if it is mispronounced by Americans it sounds like "duh Champ!" One of Duchamp's styles, dynamism, attempted to capture dynamic behavior (i.e., motion) in a static medium (i.e., painting). Sounds like templates! :-):-):-) Title ----- We've changed the scope of the paper. Should we change the title? Too late!!! Abstract -------- Section 1: Introduction ----------------------- Section 2: Example ------------------ Section 3: Run-Time Specializer ------------------------------- Section 4: Annotations ---------------------- Inlining No mention of run-time inlining. Section 5: Analysis of the Annotations -------------------------------------- MergePartitions and Partition It would also be helpful to say more about how these work in the text. Section 6: Generating the Run-Time Specializer ---------------------------------------------- Integrating dynamic code into static code This is also where intra-unit branch patching needs to happen. Section 7: Related Work ----------------------- Section 8: Conclusions ---------------------- Acknowledgements ---------------- References ---------- Appendices ---------- Things to remember, but we don't need in the paper -------------------------------------------------- Return nodes "Return nodes need no analysis function, since there are no program points after return nodes." This is more or less true. However, the interprocedural specializer could use this information (Tempo does). Unfortunately, the BTA would need to also be interprocedural in order to propagate return divisions for specialized (rather than just memoized) calls. Lazy policy We should think about the semantics of the lazy policy. Should we really make every dynamic branch lazy if any static variable with the lazy policy is live? That sounds ok, and I don't have an alternative proposal. However, the following situation is the type I'm thinking of: make_static(x); x = foo; if (d1) { /* no uses or defs of x */ } if (d2) { /* no uses or defs of x */ } etc. ...x... Do we really want to make all intervening dynamic branches lazy between defs and uses of a variable annotated lazy, even if they don't affect the variable directly? Do we really want to find the branches that determine execution of the merge? For example: make_static(x); if (d1) { if (d2) x++; else x--; DM: ...x... } So, the true edge of the d1 if will be made lazy? Why is this behavior what we want? Single-Run-Time-Predecessor (SRTP) Optimizations This is an updated version of what was in my GEgen writeup (also now updated). When I say a unit has the all_unchecked policy, I mean some static variable has that policy as its current policy, causing the unit to always be specialized. A unit must have a single entry... ...If a predecessor to a node has a different context (and, hence, must be from a different unit), a cache lookup will be required either at run time (if there are promotions) or at specialization time (if there are only demotions), and so the node must start a new unit even if it has other predecessors with the same context -- Even if all predecessors to a node have the same context as the node, but they don't all belong to the same unit (e.g., due to "voluntary" laziness), then the node must start a new unit because it must be possible to generate code for that node if it is reached by any of its predecessors (remember that the specializer only generates code for whole units) -- Even if you know that a particular predecessor will be specialized before the node is reached, when the other predecessors are specialized, you need to be able to find the node in order to link (backpatch) the branches to it. Thus, some kind of context-dependent lookup is required, which is essentially a cache lookup. For example: If some node d in the unit of one of the predecessors of node n dominates all of the predecessors and there exists downward paths (not traversing back edges) from d to each of the predecessors that do not contain lazy edges, then n _could_ be included in the same unit as d A concrete instance of this: make_static(x); L1: ...x... if (d1) { ...x... } else { if (d2) { make_static(y); L2: ...y... make_dynamic(y); } L3: ...x... } L4: ...x... make_dynamic(x); The node marked L4 could be included in the same unit as the node marked L1. But, without a cache lookup, how would the specializer link the bottom of L3 to L4? We can relax the compile-time single-entry requirement for units if we know that a node will have only a single predecessor _at_run_time_. We can eliminate unit boundaries for eager successors with single run-time predecessors, and for those with multiple predecessors but the replicate (all_unchecked) policy. For example, a static merge has only one run-time predecessor, so we don't need a unit boundary at a static multi-unit merge where all predecessors have the same context. Actually, we only require that predecessors from different units be on statically mutually exclusive paths. Multiple non-mutually exclusive predecessors from the same unit are acceptable because they will be converted to a single predecessor on that static path during DAG conversion. Thus, it is possible for multiple units to share the same code, and the graph is no longer strictly partitioned. Note that the shared code may actually be copied during DAG conversion if it is reached in different orders on statically mutually exclusive paths. I probably haven't thought enough about this. For example, when looking for an edge from such a shared subgraph, how do we consider it when applying the SRTP opt. to its successor, especially if that successor has multiple predecessors? Similarly, we can eliminate cache lookups (though not the unit boundary) for lazy successors with no dynamic-to-static promotions and single run-time predecessors (e.g., at lazy successors to dynamic branches) or multiple predecessors and the replicate (all_unchecked) policy. I only mentioned variables with the all_unchecked policy. The optimizations apply equally to all other combinations of cache policies, including one_unchecked. NB: we do have to look at back edges when performing these optimizations. This means unit identification is iterative! This could provide justification to include "simple" unit identification in the BTA. If we require back edges to be mutually exclusive with all other incoming edges, then "simple" unit identification is non-iterative because it won't matter from which unit they emanate. Polyvariant division and policies We had a discussion about this a while ago and decided it was a good idea to split on policies because that was the "strongest" notion of keepConst. However, I could imagine cases where the programmer wouldn't want that behavior. We could have a footnote saying that splitting only on the raw division (static/dynamic) might also be reasonable and that we don't have enough experience to know whether we would want to split or not. One could also imagine yet another policy to control the splitting (poly/mono_policies). Motivation Why compute root vars for static vars? We should have an example of how we'd otherwise screw up if we didn't kill derived variables: make_static(x); ...x... make_static(y); z = x + y; make_dynamic(x); A: ...z... If we have one version of the code labelled A for all values of x, clearly we cannot specialize on z without putting it in the context, which we can't, if it simply an unannotated, derived static variable. So, we must kill it. Why compute discordant variables? We need them now to change cache policies. Why output them? The specializer could use them in a hierarchical caching scheme. Why compute lazy points? This should be clear from lazy/ eager edge info used by the specializer. Why push lazy points up to dynamic branches _should_ be discussed, however. Lazy branch successors The lazy-branch identification introduces new unit boundaries. There must be a boundary at the discordant merge in any case. So, there is a trade-off between introducing new unit boundaries and (potentially) needlessly specializing code on one side of the branch. We should perhaps mention the single-run-time-predecessor optimizations here as well, since they provide the mechanism to eliminate this cache lookup. Interprocedural annotations (specialize vs. make_static) What you say is fine, but I'm just leaving this here because I want to keep it. specialize f(x,y,z) on (x,y); vs. f(x,y,z) { make_static(x,y); ... } 1. specialize only specializes f on x and y if they are static at the call site. A make_static inside f would cause all arguments passed for x and y to be promoted. 2. specialize permits x and y to begin as static in f, without the potentially redundant promotion at the beginning of f. Thus, f may be eagerly specialized after specializing the caller, possibly eliminating a run-time cache check. Of course, if we say this, we must at least discuss extending the specializer to be interprocedural in the previous section. 3. specialize doesn't have to pass x and y to the specialized version of f (merge & split, as mentioned). Bottom line: specialize allows function parameters to begin life static. make_static must assume all function parameters are dynamic. Function pointers Perhaps we should at least mention something about specialize and function pointers (stubs, calling general versions, topic of future work, whatever). setjmp() and longjmp() Should we mention them? What should we do? Static and discordant merges Called constant merges in the PLDI paper, we no longer care about static merges. We don't typically care whether call the merge predecessors are mutually exclusive, but rather whether particular edges are mutually exclusive. If we perform the SRTP optimizations (see below), the reachability conditions must be an output of the BTA. Craig's comment: We care about equivalence classes of predecessors defined by defs of discordant vars being disjoint. Static merges are a coarser, more conservative test. A merge is discordant if some variable has potentially different definitions along different predecessors that are not mutually exclusive. This better handles the following case: make_static(x); if (s) { x++; if (d1) { /* no defs of x */ } else { /* no defs of x */ } else { x--; if (d2) { /* no defs of x */ } else { /* no defs of x */ } } M: ...x... The merge at M has four incoming edges, not all of which are mutually exclusive with each other, but the two defs of x are mutually exclusive. I could make this example more explicit by inserting four "goto M;" statements. Adding derived variables to the context If any root variable of a derived annotated variable is removed from the context, the annotated variable must be added to the context. This is an eager promotion, and probably shouldn't be in the Promotions set. Here is the example: make_static(w,x,y); z = w + x + y; make_static(z); ...z... /* z need not be in context yet */ make_dynamic(y); /* add z to context here */ The equations now handle this (in FilterStaticVars). Removing dead variables from the context This issue also arises because we remove dead variables from the context. We hope that between are handling of the above situation and clustering, we'll do better than leaving variables in the context until their last use. This is pretty cool! Here is the example: make_static(x,y,z); ...x... /* x is dead after this use */ demote(x); /* inserted by BTA */ ... /* code w/o uses of x */ x = DEF; /* could be static or dynamic */ ... make_dynamic(y); ...x... /* last use of x */ demote(x); /* inserted by BTA */ The idea is that we don't really want to keep a variable in the context until its last use because then we needlessly specialize on it while it's dead. On the other hand, we don't want to introduce a lot of demotions and promotions. We've recently had a few neat insights into this problem that allows us to remove a variable from the context whenever it goes dead (demote). If the DEF of x is dynamic, there must be a promotion there in any case, so demoting it won't really hurt. There are no uses of the variable between the inserted demotion and the promotion, so clustering could conceivably move the demotion down to the promotion, eliminating that unit boundary. If the demotion is clustered with a different boundary, then we've cut down on the amount of code needlessly specialized. If the DEF of x is static, the operation must not use x because otherwise x would be live at the def. Therefore, we don't want x in the context because its roots will be in the context. x will only be added back into the context when one of its roots is removed, as discussed above. At that point, the "promotion" of x need not be lazy since we know its value. Once again, the demotion may be moved by clustering, potentially, all the way to the promotion because there are no intervening uses of x. Excellent! Restriction to control-equivalent program points is key Do we want to say more about this? For example, that it ensures all holes will be in the same place. Or, that we don't have to consider pushing above or below branches or merges (which we might want to do)? I guess we could something about it when discussing future work. Integrating dynamic code into static code In the long run, we don't want to use strict block-to-block mappings. We should be able to get by with fewer restrictions on instruction movement. The weakest possible restrictions are: - instructions cannot be moved outside their units - instructions cannot be speculatively hoisted above static branches Optimizing specializer interations Elimination of cache lookups for the replicate policy is discussed under the SRTP opts. The overwriting of computed jumps with direct jumps (or fall-throughs) applies to any unit for which the cache lookup is eliminated by the SRTP opts. We may want to stress that no cache flush is required. I love that. :-) The only other optimizations we (plan to) perform have to do with optimizing the calls to the Specialize function themselves. I'm talking about implementing our own stack (the work list), reducing register saving and restoring, passing values in registers, etc. Repatching direct branches to replaced cache_one units Eager and lazy unit edges w/o promotions are patched to be direct branches to their destinations. What happens if their destination is removed from the cache? I thought of three possible solutions: 1) Require a run-time cache lookup for any unit where any static variable has the cache_one policy (henceforth called a cache_one unit), rather than putting in a direct branch. 2) Leave a callback in the place of the cache_one unit; a cache flush is required*, and the space required by the callback cannot be freed until all units jumping to it are freed; thus, we must also maintain a list of units branching to the cache_one unit. 3) Insert an indirect jump, rather than a direct branch. When the cache_unit unit is generated, set the jump address to the address of the unit. Initially or when it is removed from the cache, direct the jump to a callback stub. We wouldn't be able to set the prefetch-hint bits, because then the indirect jumps would require patching as in #2. #1 is conceptually simple and has no specialize-time cost but the greatest run-time cost (run time being time spent executing generated code). #2 is the most complicated scheme and has a significant specialize-time cost, but no run-time cost. #3 is very simple and has a small specialize-time cost and a small run-time cost. #1 is obviously out. #3 would be easier to implement than #2. One might also think of using a valid/invalid flag for the cache_one unit, but I couldn't think of a way to use a flag that wouldn't be strictly more expensive (at both specialize time and run time) than #2. * If we have to flush the entire instruction cache anyway to replace the previous specialized version, then no additional flush is needed.