Outstanding Issues
----------------------------------------------------------------------

Name
----

We need a name for our system.  There appears to be little enthusiasm
for DyC.  My best suggestion is Duchamp.  Why?

   Marcel Duchamp, b. July 28, 1887, d. Oct. 2, 1968, was a French painter and
   theorist and one of the most influential figures of avant-garde 20th-century
   art.  Duchamp developed a type of symbolic painting, a dynamic version of facet
   CUBISM (similar to FUTURISM), in which the image depicted successive movements
   of a single body....Following his maxim never to repeat himself...

   Note _dynamic_, "depicted successive movements" (sounds like specialized
   versions), and "never to repeat" (CacheAllUnchecked!). Note also
   that Duchamp was French and that if it is mispronounced by
   Americans it sounds like "duh Champ!"

   One of Duchamp's styles, dynamism, attempted to capture dynamic behavior
   (i.e., motion) in a static medium (i.e., painting).  Sounds like templates!

   :-):-):-)

Title
-----

We've changed the scope of the paper.  Should we change the title?
Too late!!!

Abstract
--------

Section 1: Introduction
-----------------------

Section 2: Example
------------------

Section 3: Run-Time Specializer
-------------------------------

Section 4: Annotations
----------------------

Inlining
	No mention of run-time inlining.

Section 5: Analysis of the Annotations
--------------------------------------

MergePartitions and Partition
	It would also be helpful to say more about how these work in
	the text.

Section 6: Generating the Run-Time Specializer
----------------------------------------------

Integrating dynamic code into static code
	This is also where intra-unit branch patching needs to happen.

Section 7: Related Work
-----------------------

Section 8: Conclusions
----------------------

Acknowledgements
----------------

References
----------

Appendices
----------


Things to remember, but we don't need in the paper
--------------------------------------------------

Return nodes
	"Return nodes need no analysis function, since there are no
	program points after return nodes."  This is more or less
	true.  However, the interprocedural specializer could use this
	information (Tempo does).  Unfortunately, the BTA would need
	to also be interprocedural in order to propagate return
	divisions for specialized (rather than just memoized) calls.

Lazy policy
	We should think about the semantics of the lazy policy.
	Should we really make every dynamic branch lazy if any
	static variable with the lazy policy is live?  That sounds ok,
	and I don't have an alternative proposal.  However, the
	following situation is the type I'm thinking of:
		make_static(x);
		x = foo;
		if (d1) {
			/* no uses or defs of x */
		}
		if (d2) {
			/* no uses or defs of x */
		}
		etc.
		...x...
	Do we really want to make all intervening dynamic branches
	lazy between defs and uses of a variable annotated lazy, even
	if they don't affect the variable directly?

	Do we really want to find the branches that determine
	execution of the merge?  For example:
		make_static(x);
		if (d1) {
			if (d2)
				x++;
			else
				x--;
		    DM:	...x...
		}
	So, the true edge of the d1 if will be made lazy?  Why is this
	behavior what we want?

Single-Run-Time-Predecessor (SRTP) Optimizations
	This is an updated version of what was in my GEgen writeup
	(also now updated).  

	When I say a unit has the all_unchecked policy, I mean some
	static variable has that policy as its current policy, causing
	the unit to always be specialized.

	A unit must have a single entry...

	...If a predecessor to a node has a different context (and,
	   hence, must be from a different unit), a cache lookup will
	   be required either at run time (if there are promotions) or
	   at specialization time (if there are only demotions), and
	   so the node must start a new unit even if it has other
	   predecessors with the same context

	-- Even if all predecessors to a node have the same context as
	   the node, but they don't all belong to the same unit (e.g.,
	   due to "voluntary" laziness), then the node must start a
	   new unit because it must be possible to generate code for
	   that node if it is reached by any of its predecessors
	   (remember that the specializer only generates code for
	   whole units)

	-- Even if you know that a particular predecessor will be
	   specialized before the node is reached, when the other
	   predecessors are specialized, you need to be able to find
	   the node in order to link (backpatch) the branches to it.
	   Thus, some kind of context-dependent lookup is required,
	   which is essentially a cache lookup.

	   For example:
	   If some node d in the unit of one of the predecessors of
	   node n dominates all of the predecessors and there exists
	   downward paths (not traversing back edges) from d to
	   each of the predecessors that do not contain lazy edges,
	   then n _could_ be included in the same unit as d

	   A concrete instance of this:
		make_static(x);
	L1:	...x...
		if (d1) {
		    ...x...
		} else {
		    if (d2) {
			make_static(y);
	L2:		...y...
			make_dynamic(y);
		    }
	L3:	    ...x...
		}
	L4:	...x...
		make_dynamic(x);
	   The node marked L4 could be included in the same unit as
	   the node marked L1.  But, without a cache lookup, how would
	   the specializer link the bottom of L3 to L4?

	We can relax the compile-time single-entry requirement for
	units if we know that a node will have only a single
	predecessor _at_run_time_. 

	We can eliminate unit boundaries for eager successors with
 	single run-time predecessors, and for those with multiple
 	predecessors but the replicate (all_unchecked) policy.  For
 	example, a static merge has only one run-time predecessor, so
 	we don't need a unit boundary at a static multi-unit merge
 	where all predecessors have the same context.  Actually, we
 	only require that predecessors from different units be on
 	statically mutually exclusive paths.  Multiple non-mutually
 	exclusive predecessors from the same unit are acceptable
 	because they will be converted to a single predecessor on that
 	static path during DAG conversion. 

	Thus, it is possible for multiple units to share the same
 	code, and the graph is no longer strictly partitioned.  Note
 	that the shared code may actually be copied during DAG
 	conversion if it is reached in different orders on statically
 	mutually exclusive paths.  I probably haven't thought enough
	about this.  For example, when looking for an edge from such a
	shared subgraph, how do we consider it when applying the SRTP
	opt. to its successor, especially if that successor has
	multiple predecessors?

	Similarly, we can eliminate cache lookups (though not the unit
 	boundary) for lazy successors with no dynamic-to-static
 	promotions and single run-time predecessors (e.g., at lazy
 	successors to dynamic branches) or multiple predecessors and
 	the replicate (all_unchecked) policy. 

	I only mentioned variables with the all_unchecked policy.  The
	optimizations apply equally to all other combinations of cache
	policies, including one_unchecked.
	
	NB: we do have to look at back edges when performing these
 	optimizations.  This means unit identification is iterative!
  	This could provide justification to include "simple" unit
 	identification in the BTA.  If we require back edges to be
	mutually exclusive with all other incoming edges, then
	"simple" unit identification is non-iterative because it won't
	matter from which unit they emanate.

Polyvariant division and policies
	We had a discussion about this a while ago and decided it was
	a good idea to split on policies because that was the
	"strongest" notion of keepConst.  However, I could imagine
	cases where the programmer wouldn't want that behavior.  We
	could have a footnote saying that splitting only on the raw 
	division (static/dynamic) might also be reasonable and that we
	don't have enough experience to know whether we would want to
	split or not.  One could also imagine yet another policy to
	control the splitting (poly/mono_policies).

Motivation
	Why compute root vars for static vars?  We should have an
	example of how we'd otherwise screw up if we didn't kill
	derived variables:
		make_static(x);
		...x...
		make_static(y);
		z = x + y;
		make_dynamic(x);
	     A:	...z...
	If we have one version of the code labelled A for all values
	of x, clearly we cannot specialize on z without putting it in
	the context, which we can't, if it simply an unannotated,
	derived static variable.  So, we must kill it.

	Why compute discordant variables?  We need them now to change
	cache policies.  Why output them?  The specializer could use
	them in a hierarchical caching scheme.

	Why compute lazy points?  This should be clear from lazy/
	eager edge info used by the specializer.  Why push lazy points
	up to dynamic branches _should_ be discussed, however.

Lazy branch successors
	The lazy-branch identification introduces new unit boundaries.
	There must be a boundary at the discordant merge in any case.
	So, there is a trade-off between introducing new unit
	boundaries and (potentially) needlessly specializing code on
	one side of the branch.

	We should perhaps mention the single-run-time-predecessor
	optimizations here as well, since they provide the mechanism
	to eliminate this cache lookup.

Interprocedural annotations (specialize vs. make_static)
	What you say is fine, but I'm just leaving this here because I
	want to keep it.

	specialize f(x,y,z) on (x,y);
		vs.
	f(x,y,z) { make_static(x,y); ... }

	1. specialize only specializes f on x and y if they are static
	   at the call site.  A make_static inside f would cause all
	   arguments passed for x and y to be promoted.
	2. specialize permits x and y to begin as static in f, without
	   the potentially redundant promotion at the beginning of f.
	   Thus, f may be eagerly specialized after specializing the
	   caller, possibly eliminating a run-time cache check.  Of
	   course, if we say this, we must at least discuss extending
	   the specializer to be interprocedural in the previous section.
	3. specialize doesn't have to pass x and y to the specialized
	   version of f (merge & split, as mentioned).

	Bottom line: specialize allows function parameters to begin
	life static.  make_static must assume all function parameters
	are dynamic.

Function pointers
	Perhaps we should at least mention something about specialize
	and function pointers (stubs, calling general versions, topic
	of future work, whatever). 

setjmp() and longjmp()
	Should we mention them?  What should we do?

Static and discordant merges
	Called constant merges in the PLDI paper, we no longer care
	about static merges.  We don't typically care whether call the
	merge predecessors are mutually exclusive, but rather whether
	particular edges are mutually exclusive.  If we perform the SRTP
	optimizations (see below), the reachability conditions must be
	an output of the BTA. 

	Craig's comment:
	We care about equivalence classes of predecessors defined by
	defs of discordant vars being disjoint.  Static merges are a
	coarser, more conservative test.

	A merge is discordant if some variable has potentially
	different definitions along different predecessors that are
	not mutually exclusive.  This better handles the following
	case:
		make_static(x);
		if (s) {
			x++;
			if (d1) {
				/* no defs of x */
			} else {
				/* no defs of x */
			}
		else {
			x--;
			if (d2) {
				/* no defs of x */
			} else {
				/* no defs of x */
			}
		}
	     M: ...x...
	The merge at M has four incoming edges, not all of which are
	mutually exclusive with each other, but the two defs of x are
	mutually exclusive.  I could make this example more explicit
	by inserting four "goto M;" statements.

Adding derived variables to the context
	If any root variable of a derived annotated variable is
	removed from the context, the annotated variable must be added
	to the context.  This is an eager promotion, and probably
	shouldn't be in the Promotions set.
	Here is the example:
		make_static(w,x,y);
		z = w + x + y;
		make_static(z);
		...z...		 /* z need not be in context yet */
		make_dynamic(y); /* add z to context here */
	The equations now handle this (in FilterStaticVars).

Removing dead variables from the context
	This issue also arises because we remove dead variables from
	the context.  We hope that between are handling of the above
	situation and clustering, we'll do better than leaving
	variables in the context until their last use.  This is pretty
	cool!  Here is the example:
		make_static(x,y,z);
		...x...			/* x is dead after this use */
		demote(x);		/* inserted by BTA */
		...			/* code w/o uses of x */
		x = DEF;		/* could be static or dynamic */
		...
		make_dynamic(y);
		...x...			/* last use of x */
		demote(x);		/* inserted by BTA */
	The idea is that we don't really want to keep a variable in
	the context until its last use because then we needlessly
	specialize on it while it's dead.  On the other hand, we don't
	want to introduce a lot of demotions and promotions.  We've
	recently had a few neat insights into this problem that allows
	us to remove a variable from the context whenever it goes dead
	(demote).
	If the DEF of x is dynamic, there must be a promotion there in
	any case, so demoting it won't really hurt.  There are no uses
	of the variable between the inserted demotion and the
	promotion, so clustering could conceivably move the demotion
	down to the promotion, eliminating that unit boundary.  If
	the demotion is clustered with a different boundary, then
	we've cut down on the amount of code needlessly specialized.
	If the DEF of x is static, the operation must not use x
	because otherwise x would be live at the def.  Therefore, we
	don't want x in the context because its roots will be in the
	context.  x will only be added back into the context when one
	of its roots is removed, as discussed above.  At that point,
	the "promotion" of x need not be lazy since we know its value.
	Once again, the demotion may be moved by clustering,
	potentially, all the way to the promotion because there are no
	intervening uses of x.  Excellent!

Restriction to control-equivalent program points is key
	Do we want to say more about this?  For example, that it
	ensures all holes will be in the same place.  Or, that we
	don't have to consider pushing above or below branches or
	merges (which we might want to do)?  I guess we could
	something about it when discussing future work.

Integrating dynamic code into static code
	In the long run, we don't want to use strict block-to-block
	mappings.  We should be able to get by with fewer restrictions
	on instruction movement.  The weakest possible restrictions
	are:
		- instructions cannot be moved outside their units
		- instructions cannot be speculatively hoisted above
		  static branches

Optimizing specializer interations
	Elimination of cache lookups for the replicate policy is
 	discussed under the SRTP opts. 

	The overwriting of computed jumps with direct jumps (or
	fall-throughs) applies to any unit for which the cache lookup
	is eliminated by the SRTP opts.  We may want to stress that no
	cache flush is required.  I love that. :-)

	The only other optimizations we (plan to) perform have to do
	with optimizing the calls to the Specialize function
	themselves.  I'm talking about implementing our own stack (the
	work list), reducing register saving and restoring, passing
	values in registers, etc.

Repatching direct branches to replaced cache_one units

	Eager and lazy unit edges w/o promotions are patched to be
	direct branches to their destinations.  What happens if their
	destination is removed from the cache?  I thought of three
	possible solutions: 

        1) Require a run-time cache lookup for any unit where any static
           variable has the cache_one policy (henceforth called a cache_one
           unit), rather than putting in a direct branch.
        2) Leave a callback in the place of the cache_one unit; a cache
           flush is required*, and the space required by the callback
           cannot be freed until all units jumping to it are freed; thus,
           we must also maintain a list of units branching to the cache_one
           unit.
        3) Insert an indirect jump, rather than a direct branch.  When the
           cache_unit unit is generated, set the jump address to the address
           of the unit.  Initially or when it is removed from the cache,
           direct the jump to a callback stub.  We wouldn't be able to set the
           prefetch-hint bits, because then the indirect jumps would require
           patching as in #2.

	#1 is conceptually simple and has no specialize-time cost but
	the greatest run-time cost (run time being time spent
	executing generated code). 
	#2 is the most complicated scheme and has a significant
	specialize-time cost, but no run-time cost.
	#3 is very simple and has a small specialize-time cost and a
	small run-time cost.

	#1 is obviously out.  #3 would be easier to implement than #2.

	One might also think of using a valid/invalid flag for the
	cache_one unit, but I couldn't think of a way to use a flag
	that wouldn't be strictly more expensive (at both specialize
	time and run time) than #2. 

	* If we have to flush the entire instruction cache anyway to
	replace the previous specialized version, then no additional
	flush is needed.