Annotation-Directed Run-Time Specialization in C
Brian Grant, Markus Mock, Matthai Philipose, Craig Chambers, Susan Eggers

The first thing we have to ask ourselves is "what is our contribution?"
Possibilities:
	-- Primitives for describing run-time specialization
		-- makeStatic
		-- makeDynamic
	-- The concept of dynamic-to-static promotion (dropping?), including
	   dynamic re-specialization, nested and overlapping dynamic regions,
	   automatic dynamic-to-static promotion, and associated caching
	-- The idea of associating specialization policies with static
	   variables
	-- Conditional specialization (though Consel & Danvy mentioned
	   conditionally static values, it wasn't intended as a means to
	   control specialization)
	-- Lazy (or incremental) specialization (all run-time specialization
	   is, in a sense, lazy but we can use laziness to ensure safety,
	   reduce specialization cost, etc.)
	-- The BTA and run-time-specialization algorithm
	-- Unit optimizations (similar to identification of specialization
	   points or configuration analysis?)

I don't think that the intraprocedural/subfunction/direct-style nature of
our approach is a contribution, but rather a requirement.

-- Introduction
	-- Motivating example: stack-based byte-code interpreter
		-- Multi-way unrolling
		-- Automatic dynamic-to-static promotion
		-- Conditional specialization
-- The Run-Time Specializer
	-- Novel features
		-- Dynamic-to-static promotion
		-- Dynamic re-specialization
		-- Lazy/incremental specialization
		-- Unit-based
	-- Simplifications
		-- Intraprocedural
			-- Though "constant" functions can be invoked
				-- Can they contain dynamic regions?
		-- No generalize decisions made
	-- Algorithm (see specializer.txt)
-- The Annotations
	-- makeStatic
	-- makeDynamic
	-- Function annotations
	-- Specialization policy
		-- monoSpecialize
		-- polySpecialize
	-- Division policy
		-- monoDivide
		-- polyDivide
	-- Promotion policy
		-- manualPromote
		-- autoPromote
		-- Possible 3rd policy for promoting only at must defs
	-- Laziness policy
		-- eager
		-- loopLazy
		-- polyLazy
		-- lazy
	-- Cache policy
		-- CacheAll, Cache1, Replicate, Unchecked
		-- Could want different policies at different places
			-- promotions & demotions
			-- split discordant merges
		-- Other possibilities
			-- Cache(n)
			-- User-defined replacement policies
	-- Benefits
		-- Easy to use
		-- Turn them off for debugging
		-- Lots of control
		-- Good for experimentation
		-- Target for fully automated system
-- BTA
	-- Prepass
		-- Compute use-def chains
		-- Compute live ranges and sums of all live ranges
		-- Find apparent boundaries of the dynamic region
		-- Insert makeStatics and makeDynamics
	-- Polyvariant specialization
	-- Polyvariant division
	-- Automatic d->s promotion
	-- IDFA over CFG
	-- Flow functions
		-- Keep track of root variables of derived temps
	-- Eliminate redundant makeStatics and makeDynamics
	-- Output
		-- Static/dynamic variables w/ cache policies
		-- D->S promoted variables
		-- Discordant variables
		-- Context sets
		-- Eager/lazy edges
-- GEgen
	-- (Sub-)basic blocks as units
		-- Split merges with multiple divisions
		-- Partition basic blocks at specialization points
			-- Context changes
			-- Dynamic-to-static promotions
		-- Insert calls to the specializer at d->s
		   promo. pts. outside the dynamic region
	-- Bigger units
		-- Unit identification and graph partitioning
			-- New units at specialization points
				-- Context changes
				-- Dynamic-to-static promotions
				-- Discordant merges
			-- New units at merges with any lazy
			   incoming edges
			-- New units at merges with predecessors from
			   different units (for single-entry)
			-- Clustering!
		-- DAG construction via splitting & merging
			-- Want to specialize all blocks of a unit
			   that could potentially be reached => pursue
			   all paths under dynamic control
			-- Want to know order of traversal of the
			   blocks at compile time
				-- Better optimization of static code
				-- Eliminate run-time work-list overhead
			-- So, build a DAG of blocks reached
			-- Difficult because of
				-- Unstructured code
				-- Combinations of static and dynamic branches
				-- Loops / back edges
			-- Solution
				-- Pursue all static paths (static
				   paths are determined by which
				   static branches are taken)
				-- Copy blocks when they are reached
				   in different orders on different
				   static paths
			-- Algorithm - should we put the full
			   algorithms in the appendix given that they
			   are untested?  how much detail in the body?
				-- Splitting
					-- Split all static paths
					-- Add nodes to a given static
					   path as they are reached;
					   topologically order nodes
					   on different dynamic paths
				-- Merging
				   Merge the tails of branches of the
				   tree created by splitting
					-- Create a dummy node and
					   link it to all of the
					   leaves of the unit's tree
					-- Add the sink to a work list
					-- Until the work list is
					   empty, remove the head of
					   the list. Compare the
					   sources of all incoming
					   edges into the node. If two
					   source nodes represent the
					   same block and they have
					   the same successors in this
					   graph, then merge the two
					   nodes and add the merged
					   node to the work list if it
					   isn't there already.
		-- Eliminate unit boundaries for eager successors with single
		   run-time predecessors, and for those with multiple
		   predecessors but the replicate policy, statically copying
		   code for those with multiple compile-time predecessors
		-- Eliminate cache lookups for lazy successors with no
		   dynamic-to-static promotions and single run-time
		   predecessors or multiple predecessors and the
		   replicate policy
		-- What about unchecked eager promotions for static
		   pointer derefs?
	-- Further optimizations to the run-time specializer
		-- In-line and specialize the run-time primitives
			-- Everything in the Unit structures is known
			   at compile time
			-- Recursion is implemented with a stack and
			   jump (tail recursion!)
			-- Eliminate stack manipulation for all first
			   eager successors
			-- Saving/restoring of state and argument
			   passing can be specialized in many cases as
			   well (e.g., at lazy callbacks)
		-- We don't explicitly pass around a set of all live
		   static variables; we load/store them from/to the cache
		-- Hybrid caching scheme?
			-- Use hierarchy when adding to the context
			-- Reset to top of hierarchy when remove from context
-- Related Work
	-- Tempo (Consel)
	-- Fabius (Leone, Lee)
	-- `C (Engler, Kaashoek)
	-- Cmix (Andersen)
	-- Schism (Consel)
-- Conclusions
	-- Status
	-- Future work
		-- Alias analyis
		-- Interprocedural analysis
		-- Interprocedural specializer
			-- Overcome required laziness at function boundaries
			-- Better handling of side effects
		-- Different cache organizations (e.g., hierarchical)
		-- Cache interface
			-- Replacement policies?
			-- Information for making specialization decisions
			-- SPIN: must be able to reclaim space
			-- Don't know how important it will be
		-- Add a run-time optimizer to the run-time
		   specializer
			-- Register allocation
			-- Instruction scheduling
			-- Won't always be profitable => optimization policy