Ideas for a submission to PLDI `98
The submission deadline is 5pm CST November 7.
Priorities
 What follows is a list of Things To Do for the
submission.  We need to prioritize the list, decide which subset to
target for the submission, and which (slightly larger) subset to
target for the final paper.  Each of the things in the list (but not
all of them!) could be finished by the deadline for the final paper,
but not necessarily by the submission deadline.
- *Name for our system
	- My latest proposal is Art Deco (for
	At Run Time Dynamic Compilation)
 
	- Previous proposals:
	
		- Sorcerer (for
		Statically Optimized
		Run-time Compiler)
 
		- Duchamp (dynamist artist)
 
	
	 
	- Defacto name: DyC
 
 
- *Benchmarks
	- Find the code
 
	- Run it through the 64-bit version of Multiflow
	and get help from Charlie, if required
 
	- Determine what functionality is required from our system
 
	- Annotate it
 
	- Examples
	
		- Our previous benchmarks
 
		- Byte-code interpreter
 
		- New SPIN dispatcher mockup
 
		- `C benchmarks (e.g., xv)
 
		- Tempo benchmarks (e.g., FFT)
 
		- Java
 
		- OpenGL
 
		- Perl, Tcl, etc.
 
		- Andrew Certain's multi-resolution viewer
 
	
	 
 
- *Regression tests
	- Collect past test cases and re-annotate them
 
	- Dust off the script(s)
 
	- Develop a more comprehensive testing suite
 
 
- *Instrumentation
	- Figure out how to use the Alpha cycle counter or another way
	to accurately measure user cycles
 
	- Based on what we want to claim, decide what to measure and why
	
		- Overhead (partially, to estimate benefit of optimizations we
		don't do)
		
			- Categorized unit boundary crossings
			(promotions, discordant merges, etc.)
 
			- Cache lookups
 
			- Work-list operations
 
			- Function-call overhead (for lazy GE calls)
 
			- Per-instruction-generated (with & without
			above overheads)
 
		
		 
		- Speedup
 
		- Breakeven points
 
		- Amount of code reuse
 
	
	 
 
- Features
	- *Static annotation (@) for other unsafe
	operations, such as calls and divide
 
	- *Register actions
 
	- Split the graph for multiple divisions
 
	- Function annotations
 
	- promote_all_invalidate,
	promote_one_invalidate, and
	promote_one_invalidate_unchecked and explicit invalidation
	annotations
  
	- cache_one
 
	- cache(n)
 
	- Global space management / global cache policies
 
	- Program-point lazy annotation
 
	- Division querying
 
	- Run-time inlining
 
	- Specialized varargs
 
	- `C emulation layer (IR-building operations and a
	specializing interpreter)
 
	- Data specialization
 
 
- Optimizations
	- *Multiflow's inlining for apps and for GE operations
 
	- *Investigate excessive spilling at trace boundaries
 
	- Static and dynamic IGOTOs
 
	- Move loads of live static variables to their uses
 
	- Discordant SELECTs (control-flow merge converted to
	data-flow merge; we should treat them equally; convert dynamic SELECTs
	of static operands back to control flow inside the BTA, saving
	enough state to revert if the binding times change)
 
	- Reduce scheduling constraints (eliminate use of trace
	boundaries to stop code motion)
 
	- Reduce caching overheads
	
		- Make cache_one_unchecked a single jump
 
		- Specialize lazy entries
 
		- Sort vectors of live static variables by size to
		reduce fragmentation
 
		- PIC-style cache lookups
 
		- Hierarchical-context-based caching
 
		- Choose optimal cache-context composition
 
	
	 
	- Reachability conditions
 
	- Reduce work-list overhead
	
		- Peel / partially unroll the work-list loop
 
		- Eliminate redundant copying of static variables
		live across unit boundaries
 
	
	 
	- Reduce unit-boundary crossings
	
		- Linearization
 
		- Clustering
 
		- Single-run-time-predecessor optimizations
 
	
	 
	- Further optimize memory allocation
 
	- Eliminate redundant table-pointer construction
 
	- Overwrite jmps with brs
 
	- Eliminate branch laziness when possible
 
	- Scrap merger and generate the GEs in a second pass of the
	compiler; this would allow more scheduling constraints on the
	template code to be removed
 
 
- Other Infrastructure
	- Oracle
 
	- Compile-time specializer (could we use Tempo?)
 
 
Other Paper Thoughts
- Slant
	- Dynamic-compilation system that can provide speedups for
	real apps written in C
 
	- Useful, practical declarative system
 
	- What functionality is required for speedups?
 
	- How fast does dynamic compilation need to be?
 
	- Are there applications of dynamic compilation outside
	operating systems and OOPLs?
 
 
- Comparisions
	- How will we compare our system with others?
 
	- With statically compiled C
 
	- With `C?
 
	- With an oracle?
 
	- Java: JIT? Toba? Vortex?
 
 
Abstract
Here is a sample abstract that I wrote a few months ago.  It has a
weak slant, but no mention of contributions.  Somebody munge it to
make it better or replace it.
We previously presented the design of a dynamic compilation system for
C [PEPM97].  The system requires the user to specify what code is to
be dynamically compiled and leaves most of the key cost/benefit
trade-offs open to user control through declarative annotatations.
From these annotations our system automatically produces light-weight
specialized dynamic compilers for annotated regions of the program.
In this paper, we present results in applying the system to several
real programs and reflect upon which functionality was required to
obtain those results and to what degree we have succeeded in making
dynamic compilation both fast and effective.
Introduction
What is DC; DC is cool
Dynamic Compilation, the generation and optimization of code
at runtime, has received much attention recently as a means of
transforming programs using information available at runtime. Proposed
applications for this capability have included specializing
architectural simulators for the configuration being simulated,
language interpreters for the program being interpreted, numerical
programs for dimensions and values of frequently used arrays, and
critical paths in operating systems for the type of data being
processed and the current state of the system. Trends in
software engineering toward dynamic reconfigurability,
parameterization for re-use and portability across hardware
architecture all imply a promising role for dynamic compilation. 
People have shown DC is viable, but not for big programs
Recent research efforts have made considerable progress towards
proving the viability of dynamic compilation. In particular,
researchers have demonstrated that run-time optimization overhead can
be more than amortized quickly by the increased efficiency of the
optimized code for many small, frequently used kernels. Existing
techniques have, however, failed to demonstrate applicability on any
existing applications of the size and complexity of the interpreters and
simulators mentioned above. The underlying reason for the failure is the
tradeoff between expressivity of the language interface for dynamic 
compilation and its ease of use:
	- Systems that require the programmer to explicity construct,
	  compose and compile the dynamically generated code ("imperative
	  systems") imply large scale re-writing of existing code that
         needs to be dynamically compiled. (The expressivity of these
	  systems is also in question: e.g. breaks out of loops)
 
	- Systems that offer a declarative scheme of sparse
          user-annotation to guide dynamic compilation ("declarative systems" 
	have not been able to express directions for compiling complex
	  programs in a natural, compact way.
 
This the new stuff we do
In this paper, we describe a declarative annotation-based system which:
	
Is expressive enough to exploit complex opportunities for dynamic
	  compilation in large, realistic programs. The required
	  expressiveness is significantly beyond that of existing
	  annotation-based DC systems.
	Requires only sparse annotations of existing statically
	  compilable code, thereby eliminating the rewriting hurdle of
	  existing imperative systems 
	Is based on a principled extension of traditional offline
	  PE techniques to the online case, and
	Produces code of quality comparable to that other existing
	  systems at similar speeds.
This is how we do it, in a paragraph
Summarize features (~list from PEPM paper intro):
- per program-point specialization.
 
- polyvariant specialization and division.
 
- on-demand specialization to avoid specializing for values not yet
computed.
 
- interprocedural specialization with support for separate compilation. 
 
- automatic caching of specialized code
 
We can get speedups on reasonable inputs for _big_ programs
We demonstrate the effectiveness of our approach by profitably applying dynamic
compilation to  javap, the  bytecode interpreter for the Java
programming language, and to MIPSI, an instruction-level simulator
for the MIPS R3000. Translating the dynamically compiled regions of
these programs (x and y lines of C) into an imperative language would
clearly require a lengthy, painstaking effort by programmers using an
imperative scheme. In contrast, we enable dynamic compilation of
both systems by  adding a few carefully selected annotations (x and y
lines) to their C source code. We also identify the features of these
applications that prevent existing annotation-based systems from optimizing 
them without substantial re-writing. (These applications are representative--
talk of other similar alluring ones!)
JITs do much better than us; why? opportunities...
Using java provides us the opportunity...
The existence of hand-written "just-in-time" (and conventional)
compilers for Java bytecode allows us to measure a few more
data-points on the code-quality vs. compilation time curve.
We do as well as other systems: not trading off expressivity/ease
   for speed
The improved expressivity and ease of use of our system do not come at
the expense of dynamically generated code quality or longer dynamic
compile times. Though our current system is not configured for
extremely aggressive run-time optimizations (we do not perform global
optimizations such as register allocation or scheduling), we use a
combination of cheap local optimization, extensive static global
optimization of templates for dynamic code and customized
code-generators to achieve speedups and code-generation costs
comparable to those of other existing systems. We   
We will do much better, once our future work is done: microbenchmarks with 
proposed optimizations
Show this (Brian's list):
	  -  Cost of not specializing calls
 
	  -  Categorized unit boundary crossings (promotions,
                    discordant merges, etc.)
 
          -  Cache lookups
 
          -  Work-list operations
 
          -  Function-call overhead
 
          -  Per-instruction-generated (with & without above overheads)
 
	  -  Cost of not doing register actions.
 
Summary of sections (important: what will be the order and content of 
sections?)
Outline
None yet.  Anyone, anyone?