Ideas for a submission to PLDI `98

The submission deadline is 5pm CST November 7.

Priorities

What follows is a list of Things To Do for the submission. We need to prioritize the list, decide which subset to target for the submission, and which (slightly larger) subset to target for the final paper. Each of the things in the list (but not all of them!) could be finished by the deadline for the final paper, but not necessarily by the submission deadline.

^*Name for our system
- My latest proposal is Art Deco (for At Run Time Dynamic Compilation)
- Previous proposals:
  - Sorcerer (for Statically Optimized Run-time Compiler)
  - Duchamp (dynamist artist)
- Defacto name: DyC
^*Benchmarks
- Find the code
- Run it through the 64-bit version of Multiflow and get help from Charlie, if required
- Determine what functionality is required from our system
- Annotate it
- Examples
  - Our previous benchmarks
  - Byte-code interpreter
  - New SPIN dispatcher mockup
  - `C benchmarks (e.g., xv)
  - Tempo benchmarks (e.g., FFT)
  - Java
  - OpenGL
  - Perl, Tcl, etc.
  - Andrew Certain's multi-resolution viewer
^*Regression tests
- Collect past test cases and re-annotate them
- Dust off the script(s)
- Develop a more comprehensive testing suite
^*Instrumentation
- Figure out how to use the Alpha cycle counter or another way to accurately measure user cycles
- Based on what we want to claim, decide what to measure and why
  - Overhead (partially, to estimate benefit of optimizations we don't do)
    - Categorized unit boundary crossings (promotions, discordant merges, etc.)
    - Cache lookups
    - Work-list operations
    - Function-call overhead (for lazy GE calls)
    - Per-instruction-generated (with & without above overheads)
  - Speedup
  - Breakeven points
  - Amount of code reuse
Features
- ^*Static annotation (@) for other unsafe operations, such as calls and divide
- ^*Register actions
- Split the graph for multiple divisions
- Function annotations
- promote_all_invalidate, promote_one_invalidate, and promote_one_invalidate_unchecked and explicit invalidation annotations
- cache_one
- cache(n)
- Global space management / global cache policies
- Program-point lazy annotation
- Division querying
- Run-time inlining
- Specialized varargs
- `C emulation layer (IR-building operations and a specializing interpreter)
- Data specialization
Optimizations
- ^*Multiflow's inlining for apps and for GE operations
- ^*Investigate excessive spilling at trace boundaries
- Static and dynamic IGOTOs
- Move loads of live static variables to their uses
- Discordant SELECTs (control-flow merge converted to data-flow merge; we should treat them equally; convert dynamic SELECTs of static operands back to control flow inside the BTA, saving enough state to revert if the binding times change)
- Reduce scheduling constraints (eliminate use of trace boundaries to stop code motion)
- Reduce caching overheads
  - Make cache_one_unchecked a single jump
  - Specialize lazy entries
  - Sort vectors of live static variables by size to reduce fragmentation
  - PIC-style cache lookups
  - Hierarchical-context-based caching
  - Choose optimal cache-context composition
- Reachability conditions
- Reduce work-list overhead
  - Peel / partially unroll the work-list loop
  - Eliminate redundant copying of static variables live across unit boundaries
- Reduce unit-boundary crossings
  - Linearization
  - Clustering
  - Single-run-time-predecessor optimizations
- Further optimize memory allocation
- Eliminate redundant table-pointer construction
- Overwrite jmps with brs
- Eliminate branch laziness when possible
- Scrap merger and generate the GEs in a second pass of the compiler; this would allow more scheduling constraints on the template code to be removed
Other Infrastructure
- Oracle
- Compile-time specializer (could we use Tempo?)

Other Paper Thoughts

Slant
- Dynamic-compilation system that can provide speedups for real apps written in C
- Useful, practical declarative system
- What functionality is required for speedups?
- How fast does dynamic compilation need to be?
- Are there applications of dynamic compilation outside operating systems and OOPLs?
Comparisions
- How will we compare our system with others?
- With statically compiled C
- With `C?
- With an oracle?
- Java: JIT? Toba? Vortex?

Abstract

Here is a sample abstract that I wrote a few months ago. It has a weak slant, but no mention of contributions. Somebody munge it to make it better or replace it.

We previously presented the design of a dynamic compilation system for C [PEPM97]. The system requires the user to specify what code is to be dynamically compiled and leaves most of the key cost/benefit trade-offs open to user control through declarative annotatations. From these annotations our system automatically produces light-weight specialized dynamic compilers for annotated regions of the program. In this paper, we present results in applying the system to several real programs and reflect upon which functionality was required to obtain those results and to what degree we have succeeded in making dynamic compilation both fast and effective.

Introduction

What is DC; DC is cool

Dynamic Compilation, the generation and optimization of code at runtime, has received much attention recently as a means of transforming programs using information available at runtime. Proposed applications for this capability have included specializing architectural simulators for the configuration being simulated, language interpreters for the program being interpreted, numerical programs for dimensions and values of frequently used arrays, and critical paths in operating systems for the type of data being processed and the current state of the system. Trends in software engineering toward dynamic reconfigurability, parameterization for re-use and portability across hardware architecture all imply a promising role for dynamic compilation.

People have shown DC is viable, but not for big programs

Recent research efforts have made considerable progress towards proving the viability of dynamic compilation. In particular, researchers have demonstrated that run-time optimization overhead can be more than amortized quickly by the increased efficiency of the optimized code for many small, frequently used kernels. Existing techniques have, however, failed to demonstrate applicability on any existing applications of the size and complexity of the interpreters and simulators mentioned above. The underlying reason for the failure is the tradeoff between expressivity of the language interface for dynamic compilation and its ease of use:

Systems that require the programmer to explicity construct, compose and compile the dynamically generated code ("imperative systems") imply large scale re-writing of existing code that needs to be dynamically compiled. (The expressivity of these systems is also in question: e.g. breaks out of loops)
Systems that offer a declarative scheme of sparse user-annotation to guide dynamic compilation ("declarative systems" have not been able to express directions for compiling complex programs in a natural, compact way.

This the new stuff we do

In this paper, we describe a declarative annotation-based system which:

Is expressive enough to exploit complex opportunities for dynamic compilation in large, realistic programs. The required expressiveness is significantly beyond that of existing annotation-based DC systems.

Requires only sparse annotations of existing statically compilable code, thereby eliminating the rewriting hurdle of existing imperative systems

Is based on a principled extension of traditional offline PE techniques to the online case, and

Produces code of quality comparable to that other existing systems at similar speeds.

This is how we do it, in a paragraph

Summarize features (~list from PEPM paper intro):

per program-point specialization.
polyvariant specialization and division.
on-demand specialization to avoid specializing for values not yet computed.
interprocedural specialization with support for separate compilation.
automatic caching of specialized code

We can get speedups on reasonable inputs for _big_ programs

We demonstrate the effectiveness of our approach by profitably applying dynamic compilation to javap, the bytecode interpreter for the Java programming language, and to MIPSI, an instruction-level simulator for the MIPS R3000. Translating the dynamically compiled regions of these programs (x and y lines of C) into an imperative language would clearly require a lengthy, painstaking effort by programmers using an imperative scheme. In contrast, we enable dynamic compilation of both systems by adding a few carefully selected annotations (x and y lines) to their C source code. We also identify the features of these applications that prevent existing annotation-based systems from optimizing them without substantial re-writing. (These applications are representative-- talk of other similar alluring ones!)

JITs do much better than us; why? opportunities...

Using java provides us the opportunity... The existence of hand-written "just-in-time" (and conventional) compilers for Java bytecode allows us to measure a few more data-points on the code-quality vs. compilation time curve.

We do as well as other systems: not trading off expressivity/ease for speed

The improved expressivity and ease of use of our system do not come at the expense of dynamically generated code quality or longer dynamic compile times. Though our current system is not configured for extremely aggressive run-time optimizations (we do not perform global optimizations such as register allocation or scheduling), we use a combination of cheap local optimization, extensive static global optimization of templates for dynamic code and customized code-generators to achieve speedups and code-generation costs comparable to those of other existing systems. We

We will do much better, once our future work is done: microbenchmarks with proposed optimizations

Show this (Brian's list):

Cost of not specializing calls
Categorized unit boundary crossings (promotions, discordant merges, etc.)
Cache lookups
Work-list operations
Function-call overhead
Per-instruction-generated (with & without above overheads)
Cost of not doing register actions.

Summary of sections (important: what will be the order and content of sections?)

Outline

None yet. Anyone, anyone?