Research Agenda

This is a list of possible research directions we would like to pursue using our new system, code-named DyC.

1Fast, Effective Run-Time Specialization in C
(Brian)

This topic is titled Fast, Effective Run-Time Specialization in C because it only covers the run-time specializer and specialization, and not other optimizations we may include in a full-blown dynamic compiler such as run-time register allocation.

This topic considers two questions:

What optimizations are most important to the performance of the run-time specializer (sans a back-end global optimizer)?
What capabilities and optimizations are most important to the performance of the target applications? There are three primary determinants of the performance of the applications.
- Capabilities of the specializer
- Cache effects
- Code quality
This topic definitely includes the first, and possibly the second. The third, however, will be considered with architectural issues.

Most likely, these issues will have to be studied in the context of multiple architectures, but architecture won't be the center-point of the study. The reason for including multiple architectures in the study is to generalize the conclusions beyond a single platform.

2/8/98: I'd like to avoid addressing architectural issues as much as possible. A recent summary of my plans can be found in this e-mail message.

1.1Designing a State-of-the-Art Specializer

1.1.1 Inlining

To make our run-time specializer state-of-the-art, we need to add interprocedural specialization, and to support both compile-time and run-time inlining. Compile-time inlining should probably be implemented in the context of the compile-time specializer, which will be needed for the architectural studies (below). Several issues need to be addressed to support run-time inlining:

Additional annotations needed
Represenation: what information is necessary or useful?
Implementation
Static preparations (e.g., register allocation across call boundaries)
What optimizations should be performed after inlining (not including inter-unit optimizations)?
Through static function pointers
Termination/cost management

1.1.2 Data Specialization

We will already be performing a limited type of data specialization for pointer and floating-point values. That is, we will stage computations based on those types, but not embed their values in code. It would be interesting to extend this capability to arbitrary types of variables by adding a new policy, nogen, that would indicate that computations based on a variable should be staged and cached, but not embedded in dynamically generated code. The policy corresponding to what we currently do would be cogen.

The nogen policy would result in a new type of staging than what we have implemented. If specialization under the cogen policy has been indicated for other variables, both early and late data-specialized stages would be dynamically generated.

Although this new type of staging introduces a third subgraph that we must create, it wouldn't substantially increase the complexity of our system because it doesn't introduce any new dynamic code generation. We can think of it as further subdividing our dynamic (template) subgraph.

1.2 Fast Run-Time Specialization

1.2.1 The specialized specializer

We should compare the unit-based specialized specializer approach to a unit-based stitcher. In addition, we should determine whether the more complicated unit optimizations are necessary. Note that both the stitcher and the specialized specializer permit only peephole optimizations at run time. The various types of units include (in order of increasing size of units and complexity of the unit-assignment algorithm):

True basic blocks (instructions between branches and merges; linear unit identification)
Disjoint, single-entry subgraphs containing no dynamic branches (quadratic unit identification)
Disjoint, single-entry, linearized subgraphs with no duplication (unsure how to identify such units)
Disjoint, single-entry, linearized subgraphs with duplication (linearization via splitting and merging is exponential)

In addition, there are three other optimizations we can perform to increase the size of units:

Single-run-time-predecessor optimizations (possibly creating units that are neither disjoint nor singe-entry)
Boundary clustering
- Matthai's optimal linear-time algorithm for control-equivalent ranges
- Algorithm permitting movement below branches and above merges
- Clustering different types of boundaries
- Determine need for clustering
  - Without alias analysis
  - With local alias analysis
  - With interprocedural alias analysis
  Hence, alias analysis is of some interest (see automation, below).
Deciding whether to push laziness of discordant merges up to dynamic branches or not (one-time lazy boundary plus an eager cache lookup vs. a lazy cache lookup)

Other than increasing unit size, there are other optimizations that can be performed to increase performance of the specialized specializer:

Perform successor analysis to determine which specializer units will be executed next, hopefully allowing us to optimize copying of static state
Reduce scheduling constraints within units
Or, more generally, coschedule the setup and template code (using separate interference graphs and resource allocation tables for the two kinds of code); Matthai has some interest in this as well
Schedule the specialized specializer after merging
Reuse setup computations somehow when respecializing where some values have changed but some have remained invariant (data specialization?)

1.2.2 Fast memory allocation and reclamation

In addition to hyper-optimizing the specialized specializer, we should give a fair amount of thought to fast memory allocation, deallocation, and reallocation. Matthai found that a large part of the overhead of the previous dynamic compiler was due to calls to malloc and free.

A few of the issues include:

Minimizing the number of sufficient-space checks and other overhead
Differentiating fixed- and variable-sized allocations and differentiating by expected lifetime
- Temporary specializer data structures
- Tables of long-lived static values
- Long-lived (but reclaimable) code
- Permanent tables and code
Making the system thread-safe?

1.2.3 Caching schemes

Finally, efficient implementation of the cache of specialized code is essential to a fast run-time specializer. We first need to complete the design of the cache_one implementation. In addition to flushing the appropriate lines of the cache at appropriate times, there are also a few other tricky issues.

We must keep track of direct branches to replaceable code so that we can repatch them
We should keep track of whether specialized cache_all_unchecked and cache_one_unchecked code is still reachable if it is immediately preceded by only cache_one units

Next, we should investigate a couple hybrid hierarchical/flat caching schemes to reduce the cost of hashing on large contexts.

It might be interesting to think about an invalidation-based scheme in addition to cache lookups for determining whether a specialization can be used, especially for static data structures.

Finally, we should explore different types of cache policies other than the simple policies we have included in our initial design.

cache(n) and replacement policies
Flexible cache interface for conditional specialization and conditional caching
Global, rather than local, cache polices (i.e., we would like to free memory for specializations of functions that won't be called again)

1.2.4 Demand-Driven Specialization

It would be enlightening to compare eager (speculative) and lazy (demand-driven) specialization, in terms of utility and cost. Some of the relevant issues include:

Difference in cost between eager and lazy specialization
Cost of eagerly specializing unused code
Using laziness to guard unsafe (trapping or non-idempotent) operations
How often is laziness required?
Techniques for eliminating loop laziness

1.2.5 Conditional Specialization

How can conditional specialization be exploited to minimize specialization overhead? Are division-querying predicates (ala filters; is_static(v) and is_dynamic(v)) useful?

1.3 Effective Run-Time Specialization

1.3.1 Capabilities of the specializer

The capabilities of the specializer determine whether the specializer can specialize the application to the degree necessary to attain a setup. With the addition of run-time inlining, our system will have more capabilities than any other specializer we know of. We only need to factor out the BTA by using annotations that can make up for any possible shortcomings in its analyses (e.g., we should support promote_one_unchecked_eager).

We should determine which capabilities yield the most benefit for which programs (e.g., multi-way loop unrolling and run-time inlining), and which were largely unnecessary (e.g., specialize_lazy, promote_one). We should emphasize what capabilities were necessary to specialize each application at all (e.g., support for unstructured code).

When is data specialization more or less useful than full dynamic specialization?

1.3.2 Cache and VM/OS effects

Cache effects could have a significant influence on performance. This issue can also be studied from the architectural perspective.

"Lazy" instruction-cache updates (e.g., overwriting indirect jumps with direct jumps)
Minimizing instruction-cache flushing
Optimizing code layout
- of basic blocks (cf. Pettis & Hanson)
- of procedures (cf. Hashemi et al.)
- of just dynamic code, or across both static and dynamic code (cf. Chen & Leupen)
Optimizing data layout (i.e., layout of our static-value tables)
Prefetching (from main memory) cached dynamically generated code

1.3.3 VM/OS effects

VM/OS effects could also have an impact on performance.

How do dynamic code generation and self-modifying code impact the system's security model (which pages can be only read vs. executed, etc.)?
Can the OS security model be changed to avoid zeroing of pages of dynamically generated instructions? Applications need to guarantee to the system (or the system needs to ensure) that such pages won't be read before they are written.
Interaction of dynamic code generation with VM organization and related OS assumptions
- Should dynamically generated code be backed in the swap space? Statically generated code is not. Backing dynamically generated code in the swap space could quickly use up that precious resource if many processes employ dynamic code generation.
- How could dynamically generated code be shared across multiple processes? For example, we might like all instances of csh to be able to share the same cache of specialized code.
Prefetching (from disk) cached dynamically generated code

1.3.4 Other optimizations

Since peephole optimizations are the only possible back-end optimizations that can be performed at run time in this framework, we should probably look at how applicable and effective they are. This is also of interest when determining the value of run-time global back-end optimizations.

Rematerialization may be important for specializing arrays, and its effect cannot be achieved through additional annotations.

Are there other static optimizations that should be considered here?

1.4 Unstructured Code

We have made the decision to fully support arbitrary unstructured code. Is this really necessary for performance of the specializer or for performance of the specialized code? Are real programs unstructured? This may be of interest in the context of automation as well.

What kinds of unstructure are present in real programs?
Compare Tempo's approach (Erosa and Hendren; match branches with merges) with ours
Reachability conditions
- How many more discordant merges without RC?
- Static merges (all predecessors have mutually exclusive static reachability conditions) vs. binary tests of exclusivity (pairs of predecessors have mutually exclusive static reachability conditions)
- For dynamic branches, too? (for eliminating laziness)
Linearized units with and without code duplication

2Global (Inter-Unit) Run-Time Optimizations
(?)

One of our long-term goals could be to have a full compiler infrastructure whose pieces could be selectively invoked at run time. We need to determine which optimizations would be profitable at run time in which contexts. Interpreters are, perhaps, easy targets because specializing an interpreter should produce code similar to unoptimized code generated by a simple compiler.

Global, inter-unit run-time optimizations should provide big performance gains for interpreters which access data structures representing local variables or fixed stack locations at addresses known (only) at run time
Scalar promotion of static members of (partially) static data structures
- Optimizations on temps were already performed at static compile time.
- Global, inter-unit run-time optimizations will be performed on scalar-promoted static memory locations.
- Optimizations on variables stored at static memory locations can happen only at specialization time because before that, a given static memory reference may refer to many different static memory locations, depending on the context.
- Accesses to values at static (run-time known) memory locations can be treated as accesses to temps (scalars) for the purposes of the optimizations; this is called scalar promotion.
Run-time intermediate representation
- Global optimizations will require some type of IR
- How should static variables, static temps, and scalar-promoted memory locations be represented?
- How do the optimizations we want to perform affect the IR?
- How can the IR support fast code generation?
- Should we use an existing IR such as ICODE or the JavaVM? If we develop a new IR, which existing IRs should we compare against?
System architecture
- Probably it will not be feasible to inline the entire dynamic compiler with the setup code. What parts could/should be inlined via the merger? For example, a merged version could generate IR rather than code. Does it make sense to merge at all? We would like to avoid the intermediate values table of the first implementation.
- What part of the IR can/should be statically generated and what part must be constructed dynamically? How can the IR be quickly generated from static information?
- Is there any information we can compute statically about locations that will be dynamically scalar promoted? What about alias analysis? Side effect analysis?
- How much analysis must/should we do at run time?
- How do we perform fast code generation? If we used an existing IR, we could use an existing code generator, or at least compare our code generator with an existing one.
- What is the difference in cost between the most optimized specialized specializer (coscheduled setup and template code) and an IR-based dynamic compiler performing no global optimizations?
Standard scalar data-flow optimizations
- Constant propagation and folding
- Copy propagation
- CSE
- Loop-invariant code motion
- PRE
- ...
- Control: annotations?
Register allocation of scalar-promoted memory locations (overlaps with architecture-dependent optimizations, below)
Instruction scheduling across compile-time boundaries (overlaps with architecture-dependent optimizations, below)
How can we support dynamic linking?
How is this different from just-in-time compilation (JIT)?
How is this different from running Multiflow at run time?
What pieces of the static compilation process do we definitely not need (or want) to run at run time? What makes us think we can perform compilation more cheaply at run time?
- We don't need to run the front end: scanning, parsing, type inference (though we may be able to remove some type checks at run time), etc.
- We can't further optimize operations on dynamic temps or dynamic memory locations, so we can ignore them during optimizations like constant and copy propagation and CSE, which only affect operations on scalar-promoted memory locations.
- Does it make sense to run a BTA on scalar-promoted memory locations?
- Probably it won't be feasible to perform interprocedural analysis at run time.
- Even though you might be willing to do it statically, at run time it doesn't make sense to perform dozens of optimizations, each of which has only marginal benefit.
- We should be able to guess statically which optimizations would be profitable to perform at run time, and then only perform those potentially profitable optimizations, rather than the entire suite of available operations. It is likely that most optimizations will only have limited applicability at run time (e.g., run-time alias analysis).
- The code generator can be statically specialized for the architecture and, possibly, for the program.
- Dynamic linking is inexpensive because the number of jump destinations and unresolved symbols is relatively small. It is also incremental.

3The Run-Time Back End and Architectural Issues
(Matthai)

Architecture-dependent optimizations
- Run-time register allocation
  - Register reservation strategy: some registers may be allocated statically, and some dynamically. What is a good scheme for deciding how many to reserve for run-time allocation in each part of the CFG? Or, should registers be entirely reallocated at run time (for both static temps and dynamic scalar-promoted memory locations)?
  - Register allocation of scalar-promoted memory locations: a simple allocation scheme could attempt to allocate a register to every referenced static memory location. If, however, many locations are referenced, this could introduce a large number of scalar-promoted locations for consideration; a faster scheme could be to only consider memory locations that are known to be [by static or dynamic measures, preferably the latter] heavily accessed.
  - Static planning plus local dynamic decisions: register actions
  - Simple global dynamic scheme: e.g., tcc's approach
  - More complex global dynamic allocation
  - Speculative static solution?
  - Control: annotations?
- Run-time instruction scheduling
  - Does software scheduling matter, given modern dynamically scheduled processors? (different answers from different camps)
  - What are the main components of the run-time scheduler for a modern processor? Can any of the main components be optimized by using static information? For example, can inter-iteration static dependence information be used to reduce the overhead of computing dependence info at run time?
  - Local/peephole/moving loads over stores (within basic blocks and within units)
  - Fully global scheduling
  - Control: annotations?
- Speculative, static approaches
  - Run-time disambiguation (Nicolau)
  - Problem of code explosion
Interaction of DC with existing architectures
- How does code-explosion due to specialization stack up against limited icache bandwidth? (Interestingly, specializing interpreter programs will probably yield programs with acceptable icache behavior, since programs have good locality; on the other hand, a specialized sparse matrix-multiply may not fare as well. Basically, if the specialized code has many frequently executed loops in it, it should perform well, otherwise not. Another way of saying this is that when trying to take resource-bounds into account while specializing, it is not the total amount of code that the specializer produces that should be limited, but rather the size of the largest working set produced.)
- How effective is DC in "compiling away" loads and branches, for different classes of apps? [Bill Joy, at the Hot Chips 96 keynote speech, seemed to think that DC should get rid of all "easily predictable" branches and loads]
- What types of instructions are eliminated by DC and what types should be eliminated?
- Effects of number of registers and register renaming on register-allocation strategies
- Effects of multiple issue and dynamic scheduling on instruction-scheduling strategies
- Opportunities in parallel systems (e.g., ZPL)
"Nouvelle" architectural features to support DC
- Dynamic compilation on SMT: Perform dynamic compilation in parallel with program execution. If done right, this should guarantee that DC _never_ produces a slowdown. Since communication between compiling and computing threads is very fast, this could allow fine-grained dynamic (re-)compilation. This could be a way in which an SMT processor improves performance of sequential code.
- Value locality, value prediction (asplos96), and dynamic instruction reuse (isca97): if cache-lookups become a significant part of our overhead (automating detection of possible rtconsts, as below, could exacerbate this effect), modifications of these proposed HW structures could be a win.
- The HW community has shown a lot of interest recently in incorporating feedback features in HW [refs, eg. DEC]. What is the right interface for a dynamic compiler (or even a static compiler!) to use this feedback? e.g. "hot-path" profiling [Ball and Larus, PLDI97] could trigger re-scheduling if the trace paths change (i.e. the identity of frequently taken edges changes) a la the Dynamo project at HP Labs; and re-allocation of static memory locations to temps (if the identity of frequently used memory locations changes).

4Automating Dynamic Compilation
(?)

This is clearly important, as it is our overall long-term goal for the project.

Make analysis sufficiently aggressive to not require many annotations
- Interaction of compiler and BTA
  - Naming problem
  - Annotations interfering with other optimizations
  - Reassociate expressions to create more static values
- Alias analysis
  - Points-to analysis
  - Shape analysis?
- Safely and efficiently eliminating annotation-specified laziness (e.g., Ball's & Larus's static branch prediction)
- Interprocedural analyses
Ensure safety of all transformations
- Using laziness to ensure termination
  - Reachability analysis for dynamic branches
  - Handle recursion as well as loops
  - Determining when it is safe to reduce calls
- Making static loads safe
  - Using alias/shape information
  - Analysis of conditional expressions and theorem proving?
  - Pessimistic approach: laziness
  - Optimistic approach: catch exceptions and roll back
Estimate costs and benefits of dynamic compilaton
Off-line selection of dynamic regions, specialization variables, and optimizations to perform
On-line decisions to dynamically compile (which regions of code for which variables using what optimizations)
Using run-time instrumentation information in performing optimizations (e.g., specialize commonly taken paths)
Tool-assisted (interactive?) compared with fully automatic approaches (-O9)

5Recurring Themes

These are issues that cut across all research topics.

5.1 Applications

We have an urgent need to find applications that benefit from DC (VSO)
What types of applications benefit most from DC (VSO) and why? Check out Joel's list of potential applications.
When is our technique more appropriate than, say, just-in-time compiling?
Are operating systems the most likely clients of DC?
- Specialize hash functions
- Specialize argument marshaling and unmarshaling
- Compose protocol-processing routines
- Compile database queries
- Compile packet filters
- Compile user-installed kernel extensions (spindles)
Graphics workloads?
Databases?
Simulators?
Interpreters: we can exploit conditional specialization; how does DC compare to JIT?
Parallel codes, especially SPMD ones, are written in a parameterized style and should greatly benefit from dynamic compilation; the ZPL gang is interested in this and we could perhaps run on a T3D with little effort (knock on wood)
Whether dynamic compilation is generally applicable or not is a big question.
Is dynamic compilation in the form of value-specific optimization important as a compiler technique that provides an additional few percent improvement in performance? Or, is dynamic compilation more important as an enabler of more powerful language or system features?
- Varargs
- Generic functions
- Polymorphism
- Dynamic dispatch
- Higher-order functions / lambda expressions
- Function currying
- Continuations
- Dynamic linking
- Machine independence

5.2 Run-Time Optimizations

How much benefit do various optimizations provide?
- Specialization (e.g., loop unrolling, inlining)
- Peephole optimizations (e.g., strength reduction)
- Standard data-flow optimizations (e.g., constant propagation and folding, CSE)
- Back-end optimizations (e.g., register allocation, scheduling)
How can the work (especially analyses) be divided into static and dynamic components?
Does dynamic compilation suggest new types of optimizations?
What else can we do other than value-based optimization?
- What run-time attributes of variables can we exploit other than just values? Types? What else?
- Dynamic-dispatch elimination (kind of VSO)
- Cache-specific optimization?
- Architecture-specific optimization? (like 486 v. Pentium)
- Using run-time instrumentation information (e.g., specialize commonly taken paths)
- What else?

5.3 Run-Time Intermediate Representations

What information does each optimization require?
Which representations facilitate which optimizations and how?
How expensive is it to generate code from different program representations?
How expensive is it to regenerate one representation from another? For example, what does SPIN currently do?
How can multiple representations be used?
Do alternative (other than CFG) full-blown program representations have anything to offer?

6Generalized Partial Computation

Would we like a more general system of dynamic assertions and invariants, combined with analysis of relational expressions and theorem proving? This would result in a system that could perform generalized partial computation, a generalization of partial evaluation. The positive context propagation (information that certain conditions hold true) may subsume our current reachability analysis.

7Multi-Stage Programming

Our system creates a two-stage run-time system. Our generating extensions are all statically generated. We could create a more general multi-stage system that dynamically generates generating extensions. This may not be worthwhile if programs aren't truely multi-staged, since it only affects the performance of the generating extensions and not of the code they generate.

8Imperative Dynamic Compilation

Should we have some kind of imperative mechanism for run-time code generation (ala `C or MetaML?), given that we can simulate imperative RTCGs? `C is combining imperative and declarative approaches by performing specialization within tick expressions. MetaML also has an interesting blend of operational and transformational approaches to run-time code generation.

An imperative mechanism might appear to be incompatible with automating dynamic compilation. However, MetaML is a possible annotation language for a multi-stage BTA.