Research Agenda
This is a list of possible research directions we would like to
pursue using our new system, code-named DyC.
1
Fast, Effective Run-Time Specialization in C
(Brian)
This topic is titled Fast, Effective Run-Time Specialization
in C because it only covers the run-time specializer and
specialization, and not other optimizations we may include in a
full-blown dynamic compiler such as run-time register allocation.
This topic considers two questions:
- What optimizations are most important to the performance of the
run-time specializer (sans a back-end global optimizer)?
- What capabilities and optimizations are most important to the
performance of the target applications? There are three primary
determinants of the performance of the applications.
- Capabilities of the specializer
- Cache effects
- Code quality
This topic definitely includes the first, and possibly the second.
The third, however, will be considered with architectural issues.
Most likely, these issues will have to be studied in the context of
multiple architectures, but architecture won't be the center-point of
the study. The reason for including multiple architectures in the
study is to generalize the conclusions beyond a single platform.
2/8/98: I'd like to avoid addressing architectural issues as
much as possible. A recent summary of my plans can be found in this e-mail message.
1.1
Designing a State-of-the-Art Specializer
1.1.1
Inlining
To make our run-time specializer state-of-the-art, we need to add
interprocedural specialization, and to support both compile-time and
run-time inlining. Compile-time inlining should probably be
implemented in the context of the compile-time specializer, which will
be needed for the architectural studies (below). Several issues need
to be addressed to support run-time inlining:
- Additional annotations needed
- Represenation: what information is necessary or useful?
- Implementation
- Static preparations (e.g., register allocation
across call boundaries)
- What optimizations should be performed after inlining (not
including inter-unit optimizations)?
- Through static function pointers
- Termination/cost management
1.1.2
Data Specialization
We will already be performing a limited type of data
specialization for pointer and floating-point values. That is, we
will stage computations based on those types, but not embed their
values in code. It would be interesting to extend this capability to
arbitrary types of variables by adding a new policy, nogen,
that would indicate that computations based on a variable should be
staged and cached, but not embedded in dynamically generated code.
The policy corresponding to what we currently do would be
cogen.
The nogen policy would result in a new type of staging
than what we have implemented. If specialization under the
cogen policy has been indicated for other variables, both
early and late data-specialized stages would be dynamically generated.
Although this new type of staging introduces a third subgraph that
we must create, it wouldn't substantially increase the complexity of
our system because it doesn't introduce any new dynamic code
generation. We can think of it as further subdividing our dynamic
(template) subgraph.
1.2
Fast Run-Time Specialization
1.2.1
The specialized specializer
We should compare the unit-based specialized specializer approach
to a unit-based stitcher. In addition, we should determine whether
the more complicated unit optimizations are necessary. Note that both
the stitcher and the specialized specializer permit only peephole
optimizations at run time. The various types of units include (in
order of increasing size of units and complexity of the
unit-assignment algorithm):
- True basic blocks (instructions between branches
and merges; linear unit identification)
- Disjoint, single-entry subgraphs containing no dynamic
branches (quadratic unit identification)
- Disjoint, single-entry, linearized subgraphs with no
duplication (unsure how to identify such units)
- Disjoint, single-entry, linearized subgraphs with
duplication (linearization via splitting and merging is
exponential)
In addition, there are three other optimizations we can perform to
increase the size of units:
- Single-run-time-predecessor optimizations (possibly
creating units that are neither disjoint nor singe-entry)
- Boundary clustering
- Matthai's optimal linear-time algorithm for
control-equivalent ranges
- Algorithm permitting movement below branches and
above merges
- Clustering different types of boundaries
- Determine need for clustering
- Without alias analysis
- With local alias analysis
- With interprocedural alias analysis
Hence, alias analysis is of some interest (see
automation, below).
- Deciding whether to push laziness of discordant merges
up to dynamic branches or not (one-time lazy boundary plus an
eager cache lookup vs. a lazy cache lookup)
Other than increasing unit size, there are other optimizations
that can be performed to increase performance of the specialized
specializer:
- Perform successor analysis to determine which specializer
units will be executed next, hopefully allowing us to optimize
copying of static state
- Reduce scheduling constraints within units
- Or, more generally, coschedule the setup and template code
(using separate interference graphs and resource
allocation tables for the two kinds of code); Matthai has some
interest in this as well
- Schedule the specialized specializer after merging
- Reuse setup computations somehow when respecializing where
some values have changed but some have remained invariant
(data specialization?)
1.2.2
Fast memory allocation and reclamation
In addition to hyper-optimizing the specialized specializer, we
should give a fair amount of thought to fast memory allocation,
deallocation, and reallocation. Matthai found that a large part of
the overhead of the previous dynamic compiler was due to calls to
malloc and free.
A few of the issues include:
- Minimizing the number of sufficient-space checks and other
overhead
- Differentiating fixed- and variable-sized allocations and
differentiating by expected lifetime
- Temporary specializer data structures
- Tables of long-lived static values
- Long-lived (but reclaimable) code
- Permanent tables and code
- Making the system thread-safe?
1.2.3
Caching schemes
Finally, efficient implementation of the cache of specialized code
is essential to a fast run-time specializer. We first need to
complete the design of the cache_one implementation. In
addition to flushing the appropriate lines of the cache at appropriate
times, there are also a few other tricky issues.
- We must keep track of direct branches to replaceable code
so that we can repatch them
- We should keep track of whether specialized
cache_all_unchecked and cache_one_unchecked
code is still reachable if it is immediately preceded by only
cache_one units
Next, we should investigate a couple hybrid hierarchical/flat
caching schemes to reduce the cost of hashing on large contexts.
It might be interesting to think about an invalidation-based
scheme in addition to cache lookups for determining whether a
specialization can be used, especially for static data structures.
Finally, we should explore different types of cache policies other
than the simple policies we have included in our initial design.
- cache(n) and replacement policies
- Flexible cache interface for conditional
specialization and conditional caching
- Global, rather than local, cache polices
(i.e., we would like to free memory for
specializations of functions that won't be
called again)
1.2.4
Demand-Driven Specialization
It would be enlightening to compare eager (speculative) and lazy
(demand-driven) specialization, in terms of utility and cost. Some of
the relevant issues include:
- Difference in cost between eager and lazy specialization
- Cost of eagerly specializing unused code
- Using laziness to guard unsafe (trapping or
non-idempotent) operations
- How often is laziness required?
- Techniques for eliminating loop laziness
1.2.5
Conditional Specialization
How can conditional specialization be exploited to minimize
specialization overhead? Are division-querying predicates (ala
filters; is_static(v) and is_dynamic(v)) useful?
1.3
Effective Run-Time Specialization
1.3.1
Capabilities of the specializer
The capabilities of the specializer determine whether the
specializer can specialize the application to the degree necessary to
attain a setup. With the addition of run-time inlining, our system
will have more capabilities than any other specializer we know of. We
only need to factor out the BTA by using annotations that can make up
for any possible shortcomings in its analyses (e.g., we
should support promote_one_unchecked_eager).
We should determine which capabilities yield the most benefit for
which programs (e.g., multi-way loop unrolling and run-time
inlining), and which were largely unnecessary (e.g.,
specialize_lazy, promote_one). We should emphasize
what capabilities were necessary to specialize each application at all
(e.g., support for unstructured code).
When is data specialization more or less useful than full dynamic
specialization?
1.3.2
Cache and VM/OS effects
Cache effects could have a significant influence on performance.
This issue can also be studied from the architectural perspective.
- "Lazy" instruction-cache updates (e.g.,
overwriting indirect jumps with direct jumps)
- Minimizing instruction-cache flushing
- Optimizing code layout
- of basic blocks (cf. Pettis & Hanson)
- of procedures (cf. Hashemi et al.)
- of just dynamic code, or across both static and
dynamic code (cf. Chen & Leupen)
- Optimizing data layout (i.e., layout of our
static-value tables)
- Prefetching (from main memory) cached dynamically generated code
1.3.3
VM/OS effects
VM/OS effects could also have an impact on performance.
- How do dynamic code generation and self-modifying code
impact the system's security model (which pages can be only
read vs. executed, etc.)?
- Can the OS security model be changed to avoid zeroing of
pages of dynamically generated instructions? Applications
need to guarantee to the system (or the system needs to
ensure) that such pages won't be read before they are written.
- Interaction of dynamic code generation with VM
organization and related OS assumptions
- Should dynamically generated code be backed in the
swap space? Statically generated code is not.
Backing dynamically generated code in the swap space
could quickly use up that precious resource if many
processes employ dynamic code generation.
- How could dynamically generated code be shared
across multiple processes? For example, we might like
all instances of csh to be able to share the
same cache of specialized code.
- Prefetching (from disk) cached dynamically generated code
1.3.4
Other optimizations
Since peephole optimizations are the only possible back-end
optimizations that can be performed at run time in this framework, we
should probably look at how applicable and effective they are. This
is also of interest when determining the value of run-time global
back-end optimizations.
Rematerialization may be important for specializing arrays, and
its effect cannot be achieved through additional annotations.
Are there other static optimizations that should be considered
here?
1.4
Unstructured Code
We have made the decision to fully support arbitrary unstructured
code. Is this really necessary for performance of the specializer or
for performance of the specialized code? Are real programs
unstructured? This may be of interest in the context of automation as
well.
- What kinds of unstructure are present in real programs?
- Compare Tempo's approach (Erosa and Hendren; match
branches with merges) with ours
- Reachability conditions
- How many more discordant merges without RC?
- Static merges (all predecessors have mutually
exclusive static reachability conditions) vs. binary
tests of exclusivity (pairs of predecessors have
mutually exclusive static reachability conditions)
- For dynamic branches, too? (for eliminating
laziness)
- Linearized units with and without code duplication
2
Global (Inter-Unit) Run-Time Optimizations
(?)
One of our long-term goals could be to have a full compiler
infrastructure whose pieces could be selectively invoked at run time.
We need to determine which optimizations would be profitable at run
time in which contexts. Interpreters are, perhaps, easy targets
because specializing an interpreter should produce code similar to
unoptimized code generated by a simple compiler.
- Global, inter-unit run-time optimizations should
provide big performance gains for interpreters which
access data structures representing local variables or
fixed stack locations at addresses known (only) at run
time
- Scalar promotion of static members of (partially) static
data structures
- Optimizations on temps were already performed at
static compile time.
- Global, inter-unit run-time optimizations will be
performed on scalar-promoted static memory locations.
- Optimizations on variables stored at static memory
locations can happen only at specialization time
because before that, a given static memory reference
may refer to many different static memory locations,
depending on the context.
- Accesses to values at static (run-time known)
memory locations can be treated as accesses to temps
(scalars) for the purposes of the optimizations; this
is called scalar promotion.
- Run-time intermediate representation
- Global optimizations will require some type of
IR
- How should static variables, static temps, and
scalar-promoted memory locations be represented?
- How do the optimizations we want to perform affect
the IR?
- How can the IR support fast code generation?
- Should we use an existing IR such as ICODE or the
JavaVM? If we develop a new IR, which existing IRs
should we compare against?
- System architecture
- Probably it will not be feasible to inline the
entire dynamic compiler with the setup code. What
parts could/should be inlined via the merger? For
example, a merged version could generate IR rather
than code. Does it make sense to merge at all?
We would like to avoid the intermediate values table
of the first implementation.
- What part of the IR can/should be statically
generated and what part must be constructed
dynamically? How can the IR be quickly generated from
static information?
- Is there any information we can compute statically
about locations that will be dynamically scalar
promoted? What about alias analysis? Side effect
analysis?
- How much analysis must/should we do at run time?
- How do we perform fast code generation? If we
used an existing IR, we could use an existing code
generator, or at least compare our code generator with
an existing one.
- What is the difference in cost between the most
optimized specialized specializer (coscheduled setup
and template code) and an IR-based dynamic compiler
performing no global optimizations?
- Standard scalar data-flow optimizations
- Constant propagation and folding
- Copy propagation
- CSE
- Loop-invariant code motion
- PRE
- ...
- Control: annotations?
- Register allocation of scalar-promoted memory locations
(overlaps with architecture-dependent optimizations, below)
- Instruction scheduling across compile-time boundaries
(overlaps with architecture-dependent optimizations, below)
- How can we support dynamic linking?
- How is this different from just-in-time compilation (JIT)?
- How is this different from running Multiflow at run
time?
- What pieces of the static compilation process do we
definitely not need (or want) to run at run time? What makes
us think we can perform compilation more cheaply at run time?
- We don't need to run the front end: scanning,
parsing, type inference (though we may be able to
remove some type checks at run time), etc.
- We can't further optimize operations on dynamic
temps or dynamic memory locations, so we can ignore
them during optimizations like constant and copy
propagation and CSE, which only affect operations on
scalar-promoted memory locations.
- Does it make sense to run a BTA on scalar-promoted
memory locations?
- Probably it won't be feasible to perform
interprocedural analysis at run time.
- Even though you might be willing to do it
statically, at run time it doesn't make sense to
perform dozens of optimizations, each of which has
only marginal benefit.
- We should be able to guess statically which
optimizations would be profitable to perform at run
time, and then only perform those potentially
profitable optimizations, rather than the entire suite
of available operations. It is likely that most
optimizations will only have limited applicability at
run time (e.g., run-time alias analysis).
- The code generator can be statically specialized
for the architecture and, possibly, for the program.
- Dynamic linking is inexpensive because the number
of jump destinations and unresolved symbols is
relatively small. It is also incremental.
3
The Run-Time Back End and Architectural Issues
(Matthai)
- Architecture-dependent optimizations
- Run-time register allocation
- Register reservation strategy: some
registers may be allocated statically,
and some dynamically. What is a good
scheme for deciding how many to reserve
for run-time allocation in each part of the
CFG? Or, should registers be entirely
reallocated at run time (for both static temps
and dynamic scalar-promoted memory locations)?
- Register allocation of scalar-promoted
memory locations: a simple allocation scheme
could attempt to allocate a register to every
referenced static
memory location. If, however, many locations
are referenced, this could introduce a large
number of scalar-promoted locations for
consideration; a faster scheme could be to
only consider memory locations that are known
to be [by static or dynamic measures,
preferably the latter] heavily accessed.
- Static planning plus local dynamic
decisions: register actions
- Simple global dynamic scheme:
e.g., tcc's approach
- More complex global dynamic allocation
- Speculative static solution?
- Control: annotations?
- Run-time instruction scheduling
- Does software scheduling matter, given modern
dynamically scheduled processors?
(different answers from different camps)
- What are the main components of the run-time
scheduler for a modern processor? Can any
of the main components be optimized by
using static information? For example, can
inter-iteration static dependence
information be used to reduce the overhead
of computing dependence info at run time?
- Local/peephole/moving loads over stores
(within basic blocks and within units)
- Fully global scheduling
- Control: annotations?
- Speculative, static approaches
- Run-time disambiguation (Nicolau)
- Problem of code explosion
- Interaction of DC with existing architectures
- How does code-explosion due to specialization
stack up against limited icache bandwidth?
(Interestingly, specializing interpreter programs
will probably yield programs with acceptable
icache behavior, since programs have good
locality; on the other hand, a specialized
sparse matrix-multiply may not fare as well.
Basically, if the specialized code has many
frequently executed loops in it, it should
perform well, otherwise not. Another way of
saying this is that when trying to take
resource-bounds into account while specializing,
it is not the total amount of code that the
specializer produces that should be limited, but
rather the size of the largest working set produced.)
- How effective is DC in "compiling away" loads
and branches, for different classes of apps?
[Bill Joy, at the Hot Chips 96 keynote speech,
seemed to think that DC should get rid of all
"easily predictable" branches and loads]
- What types of instructions are eliminated by DC
and what types should be eliminated?
- Effects of number of registers and register
renaming on register-allocation strategies
- Effects of multiple issue and dynamic scheduling
on instruction-scheduling strategies
- Opportunities in parallel systems (e.g.,
ZPL)
- "Nouvelle" architectural features to support DC
- Dynamic compilation on SMT: Perform dynamic
compilation in parallel with program
execution. If done right, this should guarantee
that DC _never_ produces a slowdown. Since
communication between compiling and computing
threads is very fast, this could allow
fine-grained dynamic (re-)compilation. This
could be a way in which an SMT processor improves
performance of sequential code.
- Value locality, value prediction (asplos96), and dynamic
instruction reuse (isca97): if cache-lookups become a
significant part of our overhead (automating
detection of possible rtconsts, as below, could
exacerbate this effect), modifications of these
proposed HW structures could be a win.
- The HW community has shown a lot of interest
recently in incorporating feedback features in
HW [refs, eg. DEC]. What is the right interface
for a dynamic compiler (or even a static
compiler!) to use this feedback? e.g. "hot-path"
profiling [Ball and Larus, PLDI97] could trigger
re-scheduling if the trace paths change (i.e. the
identity of frequently taken edges changes) a la
the Dynamo project at HP Labs; and re-allocation
of static memory locations to temps (if the
identity of frequently used memory locations
changes).
4
Automating Dynamic Compilation
(?)
This is clearly important, as it is our overall long-term goal for the
project.
- Make analysis sufficiently aggressive to not require many
annotations
- Interaction of compiler and BTA
- Naming problem
- Annotations interfering with other
optimizations
- Reassociate expressions to create more
static values
- Alias analysis
- Points-to analysis
- Shape analysis?
- Safely and efficiently eliminating
annotation-specified laziness (e.g.,
Ball's & Larus's static branch prediction)
- Interprocedural analyses
- Ensure safety of all transformations
- Using laziness to ensure termination
- Reachability analysis for dynamic branches
- Handle recursion as well as loops
- Determining when it is safe to reduce calls
- Making static loads safe
- Using alias/shape information
- Analysis of conditional expressions and
theorem proving?
- Pessimistic approach: laziness
- Optimistic approach: catch exceptions and
roll back
- Estimate costs and benefits of dynamic compilaton
- Off-line selection of dynamic regions, specialization
variables, and optimizations to perform
- On-line decisions to dynamically compile (which regions
of code for which variables using what optimizations)
- Using run-time instrumentation information in performing
optimizations (e.g., specialize commonly taken
paths)
- Tool-assisted (interactive?) compared with fully
automatic approaches (-O9)
5
Recurring Themes
These are issues that cut across all research topics.
5.1
Applications
- We have an urgent need to find applications that benefit
from DC (VSO)
- What types of applications benefit most from DC (VSO) and why?
Check out Joel's list of
potential applications.
- When is our technique more appropriate than, say,
just-in-time compiling?
- Are operating systems the most likely clients of DC?
- Specialize hash functions
- Specialize argument marshaling and unmarshaling
- Compose protocol-processing routines
- Compile database queries
- Compile packet filters
- Compile user-installed kernel extensions (spindles)
- Graphics workloads?
- Databases?
- Simulators?
- Interpreters: we can exploit conditional specialization;
how does DC compare to JIT?
- Parallel codes, especially SPMD ones, are written in a
parameterized style and should greatly benefit from dynamic
compilation; the ZPL gang is interested in this and we could
perhaps run on a T3D with little effort (knock on wood)
- Whether dynamic compilation is generally applicable or
not is a big question.
- Is dynamic compilation in the form of value-specific
optimization important as a compiler technique that provides an
additional few percent improvement in performance? Or, is dynamic
compilation more important as an enabler of more powerful language or
system features?
- Varargs
- Generic functions
- Polymorphism
- Dynamic dispatch
- Higher-order functions / lambda expressions
- Function currying
- Continuations
- Dynamic linking
- Machine independence
5.2
Run-Time Optimizations
- How much benefit do various optimizations provide?
- Specialization (e.g., loop unrolling, inlining)
- Peephole optimizations (e.g., strength
reduction)
- Standard data-flow optimizations (e.g.,
constant propagation and folding, CSE)
- Back-end optimizations (e.g., register
allocation, scheduling)
- How can the work (especially analyses) be divided into
static and dynamic components?
- Does dynamic compilation suggest new types of optimizations?
- What else can we do other than value-based optimization?
- What run-time attributes of variables can we
exploit other than just values? Types? What else?
- Dynamic-dispatch elimination (kind of VSO)
- Cache-specific optimization?
- Architecture-specific optimization? (like 486 v. Pentium)
- Using run-time instrumentation information (e.g.,
specialize commonly taken paths)
- What else?
5.3
Run-Time Intermediate Representations
- What information does each optimization require?
- Which representations facilitate which optimizations and how?
- How expensive is it to generate code from different program
representations?
- How expensive is it to regenerate one representation from
another? For example, what does SPIN currently do?
- How can multiple representations be used?
- Do alternative (other than CFG) full-blown program
representations have anything to offer?
6
Generalized Partial Computation
Would we like a more general system of dynamic assertions and
invariants, combined with analysis of relational expressions and
theorem proving? This would result in a system that could perform
generalized partial computation, a generalization of partial
evaluation. The positive context propagation (information that
certain conditions hold true) may subsume our current reachability
analysis.
7
Multi-Stage Programming
Our system creates a two-stage run-time system. Our generating
extensions are all statically generated. We could create a more
general multi-stage system that dynamically generates generating
extensions. This may not be worthwhile if programs aren't truely
multi-staged, since it only affects the performance of the generating
extensions and not of the code they generate.
8
Imperative Dynamic Compilation
Should we have some kind of imperative mechanism for run-time code
generation (ala `C or MetaML?), given that we can simulate imperative
RTCGs? `C is combining imperative and declarative approaches by
performing specialization within tick expressions. MetaML also has an
interesting blend of operational and transformational approaches to
run-time code generation.
An imperative mechanism might appear to be incompatible with
automating dynamic compilation. However, MetaML is a possible
annotation language for a multi-stage BTA.