Research Agenda

This is a list of possible research directions we would like to pursue using our new system, code-named DyC.

1Fast, Effective Run-Time Specialization in C
(Brian)

This topic is titled Fast, Effective Run-Time Specialization in C because it only covers the run-time specializer and specialization, and not other optimizations we may include in a full-blown dynamic compiler such as run-time register allocation.

This topic considers two questions:

  1. What optimizations are most important to the performance of the run-time specializer (sans a back-end global optimizer)?
  2. What capabilities and optimizations are most important to the performance of the target applications? There are three primary determinants of the performance of the applications. This topic definitely includes the first, and possibly the second. The third, however, will be considered with architectural issues.

Most likely, these issues will have to be studied in the context of multiple architectures, but architecture won't be the center-point of the study. The reason for including multiple architectures in the study is to generalize the conclusions beyond a single platform.

2/8/98: I'd like to avoid addressing architectural issues as much as possible. A recent summary of my plans can be found in this e-mail message.

1.1Designing a State-of-the-Art Specializer

1.1.1 Inlining

To make our run-time specializer state-of-the-art, we need to add interprocedural specialization, and to support both compile-time and run-time inlining. Compile-time inlining should probably be implemented in the context of the compile-time specializer, which will be needed for the architectural studies (below). Several issues need to be addressed to support run-time inlining:

1.1.2 Data Specialization

We will already be performing a limited type of data specialization for pointer and floating-point values. That is, we will stage computations based on those types, but not embed their values in code. It would be interesting to extend this capability to arbitrary types of variables by adding a new policy, nogen, that would indicate that computations based on a variable should be staged and cached, but not embedded in dynamically generated code. The policy corresponding to what we currently do would be cogen.

The nogen policy would result in a new type of staging than what we have implemented. If specialization under the cogen policy has been indicated for other variables, both early and late data-specialized stages would be dynamically generated.

Although this new type of staging introduces a third subgraph that we must create, it wouldn't substantially increase the complexity of our system because it doesn't introduce any new dynamic code generation. We can think of it as further subdividing our dynamic (template) subgraph.

1.2 Fast Run-Time Specialization

1.2.1 The specialized specializer

We should compare the unit-based specialized specializer approach to a unit-based stitcher. In addition, we should determine whether the more complicated unit optimizations are necessary. Note that both the stitcher and the specialized specializer permit only peephole optimizations at run time. The various types of units include (in order of increasing size of units and complexity of the unit-assignment algorithm):

In addition, there are three other optimizations we can perform to increase the size of units:

Other than increasing unit size, there are other optimizations that can be performed to increase performance of the specialized specializer:

1.2.2 Fast memory allocation and reclamation

In addition to hyper-optimizing the specialized specializer, we should give a fair amount of thought to fast memory allocation, deallocation, and reallocation. Matthai found that a large part of the overhead of the previous dynamic compiler was due to calls to malloc and free.

A few of the issues include:

1.2.3 Caching schemes

Finally, efficient implementation of the cache of specialized code is essential to a fast run-time specializer. We first need to complete the design of the cache_one implementation. In addition to flushing the appropriate lines of the cache at appropriate times, there are also a few other tricky issues.

Next, we should investigate a couple hybrid hierarchical/flat caching schemes to reduce the cost of hashing on large contexts.

It might be interesting to think about an invalidation-based scheme in addition to cache lookups for determining whether a specialization can be used, especially for static data structures.

Finally, we should explore different types of cache policies other than the simple policies we have included in our initial design.

1.2.4 Demand-Driven Specialization

It would be enlightening to compare eager (speculative) and lazy (demand-driven) specialization, in terms of utility and cost. Some of the relevant issues include:

1.2.5 Conditional Specialization

How can conditional specialization be exploited to minimize specialization overhead? Are division-querying predicates (ala filters; is_static(v) and is_dynamic(v)) useful?

1.3 Effective Run-Time Specialization

1.3.1 Capabilities of the specializer

The capabilities of the specializer determine whether the specializer can specialize the application to the degree necessary to attain a setup. With the addition of run-time inlining, our system will have more capabilities than any other specializer we know of. We only need to factor out the BTA by using annotations that can make up for any possible shortcomings in its analyses (e.g., we should support promote_one_unchecked_eager).

We should determine which capabilities yield the most benefit for which programs (e.g., multi-way loop unrolling and run-time inlining), and which were largely unnecessary (e.g., specialize_lazy, promote_one). We should emphasize what capabilities were necessary to specialize each application at all (e.g., support for unstructured code).

When is data specialization more or less useful than full dynamic specialization?

1.3.2 Cache and VM/OS effects

Cache effects could have a significant influence on performance. This issue can also be studied from the architectural perspective.

1.3.3 VM/OS effects

VM/OS effects could also have an impact on performance.

1.3.4 Other optimizations

Since peephole optimizations are the only possible back-end optimizations that can be performed at run time in this framework, we should probably look at how applicable and effective they are. This is also of interest when determining the value of run-time global back-end optimizations.

Rematerialization may be important for specializing arrays, and its effect cannot be achieved through additional annotations.

Are there other static optimizations that should be considered here?

1.4 Unstructured Code

We have made the decision to fully support arbitrary unstructured code. Is this really necessary for performance of the specializer or for performance of the specialized code? Are real programs unstructured? This may be of interest in the context of automation as well.

2Global (Inter-Unit) Run-Time Optimizations
(?)

One of our long-term goals could be to have a full compiler infrastructure whose pieces could be selectively invoked at run time. We need to determine which optimizations would be profitable at run time in which contexts. Interpreters are, perhaps, easy targets because specializing an interpreter should produce code similar to unoptimized code generated by a simple compiler.

3The Run-Time Back End and Architectural Issues
(Matthai)

4Automating Dynamic Compilation
(?)

This is clearly important, as it is our overall long-term goal for the project.

5Recurring Themes

These are issues that cut across all research topics.

5.1 Applications

5.2 Run-Time Optimizations

5.3 Run-Time Intermediate Representations

6Generalized Partial Computation

Would we like a more general system of dynamic assertions and invariants, combined with analysis of relational expressions and theorem proving? This would result in a system that could perform generalized partial computation, a generalization of partial evaluation. The positive context propagation (information that certain conditions hold true) may subsume our current reachability analysis.

7Multi-Stage Programming

Our system creates a two-stage run-time system. Our generating extensions are all statically generated. We could create a more general multi-stage system that dynamically generates generating extensions. This may not be worthwhile if programs aren't truely multi-staged, since it only affects the performance of the generating extensions and not of the code they generate.

8Imperative Dynamic Compilation

Should we have some kind of imperative mechanism for run-time code generation (ala `C or MetaML?), given that we can simulate imperative RTCGs? `C is combining imperative and declarative approaches by performing specialization within tick expressions. MetaML also has an interesting blend of operational and transformational approaches to run-time code generation.

An imperative mechanism might appear to be incompatible with automating dynamic compilation. However, MetaML is a possible annotation language for a multi-stage BTA.


Last updated June 17, 1997.
Brian Grant (grant@cs.washington.edu)