Remaining Unit Details

This page outlines proposals for handling the remaining details in the design of the unit-based specialized specializers. This page is more or less notes for me and is not terribly readable; if you have any questions, let me know - bkg.

Current (9/17/97) To-Do List

  1. Debugging
  2. Memory allocation
  3. Eliminate redundant work-list pushes and pops (make eager fall-throughs)
  4. Don't insert table-pointer loads if table pointer will be the same as for the previous unit; when creating a run-time merge, find the table pointer of the merge unit and insert a table-pointer load before the jump to that unit
  5. Dynamic jump-table construction
  6. Unit linearization
  7. Fix intra-unit branches if linearized setup blocks are laid out in a different order from the template blocks

Design Issues

Forming Units

  1. Initially, units will be true basic blocks (instructions between branches and merges)
  2. Next, units will be disjoint, single-entry subgraphs containing no dynamic branches
  3. Later, units will disjoint linearized subgraphs
  4. Eventually, I'd like to experiment with the single-run-time-predecessor optimizations, creating units that are not necessarily disjoint
  5. Finally, we should attempt clustering of unit boundaries

Connecting Units

The template subgraph retains most of its original connectivity. Eager edges corresponding to static branches between units of the setup subgraph remain. Eager edges corresponding to dynamic branches to other units remain or are replaced by dummy IGOTOs. Lazy edges extend from the template subgraph to the setup subgraph. IGOTOs are inserted at promotions, and connect to corresponding units in both the setup and template subgraphs.

This scheme may need to be changed if Multiflow yacks or if registers aren't saved and restored correctly. It just has to be refined by trial and error.

GE Entries

We will create multiple entries to each unit of the setup subgraph, eager entries and lazy entries. Generally, there will be at most one eager entry, but lazy entries will be replicated in the template subgraph. These "entries" may or may not be Multiflow ENTRYs.

Entries may or may not include cache lookups. The cache lookups will be implemented as described in the PEPM paper, using a two-level hierarchy. Lazy entries with no lookup will be implemented using IGOTOs, initially jumping to setup code and then to stitched code, that also are overwritten with direct branches except in the case of cache_one units.

Cache Lookups

First, a hash function must be computed on the values of the cache_all variables. Then, comparisons must be made between the context found and the current context to find the correct cache location or to determine that the current context is not present in the cache. If the context is found, a second hash is computed on the values of the cache_one variables and similar comparisons must be made between the current context and the keys in the cache. If there are no cache_one variables, there need not be a second level to the cache. If there are no cache_all variables, there need not be a first level to the cache. If there are neither cache_all nor cache_one variables, there is no cache for that unit.

Eventually, I would like to try the hybrid hierarchical/flat caching scheme, which would create a hierarchy across different units.

Eager-Successor Work List and Passing Live Static Variables

Except for final successors (e.g., successors of static branches), all eager successors are pushed onto a LIFO work list. The information needed in a work-list node includes the address of the successor (provided by an anchor tag for the eager entry to that setup unit), the address of the vector of values of live static variables (LSV vector), and a pointer to the next node. Thus, where there would be a dynamic branch to another unit, the LSV vector is allocated and filled in, and then a node for the work list is allocated, initialized, and pushed onto the work list.

At the end of a unit that has no eager successors, a node from the work list is removed and returned to the free list, and the head of the work list is updated. The values from the LSV vector are loaded, the LSV vector is returned to the sucessor's free list, and the specializer will jump to the address of the eager entry for the successor.

The values of the LSV vector are sorted such that values of cache_all variables are contiguous and values of cache_one variables are contiguous. This simplifies the construction of cache lookups.

Static-Value Tables

Only values of long-lived static variables will be stored in static value tables (SVTs). Initially, the only long-lived static variables will be large run-time constants (i.e., run-time constants too big to fit in an immediate field). Such a run-time constant will be duplicated in the static-value table for each use in the template. We can statically compute an upper bound on the number of static variables to be stored in an SVT. If the amount of space remaining in the current SVT is not sufficient to store this upper bound, a new SVT will be allocated.

Each specialized unit will only access a single SVT, the address of which will be written into a reserved register at the beginning of the unit. Each SVT will be 216 bytes because the immediate fields of load and store instructions are 16 bits. This should be plenty big for a single unit and, in general, we expect multiple specialized units to share a single SVT.

Memory Allocation, Deallocation, and Reallocation

We will link with a library containing two huge arrays, DYC_TableSpace[] and DYC_CodeSpace[]. Space for intermediate specializer data structures and static-value tables will be allocated from DYC_TableSpace, and code space will be allocated from DYC_CodeSpace. Space will never be returned to these two pools, but rather we will maintain a number of free lists containing fixed-sized objects. For example, for each unit we will maintain a free list for the LSV vectors (whose sizes are known at static compile time). There will be global free lists for work-list nodes and SVTs.

A SVT can be freed when its count of referencing units reaches zero. The count is incremented by each unit that uses that table and decremented when such a (cache_one) unit is reallocated.

Hence, we need global pointers to the first unused location in each of DYC_TableSpace and DYC_CodeSpace, to the current SVT and the first unused location therein, and to the global free lists. We require per-unit pointers for the per-unit free lists.

We're not sure how to handle fragmentation in the SVTs and in chunks of memory allocated for stitching. If we know the code chunks will never be reallocated (no variable has the cache_one policy), then we can stitch code back-to-back, returning unused space to DYC_CodeSpace. If a chunk may be reallocated, then we can't return the extra space and we must place an unconditional branch to the next unit at the end of the valid instructions. Consequently, a significant portion of the memory could go unused.

A much greater problem is how to deallocate or reallocate the memory of units previously reached only by a cache_one unit that has been reallocated. For example, a cache_one unit could be followed by an unrolled loop with the cache_all_unchecked policy. Fortunately, cache_one is not part of the initial implementation, so we can think about this for a while.

Note also that cache_one units and units that are lazily stitched should be started on new cache lines.

Compiler-Merger Interface

WARNING: this spec. is largely obsolete! I need to rewrite it.

The compiler will reserve two registers (s9 and s10) for use by the merger. "Communication" between the compiler and the merger will be through these registers, various global variables, and tags (labels).

Code and Table Space

The merger will assume that the pointer to the space that should be used for stitching will be in a particular reserved register (s10) and that the address where large run-time constants should be stored will be stored in the global variable DyCRT_StaticValueTable. These locations will be set by instructions at the beginning of the generating extension inserted by the compiler.

The beginning of each unit template is tagged with function.DYC.UNITTEMPLATE.unit_id. The beginning of each unit generating extension is tagged with function.DYC.GEBEGIN.unit_id to mark the location where the table-pointer load should be inserted.

The current proposal for embedding the 64-bit table pointer in the code is to keep a global table of table pointers, DyCRT_TablePointers (size is 216 bytes, or 213 pointers). An LDA will build the index (DyCRT_TablePointerIndex) into this table, followed by a load to get the address of the static-value table. The compiler will insert code to update the index and store entries into the table of table pointers.

At the end of the unit (any leaf), the compiler will use the updated values of register s10 and of DyCRT_StaticValueTable to reclaim unused space.

Calls to Generating Extensions

The merger must replace calls to DyCRT_CallGeneratingExtension with calls through register 16.

Returns from the generating extension are inserted by the compiler, but must be set up by the merger. The merger must pick up a storage location from any load marked function.DYC.GEReturnAddress: (and remove all such labels). Then, the merger must insert stores at lazy entries, marked by function.DYC.LAZY_ENTRY.unit_id, to that location.

Branch Patching

The compiler now inserts the call to the intra-unit branch patcher, and wrapper.c is defunct. LabelAddrs has become DyCRT_IntraUnitLabelAddrs. BranchTable has become DyCRT_IntraUnitBranches. NumBranches has become DyCRT_BranchIndex. These are declared in dycrt.c. The compiler initializes DyCRT_BranchIndex. So, the merger need only insert branch-recording operations.

The merger must store the address of an inter-unit branch in DyCRT_LastBranch. Such branches are labelled in the template code with function.DYC.BRANCH..src_unit.taken_dst_type.taken_dst_unit.nottaken_dst_type.nottaken_dst_unit.B.src_blk, where the dst_types are S for static destinations and D for dynamic destinations. When at least one dst_type is dynamic, the merger must ensure that the branch is directed the right way (i.e., to the `taken' unit). Note that if one side is static and the other is dynamic, the dynamic edge will always be `not taken'. If both destinations are static units, an unconditional branch must be added to jump to the `not taken' unit.

The tops of the destination units are marked by tags containing their unit ids: function.DYC.UNITTEMPLATE.unit_id for dynamic units and function.DYC.STATICUNIT.unit_id for static ones.

Eventually, we should extend this to check whether a relative branch or absolute jump is required, and to handle patching of dynamically allocated jump tables.

Squashing Dummy Branches and Jumps

There are dummy branches and/or jumps in both setup code and template code. The dummy branches and jumps in the setup code serve to connect the unit graph. The dummy jumps in the template code serve to demarcate the ends of unit templates for the purpose of appropriately constraining instruction motion across unit boundaries.

The dummy branches and jumps in the setup code are marked by function.DYC.DUMMY_GE_LINK.stat_number. These should simply be eliminated.

The dummy jumps in the template code are marked by function.DYC.DUMMY_TO_DYNAMIC.src_unit.dst_unit.src_blk when it is a jump to another dynamically generated unit and function.DYC.JUMP_TO_STATIC.src_unit.dst_unit.src_blk when it is a jump to static code. In general, these jumps will need to be replaced with a branch or jump to the next unit. (The jumps to static code can actually be left as is.)

There are additional dummy jumps in both the setup and template code labelled function.DYC.DUMMY_IGOTO_MARKER.count that were inserted to block code motion. The count doesn't encode any useful information.

Other Changes

{op,mem,branch}_{max,min,mask} have been prepended with "DyCRT_" and are declared in dycrt.c. Similarly, DycError has become DyCRT_Error.


Last updated September 17, 1997.
Brian Grant (grant@cs.washington.edu)