Register Actions for DyC

Markus Mock 3/5/1998

1. The Problem

We want to allocate (parts of) array variables to registers. A precondition is that the array is accessed only at run-time constant offsets, so that we know at specialization time which array elements are accessed. The programmer specifies with a source annotation which array variable(s) and which offsets of those the programmer would like to be allocated to registers. At compile time, we compute register actions that rewrite the template code at setup time to allocate the annotated elements to registers (or a subset thereof if more elements are to be allocated than registers are available). The problem consists of the following distinct sub-problems:

Specification of the elements that are to be allocated to registers. We use annotations that could be automatically generated or written by the programmer.
Computation of register actions that will specify how to rewrite the template code at specialization time to allocate (a subset of) the annotated array elements to registers.
Selection of the subset of candidate array elements that are allocated to registers. Which of the candidate array elements are selected if the number of available registers is exceeded by the number of candidates, can be based on heuristics, profiling or be at random (our initial default).

Wall [Wall86] allocates registers globally, i.e., for the whole program,. He computes the program's call graph in order to allocate global variables to the same registers across different procedures and also to avoid that different locals in distinct procedures are allocated to the same registers, which avoids register save and restore operations. Our task is restricted to allocating array variables for a particular procedure (part of) which is dynamically compiled. This restriction also means that we need to generate code to save and restore registers allocated by register actions across function calls. Initially, we can avoid that complexity by setting aside a set of registers for register actions (we're already doing this for a few scratch registers used while producing dynamic code from code templates)

Restrictions

Apart from the restriction that the register-allocated array only be accessed at run-time constant offsets, there are no other restrictions, for instance, array elements may be freely assigned in the dynamic region. The current BTA is, however, not capable of identifying that values at certain indices are run-time constant, but we may be able to extend it for that purpose if that appears profitable. To be able to access the correct values of array elements after the dynamic region, code needs to be generated at the end of a region that stores the values of register-allocated array elements back to the array. For local arrays a simple use-analysis can determine if this is necessary, a conservative default is to always write back the values from the registers to memory at the end of a region.

Generalization to C-Structs

The algorithm for register actions to allocate array variables for DyC can be generalized to C-structs. Instead of indices the field names could be used to specify what parts of the struct should be allocated to registers, and a small modification of the algorithm could compute registers actions for structs as well (cf. Section on computation of register actions).

2. Annotations

We use dyc_allocate(VAR, Int, Int),

e.g., dyc_allocate(reg,1,3), to specify that reg[1] .. reg[3] are eligible for register allocation through register actions.
An alternative or additional annotation that could be used, could specify a function called at specialization time to decide which array elements are to be candidates for register allocation: dyc_allocate(VAR, fn) where fn: Int ® Bool a function that is called at specialization time.
E.g., dyc_allocate(reg, f) reg[i] is a candidate for register allocation iff f(i) = true.
In an interpreter, for instance, the user-supplied function could check the bytecodes to determine which registers are used in the interpreted program or it could be an interface function that consults profiling data of previous runs to determine the most frequently accessed array offsets, etc. Multiple dyc_allocate annotations for the same array variable can be used to specify non-consecutive array indices as candidates for register allocation.

3. Computation of Register Actions

The actions can be computed by essentially one modification to Wall's algorithm [WALL86]. Wall's algorithm computes the actions for each basic block. Each instruction in the basic block is either

a leaf (load of a variable),
an assign (assignment to a variable), or
an operate command (an operation using one or two values in registers producing a value in a target register.

To compute when instructions can be annotated with REMOVE,and when it is necessary to replace them with register moves, Wall's algorithm computes what he calls "time-critical leaves", which are loads that cannot be removed, and which also determine when an operand can be annotated with the OP action. In addition, there are some conditions when stores cannot be removed but have to be replaced by register moves. Here are the different possible situations that motivate the decisions in Wall's algorithm:

1.REMOVE, OP, and RESULT actions

r1 := load y     REMOVE.y
r2 := load x     REMOVE.x
r3 := r1 + r2    OP1.y OP2.z RESULT.x
store x := r3    REMOVE,x

REMOVE.v specifies that the load/store can be removed if variable v is allocated to a register, OPi.v specifies that operand i is replaced by the register to which variable v is allocated (if it is), and RESULT.v says that the result register needs to be replaced by the register to which variable v is alloated.

2.Load Actions

If a variable is evaluated (loaded), then changed and later the original value is used, a remove action cannot be generated. For instance, for the C-statement x = y++ + z the following code would be produced:

r1 := load y     LOAD.y
r2 := r1 + 1     RESULT.y
store y := r2    REMOVE.y
r2 := load z     REMOVE.z
r3 := r1 + r2    OP2.z RESULT.x
store x := r3    REMOVE.x

The LOAD action LOAD.y says that the instruction says that the load will be changed to a register move from the register y's allocated to. Since the loaded value of y is used after it is changed, the load cannot be removed since the original value of y has to be available for the add operation. Similarly, the first operand of the add operation cannot be annotated OP1.y since the original value of y (preserved in register r1) must be used. Load operations whose current value is used after the variable is changed, are called time-critical by Wall. Instead of a remove action, a load action is generated for them. In addition, operands that use the loaded value in operations are not annotated with an OP-action.

3.STORE Actions

If a computed value is used as an operand in more than one command, using the RESULT action becomes impossible. For instance, for the C-statement x = y = a + b the following code would be produced:

r1 := load a     REMOVE.a
r2 := load b     REMOVE.b
r3 := r1 + r2    OP1.a OP2.b
store y:= r3     STORE.y
store x := r3    STORE.x

So, instead of using a RESULT action, for each use of the computed value in an assignment to a variable, a STORE action is produced that will be replaced by a register move in case y or x, respectively, are allocated to registers.

One degenerate case arises as well: naively generating register actions for x = y might produce:

r1 := load y REMOVE.y
store x := r1 OP1.y REMOVE.x

in which case both commands would be removed if both x and y were allocated to registers. Instead,

r1 := load y REMOVE.y
store x := r1 OP1.y STORE.x

has to be generated. The rule here is, if the operand of a store is a leaf command (i.e., a load) then generate a STORE action.

Computing Time-Criticality

Time-critical leaves are identified in a backwards pass over the commands of a basic block, which identifies where the values of commands are used. If it turns out that a leaf is used after its variable is assigned, it has to be marked as time critical. Here's an example:

r1 := load y     #1 leaf y
r2 := r1 + 1     #2 operate #1 + 1
store y := r2    #3 assign y := #2
r2 := load z     #4 leaf z
r3 := r1 + r2    #5 operate #1 + #4
store x := r3    #6 assign x := #5

The first column shows the original code for x = y++ + z, the second column a three-address representation where loads are designated as leaf, stores as assign, and operations producing values in registers as operate. Leaf command #1 is used as an operand in command #5 and the destination of the assignment in command #3 is the same as variable loaded by it, leaf #1 is marked as time-critical.

To decide if an assign command is critical, Wall's algorithm checks whether the destination of the assign (command #3 in the example) is a live leaf, that is a leaf that is used after the assign (in the example leaf #1 is used in command #5 after command #3). Since we allocate parts of arrays whose indices we don't know until specialization time, we don't know until then if the destination of an assign is equal to a particular leaf, for instance if reg[rt] refers to the same location as reg[rs]. Therefore, instead of computing for each leaf it is time-critical or noe, we generate a condition marker for each leaf that is initialized to false. Whenever the algorithm checks for potential criticality we OR the condition that determines if the leaf will be critical. For instance, if the load is from offset n and the store is to offset m, the condition that is added to the leaf is ` OR(n == m). The back quote is used here to indicate that the condition is typically not evaluated at compile time when the action is pre-computed since n and m are typically run-time and not compile-time constants. At specialization time the disjunction constructed for each leaf is evaluated to decide if the leaf is in fact time critical. Loads of and stores to time critical leaves will not be removed, instead they will be replaced by register moves. So here's Wall's modified algorithm:

For each leaf LEAF do:
initialize LEAF.TimeCritical := false

Let LIVELEAVES be the set of leaves in this basic block,

After computing which leaves are potentially time critical, register actions are computed. In contrast to Wall's approach, we cannot distinguish between a remove and load action until specialization time. So instead of a REMOVE and LOAD action, we have a TC_LOAD action, which is completed at specialization time to either a REMOVE or LOAD action. Similarly, the OP-action, which rewrites operation operands, is modified to incorporate a specialization-time condition that determines if the action is carried out or not.

Register action types:

TC_LOAD(v,exp1, cond): replace a load of v[exp1] by a register move if element exp1 has been selected for register allocation in step 3 and cond evaluates to true at specialization time; if v[exp1] has been selected for register allocation and cond evaluates to false at specialization time, remove the load instruction; otherwise the action has no effect. If cond can be evaluated at compile time, produce a remove or load action at compile time.
RESULT(v, n): replace the result of the instruction by the register to which v[n] is allocated; if v[n] is not allocated to a register, the action has no effect.
STORE(v, n): replace a store to v[n] by a register move to the register to which v[n] is allocated; if v[n] is not allocated to a register, the action has no effect.
REMOVE(v,n): remove the load of / store to v[n] if v[n] is allocated to a register; if v[n] is not allocated to a register, the action has no effect.
OP_i(v, n, cond): if v[n] is allocated to a register and cond evaluates to false at specialization time, replace operand i by the register to which v[n] is allocated, otherwise the action has no effect. If cond is a compile-time condition that evaluates to true, the action has no effect.

The conditions that determine which action is generated for a particular command, are similar to Wall's:

If an instruction LEAF is a leaf load v[exp] for an array variable v that is a candidate for register allocation, annotate it with TC_LOAD(v, exp, LEAF.timeCritical). If LEAF.timeCritical is a compile-time constant, the action can be completed at compile time to either REMOVE(v,exp) if the condition is false, or LOAD(v,exp) if the condition is true.
If an instruction is an operation and the result of the operation is used only once in an assignment command, then annotate the instruction with RESULT(v,n).
If the instruction is a store to v[n] and the operand is a leaf command, or if the operand is used in some other command, then annotate the store with STORE(v,n), otherwise, annotate it with REMOVE(v,n).
To determine where to generate OP actions, it is more convenient to look at the actions generated for a particular variable v, instead of looking at the instructions. So in both case 2 and 3 for each operand that is a leaf LEAF v[n], annotate any instructions that use the leaf command with OP_i(v, n, LEAF.timeCritical.)(i=1,2 for the first and second operand, respectively).

There are only two things going on here: (1) Cases 1 and 2 make sure that a load action is generated instead of a remove action if the loaded values is used after the loaded variable is modified. Case 3 takes care of the cases where a computed value is used more than once and the degenerate case where a loaded value is stored to some other variable. For case 4, we are looking at the actions generated for a particular variable v that is used as an operand in a command.

An Example

Here's the code we generate for our interpreter branch that does the multiply:

ldl $27,interpreter.rtconst_5_1($24)
ldl $25,interpreter.rtconst_5_3($24)
mull $27,$25,$25
stl $25,interpreter.rtconst_5_5($24)

at the BTA stage this looks as follows:

536 <S> SHIFT.ASHL<>t474 @<_rt$1.2>t457 2(I64)
537 <D> ALD<>t349 @<>t474<>t0 _reg$1.2
538 <S> SHIFT.ASHL<>t475 @<_rs$1.2>t458 2(I64)
602 <D> ALD<>t351 @<>t475<>t0 _reg$1.2
539 <D> GIMUL<>t352 <>t349<>t351]
540 <S> SHIFT.ASHL<>t476 @<_rd$1.2>t456 2(I64)
541 <D> AST<>t352 @<>t476<>t0 _reg$1.2

Computing the leaf information for the dynamic (<D>) operations gives:

#1 leaf reg[t474]
#2 leaf reg[t475]
#3 oper #1 * #2
#4 assign reg[t476] = #3

where t474, t475, and t476 are all run-time constants.

None of the leaves has a time critical condition, so for all of them timeCritical = `false.

So the following actions would be generated:

TC_LOAD(reg, t474, false)
TC_LOAD(reg, t475, false)

RESULT(reg, t476) OP1(reg, t474, false), OP2(reg, t475, false)
REMOVE(reg, t476)

At specialization time we check the values of the run-time constant offsets and can decide if we rewrite the code. Assuming, for instance, that t474=1, t475=2, t476=3 and that these offsets are allocated to registers r1, r2, and r3, respectively, the dynamic code would be rewritten to:

mull r1, r2, r3

getting rid of the loads and the store.

References

[WALL86] David D. Wall: Global Register Allocation at Link Time. In SIGPLAN 86 Symposium on Compiler Construction, Palo Alto, California, 1986, pp. 264-275. Also as Technical Report URL: ftp://gatekeeper.dec.com/pub/DEC/WRL/research-reports/WRL-TR-86.3.ps