Comparing Speedups across Architectures

The issue width and instruction latencies of the processor on which the dynamically generated code runs have a large impact on speedup due to dynamic compilation. In particular, on a multi-issue machine, if the instructions removed by dynamic compilation are not on the critical path of the computation, the total running time of the computation may not be affected.

Example: `pow`

Summary

For example, all instructions from the inner loop of the pow benchmark's kernel can be removed by dynamic compilation except one floating-point multiplication. Unfortunately, on the DEC Alpha 21064, the latency of that operation, 6 cycles, dominates the loop's critical path. Removing the other instructions, therefore, does not significantly reduce the time required to execute the loop. For this reason, that benchmark is unlikely to benefit from dynamic compilaton on modern architectures.

Detailed Explanation

Consider the following loop schedule (the kernel of the pow benchmark) running on the dual-issue 21064, where a floating-point multiply has a latency of six cycles. Each line represents a cycle of the schedule, and contains a maximum of two instructions that can be issued together. All instructions except the multiplies have a one-cycle latency. The exact functioning of the loop is unimportant, but note that the multiply M is data-dependent on itself.

L0:     and t1, a1, t0

        beq t0, L1           <-- branch B1

        mult f0, f16, f0

L1:     addq t1, t1, t0 

        addl t0, zero, t1

        cmple t1, a1, t0

        mult f16, f16, f16   <-- multiply M	bne t0, L0 <--branch B2

According to this schedule, the loop body requires a minimum of 6 cycles to execute (this is the case if branch B1 is taken). Even if every instruction in the body except M were removed (as dynamic compilation could), the body would still take 6 cycles to execute, since M has a dependence of latency 6 on itself. On this machine, therefore, dynamic compilation would not yield a speedup.

On the other hand, when executing on a single-issue machine, the minimum latency of the above schedule would be 7 cycles, since branch B2 would issue in a cycle of its own. Removing all instructions except M would result in a loop of latency 6 cycles (as before), producing a speedup of 7/6 (= 1.17x) for the dynamically compiled code. On the other hand, if the multiply cost only 3 cycles (e.g., this is the latency for single precision floating-point multiplication on the MicroSparc II used in the tcc experiments) instead of 6, the speedup due to dynamic compilation would double to 7/3 (= 2.34x).

Comparing Speedups across Architectures

Example: pow

Summary

Detailed Explanation

Last updated January 12, 1998. Brian Grant (grant@cs.washington.edu)

Example: `pow`

Last updated January 12, 1998.
Brian Grant (grant@cs.washington.edu)