Written by Matthai Philipose The main topics discussed were: 1)How to merge the setup calculation code with stitcher code. 2)Figuring out whether setup and stitching should hapen eagerly (as it does now) or lazily. 1)Integrating setup and stitcher : ----------------------------------- We agreed that the thing to do for merging setup and stitcher code was, for each piece (each basic block?, or for each piece of code between holes?) c1 of code in the template code that used runtime constants r1...rn, insert before c1 code to compute r1...rn if it's not already computed. Consider, e.g, the dot-product example from Ruf et al; the conventional (non-dc) assembler for this could be: Dot product of (x1,y1,z1), (x2,y2,z2), with scale factor s: mult x1,x2, sx mult y1,y2, sy mult z1,z2, sz add sx,sy, temp add temp,sz,sum mult sum,scale,res If x1,y1,x2,y2,scale are rtcs, our current ("old") approach would produce: setup: ------- mult x1,x2,rtc1 mult y1,y2,rtc2 add rtc1,rtc2,rtc3 ;;store rtcs into table stl rtc1,0(tbl) ... template: --------- ;;;<...> denotes a "hole" add ,sz,sum mult sum,,res ...and some directives The new approach would produce _in a single basic block_: ;;;compute all rtcs used in the basic block ;;;(corresponds to setup above) * mult x1,x2,rtc1 * mult y1,y2,rtc2 * add rtc1,rtc2,rtc3 ;; stitch in the template code on the basis ;; of these calculated values ;; stitch(code) is the (inlined) assembler needed ;; for stitching the code "code" * stitch( add ,sz,sum * mult sum,,res * ) ;;depending on our final design, we ;;could may want to execute the ;;the template instructions the first ;;time around: add rtc1,sz,sum mult sum,scale,res The idea is to produce the template code for the entire dynamic region just as we do now (so that we get good scheduling between the template instructions), and introduce the "extra" code (each "extra" line of code in the above example begins with a *) __after all the scheduling has taken place__. Problems to be addressed include: 1)What registers should the setup and stitcher code use? --set some registers aside (-mflow can't use them: increases reg. pressure) --save and restore needed registers (-runtime cost) --use registers dead within the basic bloc (-will there be any?0 --one advantage of doing a setup-stitch before each instruction with a hole (instead of before each basic block containing holes) is that the number of extra registers needed is small 2)How to traverse the basic blocks of the above modified graph: ----------------------------------------------------------- We figured there were two ways to solve this problem: a)The eager approach (the one we currently use): ---------------------------------------------- Given the rtc's for a given program run, stitch every basic block that can possibly be traversed, and evaluate the rtc's used in these basic blocks. This is the current approach used both by our stitcher and our setup code. The way the setup code does this right now (roughly) is to maintain a worklist of blocks to be traversed, adding to the worklist at dynamic branches and looping through the worklist, _executing_ the next block on the list; one undesirable side effect of this structure is that the control structure of the dynamic region (DR) now is: loop over the worklist { bblock=worklist.first case bblock... / / ... | ...\ b1 b2 bi bn } where the bi are (rougly) the original basic blocks of the DR. --One big problem with this approach is that every temp in blocks b1...bn now becomes simultaneously live, breaking the Multiflow register allocator and forcing us to spill many (every?) temp(s?) to memory. --Another problem is of course, the overhead of maintaining the worklist --Any other problems? Joel said something about the semantics of "dynamic(...)" annotations on one side of a branch being difficult to implement... [Note that a worklist approach is needed because for an unstructured CFG g, we haven't figured out a way to statically construct the corresponding graph g'_rtcvals which speculatively stitches the right blocks of g, given "rtcvals", the runtime values of various rtconsts (Craig's suggestion of doing the exponential size expansion g'' of g followed by an optimizing phase that tries to clump subgraphs of g'' to make it non-exponential is the the only insight we have in this direction).] Joel suggested the optimization of finding linearizable sub-graphs of the CFG for the dynamic region; instead of switching on the next _basic block_, bi, as in the above picture, we should divide the graph into larger sub- graphs (Matthai suggested these subgraphs should be single-entry single-exit regions, Joel had another suggestion...(Joel, clarify)). The advantage is that for completely structured graphs, the above loop (modified to iterate over sub-graphs rather than blocks) would have only one case to switch over, so no live temps problem, worklist overhead, etc; further for less structured graphs, the number of temps to be spilled across these larger regions of the graph will presumably be smaller than those needed to spilled across basic blocks. b)The lazy approach: ------------------ The idea here is to defer stitching a piece of code and evaluating corresponding runtime constants until the program needs to execute the code. Thus, the not-taken branches of static branches will automatically be not-stitched. The not taken branches of dynamic branches will be not-stitched too, until they need to be executed. The first question is, during an execution of the stitced region, what happens if we come across a dynamic branch where the side of the branch to be executed hasn't yet been stitched? We figured all unstitched branch instructions should be patched to point to fixup code which will: i)re-patch the stitched branch instruction to point to where the body of the branch will be stitched ii)jump to the setup/stitcher code for stitching the body of the branch (before the jump, the fixup has to make sure that values in registers are what the code in the body, which we are jumping to, expects; on the other hand, if we stitch things right, maybe the stitched code and the template code use the same registers for the same instructions) We need some way for the fixup code to figure out which branch instruction jumped to it; the fastest way to do this is probably to have one fixup fragment per dynamic branch. Each fixup-fragment will then know which branch jumped to it (so it can patch that branch), and where in the setup/stitcher code it has to branch to. Each fragment will probably be about 4/5 assembly instructions long. Once we begin executing the stitcher/setup code, we need some way to figure out if some piece of code we are about to execute has been stitched before; we probably need to add something like a check per basic block for this purpose (most of these checks can probably be avoided for structured subgraphs of the original graph). When we come to a piece of code which has already been stitched(say at location L), we need to stop stitching, and stitch in a branch to L. We then have two options: 1)Jump back to the stitched version (especially good if no more areas need to be stitched); again, if we co-ordinate register names in setup/stitch-code and stitched code, we probably don't have to restore registers here 2)Continue executing the setup/stitch-code, not stitching any basic blocks which have already been stitched; the advantage here is that _if_ there's more stuff to be stitched, we don't have to traverse fix-up code (see (i) & (ii) above); but I guess we'll have to special case the patching the dynamic branch (which led to this particular region not being stitched previously) anyway... (1) sounds like a better idea. Finally, the optimization that Joel suggested, of glomming structured sub-graphs (SESE regions ?) of the CFG will also be beneficial here: for any such subgraph, we essentially eagerly stitch the entire subgraph, and avoid paying any of the fixup overhead above at some later stage. The attraction of this approach is that if certain branches are never taken, we never stitch them. Also, we pay the extra price (for fixup) only in the unstructured parts of the graph. We avoid having to choose any traversal of the entire graph, and the problem of a very large number of live temps goes away. Also (Joel--pl. clarify) the problem with "dynamic" annotations in the middle of the dynamic region goes away. Detracting from the approach, it's still missing a lot of details. We haven't considered single/multi-way loop unrolling, for instance; the exact cost (time and space) of fixup code is not clear. Joel says the eager approach can be improved to get order(s?) of magnitude speedup... so maybe it is sufficient. I don't have a feel of how either method scales (lazy seems to more promising...). ------------------------------------------------------- To summarize, we're pretty clear on how to integrate setup and stitcher code without sacrificing the good scheduling we get throughout the entire dynamic region. On the other hand, we are unclear on whether setup/stitching should happen in an eager/lazy fashion, and are leaning towards a combination of the two. This writeup probably has lotsa holes and bugs. Feel free to get back to me with corrections! Matthai