Written by Matthai Philipose <matthai@cs.washington.edu>

The main topics discussed were:
	1)How to merge the setup calculation code
	  with stitcher code.
	2)Figuring out whether setup and stitching 
	  should hapen eagerly (as it does now) or
	  lazily.


1)Integrating setup and stitcher :
-----------------------------------
We agreed that the thing to do for merging setup and
stitcher code was, for each piece (each basic block?, or
for each piece of code between holes?) c1 of code  
in the template code that used runtime constants r1...rn, 
insert before c1 code to compute r1...rn if it's not already 
computed.

Consider, e.g, the dot-product example from Ruf et al; the 
conventional (non-dc) assembler for this could be:
	
Dot product of (x1,y1,z1), (x2,y2,z2), with scale factor
s:

		mult x1,x2, sx
		mult y1,y2, sy
		mult z1,z2, sz
		add  sx,sy, temp
		add  temp,sz,sum
		mult sum,scale,res   

If x1,y1,x2,y2,scale are rtcs, our current ("old") approach would produce:
	setup:
	-------
		mult x1,x2,rtc1
		mult y1,y2,rtc2	
		add  rtc1,rtc2,rtc3
	;;store rtcs into table
		stl rtc1,0(tbl)
		  ...

	template:
	---------
		;;;<...> denotes a "hole"
		add <rtc1>,sz,sum
		mult sum,<scale>,res

	...and some directives


The new approach would produce _in a single basic block_:
		;;;compute all rtcs used in the basic block
		;;;(corresponds to setup above)
*		mult x1,x2,rtc1
*		mult y1,y2,rtc2	
*		add  rtc1,rtc2,rtc3

		;; stitch in the template code on the basis
		;; of these calculated values
		;; stitch(code) is the (inlined) assembler needed
		;; for stitching the code "code"
*		stitch(	add <rtc1>,sz,sum
*			mult sum,<scale>,res
*			)

		;;depending on our final design, we
		;;could may want to execute the
		;;the template instructions the first
		;;time around:
		add rtc1,sz,sum
		mult sum,scale,res

The idea is to produce the template code for the entire dynamic
region just as we do now (so that we get good scheduling between the
template instructions), and introduce the "extra" code (each "extra" line
of code in the above example begins with a *) __after all the scheduling 
has taken place__.
	
Problems to be addressed include:
	1)What registers should the setup and stitcher code use?
		--set some registers aside (-mflow can't use them: increases 
						reg. pressure)
		--save and restore needed registers (-runtime cost)
		--use registers dead within the basic bloc (-will there be any?0	
		--one advantage of doing a setup-stitch before
		  each instruction with a hole (instead of before
		  each basic block containing holes) is that the
		  number of extra registers needed is small


2)How to traverse the basic blocks of the above modified graph:
 -----------------------------------------------------------
 We figured there were two ways to solve this problem:

 a)The eager approach (the one we currently use):
   ----------------------------------------------
   Given the rtc's for a given program run, stitch every basic block
   that can possibly be traversed, and evaluate the rtc's used
   in these basic blocks. This is the current approach used both
   by our stitcher and our setup code.

   The way the setup code does this right now (roughly) is to maintain
   a worklist of blocks to be traversed, adding to the worklist at
   dynamic branches and looping through the worklist, _executing_ the
   next block on the list; one undesirable side effect of this structure
   is that the control structure of the dynamic region (DR) now is:

	loop over the worklist
	{
	   bblock=worklist.first
	   case bblock...
	     /	/  ... | ...\  
            b1 b2     bi    bn
	}

   where the bi are (rougly) the original basic blocks of the DR. 
   --One big problem with this approach is that every temp in blocks
   b1...bn now becomes simultaneously live, breaking the Multiflow
   register allocator and forcing us to spill many (every?) temp(s?) to memory.
   --Another problem is of course, the overhead of maintaining the
     worklist 
   --Any other problems? Joel said something about the semantics of
     "dynamic(...)" annotations on one side of a branch being difficult
     to implement...
 
   [Note that a worklist approach is needed because for an unstructured CFG g,
   we haven't figured out a way to statically construct the corresponding 
graph g'_rtcvals
   which speculatively stitches the right blocks of g, given  "rtcvals", 
   the runtime values of various rtconsts (Craig's suggestion of doing
   the exponential size expansion g'' of g followed by an optimizing phase   
   that tries to clump subgraphs of g'' to make it non-exponential is the
   the only insight we have in this direction).]

   Joel suggested the optimization of finding linearizable sub-graphs
   of the CFG for the dynamic region; instead of switching on the 
   next _basic block_, bi, as in the above picture, we should divide 
   the graph into larger sub- graphs (Matthai suggested these subgraphs 
   should be single-entry   single-exit regions, Joel had another    	
   suggestion...(Joel, clarify)). The advantage is that for completely 
structured graphs, 	
   the above loop (modified to iterate over sub-graphs rather than blocks)
   would have only one case to switch over, so no live temps problem, worklist
   overhead, etc; further for less structured graphs, the number 
   of temps to be spilled across these larger regions of the graph will 
   presumably be smaller than those needed to spilled across basic blocks.
 

b)The lazy approach:
  ------------------   
  The idea here is to defer stitching a piece of code and evaluating 
corresponding
  runtime constants until the program needs to execute the code. Thus, the
  not-taken branches of static branches will automatically be not-stitched. 
The not
  taken branches of dynamic branches will be not-stitched too, until they need 
to
  be executed.

  The first question is, during an execution of the stitced region, what
  happens if we come across a dynamic branch where the side of the branch
  to be executed hasn't yet been stitched?

  We figured all unstitched branch instructions should be patched to point to 
  fixup code which will:
	i)re-patch the stitched branch instruction to point to where the body of
         the branch will be stitched
       ii)jump to the setup/stitcher code for stitching the body of the branch
	  (before the jump, the fixup has to make sure that values in registers
	  are what the code in the body, which we are jumping to, expects; on 
	  the other hand, if we stitch things right, maybe the stitched code 
          and the template code use the same registers for the same 
instructions)

We need some way for the  fixup code to figure out which branch
instruction jumped to it; the fastest way to do this is probably to have one 
fixup fragment per dynamic branch. Each fixup-fragment will then know which
branch jumped to it (so it can patch that branch), and where in the 
setup/stitcher
code it has to branch to. Each fragment will probably be about 4/5 assembly
instructions long.

Once we begin executing the stitcher/setup code, we need some way to figure 
out 
if some piece of code we are about to execute has been stitched before; we 
probably need to add something like a check per basic block for this
purpose (most of these checks can probably be avoided for structured subgraphs
of the original graph).

When we come to a piece of code which has already been stitched(say at 
location L),
we need to stop stitching, and stitch in a branch to L. We then have two 
options:
 1)Jump back to the stitched version (especially good if no more areas need to 
be stitched);
   again, if we co-ordinate register names in setup/stitch-code and stitched
   code, we probably don't have to restore registers here
 2)Continue executing the setup/stitch-code, not stitching any basic blocks
   which have already been stitched; the advantage here is that _if_ there's 
more
   stuff to be stitched, we don't have to traverse fix-up code (see (i) & (ii) 
above);
   but I guess we'll have to special case the patching the dynamic branch 
(which led to
   this particular region not being stitched previously) anyway... (1) sounds 
like a better
   idea.


Finally, the optimization that Joel suggested, of glomming structured 
sub-graphs
(SESE regions ?) of the CFG will also be beneficial here: for any such 
subgraph, we essentially eagerly stitch the entire subgraph, and avoid paying 
any of the fixup
overhead above at some later stage.

The attraction of this approach is that if certain branches are never taken,
we never stitch them. Also, we pay the extra price (for fixup) only in the
unstructured parts of the graph. We avoid having to choose any traversal of
the entire graph, and the problem of a very large number of live temps goes 
away. Also (Joel--pl. clarify) the problem with "dynamic" annotations in the 
middle
of the dynamic region goes away. 

Detracting from the approach, it's still missing a lot of details. We haven't 
considered
single/multi-way loop unrolling, for instance; the exact cost (time and space) 
of
fixup code is not clear. Joel says the eager approach can be improved to get 
order(s?) 
of magnitude speedup... so maybe it is sufficient. I don't have a feel of how 
either
method scales (lazy seems to more promising...).

-------------------------------------------------------

To summarize, we're pretty clear on how to integrate setup and stitcher code
without sacrificing the good scheduling we get throughout the entire dynamic
region. On the other hand, we are unclear on whether setup/stitching should
happen in an eager/lazy fashion, and are leaning towards a combination of the
two.

This writeup probably has lotsa holes and bugs.
Feel free to get back to me with corrections!			
			Matthai