1. Loop Unrolling [40 marks]
Consider the following loop:
loop: l.d f4,0(r1) l1
l.d f6,0(r2) l2
mul.d f4,f4,f0 m1
mul.d f6,f6,f2 m2
add.d f4,f4,f6 a1
s.d f4,0(r1) s1
daddui r1,r1,#-8 sub1
daddui r2,r2,#-8 sub2
bnez r1,loop br
Note: Our default is that FP arithmetics have 4 x-boxes.
a) [4 marks] Using the names ‘l1’ to ‘s1’ for the first six instructions
in the body of this loop, draw the flow-dependence graph for just these
instructions. Label each arrow with the dependence gap between the
producer and the consumer.
graph:
In what follows, focus on three flow-dependence types: i) FP arith to
FP arith, ii) FP arith to FP store, and iii) FP load to FP arith. Denote
the number of m-boxes in memory references by ‘#m’, and the number of
x-boxes in FP arithmetics by ‘#x’. By the way, memrefs have one x-box,
and arithmetics have one (never-used) m-box.
b) [6 marks] For each of the three designated flow-dependence types,
indicate the number of stalls in adjacent producer-consumer pairs as
functions of ‘#m” and ‘#x’.
FP arith —> FP arith =
FP arith —> FP store =
FP load —> FP arith =
c) [10 marks] Suppose #m = 1 and #x = 4. How many stalls occur in one
iteration of the loop if it is executed exactly as written? Don’t
write out a space-time diagram; use the method of Chapter 3.
ans:
e) [10 marks] Unroll the loop twice. If one reschedules the unrolled
loop optimally, how many stalls are left? (Keep the branch as the last
instruction. Show the rescheduled code using the _short_ names.)
ans:
f) [10 marks] Increases the ‘mul-add dependence gap’ to 5 cycles, leaving
everything else unchanged. Unroll the loop three times. If one reschedules
the unrolled loop optimally, how many stalls are left? (Keep the branch as
the last instruction. Show the rescheduled code using the _short_ names.)
ans:
2. Dynamic Instruction Scheduling I [30 marks]
Imagine that reservation stations only track whether floating-point operands
are valid, and that integer operands appear by magic whenever needed.
a) [10 marks] Stage instructions ‘l1’ to ‘s1’ to reservation stations rs1(l1)
to rs6(s1). Show the contents of each reservation station. Indicate both
present and pending operands in each station. For the loads, you may make
the operand entries ‘blank’. That is, mark all load dependences as resolved.
For the store, just invent a regular single-operand reservation station.
rs1(l1):
rs2(l2):
rs3(m1):
rs4(m2):
rs5(a1):
rs6(s1):
b) [10 marks] After both loads have completed, but no further action has
occurred, show the contents of each reservation station. You may mark
entire reservation stations as ‘free’. Syntax: result[rs9(a9)]; val[f12].
You may alter the tag inside parentheses if warranted.
rs1(l1):
rs2(l2):
rs3(m1):
rs4(m2):
rs5(a1):
rs6(s1):
c) [10 marks] At this point, dispatch the two loads of the second iteration,
viz., ‘l3’ and ‘l4’, into reservation stations ‘rs1’ and ‘rs2’. Moreover,
let instruction ‘m1’ of the first iteration complete. Show the contents of
each reservation station.
rs1(l3):
rs2(l4):
rs3(m1):
rs4(m2):
rs5(a1):
rs6(s1):
3. Dynamic Instruction Scheduling II [15 marks]
a) [5 marks] Find a flow dependence in the original code, and show that it
cannot cause harm.
ans:
b) [5 marks] Find an antidependence in the original code, and show that it
cannot cause harm.
ans:
c) [5 marks] Find an output dependence in the original code, and show that
it cannot cause harm.
ans:
4. Load Buffers and Store Buffers [15 marks]
a) [5 marks] Consider two code sequences:
l.d f4,0(r1) l.d f6,0(r1)
s.d f6,0(r1) s.d f6,0(r1)
Show that ignoring the register dependence and respecting the memory
dependence provides a uniform, correct solution to issue delay in
both code sequences.
ans:
b) [5 marks] Consider:
s.d f4,0(r1)
l.d f6,0(r1)
Assume these two instructions have been loaded into a store buffer and
a load buffer, respectively. There is a memory dependence. If we
delay the issue of the load until the store “completes”, operationally
what do we mean by “complete”? (Forwarding is a different approach).
ans:
c) [5 marks] A _speculated_ load is allowed to progress through more
stages of an out-of-order pipeline compared to a _speculated_ store.
What is difference and why is it necessary?