### Credits

# EE 457 Unit 9b

Tomasulo Part 2: In-Order Completion Speculation

#### • Some of the material in this presentation is taken from:

- Computer Architecture: A Quantitative Approach
  - John Hennessy & David Patterson
- Some of the material in this presentation is derived from course notes and slides from
  - Prof. Michel Dubois (USC)
  - Prof. Murali Annavaram (USC)
  - Prof. David Patterson (UC Berkeley)



**USC**Viterb

USC Viterbi 3

### Tomasulo w/ Speculative Execution

#### **Tomasulo 1**

- In-order Issue
- Out-of-Order Execution
- Out-of-order Completion
  - Completion = Commit = Update state = Write to Reg./Mem.

- No speculative execution beyond branches (stall dispatch until branch is resolved)
- No precise exceptions

#### Tomasulo 2

- In-order Issue
- Out-of-Order Execution
  - \_\_\_\_\_ Completion
  - Plus, we now allow "Speculative" Execution
- Execute out of order but don't write reg/memory immediately but "\_\_\_\_\_ (\_\_\_\_\_\_ store) results and commit in-order.
- Can speculate branch outcomes and dispatch down a pathway before they execute, flushing instruction results if we are wrong
- Support precise exceptions!

### Changes to Tomasulo Part 1

- Removed structures:
  - No more TAG FIFO: Use ROB location (write pointer) as TAG of the instruction
  - No more RST (Register Status Table): Instead do an associative search of the ROB
- D-Cache shown in one place (used by LW and SW in same place)

- New Structures:
  - ROB (\_\_\_\_\_\_): Enables in-order completion and \_\_\_\_\_\_after misspeculated branch
  - BPB (Branch Prediction Buffer): Enables speculating (issuing instructions) past branches
  - SAB (Store Address Buffer): Helps with memory disambiguation
- D-Cache shown in two places (LW and SW use at different places/times)



# Re-Order Buffer (ROB) Structure

- ROB is a FIFO + \_\_\_\_\_ Access
  - In a modern system: 128-256 locations
- WP = Write pointer
  - Used by Dispatch Unit
  - Each instruction issues in order and "takes a number" (its "\_\_\_\_\_")
- Instructions can write results to its ROB entry (out of order) whenever they execute and put their result on the CDB Botton (wp)
- RP = Read pointer =
  - Used for committing (allow writeback for) the most senior / oldest instruction when it has completed without generating an exception

|             |            | Valid   | Comp   | Rd    | RegWr    | Result  | Others |
|-------------|------------|---------|--------|-------|----------|---------|--------|
|             | 0          | 0       | 0      | 0     | 1        |         |        |
| Ten         | 1          | 0       | 0      | \$2   | 1        |         |        |
|             | 2          | 0       | 0      | 0     | 0        |         |        |
| Top<br>(rp) | <b>→</b> 3 | 1       | 1      | \$1   | 1        |         |        |
| (           | 4          | 1       | 0      | \$2   | 1        |         |        |
|             | 5          | 1       | 0      | \$15  | 1        |         |        |
|             | 6          | 1       | 1      | \$2   | 1        |         |        |
|             | 7          | 1       | 1      | \$6   | 1        |         |        |
|             | 8          | 1       | 0      | \$2   | 0        |         |        |
| ttom        | <b>→</b> 9 | 1       | 0      | \$7   | 0        |         |        |
| wp)         | 10         | 0       | 0      | \$13  | 1        |         |        |
|             | 11         | 0       | 0      | 0     | 1        |         |        |
|             | 12         | 0       | 0      | \$4   | 0        |         |        |
|             | 13         | 0       | 0      | \$2   | 1        |         |        |
|             | 14         | 0       | 0      | 0     | 1        |         |        |
|             | 15         | 0       | 0      | 0     | 0        |         |        |
|             | No         | ote: Va | lid is | not r | needed ( | uses it | ems    |

**USC**Viterbi

from RP to WP) Others: MemWrite (SW), MemAddr

# Dispatch and the ROB

- No more token FIFO (for tagging instructions) as in OoO execution and completion
  - is your \_\_\_\_\_ and is allocated for an instruction on issue/dispatch
  - When instruction finishes executing its result is buffered in the ROB entry until it can be committed safely
- It does not use the RST (Register Status Table) as before (because of difficult with implementing speculative execution)
  - When an instruction is dispatched, the ROB is searched for its source register (Rs and/or Rt) producers
    - Unproduced: If an entry in the ROB is producing Rs/Rt but has NOT YET EXECUTED the ROB tag/slot of the producer is taken with the dependent instruction
    - Produced: If an entry in the ROB is producing Rs/Rt and the result is PRODUCED BUT WAITING TO BE COMMITTED, that value is taken with the dependent instruction
    - Unfound: If no entry in the ROB is producing Rs/Rt, DATA IN THE REGISTER FILE IS THE LATEST value and is taken with the dependent instruction
  - Since multiple entries in the ROB may match Rs/Rt a priority resolver is necessary

# Re-Order Buffer (ROB) Structure

Top

(rp)

Bottom (wp)

- We will not use the RST (Register Status Table)
  Though this may vary depending on implementation
- On instruction dispatch: the ROB is searched for its source register (Rs and/or Rt) producers and can find its source operands from one of three sources:
- Unproduced (e.g. add \$8, **\$2, \$2**)
  - Situation: producer still waiting to \_\_\_\_\_\_
  - Action: Take \_\_\_\_\_\_ of producer (ROB8)
  - Produced (e.g. add \$8, **\$6, \$6**)
  - Situation: Producer executed and is waiting to
  - Action: Take \_\_\_\_\_ from ROB (data from ROB7)
- Unfound (e.g. add \$8, **\$3, \$3**)
- Situation: Latest value is in \_
- Action: Take value from RegFile
- Since multiple entries in the ROB may match Rs/Rt a priority resolver is necessary (e.g. \$2)

|   |     | Valid | Comp | Rd   | RegWr | Result | Others |
|---|-----|-------|------|------|-------|--------|--------|
|   | 0   | 0     | 0    | 0    | 1     |        |        |
|   | 1   | 0     | 0    | \$2  | 1     |        |        |
|   | 2   | 0     | 0    | 0    | 0     |        |        |
| - | • 3 | 1     | 1    | \$1  | 1     |        |        |
|   | 4   | 1     | 0    | \$2  | 1     |        |        |
|   | 5   | 1     | 0    | \$15 | 1     |        |        |
|   | 6   | 1     | 1    | \$2  | 1     |        |        |
|   | 7   | 1     | 1    | \$6  | 1     |        |        |
|   | 8   | 1     | 0    | \$2  | 0     |        |        |
| ٠ | 9   | 1     | 0    | \$7  | 0     |        |        |
|   | 10  | 0     | 0    | \$13 | 1     |        |        |
|   | 11  | 0     | 0    | 0    | 1     |        |        |
|   | 12  | 0     | 0    | \$4  | 0     |        |        |
|   | 13  | 0     | 0    | \$2  | 1     |        |        |
|   | 14  | 0     | 0    | 0    | 1     |        |        |
|   | 15  | 0     | 0    | 0    | 0     |        |        |

from RP to WP) Others: MemWrite (SW), MemAddr

School of Engineering

## Not Just a FIFO: ROB Interfaces

- ROB has many interfaces

  - RP, WP work like a FIFO (sequential access)
  - RS,RT source register/tag lookup (associative search)
  - CDB write execution results (index / random access)



## **ROB DEPTH AND PRIORITY** RESOLUTION

# Motivation for finding ROB Depth

• How do we determine the correct ROB entry to help when trying to obtain our source registers

- e.g. add \$8, **\$2, \$2** 

- We need to understand ROB depth calculation and priority resolution
- In the diagram how many instructions are waiting in the ROB?

– Answer:

• Can we just use the LARGEST valid index that matches the desired register?

|               |    |       | -    |      |       |        |        |
|---------------|----|-------|------|------|-------|--------|--------|
|               |    | Valid | Comp | Rd   | RegWr | Result | Others |
|               | 0  | 0     | 0    | 0    | 1     |        |        |
|               | 1  | 0     | 0    | \$2  | 1     |        |        |
| Top           | 2  | 0     | 0    | 0    | 0     |        |        |
| Top<br>(rp) → | 3  | 1     | 1    | \$1  | 1     |        |        |
|               | 4  | 1     | 0    | \$2  | 1     |        |        |
| h             | 5  | 1     | 0    | \$15 | 1     |        |        |
|               | 6  | 1     | 1    | \$2  | 1     |        |        |
|               | 7  | 1     | 1    | \$6  | 1     |        |        |
|               | 8  | 1     | 0    | \$2  | 0     |        |        |
| Bottom 🛶      | 9  | 1     | 0    | \$7  | 0     |        |        |
| (wp)          | 10 | 0     | 0    | \$13 | 1     |        |        |
|               | 11 | 0     | 0    | \$0  | 1     |        |        |
| 1             | 12 | 0     | 0    | \$4  | 0     |        |        |
|               | 13 | 0     | 0    | \$5  | 1     |        |        |
|               | 14 | 0     | 0    | \$9  | 1     |        |        |
|               | 15 | 0     | 0    | \$0  | 0     |        |        |

**USC**Viterbi<sup>(</sup> 15

School of Engineering

# **ROB Matches**

- Can we just use the LARGEST valid index that matches the desired register?
  - In the example to the right should we say to use entry 30's information?
- Not necessarily
  - Need to know where the and are
  - What if RP=30 and WP=2?
  - Let's explore more

Rd, RdTag, Instruction Valid, Instruction completed, RdData V,RegW,Rd 0



## **ROB** Depth/Distance

- Case 1
  - Your number is 55 and mine is 65
  - I am numbers (after / before) you.
- Case 2
  - Your number is 55 and mine is 45
  - I am numbers (after / before) you.



**USC**Viterbi











#### **Tournament Predictor** Dynamically selects when to use the global vs. local predictor Local Global - Accuracy of global vs. local Prediction Prediction predictor for a branch may vary for different branches Supporting Speculative Execution

35

 Tournament predictor keeps the history of both predictors (global or local) for a branch and then selects the one that is currently the most accurate



# SELECTIVE FLUSHING

### **USC**Viterbi **Flushing Mechanism**

- When we mispredict, we need to flush executed instructions in the and not-yet-executed instructions in the
- To do so, we provide the following to the backend (ROB, Issue queues):
  - A 'flush' command signal
  - Current Top of ROB
  - Depth of the Branch Instruction
- All instructions in the backend (as well as the ROB) with depth than the successful branch need to leave (be flushed)



### Selective Flushing for Branch Misprediction

- Paper token analogy
  - Say the store is going to close in 20 min. and they noticed too many people are waiting
  - They may announce that they will serve up to token #72 and people having tokens after that may leave now
- If the last token pulled is 92, then people with tokens #73 to #92 will leave
- If the last token pulled is #32, then people with tokens and will leave



- Because of the circular nature of the tokens/ROB FIFO mechanism, one cannot simply compare his token with #72 to decide whether to stay or leave
- Leave if you are more than 20 people away from current person being served (i.e. #52)



# **Register Hazard Summary**

- Recall, RAW hazard for registers was handled by
  - Dependent instructions are given the ROB tag of their specific producer to wait on in the backend
  - When the specific producer comes on the CDB an announces the value, then the dependent instruction grabs the value
  - Once the dependent instruction has all its sources, it raises his hand to say, "I am ready to go the execution unit" and waits for the issue unit to grant permission
- We must still take care with WAR and WAW hazards for registers, but we do so by taking ROB tags (solves WAR) and In-Order Completion/Writeback (solves WAW)

# Tomasulo 2: Memory Assumptions



## RAW, WAR, WAW for Memory

- We said hazards may occur in memory
- WAR and WAW hazards are handled through In-Order Completion
  - R = Read = LW (load word)
  - W = Write = SW (store word)
- An 'LW' reads cache in the execution unit before going to ROB
- An 'SW' writes into cache (i.e. commits) when it reaches the "top" of ROB (meaning it became the oldest instruction)



D-Cache

ROB

Reorder Buffer and

In-order

Completion solve

WAW (and helps with WAR)

1 add

2 sw 3 lw

4 sw

**USC**Viterbi











