

### EE 457 Unit 9a

#### Exploiting ILP Out-of-Order Execution

### Credits

- Some of the material in this presentation is taken from:
  - Computer Architecture: A Quantitative Approach
    - John Hennessy & David Patterson
- Some of the material in this presentation is derived from course notes and slides from
  - Prof. Michel Dubois (USC)
  - Prof. Murali Annavaram (USC)
  - Prof. David Patterson (UC Berkeley)







## **Exploiting Parallelism**

School of Engineering

- With increasing transistor budgets of modern processors (i.e., can do more things at the same time) the question becomes how do we find enough *useful* tasks to increase performance, or, put another way, what is the most effective way of exploiting parallelism!
- Many types of parallelism available
  - Instruction Level Parallelism (ILP): Overlapping instructions within a single process/thread of execution
  - Thread Level Parallelism (TLP): Overlap execution of multiple processes/threads
  - Data Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied independently to multiple data values (usually, an array)

for (int i=0; i < MAX; i++) { A[i] = A[i] + 5; }

• We'll focus on ILP in this unit

## Outline

- Instruction Level Parallelism
  - In-order (IO) pipeline
    - From academic 5-stage pipeline
    - To 8-stage MIPS R4000 pipeline
    - Superscalar, superpipelined
  - Out-of-Order (OoO) Execution
    - This unit: OoO Execution (Compute the result) AND OoO Completion (write result to memory or a register). (Problem: Exceptions
    - Next Unit: OoO Execution BUT In-order completion

### Instruction Level Parallelism (ILP)

- Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel (i.e. **out of (program) order**).
- ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel
- Data flow (data dependencies) limits out-of-order execution
- Independent instructions (no data dependencies) can be executed at the same time)
- Control hazards also provide some ordering constraints



### **Basic Blocks**

- Basic Block (def.) = Sequence of instructions that will always be executed together
  - No conditional branches out
  - No branch targets coming in
  - Also called "straight-line" code
  - Average size: 5-7 instrucs.

| L1: | lw<br>and<br>add<br>or<br>sub<br>beq<br>xor | <pre>\$s3,0(\$s4) \$t3,\$t2,\$t3 \$t0,\$t0,\$s4 \$t5,\$t3,\$t2 \$t1,\$t1,\$t2 \$t0,\$t8,L1 \$s0,\$t1,\$s2</pre> |  |
|-----|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------|--|
|     | xor                                         | \$s0,\$t1,\$s2                                                                                                  |  |

This is a basic block (starts w/ target, ends with branch)

- Instructions in a basic block can be overlapped if there are no data dependencies
- Control dependences really limit our window of possible instructions to overlap
  - W/o extra hardware, we can only overlap execution of instructions within a basic block

## **SUPERSCALAR & SUPERPIPELINING**

Other In-Order techniques



7

### **Overview**

School of Engineering

- Superscalar = More than 1 instruction completing per clock cycle (IPC > 1)
  - 2-way superscalar = Proc. that can issue 2 instructions per clock cycle
  - Success is sensitive to ability to find independent instructions to issue in the same cycle
- Superpipelining = Many small stages to boost clock freq.
  - Success depends of finding instructions to schedule in the shadow of data and control hazards

| Superscalar     | Instruction<br>1                                                                    | Instr<br>Fet |     |                |    | Execute Data<br>Memory |     | , W            | rite back       |           |  |
|-----------------|-------------------------------------------------------------------------------------|--------------|-----|----------------|----|------------------------|-----|----------------|-----------------|-----------|--|
| Super           | Instruction<br>2                                                                    | Instr<br>Fet |     | Instru<br>Deco | -  | Execut                 | te  | Data<br>Memory | , W             | rite back |  |
|                 | Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1 or IPC > 1) |              |     |                |    |                        |     |                | : 1 or IPC > 1) |           |  |
| Superpipelining | Instruction<br>1                                                                    | IF1          | IF2 | ID             | EX | DM1                    | DM2 | DM3            | WB              |           |  |
| Superpi         | Instruction<br>2                                                                    |              | IF1 | IF2            | ID | EX                     | DM1 | DM2            | DM3             | WB        |  |

Superpipelining: Divide logic into many short stages (Higher Clock Frequency)

#### 2-way Superscalar

9

- Ex: One ALU & Data transfer (LW/SW) instruction can be issued at the same time
- Relies on compiler to find and reorder appropriate instructions (using nops if no appropriate instruction can be found



## Sample Scheduling

 Compiler can reorder instructions to find integer and memory instructions to fuse together that can be run down the pipeline at the same time

```
void f1(int *A, int n) {
    do {
        *A += 5;
        A++;
        n--;
    } while (n != 0);
}
```

```
# $6 = A
# $7 = n = # of iterations
L1: ld $9, 0(%6)
    add $9, $9, 5
    st %r9,0(%rdi)
    add $6, $6, 4
    add $7, $7, -1
    jne $0,%esi,L1
```

| 1 | me               |              |    |             |
|---|------------------|--------------|----|-------------|
|   | Int./Branch Slot |              |    | LD/ST Slot  |
| - | addi             | \$7, \$7, -1 | lw | \$9,0(\$6)  |
| - | addi             | \$6, \$6, 4  |    |             |
| - | addi             | \$9, \$9, 5  |    |             |
| • | bne              | \$0,\$7,L1   | st | \$9,-4(\$6) |
|   |                  |              |    |             |

10

School of Engineering

w/ modifications and code movement IPC = 6 instrucs. / 4 cycle = 1.5

## **Scheduling Strategies**

- Static Scheduling
  - Compiler re-orders instructions in such a way that no dependencies will be violated and allows for OoOE
- Dynamic Scheduling
  - HW implementing the Tomasulo algorithm or other similar approach will re-order instructions to allow for OoOE
- More Advanced Concepts
  - Branch prediction and speculative execution (execution beyond a branch flushing if incorrect) will be covered later

## Static Scheduling

- Strengths
  - Hardware simplicity [Better clock rate]
    - Power/energy advantage
    - Compiler has a global view of the program anyway, so it should be able to do a "good" job
  - Very predictable: static performance predictions are reliable
- Weaknesses
  - Requires re-compilation to take advantage of new/modified architecture
  - Cannot foresee dynamic (data-dependent) events
    - Cache miss, conditional branches (can only recedule instructions in a basic block)
  - Cannot precompute memory addresses
  - No good solution for precise exceptions with out-of-order completion

### **OUT-OF-ORDER EXECUTION**



13

## **Out-of-Order Motivation**

- We will focus on dynamically scheduled, OoO processors
- Hide the impact of dynamic events such as a cache miss
  - Let independent instructions behind a stalled instruction execute
- Separate functional units (ALU, MUL, DMEM, etc.)
- "Queues" where instructions wait until they are ready at which point they can execute "out-of-order"



14

School of Engineering

LW \$4,0(\$5) // cache miss ADD \$6,\$7,\$4 SUB \$1,\$2,\$3 MUL \$9,\$7,\$2

### Dispatch, Execution, and Completion

- "Execution" here means producing the results not necessarily writing them to a register or memory
- Completion means committing/writing the results to register file or memory
- While we say out-of-order execution we really mean/want:
  - In-order (Program order) Issue/Dispatch (IoD)
  - Out-of-Order Execution (OoOE)
  - In-order Completion (IoC) [hard]
    - So we'll start with the easier Out-of-Order Completion (OoOC)

| LW <mark>\$4</mark> ,0(\$5)    |
|--------------------------------|
| <pre>// cache miss</pre>       |
| ADD \$6,\$7, <mark>\$</mark> 4 |
| SUB \$1,\$2,\$3                |
| MUL \$9,\$7,\$2                |



## **Branch Handling**

16

- We will present the concept of OoOC (out-of-order completion) which is a bit easier and then come back to the desired approach of In-Order Completion (IOC)
- OoOC Issues
  - Branches...we should not commit an instruction that came after (in program order) a branch



### **Data Hazard Stalling**

17

- In our 5-stage pipeline (in-order execution) RAW dependency was solved by
  - Forwarding (preferably) or
  - Stalling (LW followed by dependent instruction)
- Dependent instructions stalled in the ID stage if necessary
- Do we want to stall in the decode stage in our OoO processor?
  - No! Doing so would necessarily stall everyone behind us



#### **EX Stage Stalling**

- In our 5-stage pipeline, could we have stalled in the EX stage
- No! If ADD depended on an instruction in WB then it has no place to store that forwarded data while it stalls



Thus we stall in ID so we can use the Register File to grab dependent values. Further stalling in ID incurs only 1 cycle penalty as would stalling in EX.



18

### Where to Stall?

19

- But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions
- Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready





## Forwarding in OoO Execution

- In 5-stage pipeline later instructions carried their source register IDs into the EX stage to be compared with destination register ID's of their earlier instructions
- But in OoO execution, we may have many (earlier) instructions in front of us and would require more complex hardware to determine who is producing the data we need (especially when multiple producers exist and we want the latest version)
- Instead, the dispatch unit will explicitly tell the dependent instruction who to get data from using part of Tomasulo's algorithm



### Tomasulo's Plan

21

- OoO Execution
- Multiple functional units
  - Integer ALU, Data memory, Multiplier, Divider
- Queues between ID and EX stages (in place of ID/EX register)
  - Allows later instructions to keep issuing even if earlier ones are stalled
- Method for dealing with RAW data hazards by specifying who dependent instructions should get data from
  - But with OoO execution, new hazards arise!

## **NEW DATA HAZARDS**

WAR and WAW



# RAW, WAR, and WAW

23

School of Engineering

- RAW = Read After Write
  - lw <mark>\$8</mark>, 40(\$2)
  - add \$9, <mark>\$8</mark>, \$7
- WAR = Write After Read
  - add \$9, \$8, \$6  $\leftarrow$  say \$6 is not available yet, can LW execute?
  - lw **\$8**, 40(\$2)
- WAW = Write After Write
  - add \$9, \$8, \$6 ← say \$6 is not available yet, can LW execute? - Iw \$9, 40(\$2)

Why would anyone produce one result in \$9 without utilizing that result? Why would he overwrite it with another result? How is this possible?

### WAW can easily occur

- How is WAW possible?
- Example 1
  - Say a company gives standard bonus to most of the employees and a higher bonus to managers
  - The software may set a default value to the standard bonus and then overwrite for the special case
- Example 2
  - Consider multiple iterations of a loop body

int(x = standard bonus;if (manager) special bonus; set bonus(x);

for(i=MAX; i != 0; i--) A[i] = A[i] \* 3;

24

School of Engineering

| L1: | lw   | \$2, | 40(\$1)         |
|-----|------|------|-----------------|
|     | mult | \$4, | \$2, \$3        |
|     | SW   | \$4, | 40(\$1)         |
|     | addi | \$1, | \$1,-4          |
|     | bne  | \$1, | \$0, <b>L</b> 1 |
|     |      |      |                 |

**Original Code** 





## RAW, WAR, and WAW

- Some terminology to remember
- RAW = Read After Write RAW – lw <mark>\$8</mark>, 40(\$2) A true dependency – add \$9, **\$8**, \$7 WAR = Write After Read – add \$9, **\$8**, \$6 WAR Name Depdencies An anti-dependency - lw **\$8**, 40(\$2) WAW = Write After Write WAW – add **\$9**, \$8, \$6 An anti-dependency - lw **\$9**, 40(\$2) Note: No information is communicated in WAR/WAW hazards. If no info is communicated can we somehow solve these hazards?

## RAW, WAR, and WAW

- In-order execution:
  - We need to deal with RAW only
- Out-of-order execution
  - Now we need to deal with WAR and WAW hazards besides RAW
  - Any of these hazards seem to prevent re-ordering instructions and executing them out-of-order



26

### **Register Renaming**

- WAR and WAW hazards can always be solved by simply choosing a DIFFERENT register since no data is being communicated but we were simply "reusing" a register
- If we had 64 registers instead of 32 registers, then perhaps the compiler might have used \$48 instead of \$8 and we could have executed the second part of the code before the first part



27

| First iteration                                     | lw<br>add<br>sw | \$8, 40(\$2)<br>\$8, \$8, \$8<br>\$8, 40(\$2)      |
|-----------------------------------------------------|-----------------|----------------------------------------------------|
| iteration<br>(using<br>alternate<br>register, \$48) | lw<br>add<br>sw | \$48, 60(\$3)<br>\$48, \$48, \$48<br>\$48, 60(\$3) |

### **Register Renaming**

28

- Renaming requires more registers
- We have limited architectural registers
   Registers the instruction set is aware of
- We could have more physical registers
  - Actual registers part of the register file





## **Increasing Number of Registers**

- Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled code?
- Answer: Yes / No
   NO
- Why?

Machine code has 5-bit fields for register ID's

## **Register Renaming**

- Rather than creating new architectural registers, let us internally provide multiple "versions" of the same architectural register
  - \$8v1 = \$8 version 1
  - \$8v2 = \$8 version 2

| lw<br>add<br>sw | \$8v2, | 40(\$2)<br><b>\$8v1, \$8v1</b><br>40(\$2)   |
|-----------------|--------|---------------------------------------------|
| lw<br>add<br>sw | \$8v4, | 60 (\$3)<br><b>\$8v3, \$8v3</b><br>60 (\$3) |



30



31

- Cannot change the number of architectural registers
- Instead we will perform
   Register Renaming through *Tagging* Registers
  - This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues
  - Please be sure you understand this!



School of Engineering

32

#### **OoO Execution & Tomasulo's Algorithm**



(Simplified for EE457)

## Tomasulo's Algorithm

33

- Dispatch/Issue unit decodes and dispatches instructions
- Assign a binary code (aka TAG) to each instruction producing a register value using the TAG FIFO
- Adds a Register Status Table (RST) that holds the TAG of the instruction that is producing the LATEST version of each architectural register or NULL if the LATEST version is in the register file
- The destination operand is represented by the TAG but not the actual register name
- For source operands, an instruction carries either the values (if TAG is null in RST) or TAGs of the operands (but not the actual register name)
- When an instruction executes and produces a result it broadcasts the result and its destination TAG
  - Any instruction waiting can compare its SRC tags with the destination tag and grab the value if they match
  - If entry in RST matches the TAG then this instruction is the latest producer of the register and the value will be written to the register file

**USC**Viter School of Engineering **Tagging process** RST (Identify latest RF version of a req.) sqrt \$2, \$10 \$1 \$2 \$3 \$5 \$6 \$7 \$8 \$1 \$2 \$3 \$5 \$6 \$7 \$8 \$8, 40(\$2) lw \$8, \$8, \$8 add \$8, 40(\$2) SW \$8, 60(\$3) lw \$8, \$8, \$8 add \$8, 60(\$3) SW . . . ... \$31 \$31 **Issue Logic RST = Register Status Table RF** = **Register** File T1: SQRT \$2 Val / \$10 Val INT INT MUL/DIV/SQRT Load/ ALU **Store** 

34





School of Engineering

### Tagging process: CC2







## Tagging process: CC4











USC Viterbi

### Tagging process: CC8





# Tagging process: CC10



44

Viter



## Tagging process: CC11







46

Viter





47)

Viter

#### **Register Renaming**



48

**USC**Viter

# Unique TAGs

- Like SSN, we need a unique TAG
- SSN's are reused.
- Similarly TAGS can be reused
- TAGs are similar to number TOKEN



Helps to create a virtual queue.

We do not need that here



In State Bank of India, the cashier issues brass token to customers trying to draw money as an ID (and not at all to put them in any virtual queue / ordering). Token numbers are in random order.

The cashier verifies the signature in the record rooms, returns with money, calls the token number and issues the money.

Tokens are reclaimed & reused.

49

Tags (= Tokens)

50

- How many tokens should the bank casheir have to start with?
- What happens if the tokens run out?
- Does the cashier need to have any order in holding tokens and issuing tokens?
- Do they have to collect the tokens back?



#### TAG FIFO

FIFO's are taught in EE 560

- To issue and collect tokens (TAGS) use a circular FIFO (First-In/First-Out) unit
  - While the FIFO order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile
- Filled (with say) 64 tokens (in any order) initially on reset
- Tokens return in any order
- Put tokens back in the FIFO and reissue



51

# Organization for OoO Execution



52)

**USC**Viterb



# Front-End & Back-End

- IFQ (Instruction Fetch Queue)
  - A FIFO structure
- Dispatch (Issue) Unit
  - Includes RST, RF, Tag FIFO
- Load/Store and other Issue Queues
- Issue Units
- Functional units
- CDB (Common Data Bus)
  - Like a public address system that everyone can see/hear when data is produced



# More Tomasulo Algorithm

- Front End
  - Instructions are fetched
  - They are stored in a FIFO (IFQ)
  - When instruction reached the head of the IFQ it is
    - Decoded
    - Dispatched to an issue queue/functional unit
    - Even if some of the inputs are not ready (takes TAGs)
- Back End
  - Instructions in issue queues wait for their input operands
  - Once register operands are ready instructions can be scheduled for execution provided they will not conflict for the CDB or their functional unit
  - Instructions execute in their functional unit and their result is put on the CDB
  - All instructions in queues and the register file "watch" the CDB and grab the value they are waiting for when it is produced
- Bottleneck in Tomasulo's algorithm?
  - The CDB!!!
  - Do all instructions use the CDB? No, not SW, J (jump), BEQ



Data hazards and memory

### **MEMORY DISAMBIGUATION**

# Load/Store Queue (LSQ)

- For our course, the LSQ performs
  - Address calculation
  - Memory disambiguation
    - RAW, WAR, WAW hazards due to memory reads and writes

```
// Is there a dependency here?
SW $2,0($5)
LW $8,0($5)
// What about here?
SW $2, 1000($4)
LW $3, 0($6)
```

56

# **Memory Disambiguation**

• Data hazards (RAW, WAR, WAW) can occur in memory just as with registers, and hazards in memory are much harder to deal with since many combinations could produce the same address

| RAW |                          |
|-----|--------------------------|
|     | 2000 (\$0)<br>2000 (\$0) |

This later lw can proceed only if there is no store ahead of it with the same address 57

School of Engineering

|  | V | V | Δ | V | V |  |
|--|---|---|---|---|---|--|
|--|---|---|---|---|---|--|

sw \$2, 2000(\$0)
sw \$8, 2000(\$0)

This later sw can proceed only if there is no store ahead of it with the same address

| WA | R    |            |
|----|------|------------|
| lw | \$2, | 2000 (\$0) |
| SW | \$8, | 2000 (\$0) |

This later sw can proceed only if there is no load ahead of it with the same address

# Address Calculation for LW/SW

58

- EE 557 approach for address calculation
  - Loads & store in 2 sub-instructions
    - 1 instruction computes address and is dispatched to integer ALU
    - 1 instruction access data cache and is issued to LSQ
    - Address is communicated from integer ALU to LSQ via CDB forwarding using a tag
- EE 560/457 approach
  - Use a dedicated adder in the LSQ to compute address (so just 1 dispatched instruction)

# **Memory Disambiguation**

59

- When can LSQ can issue a LW or SW to cache?
  - Loads can issue to a cache when their address is ready
  - Stores can issue to cache when both address & data is ready
  - Memory hazards (RAW, WAR, WAW) are resolved in the LSQ
    - Load can issue to cache if no store with same address is before it
    - Store can issue to cache if no store or load with same address before it
    - Otherwise, access waits in LSQ
  - If an address is unknown it is assumed to be the same
    - Worst case to enforce correctness
  - The process of figuring out and comparing memory address is called "disambiguation"

# LAST CONSIDERATIONS FOR OUT-OF-ORDER EXECUTION/COMPLETION

Issue Queue priority, Branches, etc.



## **Issue Unit**

61

- How do we determine when to issue an instruction to the functional unit?
  - Is the instruction ready
  - Is the functional unit free to start the operation?
  - CDB availability constraint
    - Will there be room on the CDB when operation finished?
  - Priority/conflict resolution
    - If many instructions are available, which should be chosen? (Is round-robin priority adequate)?



## **Issue Queue Priority**

62

- Priority (based on the order of arrival among ready instructions)
  - Is it necessary or just desirable?
  - Local priority within queues?
  - Global priority across the queues?



# LSQ Ordering/Priority

63

- Maintaining instructions in the order of arrival
   Issue order/program order in a queue
- Is this necessary and/or desirable?
  - In the case of LSQ?
    - Necessary! To enforce memory disambiguation
  - In the case of Integer, MUL, DIV queues?
    - Desirable, so that an earlier instruction gets executed whenever possible, thereby reducing queue pressure from too many instructions waiting on it

## **Conditional Branches**

- Dispatcher stalls when it reaches a branch (and waits until it is resolved)
- Branches are dispatched to integer queue where they wait for their operands (if necessary)
- When branch executes it puts its outcome & target on CDB
  - If untaken, dispatch unit resumes
  - If taken, then dispatch clears flushes the IFQ and resumes at target
- Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end?
- Is it possible that the back-end holds simultaneously
  - A. Some instructions dispatched before the branch .. AND ..
  - B. Some instructions issued after the branch

|     |            | \$4,\$5,\$5<br>\$6,\$7,L1  |
|-----|------------|----------------------------|
| L1: | SUB<br>MUL | \$1,\$2,\$3<br>\$9,\$7,\$2 |



64

## Structural Hazards + Exceptions

- Structural Stalls
  - Dispatch must stall if IFQ empty OR all entries in the desired functional unit's issue queue are occupied AND an instruction of that type is attempting to dispatch
  - Fetch unit must stall if the IFQ is full
  - Functional units stall when no ready instructions in the queue or CDB scheduling conflicts
- Precise exceptions not supported
  - Some instructions after the offending instruction may have updated registers or memory! BAD!
  - We'll handle this in the next unit



65



School of Engineering

#### **BACKUP**

#### 67

School of Engineering

## **Tagging Registers: CC1**



|                                      | RST |                                                      |
|--------------------------------------|-----|------------------------------------------------------|
| 1<br>2<br>3<br>4<br>5<br>6<br>7<br>8 | DOG | \$1<br>\$2<br>\$3<br>\$4<br>\$5<br>\$6<br>\$7<br>\$8 |
| -                                    |     |                                                      |
| 51                                   |     | \$31                                                 |





**Dependent source** 

**RST = Register Status Table** 

**RF** = Register File

# Tagging Registers: CC2



|                                               | RST  |  |
|-----------------------------------------------|------|--|
| \$1<br>\$2                                    | DOG  |  |
| \$1<br>\$2<br>\$4<br>\$5<br>\$6<br>\$7<br>\$8 |      |  |
| \$5<br>\$6                                    |      |  |
| \$7<br>\$2                                    | LION |  |
| ΨΟ                                            |      |  |
|                                               |      |  |
| \$31                                          |      |  |





\$1 \$2 \$3 \$4 \$5 \$6 \$7 \$8

. . .

\$31

**RST = Register Status Table** 

**RF = Register File** 

# Tagging Registers: CC3



|                                               | RST   |  |
|-----------------------------------------------|-------|--|
| \$1<br>\$2                                    | DOG   |  |
| \$1<br>\$2<br>\$4<br>\$5<br>\$6<br>\$7<br>\$8 |       |  |
| \$4<br>\$5                                    |       |  |
| \$6<br>\$7                                    |       |  |
| \$8                                           | TIGER |  |
|                                               |       |  |
| \$31                                          |       |  |
| ΨΨ·                                           |       |  |





\$1 \$2 \$3 \$4 \$5 \$6 \$7 \$8

. . .

\$31

**RST = Register Status Table** 

**RF = Register File** 

# Tagging Registers: CC4



|                                               | RST   |  |
|-----------------------------------------------|-------|--|
| \$1                                           | DOG   |  |
| \$2<br>\$3                                    |       |  |
| \$1<br>\$2<br>\$4<br>\$5<br>\$6<br>\$7<br>\$8 |       |  |
| \$6<br>\$7                                    |       |  |
| \$7<br>\$8                                    | TIGER |  |
|                                               |       |  |
|                                               |       |  |
| \$31                                          |       |  |





\$1 \$2 \$3 \$4 \$5 \$6 \$7 \$8

. . .

\$31

**RST = Register Status Table** 

**RF** = **Register** File



## **Tagging Registers Review**





- Dispatch unit decodes and dispatches instructions
- For destination operand, an instruction carreis a TAG (but not the actual register name)
- For source operands, an instruction carries either the values (if no TAG in RST) or TAGs of the operands (but not the actual register name)
- When

# Organization for OoO Execution



72)

**USC**Viterb

# **Multiple Functional Units**

73

- We now provide multiple functional units
- After decode, issue to a queue, stalling if the unit is busy or waiting for data dependency to resolve



# **Multiple Functional Units**

74

- We now provide multiple functional units
- After decode, issue to a queue, stalling if the unit is busy or waiting for data dependency to resolve



## Where to Stall?

75

- But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions
- Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready





School of Engineering

## **Functional Unit Latencies**



| Functional Unit | Latency<br>(Required stalls cycles<br>between dependent [RAW] instrucs.) | Initiation Interval<br>(Distance between 2 independent instructions<br>requiring the same FU) |
|-----------------|--------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Integer ALU     | 0                                                                        | 1                                                                                             |
| FP Add          | 3                                                                        | 1                                                                                             |
| FP Mul.         | 6                                                                        | 1                                                                                             |
| FP Div.         | 24                                                                       | 25                                                                                            |

## OoO Execution w/ ROB

77)

School of Engineering

• ROB allows for OoO execution but in-order completion

