### Credits

# EE 457 Unit 9a

Exploiting ILP Out-of-Order Execution

### • Some of the material in this presentation is taken from:

- Computer Architecture: A Quantitative Approach
  - John Hennessy & David Patterson
- Some of the material in this presentation is derived from course notes and slides from
  - Prof. Michel Dubois (USC)
  - Prof. Murali Annavaram (USC)
  - Prof. David Patterson (UC Berkeley)





**US** 

**USC**Viterbi

## **Exploiting Parallelism**

- With increasing transistor budgets of modern processors (i.e., can do more things at the same time) the question becomes how do we find enough *useful* tasks to increase performance, or, put another way, what is the most effective way of exploiting parallelism!
- Many types of parallelism available
  - Level Parallelism (ILP): Overlapping instructions within a single process/thread of execution

Level Parallelism (TLP): Overlap execution of multiple processes/threads

 Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied independently to multiple data values (usually, an array)

```
for (int i=0; i < MAX; i++) { A[i] = A[i] + 5; }
```

• We'll focus on ILP in this unit

## Outline

Instruction Level Parallelism

### (IO) pipeline

- From academic 5-stage pipeline
- To 8-stage MIPS R4000 pipeline
- Superscalar, superpipelined

### (OoO) Execution

- This unit: OoO Execution (Compute the result) AND OoO Completion (write result to memory or a register). (Problem: Exceptions)
- Next Unit: OoO Execution BUT In-order completion



Superpipelining: Divide logic into many short stages (Higher Clock Frequency)

#### Sample Scheduling 2-way Superscalar Ex: One ALU & Data transfer (LW/SW) instruction can be issued at the same time Relies on compiler to find and reorder appropriate instructions (using nops if no Compiler can reorder instructions to find integer and memory appropriate instruction can be found instructions to fuse together that can be run down the Instruction **Pipeline Stages** pipeline at the same time ALU or branch IF ID ΕX MEM WB IF ID MEM WB time LW/SW ΕX void f1(int \*A, int n) { do { ALU or branch IF ΕX MEM ID WB Int./Branch Slot LD/ST Slot \*A += 5; LW/SW IF ID ΕX MEM WB A++; addi \$7, \$7, -1 lw \$9,0(\$6) ALU or branch IF ID ΕX MEM WB n--: LW/SW ID WB addi \$6, \$6, 4 IF ΕX MEM } while (n != 0); \$9, \$9, 5 addi Slot PC \$0,\$7,L1 # \$6 = A st \$9,-4(\$6) bne nteger ¥ ALU = # of iterations Reg. L1: 1d \$9, 0(%6) File \$9, \$9, 5 add I-Cache w/ modifications and code movement (4 Read. %r9,0(%rdi) IPC = 6 instrucs. / 4 cycle = 1.5 Addr. Slot D-Cache add \$6, \$6, 4 2 Write) Calc. \$7, \$7, -1 LD/ST add \$0,%esi,L1 ine 2 instructions 11 **USC**Viterbi School of Engineering **Scheduling Strategies** Static Scheduling Strengths Scheduling Hardware simplicity [Better clock rate] re-orders instructions in such a way that no Power/energy advantage dependencies will be violated and allows for OoOE • Compiler has a global view of the program anyway, so it should be able to Scheduling do a "good" job Very predictable: static performance predictions are reliable implementing the Tomasulo algorithm or other similar Weaknesses • approach will re-order instructions to allow for OoOE to take advantage of new/modified Requires More Advanced Concepts architecture Branch prediction and speculative execution (execution beyond Cannot foresee dynamic (data-dependent) events a branch flushing if incorrect) will be covered later · Cache miss, conditional branches (can only recedule instructions in a basic block) Cannot precompute memory addresses

- No good solution for precise exceptions with out-of-order completion





instructions

- But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions
- Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready



# In 5-stage pipeline later instructions carried their source register IDs into the EX stage to be compared with destination register ID's of their earlier

- But in OoO execution, we may have many (earlier) instructions in front of us and would require more complex hardware to determine who is producing the data we need (especially when multiple producers exist and we want the latest version)
- Instead, the dispatch unit will \_\_\_\_\_\_ tell the dependent instruction who to get data from using part of Tomasulo's algorithm



### Tomasulo's Plan OoO Execution Multiple functional units - Integer ALU, Data memory, Multiplier, Divider Queues between ID and EX stages (in place of ID/EX register) - Allows later instructions to keep issuing even if earlier ones WAR and WAW are stalled NEW DATA HAZARDS Method for dealing with RAW data hazards by specifying who dependent instructions should get data from - But with OoO execution, arise! **USC**Viterbi 23 USC RAW, WAR, and WAW WAW can easily occur RAW = Read After Write How is WAW possible? for(i=MAX; i != 0; i--) A[i] = A[i] \* 3; Example 1 - lw \$8, 40(\$2) Say a company gives standard bonus to L1: lw \$2, 40(\$1) - add \$9, **\$8**, \$7 mult \$4, \$2, \$3 most of the employees and a higher bonus SW \$4, 40(\$1) • WAR = Write After Read to managers addi \$1, \$1,-4 bne \$1, \$0,L1 The software may set a default value to the - add \$9, \$8, \$6 ← say \$6 is not available yet, can LW execute? **Original Code** standard bonus and then overwrite for the - lw \$8, 40(\$2) special case L1: 1w (\$2, 40(\$1) \$2, \$3 WAW = Write After Write • Example 2 40(\$1) Consider multiple iterations of a loop body - add \$9, \$8, $\$6 \leftarrow$ say \$6 is not available yet, can LW execute? - lw \$9, 40(\$2) 40(\$1) L1: **\$2** int x = standard\_bonus; \$2, \$3 if (manager) 40(\$1) Why would anyone produce one result in \$9 without utilizing addi \$1, \$1,-4 🕻 🗴 🝦 special\_bonus; that result? Why would he overwrite it with another result? bne \$1, \$0,L1

set\_bonus(x);

How is this possible?













Do all instructions use the \_\_\_\_\_

\_\_\_\_\_



## Issue Unit

- How do we determine when to issue an instruction to the functional unit?
  - Is the instruction ready
  - Is the functional unit free to start the operation?
  - CDB availability constraint
    - Will there \_\_\_\_\_\_ when operation finished?
  - Priority/conflict resolution
    - If many instructions are available, which should be chosen? (Is round-robin priority adequate)?



USC Viterbi 55

## **Issue Queue Priority**

- Priority (based on the order of arrival among ready instructions)
  - Is it necessary or just desirable?
  - Local priority within queues?
  - Global priority across the queues?



# LSQ Ordering/Priority

- Maintaining instructions in the order of arrival
  Issue order/program order in a queue
- Is this necessary and/or desirable?
  - In the case of LSQ?
  - In the case of Integer, MUL, DIV queues?
    - Desirable, so that an earlier instruction gets executed whenever possible, thereby reducing queue pressure from too many instructions waiting on it

Issue Queue priority, Branches, etc.

## LAST CONSIDERATIONS FOR OUT-OF-ORDER EXECUTION/COMPLETION

# **Conditional Branches**

- Dispatcher stalls when it reaches a branch (and waits until it is resolved)
- Branches are dispatched to integer queue where they wait for their operands (if necessary)
- When branch executes it puts its outcome & target on CDB
  - If untaken, dispatch unit resumes
  - If taken, then dispatch clears flushes the IFQ and resumes at target
- Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end?
- Is it possible that the back-end holds simultaneously
  - A. Some instructions dispatched before the branch .. AND ..
  - B. Some instructions issued after the branch

|     | \$4,\$5,\$5<br>\$6,\$7,L1  |
|-----|----------------------------|
| L1: | \$1,\$2,\$3<br>\$9,\$7,\$2 |

| Reg. File           | Register<br>Status<br>Table | I-Cache<br>Instruc.<br>Queue<br>Dispatch | TAG FIFO   |               |
|---------------------|-----------------------------|------------------------------------------|------------|---------------|
| Integer /<br>Branch | D-Cache                     |                                          | Mult Onene | Issue<br>Unit |

## Structural Hazards + Exceptions

• Structural Stalls

57

- Dispatch must stall if \_\_\_\_\_ OR all entries in the desired functional unit's issue queue are occupied AND an instruction of that type is attempting to dispatch
- Fetch unit must stall if the \_\_\_\_\_
- Functional units stall when no ready instructions in the queue or CDB scheduling conflicts
- Precise exceptions not supported
  - Some instructions \_\_\_\_\_\_ the offending instruction may have updated registers or memory! \_\_\_\_\_
  - We'll handle this in the next unit

