

#### Goals

- Understand the terms and ideas used in a modern, high-performance processor
- Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor
- Terms to listen for and understand the concept:
  - Superscalar/multiple issue, loop unrolling, register renaming, out-of-order execution, speculation, and branch prediction

#### CS356 Unit 12b

**Advanced Processor Organization** 



#### A New Instruction

- In x86, we often perform
  - cmp %rax, %rbx
  - je L1 or jne L1
- Many instruction sets have a single instruction that both compares and jumps (limited to registers only)
  - je %rax, %rbx, L1
  - jne %rax, %rbx, L1
- Let us assume x86 supports such an instruction in our subsequent discussion



#### **INSTRUCTION LEVEL PARALLELISM**



#### Have We Hit The Limit

- Under ideal circumstances, pipeline would allow us to achieve a throughput (IPC = Instruction per clock) of
- Can we do better? Can we execute more than one instruction per clock?
  - Not with a single pipeline
  - But what if we had \_\_\_\_\_\_
  - What if we fetched multiple \_\_\_\_\_ per clock and let them run down the pipeline in parallel
- Let's exploit \_\_\_\_\_!



#### Instruction Level Parallelism (ILP)

- Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel.
- ILP refers to the process of finding instructions from a single program/thread
  of execution that can be executed in parallel
- Data flow (data \_\_\_\_\_\_) is what truly \_\_\_\_\_ ordering
   We call these dependencies \_\_\_\_\_\_ Hazards
- Independent instructions can be
- Control hazards also provide ordering constraints

|         | ld<br>and<br>or | 0(%r8), %r9 %r10, %r11[write %r11] %r11, %r13[read %r11]                   | LD                  | AND | SUB | ADD |
|---------|-----------------|----------------------------------------------------------------------------|---------------------|-----|-----|-----|
|         | add<br>je       | %r14, %r15[write %r15]<br>%r10, %r12[write %r12]<br>\$0,%r12,L1[read %r12] | Dependency<br>Graph | OR  |     | JE  |
|         | xor             | %r15, %rax[read %r15]                                                      |                     |     | XOR |     |
| ycle 1: |                 | /                                                                          | /                   | /   |     |     |
| ycle 2: |                 | /                                                                          | /                   | /   |     |     |
| ycle 3: |                 | /                                                                          | /                   | /   |     |     |



#### **Exploiting Parallelism**

- With increasing transistor budgets of modern processors (i.e.
  can do more things at the same time) the question becomes
  how do we find enough *useful* tasks to increase performance,
  or, put another way, what is the most effective ways of
  exploiting parallelism!
- · Many types of parallelism available
  - Level Parallelism (ILP): Overlapping instructions within a single process/thread of execution
  - Level Parallelism (TLP): Overlap execution of multiple processes / threads
  - Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied to multiple data values (usually in an array)
    - for(i=0; i < MAX; i++) { A[i] = A[i] + 5; }</li>
- We'll focus on ILP in this unit



#### **Basic Blocks**

• Basic Block (def.) = Sequence of instructions that will be executed

- No conditional branches out
- No branch targets coming in
- Also called "straight-line" code
- Average size: \_\_\_\_\_ instrucs.
- ld 0(%r8),%r9
  and %r10,%r11
  L1: add %r8,%r12
  or %r11,%r13
  sub %r14,%r10
  jeq %r12,%r14,L1
  xor %r10,%r15

| L | This is a basic block (starts w/ |
|---|----------------------------------|
|   | , en                             |
|   | with                             |

- Instructions in a basic block can be overlapped if there are no data dependencies
- dependences really limit our window of possible instructions to overlap
  - Without extra hardware, we can only overlap execution of instructions within a basic block



### Superscalar

- When airplanes broke the sound barrier we said they were super-sonic
- When processor (HW) can complete instruction per clock cycle we say they are superscalar
- Problem: The HW can execute 2 or more instructions during the same cycle but the SW may be written and compiled assuming 1 instruction executing at a time.



| the code and rely on the                                |
|---------------------------------------------------------|
| to safely order instructions that can be run in paralle |
| (static scheduling)                                     |

to be smart, Build the instructions on the fly while guaranteeing correctness (dynamic scheduling)







### **Data Flow and Dependency Graphs**

- The compiler produces a sequential order of instructions
- Modern processors will transform the sequential order to execute instructions in parallel
- Instructions can be executed in any valid of the dependency graph

| 1d  | 0(%r8), %r9 |
|-----|-------------|
| and | %r9, %r11   |
| or  | %r11, %r13  |
| sub | %r14, %r15  |
| add | %r10, %r12  |
| jе  | \$0,%r12,L1 |
| xor | %r15, %r9   |
|     |             |





# Superscalar (Multiple Issue)

- Multiple "pipelines" that can fetch, decode, and potentially execute more than 1 instruction per clock
  - k-way superscalar = Ability to complete up to k instructions per clock cycle
- Benefits
  - Theoretical throughput greater than 1 (IPC > 1)
- Problems
  - Hazards
    - Dependencies between instructions limiting parallelism
    - Branch/jump requires flushing all pipelines
  - Finding enough parallel instructions



Compiler-based solutions

#### STATIC MULTIPLE ISSUE MACHINES



#### Static Multiple Issue

- is responsible for finding and packaging instructions that can execute in parallel into issue packets
  - Only certain combinations of instructions can be in a packet together
  - Instruction packet example:
    - (1) Integer/Branch instruction slot
    - (1) LD/ST instruction
    - (1) FP operation
- An issue packet is often thought of as an LONG instruction containing multiple instructions (a.k.a. \_ery \_ong \_nstruction \_ord)
  - Intel's Itanium used this technique (static multiple issue) but called it EPIC (xplicitly arallel nstruction omputer)



#### 2-way VLIW Scheduling

- 1.) No forwarding w/in an issue packet (between instructions in a packet)
- 2.) Full forwarding to previous instructions
  - Those behind in the pipeline
- 3.) Still 1 stall cycle necessary when LD is followed by a dependent instruction





### Example 2-way VLIW machine

- One issue slot for INT/BRANCH operations & another for LD/ST instructions
- I-Cache reads out an entire issue packet (more than 1 instruction)
- HW is added to allow many registers to be accessed at one time
  - Just more multiplexers
- Address Calculation Unit (just a simple adder)





### Sample Scheduling

 Schedule the following loop body on our 2-way static issue machine



| # %rdi = | A                   |
|----------|---------------------|
| # %esi = | n = # of iterations |
| L1: ld   | 0(%rdi),%r9         |
| add      | \$5,%r9             |
| st       | %r9,0(%rdi)         |
| add      | \$4,%rdi            |
| add      | \$-1,%esi           |
| jne      | \$0,%esi,L1         |
|          |                     |



| Int./Branch Slot | LD/ST Slot |
|------------------|------------|
|                  |            |
|                  |            |
|                  |            |
|                  |            |
|                  |            |
|                  |            |

w/ modifications and code movement IPC = \_\_\_ instrucs. / \_\_\_ cycle =



#### **Annotated Example**

| In  | t./Branch Slot |    | LD/ST Slot  |
|-----|----------------|----|-------------|
|     |                | ld | 0(%rdi),%r9 |
| add | \$-1,%esi      |    |             |
| add | \$5,%r9        |    |             |
| add | \$4,%rdi       | st | %r9,0(%rdi) |
| jne | \$0,%esi,L1    |    |             |



## USC Viterbi 12b.18

### **Loop Unrolling**

- Often not enough ILP w/in a single iteration (body) of a loop
- However, different iterations of the loop are often independent and can thus be run in parallel
- This parallelism can be exposed in static issue machines via loop unrolling
  - Copy the body of the loop k times and iterate only n/k times
  - Instructions from different body iterations can be run in parallel

```
void f1(int* A, int n)
{
  for(; n != 0; n--, A++)
    *A += 5;
}
```

```
// Loop unrolled 4 times
void f1(int* A, int n)
{ // assume n is a multiple of 4
for(; n!= 0; n-=___, A+=4){
   *A += 5;
}
}
```



#### **Loop Unrolling**

```
void f1(int* A, int n) {
  for( ; n != 0; n--, A++)
     *A += 5;
}

# %rdi = A
# %esi = n = # of iterations
```

#### **Original Code**

A side effect of unrolling is the reduction of overhead instructions (less branches and counter/ptr. updates

```
// Loop unrolled 4 times
for(i=0; i < MAX; i+=4){
    A[i] = A[i] + 5;
    A[i+1] = A[i+1] + 5;
    A[i+2] = A[i+2] + 5;
    A[i+3] = A[i+3] + 5;
}
Unrolled # %rdi = A
```

#### # %esi = n = # of iterations 0(%rdi),%r9 \$5,%r9 %r9,0(%rdi) 4(%rdi),%r9 add \$5,%r9 %r9,4(%rdi) st 8(%rdi),%r9 add \$5,%r9 %r9,8(%rdi) 1d 12(%rdi),%r9 add \$5,%r9 %r9,12(%rdi) \$16,%rdi add add \$-4,%esi jne \$0,%esi,L1



| • | To effectively schedule the code, the compiler will often move |               |  |  |  |
|---|----------------------------------------------------------------|---------------|--|--|--|
|   | code but must take care not to change the                      |               |  |  |  |
|   | program behavior                                               |               |  |  |  |
| • | Must deal with WAR                                             | ( ) and WAW ( |  |  |  |

| mast dear men min (                                                        |
|----------------------------------------------------------------------------|
| ) hazards in addition to RAW hazards when moving code                      |
| <ul> <li>WAW and WAR hazards are hazards (no data communication</li> </ul> |
| between instrucs.) but simply conflicts because we want to use the same    |
| registerwe call them dependencies ordependences!                           |

| ٠ | – H  | ow  | can | we | sol | ve? |          |
|---|------|-----|-----|----|-----|-----|----------|
| d | %r8, | %r9 |     |    |     | LD  | instruct |

| add | %r8, %r9 %r9, %r10                    | LD instruction is<br>REALLY independe<br>(only needs %r11).<br>Could LW instructi |
|-----|---------------------------------------|-----------------------------------------------------------------------------------|
| 1d  | %r8, %r9<br>0(%r11), %r9<br>%r9, %r10 | be moved between 2 add's?                                                         |

| ent       | L1: add %r8, %r9 write %r9 add %r9, %r10 ld 0(%r11), %r9 write %r9 sub %r9, %r12 |
|-----------|----------------------------------------------------------------------------------|
| on<br>the | Original  L1: ld 0(%r11), %r9 add %r8. %r9                                       |

add %r9, %r10

sub %r9, %r12

Could LD instruction be run in parallel with or before first add?



#### **Register Renaming**

- Unrolling is not enough because even though each iteration is independent there are conflicts in the use of registers (%r9 in this case)
  - Can't move another 'ld' instruction up until 'st' is complete due to a WAR hazard even though there is not a true data dependence
- Since there is no true dependence (Id does not need data from 'st' or 'add' above) we can solve the problem by register renaming
- Register Renaming: Using different registers to solve WAR / WAW hazards



#### Scheduling w/ Unrolling & Renaming

 Schedule the following loop body on our 2-way static issue machine

|     | rdi = | • •           |
|-----|-------|---------------|
| # % | esi = | n = # of      |
| ite | ratio | ns            |
| L1: | 1d    | 0(%rdi),%r9   |
|     | add   | \$5,%r9       |
|     | st    | %r9,0(%rdi)   |
|     | 1d    | 4(%rdi),%r10  |
|     | add   | \$5,%r10      |
|     | st    | %r10,4(%rdi)  |
|     | 1d    | 8(%rdi),%r11  |
|     | add   | \$5,%r11      |
|     | st    | %r11,8(%rdi)  |
|     | 1d    | 12(%rdi),%r12 |
|     | add   | \$5,%r12      |
|     | st    | %r12,12(%rdi) |
|     | add   | \$16,%rdi     |
|     | add   | \$-4,%esi     |
|     | jne   | \$0,%esi,L1   |

| Int./Branch Slot                     | LD/ST Slot |  |  |  |
|--------------------------------------|------------|--|--|--|
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
|                                      |            |  |  |  |
| and the second production processing |            |  |  |  |

w/ Loop Unrolling and Register Renaming (Notice how the compiler would have to modify the code to effectively reschedule)





#### **Data Dependency Hazards Summary**

- RAW = Only real data dependence
  - Must be respected in terms of code movement and ordering
  - Forwarding reduces latency of dependent instructions
- WAW and WAR hazards = Antidependencies
  - Solved by using register renaming
- RAR = No issues / dependencies



#### Loop Unrolling & Register Renaming Summary

- Loop unrolling increases code size (memory needed to store instructions)
- Register renaming burns more registers and thus may require HW designers to add more registers
- Must have some amount of independence between loop bodies
  - Dependence between iterations known as loop carried dependence

```
// Dependence between iterations
A[0] = 5;
for(i=1; i < MAX; i++)
A[i] = A[i-1] + 5;</pre>
```



### Memory Hazard Issue

| • | Suppo | ose %rsi a | nd %rdi are passed in as argu | uments to a function, |
|---|-------|------------|-------------------------------|-----------------------|
|   | can w | e reorder  | the instructions below?       |                       |
|   | _     | , if       | we have a data                | in                    |

| • | Data dependencies can occur via      | and are harder for            |
|---|--------------------------------------|-------------------------------|
|   | the compiler to find at compile time | forcing it to be conservative |

| ld<br>addl       | 0(%rsi),%edx<br>\$5,%edx                 |                                    |  |
|------------------|------------------------------------------|------------------------------------|--|
| st               | %edx,0(%rsi)                             |                                    |  |
| ld<br>addl<br>st | 0(%rdi),%eax<br>\$1,%eax<br>%eax,0(%rdi) | Can we reorder these instructions? |  |

| Int./Branch Slot                                   |    | LD/ST Slot   |  |  |
|----------------------------------------------------|----|--------------|--|--|
|                                                    | 1d | 0(%rsi),%edx |  |  |
|                                                    | 1d | 0(%rdi),%eax |  |  |
| addl \$5,%edx                                      |    |              |  |  |
| addl \$1,%eax                                      | st | %edx,0(%rsi) |  |  |
|                                                    | st | %eax,0(%rdi) |  |  |
| Can we move the 2 <sup>nd</sup> 'ld' up to enhance |    |              |  |  |



#### Itanium 2 Case Study

- Max 6 instruction issues/clock
- 6 Integer Units, 4 Memory units, 3 Branch, 2 FP
  - Although full utilization is rare
- Registers
  - (128) 64-bit GPR's
  - (128) FPR's
- On-chip L3 cache (12 MB or cache memory)



### **Memory Disambiguation**

- Data dependencies occur in MEMORY and not just registers
- Memory RAW dependencies are also made harder because of different ways of addressing the same memory location
  - Can the following be reordered?
  - st %eax, 4(%rdi)
    ld -12(%rsi), %ecx
     \_\_\_! What if %rsi = \_\_\_\_\_\_\_
- Memory disambiguation refers to the process of determining if a sequence of stores and loads reference the \_\_\_\_\_\_ address (ordering often needs to be maintained)
- We can only reorder LD and ST instructions if we can disambiguate their addresses to determine any RAW, WAR, WAW hazards
  - LD -> LD is always fine (RAR)
  - ST -> LD (RAW), LD -> ST (WAR) or ST -> ST (WAW) hazards that need to be disambiguated



#### Static Multiple Issue Summary

- Compiler is in charge of reordering, renaming, unrolling original program code to achieve better performance
- Processor is designed to fetch/decode/execute multiple instructions per cycle in the order determined by the compiler
- Pros: HW can be \_\_\_\_\_ and thus \_\_\_\_\_
  - More cores
  - Potentially higher clock rates
- Cons: Requires \_\_\_\_\_\_
  - No support for legacy software



HW-based solutions

# DYNAMIC MULTIPLE ISSUE MACHINES



#### **Out-Of-Order Execution**

- Idea: Have processor find dependencies as instructions are fetched/decoded and execute independent instructions that come after stalled instructions
  - Known as Out-of-Order Execution or \_\_\_\_\_ Scheduling
  - HW will determine the "dependency" graph at runtime and as long as an instruction isn't waiting for an earlier instruction, let it execute!







### Overcoming the Memory Latency

- What happens to instruction execution if we have a cache miss?
  - All instructions behind us need to
  - Could take potentially \_\_\_\_\_\_ of clock cycles to fetch the data
- Can we over come this?





### **Organization for OoO Execution**





# Organization for OoO Execution



# USC Viter bi 12b.34 School of Engineering

## Organization for OoO Execution





# Organization for OoO Execution





### Organization for OoO Execution





### Organization for OoO Execution





#### Organization for OoO Execution





## Dynamic Multiple Issue

- Burden of scheduling code for parallelism is placed on the HW and is performed as the program runs (not necessarily at compile time)
  - Compiler can help by moving code, but HW guarantees correct operation no matter what
- Goal is for HW to determine data dependencies and let independent instructions execute even if previous instructions (dependent on something) are stalled
  - We call this a form of Out-of-Order Execution
- Primarily used in conjunction with speculation methods but we'll start by examining non-speculative methods (i.e. don't execute until all previous branches are resolved)



#### **Problems with OoO Execution**

- What if an \_\_\_\_\_\_ (e.g. page fault) occurs in an earlier instruction AFTER later instructions have already completed
  - OS will save the state of the program and handle the page miss
  - When OS resumes it will restart the process at the ST instruction
  - The subsequent instructions will execute for a 2<sup>nd</sup> time. BAD!!!

Solution

 I need to fetch and dispatch multiple instructions per cycle but when I hit a jump/branch I don't know which way to fetch

| 301ution |                   | Recution with abii | ity to Nonback                        |                |
|----------|-------------------|--------------------|---------------------------------------|----------------|
|          |                   |                    |                                       | I-Cache        |
|          | # %rdi = A        |                    |                                       | $\neg$         |
|          | # %esi = n = # of |                    | 9                                     | Instruc.       |
|          | iterations        | <u> </u>           |                                       | Queue          |
|          | # %rdx = s        |                    | Register                              |                |
|          | f1:               |                    | Status                                |                |
|          | ld 0(%rdx),%r8    | Resume after       | Table                                 | Dispatch       |
|          | addl \$1,%r8      | STALLING           | l <del>, , , </del>                   |                |
|          | st %r8,0(%rdx)    | Page Fault!        |                                       |                |
|          | L1:               | Cirpering          |                                       | ┐°¥┷┪╸ ┋┍┷┿┑ ┃ |
|          | ld 0(%rdi),%r9    |                    | ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο | One One        |
|          | add \$5,%r9       | Completed          | Z:a661 \$1,[res1]                     | Bir Pivo       |
|          | st %r9,0(%rdi)    | ( )                |                                       | -     .        |
|          | add \$4,%rdi      |                    |                                       | + + , _ + + !  |
|          | add \$-1,%esi     |                    | Integer / D-Cache                     | Div Mul        |
|          | jne %r8,%esi,L1   | Where next         | Branch RESS RESOLVED                  | 3              |

Execution with ability to "Pollback"





#### Speculation w/ Dynamic Scheduling

- Basic block size of 5-7 instructions between branches limits ability to issue/execute instructions
  - For safety, we might consider stalling (stop dispatching instructions) until we know the outcome of a jump/branch
- But speculation allows us to predict a branch outcome and continue issuing and executing down that path
- Of course, we could be wrong so we need the ability to roll-back the instructions we should not have executed if we mispredict
- We add a structure known as the commit unit (or \_\_\_\_\_\_)



#### SPECULATIVE EXECUTION



### **Out-Of-Order Diagram**







- When an instruction completes, its results are forwarded to others but is also stored in the ROB (rather than writing back to reg. file) until it reaches the head of the ROB
- Commit unit only commits the instruction(s) at the of the queue and only when it is fully
  - Ensures that everything committed was correct
  - If we hit an exception or misspeculate/mispredict a branch then throw away everyone behind it in the ROB and start fresh using the correct outcome



#### Re-Order Buffer (ROB)





### Commit Unit (ROB)

- What happens if the ST instruction that is STALLED ends up causing a page fault...
  - ROB allows us to throw away instructions after it and replay them after the page fault is handled
- When we get to the JEQ, we don't know %r9 so we'll just guess (predict) the outcome and fetch down that predicted path
  - If we mispredict, ROB allows us to throw away instruction results

| ld<br>and<br>L1: add<br>sul<br>st<br>jed<br>xor | d %r8,%r12<br>b %r8,%r10<br>%r9,0(%r13)<br>q %r9,%r14,L1 |
|-------------------------------------------------|----------------------------------------------------------|
|-------------------------------------------------|----------------------------------------------------------|

#### Re-Order Buffer (ROB)





### **Branch Prediction + Speculation**

- To keep the backend fed with enough instructions we need to predict a branch's outcome and perform "speculative" execution beyond the predicted (unresolved) branch
  - Roll back mechanism (flush) in case of misprediction





### **Speculation Example**

- · Predict branches and execute most likely path
  - Simply flush ROB entries after the mispredicted branch
  - Need good prediction capabilities to make this useful













Pipeline begins to fill wa correct path



Not responsible for this material

#### **BONUS MATERIAL**



#### **BRANCH PREDICTION**



#### **Branch Target Availability**

- Branches perform PC = PC + displacement where displacement is stored as part of the instruction
- Usually can't get the target until after the instruction is completely fetched (displacement is part of instruction)
  - May be 2-3 cycles in a deeply pipelined processor (ex. [I-TLB, I-Cache Lookup, I-Cache Access, Decode]
  - If a 4-way superscalar and 3 cycle branch penalty, we throw away 12 instructions on a misprediction
- Key observation: Branches always branch to same place (target is constant for each branch)





#### **Branch Prediction**

- Since basic blocks are often small, multiple issue (static or dynamic) processors may encounter a branch every 1-2 cycles
- We not only need to know the outcome of the branch but the target of the branch
  - Branch target: Branch target buffer (cache)
  - Branch outcome: Static (compile-time) or dynamic (run-time / HW assisted) prediction techniques
- To keep the pipeline full and make speculation efficient and not wasteful, the processor needs accurate predictions



#### Finding the Branch Target

- Key observation: Jump/branches always branch to same place (target is constant for each branch)
  - The first time we fetch a jump we'll have no idea where it is going to jump to and thus have to wait several cycles
  - But let's save the address where the jump instruction lives AND where it wants to jump to (i.e. the target)
  - Next time the PC gets to the starting address of the jump we can lookup the target quickly
  - Keep all this info in a small "branch target cache/buffer"



| 00000000000400 | 04d6 <sum>:</sum> |          |      |                              |
|----------------|-------------------|----------|------|------------------------------|
| 4004d6:        | 85 f6             |          | test | %esi,%esi                    |
| 4004d8:        | 7e 1d             |          | jle  | 4004f7 <sum+0x21></sum+0x21> |
| 4004da:        | 48 89 fa          |          | mov  | %rdi,%rdx                    |
| 4004dd:        | 8d 46 ff          |          | lea  | -0x1(%rsi),%eax              |
| 4004e0:        | 48 8d 4c 87       | 04       | lea  | 0x4(%rdi,%rax,4),%rcx        |
| 4004e5:        | b8 00 00 00       | 00       | mov  | \$0x0,%eax                   |
| 4004ea:        | 03 02             |          | add  | (%rdx),%eax                  |
| 4004ec:        | 48 83 c2 04       |          | add  | \$0x4,%rdx                   |
| 4004f0:        | 48 39 ca          |          | cmp  | %rcx,%rdx                    |
| 4004f3:        | 75 f5             |          | jne  | 4004ea <sum+0x14></sum+0x14> |
| 4004f5:        | eb 05             |          | jmp  | 4004fc <sum+0x26></sum+0x26> |
| 4004f7:        | b8 00 00 00       | 00       | mov  | \$0x0,%eax                   |
| 4004fc:        | c6 05 36 0b       | 20 00 01 | movb | \$0x1,0x200b36(%rip)         |
| 400503:        | c3                |          | retq |                              |
|                |                   |          |      |                              |



### **Branch Target Buffer**

- Idea: Keep a cache (branch target buffer / BTB) of branch targets that can be accessed using the PC in the 1st stage
  - Cache holds target addresses and is accessed using the PC (address of instruction)
  - First time a branch is executed, cache will miss, and we'll take the branch penalty but save its target address in the cache
  - Subsequent accesses will hit (until evicted) in the BTB and we can use that target if we predict the branch is taken.
- Note: BTB is a "fully-associative" cache (search all entries for PC match)...thus it can't be very large





#### Local vs. Global History

- What history should we look at?
  - Should we look at just the previous executions of only the particular branch we're currently predicting or at surrounding branches as well
- Local History: The previous outcomes of that branch only
  - Usually good for loop conditions
- Global History: The previous outcomes of the last m branches in time (other previous branches)





#### **Branch Outcome Prediction**

- Now that we have predicted the target, we now need to predict the outcome
- Static prediction
  - Have compiler make a fixed guess and put that as a "hint" in the instruction itself
  - Effective for loops
- Dynamic prediction
  - Some jumps are data dependent (e.g. if(x < y))</li>
  - Keep some "history"/records of each branches outcomes from the past & use that to predict the future
  - Store that history in a cache
  - Ouestions
    - · What history should we use to predict a branch
    - How much history should we use/keep to predict a branch







### Global (Correlating) Predictor

 Use the outcomes of the last m branches that were executed to select a prediction



- Given last m jumps, 2<sup>m</sup> possible combinations of outcomes & thus predictions
  - When jeq1=NT and jeq2=NT, predict jne = T, when jeq1=NT and jeq2=T, predict jne = NT, etc.
- Branch predictor indexed by concatenating LSB's of PC and m-bits of last m branch outcomes





#### Tournament Predictor

- Dynamically selects when to use the global vs. local predictor
  - Accuracy of global vs. local predictor for a branch may vary for different branches
  - Tournament predictor keeps the history of both predictors (global or local) for a branch and then selects the one that is currently the most accurate







### Pros/Cons of Static vs. Dynamic

- Static
  - HW can be simpler since compiler schedules code
  - Compiler can see "deeper" in the code to find parallelism
  - Used in many high-performance embedded processors like GPUs, etc. where code is more regular with high computation demand
- Dynamic
  - Allows for performance increase even for legacy software
  - Can be better at predicting unpredictable branches
  - HW structures do not scale well (ROB, reservation stations, etc.) beyond small sizes and more waste (time & power)
  - Better for unpredictable, general purpose control code



### **Dynamic Scheduling Summary**

- You can understand a modern architecture
  - https://en.wikichip.org/wiki/intel/microarchitectures/haswell (client)
- Software implications
  - Code with a lot of branches will perform worse than regular code
  - Many cache misses will limit the performance
- But compared to a statically scheduled processor...



#### PHYSICAL VS ARCHITECTURAL **REGISTERS**



### Virtual Registers?

- In static scheduling, the compiler accomplished register renaming by changing the instruction to use other programmervisible GPR's
- In dynamic scheduling, the HW can "rename" registers on the fly
  - In the code on the left we would want %r9 to be renamed to %r10, %r11, for each iteration
- Solution: A level of indirection
  - Let the register numbers be "virtual" and then perform translation to a "physical" register
  - Every time we write to the same register we are creating a "new version" of the register...so let's just allocate a physically different register

```
# %esi = n
iterations
# %rdx = s
     0(%rdx),%r8
addl $1,%r8
     %r8,0(%rdx)
     0(%rdi),%r9
1.4
add
     $5,%r9
     %r9,0(%rdi)
     $-1,%esi
     $0,%esi,L1
     0(%rdi),%r9
1d
     $5,%r9
      %r9,0(%rdi)
     $4,%rdi
add
     $-1,%esi
     $0,%esi,L1
     0(%rdi),%r9
1d
     %r9,0(%rdi)
     $4,%rdi
     $-1,%esi
     $0,%esi,L1
```

Trace of instructions over 3 loop iterations. Each iteration is independent if we can

0(%rdx),%r8

%r8,0(%rdx)

0(%rdi),%r9

%r9,0(%rdi)

\$4,%rdi

\$5,%r9

\$-1,%esi

\$5,%r9

add \$4,%rdi

add \$-1,%esi

ine \$0,%esi,L1

\$0,%esi,L1

0(%rdi),%r9

%r9,0(%rdi)

add \$4,%rdi

\$-1,%esi

\$0,%esi,L1

0(%rdi),%r9

%r9,0(%rdi)

# %esi = n =

iterations

# %rdx = s

11.

add \$5,%r9

addl \$1,%r8

#### **Register Renaming**

- Whenever an instruction produces a new value for a register, allocate a new physical register and update the table
  - Mark the old physical register as "free"
  - Mark the newly allocated register as "used"
- An instruction that wants to read a register just uses whatever physical register the current mapping table indicates





**Physical Registers** 

L1:



### Architectural vs. Physical Registers

- Architectural registers = The (16) x86 registers visible to the programmer or compiler
  - Truly just names ("virtual")
  - The mapping table needs 1 entry per architectural register
- Physical registers = A greater number of actual registers than architectural registers that is used as a "pool" for renaming
- Often a large pool of physical registers (80-128) to support large number of instructions executing at once or waiting in the commit unit



**Physical Registers** 

Trace of instructions over 3 loop iterations. Each iteration is independent if we can

0(%rdi),%r9