#### Credits

### EE 457 Unit 9c

#### Thread Level Parallelism

**USC**Viterbi

School of Engineering

3

#### • Some of the material in this presentation is taken from:

- Computer Architecture: A Quantitative Approach
  - John Hennessy & David Patterson
- Some of the material in this presentation is derived from course notes and slides from
  - Prof. Michel Dubois (USC)
  - Prof. Murali Annavaram (USC)
  - Prof. David Patterson (UC Berkeley)



**USC**Viterbi

#### Power

- Power and energy consumption is a MAJOR concern for processors
- Power consumption can be decomposed into:
  - \_\_\_\_\_(P<sub>STAT</sub>): Power constantly being dissipated (grows with # of transistors)
  - \_\_\_\_\_ (P<sub>DYN</sub>): Power consumed for switching a bit (1 to 0)
  - $\mathbf{P}_{\mathsf{DYN}} = \mathbf{I}_{\mathsf{DYN}} * \mathbf{V}_{\mathsf{DD}} \approx \frac{1}{2} \mathbf{C}_{\mathsf{TOT}} \mathbf{V}_{\mathsf{DD}}^2 \mathbf{f}$
  - Recall, I = C dV/dt
  - V<sub>DD</sub> is the logic '1' voltage, f = clock frequency
- Dynamic power favors parallel processing vs. higher clock rates
  - V<sub>DD</sub> value is tied to f, so a reduction/increase in f leads to similar change in Vdd
  - Implies power is proportional to f<sup>3</sup> (a cubic savings in power if we can reduce f)
  - Take a core and replicate it 4x => 4x performance and \_\_\_\_\_ power
  - Take a core and increase clock rate 4x => 4x performance and \_\_\_\_\_ power
- Static power
  - Leakage occurs no matter what the frequency is

#### **BACKGROUND KNOWLEDGE**









#### Answer 4

- But in OoO processors, can't we just deepen our ROB, Issue queues, Store Address Buffer, etc to hide cache misses?
  - Associative \_\_\_\_\_\_
    structures are expensive and slow down dramatically as they deepen
  - DOES NOT \_\_\_\_\_ WELL



#### Motivating HW Multithread/Multicore

- Issues that prevent us from exploiting ILP in more advanced single-core processors with deeper pipelines and OoO Execution
  - Slow memory hierarchy
  - Increased power with higher clock rates
  - Increased wire delay & size with more advanced structures (ROBs, Issue queues, etc.) for potentially \_\_\_\_\_
- All of these issues point us to find "easier" sources of parallelism such as: **TLP (Thread-Level Parallelism)**

USC Viterbi <sup>23</sup>

School of Engineering

School of Engineering

### **OVERVIEW OF TLP**



) representing a separately schedulable task

**USC**Viter

\$1 0xbf01e800

0x00000005 0xbff70c44

0x0004a804

- Schedulable task: Can be transparently paused and resumed by the OS scheduler
- Consider the processor:
  - For what resources would each thread need their own copy to execute in parallel?





Very high overhead! (





| Fetch | (Thread)<br>Select | Decode | Exec. | Mem. | WB |
|-------|--------------------|--------|-------|------|----|
|-------|--------------------|--------|-------|------|----|

#### http://ogun.stanford.edu/~kunle/publications/niagra\_micro.pdf

### T1 Pipeline

- Thread select stage [Stage 2]
  - Choose instructions to issue from ready threads
  - Issues based on
    - Instruction type
    - Misses
    - Resource conflicts
    - Traps and interrupts
- Fetch stage [Stage 1]
  - Thread select mux chooses which thread's instruction to issue and uses that thread's PC to fetch more instructions
  - Access I-TLB and I-Cache
  - 2 instructions fetched per cycle

## **Pipeline Scheduling**

- No pipeline flush on context switch (except potentially of instructions from faulting thread)
- Full forwarding/bypassing to consuming, junior instructions of same thread
- In case of load, wait \_\_\_\_ cycles before an instruction from the same thread is issued
  - Solved \_\_\_\_\_ issue
- Scheduler guarantees fairness between threads by prioritizing the least recently scheduled thread

### T1 Pipeline

- Decode stage [Stage 3]
  - Accesses register file
- Execute Stage [Stage 4]
  - Includes ALU, shifter, MUL and DIV units
  - Forwarding Unit
- Memory stage [Stage 5]
  - DTLB, Data Cache, and 4 store buffers (1 per thread)
- WB [Stage 6]

39)

**USC**Viterbi

- Write to register file

# A View Without HW Multithreading



### Types/Levels of Multithreading

- How should we overlap and share the HW between instructions from different threads
  - -grained Multithreading: Execute one thread with all HW resource until a cache-miss or misprediction will incur a stall or pipeline flush, then switch to another thread

-grained Multithreading: Alternate fetching instructions from a different thread each clock

Multithreading: Fetch and execute instructions from different threads at the same time

#### Issue Slots Time Superscalar **Coarse-grained MT Fine-Grained MT** Miss Miss Expensive Cache Miss Penalty **Only instructions** Switch threads from a single when one hits a thread long-latency event

like a stall due to cache-miss. pipeline flush, etc.



Mix instructions from threads during same

issue cycle (Intel HyperThreading, IBM Power 5)

43 **USC**Viterbi

#### **Fine Grained Multithreading**

- Like Sun Niagara
- Alternates issuing instructions from different threads each cycle provided a thread has instructions ready to execute (i.e. not stalled)
- With enough threads, long latency events may be completely hidden
  - Some processors like Cray may have or more threads
- Degrades performance since it only gets 1 out of every N cycles if all N threads are ready

#### **Coarse Grained Multithreading**

Alternate threads

every cycle

(Sun UltraSparc

T2)

Levels of TLP

- Swaps threads on long-latency event
- Hardware does not have to swap threads in a single cycle (as in fine-grained multithreading) but can take a few cycles since the current thread has hit a long latency event
- Requires flushing pipeline of current thread's instructions and filling pipeline with new thread's
- Better single-thread performance

#### ILP and TLP

- TLP can also help ILP by providing another source of independent instructions
- In a 3- or 4-way issue processor, better utilization can be achieved when instructions from 2 or more threads are executed simultaneously

#### Simultaneous Multithreading

- Uses multiple-issue, dynamic scheduling mechanisms to execute instructions from multiple threads at the same time by filling issue slots with as many available instructions from either thread
  - Overcome poor utilization due to cache misses or lack of independent instructions
  - Requires HW to \_\_\_\_\_ instructions based on their thread
- Requires greater level of hardware resources (separate register renamer, branch prediction, store buffers, and multiple register files, etc.)



Example

- Intel HyperThreading Technology (HTT) is essentially SMT
- Recent processors including Core i7 are multicore, multi-threaded, multi-issue, OoO (dynamically scheduled) superscalar processors

#### Future of Multicore/Multithreaded

- Multiple cores in shared memory configuration
- Per-core L1 or even L2
- Large on-chip shared cache
- Multiple threads on each core to fight memory wall
- Ever increasing on-chip threads
  - To continue to meet Moore's Law
  - CMP's with 1000's of threads envisioned
  - Only sane option from technology perspective (i.e. out of necessity)
  - The big road block is parallel programming

#### **Parallel Programming**

- Implicit parallelism via...
  - Parallelizing compilers
  - Programming frameworks (e.g. MapReduce)
- Explicit parallelism
  - Task Libraries
    - Intel Thread Building Blocks, Java Task Library
  - Native threading (Windows threads, \_\_\_\_\_ threads)