



### **Programming Model**

- Applications are partitioned into a set of cooperating processes
- Processes can be seen as "virtual processors"
  - Usually there are many more processes than processors and time-sharing is required
- Processes may communicate by passing messages
  - Usually done by shared mailboxes (shared memory variables) or shared regions of memory in a shared memory system
  - Interprocessor interrupts or network I/O in a message passing system
- For shared memory systems, synchronization protocols must be careful followed to avoid read-modify-write race conditions
- Scheduling: Binding processes to processors

#### Difficulties in Exploiting MIMD

- Synchronization, locks, race conditions, etc
- In many cases, parallel programming requires a fair amount of knowledge of the underlying \_\_\_\_\_\_ to achieve
- Limitation of speedup due to \_\_\_\_\_\_ (i.e. the portion of code that is NOT parallelized)
  - Sequential job take 100 Time Units
  - 80 Time units are parallelized to 10 processors
  - New Exec. Time = \_\_\_\_\_\_
  - Speedup = \_\_\_\_
    - Compared to linear speedup expectation of 10 proc. => 10x speedup)

ISCA '90 Tutorial "Memory System Architectures for Tightly-coupled Multiprocessors", Michel Dubois and Faye A. Briggs © 1990.

# Synchronization

- Example: Suppose we need to sum 10,000 numbers on 10 processors. Each processor sums 1,000 at its own pace and then need to combine results
- We need to wait until the 10 threads have completed to combine results
- This is an example of a \_\_\_\_\_\_ synchronization where all threads must check in and reach the "\_\_\_\_\_" sync point *before* any thread may continue
  - No one shall execute beyond the barrier until all others reach that point
- To implement this we keep a count and increment it atomically



**USC**Vi

# Problem of Atomicity

- Sum an array, A, of numbers {5,4,6,7,1,2,8,5}
- Sequential method
  - for(i=0; i < 7; i++) { sum = sum + A[i]; }
- Parallel method (2 threads with ID=0 or 1) for(i=ID\*4; i < (ID+1)\*4; i++) { local\_sum = local\_sum + A[i]; } sum = sum + local\_sum;
- Problem
  - Updating a shared variable (e.g. sum)
  - Both threads read sum=0, perform sum=sum+local\_sum, and write their respective values back to sum
  - Sum ends up with only a partial sum
  - Any read/modify/write of a shared variable is susceptible
- Solution
  - Atomic updates accomplished via some form of locking



Sequential





### **Atomic Operations**

- Read/modify/write sequences are usually done with separate instructions
- Possible Sequence:
  - P1 Reads sum (lw)
  - P1 Modifies sum (add)
  - P2 Reads sum (lw)
  - P1 Writes sum (sw)
  - P2 uses old value...
- Partial Solution: Have a separate flag/"lock" variable (0=Lock is free/unlocked, 1 = Locked)
- Lock variable is susceptible to same problem as sum (read/modify/write)
- Hardware has to support some kind of instruction to implement atomic operations usually by not releasing bus between read and write



**USC**Viterbi

 Thread 1:
 Thread 2:

 Lock L
 Lock L

 Update sum
 Update sum

 Unlock L
 Unlock L

# Locking/Atomic Instructions

- TSL (Test and Set Lock)
  - tsl reg, addr\_of\_lock\_var
  - Atomically stores const. '1' in lock\_var value & returns lock\_var in reg
    - Atomicity is ensured by HW not releasing the bus during the RMW cycle
- LL and SC (MIPS & others)
  - Lock-free atomic RMW
  - LL = Load Linked
    - Normal lw operation but tells HW to track any external accesses to addr.
  - SC = Store Conditional
    - Like sw but only stores if no other writes since LL & returns 0 in reg. if failed, 1 if successful

| NLOCK: | sw     | \$zero,lock_addr |
|--------|--------|------------------|
|        | returr | 1;               |
|        | BNE    | \$4,\$zero,LOCK  |
| OCK:   | TSL    | \$4,lock_addr    |
|        |        |                  |

**USC**Viterbi

|         | LA        | \$8,lock_addr                    |
|---------|-----------|----------------------------------|
| LOCK:   | ADDI      | \$9,\$0,1                        |
|         | LL        | \$4,0(\$8)                       |
|         | SC        | \$9,0(\$8)                       |
|         | BEQ       | \$9,\$zero,LOCK                  |
|         | BNE       | \$4,zero,LOCK                    |
|         |           | 014                              |
|         | LA        | \$t1,sum                         |
|         |           |                                  |
| UPDATE: | LL        | \$5,0(\$t1)                      |
| UPDATE: |           | \$5,0(\$t1)<br>\$5,\$5,local_sum |
| UPDATE: |           |                                  |
| UPDATE: | ADD<br>SC | \$5,\$5,local_sum                |
| UPDATE: | ADD<br>SC | \$5,\$5,local_sum<br>\$5,0(\$t1) |

#### USC Viterbi Solving Problem of Atomicity

- Sum an array, A, of numbers {5,4,6,7,1,2,8,5}
- Sequential method

for(i=0; i < 7; i++) { sum = sum + A[i]; }

 Parallel method (2 threads with ID=0 or 1) lock L;

for(i=ID\*4; i < (ID+1)\*4; i++) { local\_sum = local\_sum + A[i]; }

getlock(L); sum = sum + local\_sum; unlock(L);



Cache Coherency

- Most multi-core processors are shared memory systems where each processor has its own cache
- Problem: Multiple cached copies of same memory block
  - Each processor can get their own copy, change it, and perform calculations on their own different values...INCOHERENT!
- Solution: \_\_\_\_\_ caches...



| <section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header></section-header></section-header></section-header></section-header></section-header></section-header></section-header></section-header></section-header></section-header></section-header> | <page-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><list-item><list-item><list-item></list-item></list-item></list-item></section-header></section-header></section-header></section-header></section-header></section-header></section-header></page-header>                                                                                                                                              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| USC Viterbi 19<br>School of Engineering                                                                                                                                                                                                                                                                                                                                     | USC Viterbi <sup>20</sup><br>School of Engineering Write Through Caches                                                                                                                                                                                                                                                                                                                                                                                                              |
| <ul> <li>A memory system is coherent if the value returned on a Load instruction is always the value given by the latest Store instruction with the same address</li> <li>This simple definition allows to understand the basic problems of private caches in MP systems</li> </ul>                                                                                         | <ul> <li>The bus interface unit of each processor "watches"<br/>the bus address lines and invalidates the cache when<br/>the cache contains a copy of the block with modified<br/>word</li> <li>The state of a memory block b in cache i can be<br/>described by the following state diagram <ul> <li>State INV: there is no copy of block b in cache i or if there<br/>is, it is invalidated</li> <li>State VAL: there is a valid copy of block b in cache i</li> </ul> </li> </ul> |

ISCA '90 Tutorial "Memory System Architectures for Tightly-coupled Multiprocessors", Michel Dubois and Faye A. Briggs © 1990.

ISCA '90 Tutorial "Memory System Architectures for Tightly-coupled Multiprocessors", Michel Dubois and Faye A. Briggs © 1990.







Michel Dubois, Murali Annavaram and Per Stenström © 2011.

Michel Dubois, Murali Annavaram and Per Stenström © 2011.



**USC**Viterbi

#### Coherency Example

| Processor<br>Activity  | Bus Activity | P1 \$<br>Content | P1 Block<br>State<br>(M,S,I) | P2 \$<br>Content | P2 Block<br>State<br>(M,S,I) | Memory<br>Contents |
|------------------------|--------------|------------------|------------------------------|------------------|------------------------------|--------------------|
|                        |              | -                | -                            | -                | -                            | А                  |
| P1 reads<br>block X    | BusRd        |                  |                              |                  |                              |                    |
| P2 reads<br>block X    | BusRd        |                  |                              |                  |                              |                    |
| P1 writes<br>block X=B |              |                  |                              |                  |                              |                    |
| P2 reads<br>block X    |              |                  |                              |                  |                              |                    |

## Updated Coherency Example

**USC**Viterbi

| Processor<br>Activity | Bus Activity | P1 \$<br>Content | P1 Block<br>State<br>(M,S,I) | P2 \$<br>Content | P2 Block<br>State<br>(M,S,I) | Memory<br>Contents |
|-----------------------|--------------|------------------|------------------------------|------------------|------------------------------|--------------------|
|                       |              | -                | -                            | -                | -                            | А                  |
| P1 reads<br>block X   | BusRd        |                  |                              |                  |                              |                    |
| P1 writes<br>X=B      |              |                  |                              |                  |                              |                    |
| P2 writes<br>X=C      |              |                  |                              |                  |                              |                    |
| P1 reads<br>block X   |              |                  |                              |                  |                              |                    |

Problem with MSI

- Read miss followed by write causes two bus accesses
- Solution: MESI
  - New "Exclusive" state that indicates you have the \_\_\_\_\_ copy and can \_\_\_\_\_ modify it



#### USCViterbi School of Engine Exclusive State & Shared Signal

- Exclusive state avoid need to perform BusUpgr when moving from Shared to Modified even when no other copy exists
- New state definitions:
  - Exclusive = only copy of (modified / unmodified) cache block
  - Shared = multiple copies exist of (modified / unmodified) cache block
- New "Shared" handshake signal is introduced on the bus
  - When a read request is placed on the bus, other snooping caches assert this signal if they have a copy
  - If signal is not asserted, the reader can assume \_\_\_\_\_\_ access



P3 reads

block X

When P3 reads and the block is in the shared state, the slow memory supplies the data.

We can add an "Owned" state where one cache takes "ownership" of a shared block and supplies it quickly to

other readers when they request it. The result is MOESI.

- In the interim, any other cache read request is serviced by the owner quickly
- Summary: Owner is responsible for...
  - Supplying a copy of the block when another cache requests it
  - Transferring ownership back to main memory when it is invalidated

