

#### Blade – A Timing Violation Resilient Asynchronous Template

Dylan Hand\*, Matheus Trevisan Moreira\*<sup>†</sup>, Hsin-Ho Huang\*, Danlei Chen\*, Frederico Butzke<sup>‡</sup>, Zhichao Li\*, Matheus Gibiluka<sup>†</sup>, Melvin Breuer\*, Ney Laert Vilar Calazans<sup>†</sup>, and Peter A. Beerel\*

May 4th, 2015

\* University of Southern California, Los Angeles, CA
† Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil
‡ Universidade de Santa Cruz do Sul, Santa Cruz do Sul, Brazil











Traditional synchronous design suffers from increased margins

Worse at low and near-threshold regions



# **Data Dependent Delays**





Delay variation due to data is rarely exploited in traditional designs



# Data Dependent Delays





Delay variation due to data is rarely exploited in traditional designs



## Data Dependent Delays





Delay variation due to data is rarely exploited in traditional designs





• Correct via architectural replay or gating/pausing clock





• Correct via architectural replay or gating/pausing clock





• Correct via architectural replay or gating/pausing clock





• Correct via architectural replay or gating/pausing clock

Effective approaches have been elusive!







The Problem [Beer et al, 2014]

- Flop metastability can propagate to ERR signal
- Metastability in control path can cause system failure





The Problem [Beer et al, 2014]

- Flop metastability can propagate to ERR signal
- Metastability in control path can cause system failure





The Problem [Beer et al, 2014]

- Flop metastability can propagate to ERR signal
- Metastability in control path can cause system failure





[Razorll, Das, 2009]

- Error signal must go through synchronizer, increasing delay
- Uses architectural replay to recover from errors





[RazorII, Das, 2009]

- Error signal must go through synchronizer, increasing delay
- Uses architectural replay to recover from errors





- Error signal must go through synchronizer, increasing delay
- Uses architectural replay to recover from errors





- Error signal must go through synchronizer, increasing delay
- Uses architectural replay to recover from errors



# Hold Time Concerns





[SafeRazor, Cannizzaro, 2014]

Relies on latch for error correction

Hold times are problematic



RESILIENT DESIGN | 7 University of Southern California

# Hold Time Concerns





[SafeRazor, Cannizzaro, 2014]

Relies on latch for error correction

Hold times are problematic



RESILIENT DESIGN | 7 University of Southern California

### **Resiliency Landscape**



| Design<br>Template | Sync /<br>Async | MTBF<br>Safe | Avoids<br>Replay<br>Logic | Hold<br>Time<br>Robust | Low Error<br>Penalty |
|--------------------|-----------------|--------------|---------------------------|------------------------|----------------------|
| Bubble<br>Razor    | Sync            | No           | Yes                       | Yes                    | Yes                  |
| Razor II           | Sync            | Yes          | No                        | No                     | No                   |
| SafeRazor          | Async           | Yes*         | Yes                       | No                     | Yes                  |



**RESILIENT DESIGN | 8** University of Southern California

#### **Resiliency Landscape**



| Design<br>Template | Sync /<br>Async | MTBF<br>Safe | Avoids<br>Replay<br>Logic | Hold<br>Time<br>Robust | Low Error<br>Penalty |
|--------------------|-----------------|--------------|---------------------------|------------------------|----------------------|
| Bubble<br>Razor    | Sync            | No           | Yes                       | Yes                    | Yes                  |
| Razor II           | Sync            | Yes          | No                        | No                     | No                   |
| SafeRazor          | Async           | Yes*         | Yes                       | No                     | Yes                  |
| Blade              | Async           | Yes          | Yes                       | Yes                    | Yes                  |

Blade combines the best features of past resiliency schemes







Our proposed resilient solution – Blade

Case study – 3-stage Plasma CPU

- Automated flow
- Area efficiency features
- Results and comparisons

Conclusions and future work











Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals







Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable









Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals







Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals







Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals







Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals







Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable









Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable







Send request speculatively before data is guaranteed stable


# Blade Template Operation





Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals



# Blade Template Operation





Send request speculatively before data is guaranteed stable

Timing errors delay handshaking signals





# **Positive Hold Margins**





Handshaking delays create positive hold margin!







C-element stores error signal, which is sampled by Q-Flop







C-element stores error signal, which is sampled by Q-Flop







C-element stores error signal, which is sampled by Q-Flop







C-element stores error signal, which is sampled by Q-Flop







Q-Flop prevents metastability propagation to control path







Q-Flop prevents metastability propagation to control path







Q-Flop prevents metastability propagation to control path







Q-Flop prevents metastability propagation to control path







C-element and OR gates amortize overhead over many EDLs







C-element and OR gates amortize overhead over many EDLs







C-element and OR gates amortize overhead over many EDLs







C-element and OR gates amortize overhead over many EDLs



# **Controller Implementation**





Three part burst-mode state machine

• Implemented using 3D [Yun, 1992]





Three part burst-mode state machine

• Implemented using 3D [Yun, 1992]





Three part burst-mode state machine

• Implemented using 3D [Yun, 1992]



# **Controller Implementation**





• Implemented using 3D [Yun, 1992]



# Case Study: Plasma



Viterbi

School of Engineering

MIPS OpenCore

3-stage pipeline

28nm FDSOI @ 666MHz (w/ ideal clock and Vdd)

| Туре          | Count  |
|---------------|--------|
| Combinational | 11,740 |
| Buf/Inv       | 1,683  |
| Seq. (Non-RF) | 531    |
| RF            | 2,048  |
| Total         | 14,319 |

[1] http://opencores.org/project,plasma



# Automatic Conversion Flow



- Convert single-clock sync RTL design to Blade
- Re-uses synchronous EDA tools and libraries
- Seamless integration into existing flows









#### Synthesize sync RTL design using standard EDA tools







Replace flip flops with master-slave latches

Two-phase non-overlapping clocking



# Latch Retiming Sync Sync Sync Sync Sync Synchesis FF to Latch Conversion Retiming Add EDL + Async Control Simulation



Retime latches to spread logic delay across stages Allow time borrowing to reduce area overhead



### **EDL** Insertion





### Replace non-TB latches with error detecting latches



#### **EDL** Insertion FF to Latch Sync **Svnthesis** Conversion 🔽 Retiming 🔁



Replace non-TB latches with error detecting latches

Not all latches need be error detecting



CASE STUDY | 22 University of Southern California

Add EDL + Async Control

**Simulation** 

Latch

# Async Control





Remove synchronous clock trees

Add Blade controllers and delay lines



# Simulation





### Back annotated SDF simulation using final netlist







































### **Brute Force Resynthesis**





Evaluated hundreds of resynthesis runs

• Each run sets a max delay constraint to a single latch

Chose result that led to largest reduction in area and error rate



# Area and Performance






### Performance Comparison with Margins



Must add margins for PVT variation, clock skew / jitter, and/or aging

• Synchronous frequency degraded to accommodate

Margin in Blade design is only imposed when an error occurs

A 30% error rate reduces impact of margin by ~70%

| Margin | Synchronous | Blade (30% ER) | % Advantage |
|--------|-------------|----------------|-------------|
| 0%     | 666MHz      | 800MHz         | 20%         |
| 15%    | 566MHz      | 764MHz         | 35%         |
| 30%    | 466Mhz      | 728MHz         | 56%         |



CASE STUDY | 28 University of Southern California

# **Other Related Work**



Canary Circuits [Sato, 2007]

- Removes some PVT margins
- But cannot take advantage of data dependency

Bundled Data Designs [Sutherland'89, Nowick'97]

- Speculative completion sensing exploits *some* data dependency
- Margins impact performance on every cycle
- No observability of errors

Soft Mousetrap [Liu, 2013]

Hold time constraints remain difficult to meet



# Conclusions



#### Blade Template

- Achieves higher performance by exploiting data dependency
- Benefits from average vs worst-case MS resolution times
- Reduces impact of margins for PVT variations
- Enables voltage scaling for power savings

#### Plasma Case Study

- Highlights design and CAD techniques for area efficiency
- Achieves 19% increase in performance with 8.4% area overhead





## Questions?



University of Southern California