

Addressing Emerging Fault Modes with Testing and Reliability

Vilas Sridharan Senior Fellow, RAS Architecture Advanced Micro Devices, Inc.

With credit to: Sankar Gurumurthy, Sudhanva Gurumurthi, Jeff Rearick, Steve Hesley

AMD together we advance\_

#### **Our Mission**

#### Public Trusts Compute

How

Security

Privacy

Integrity

Reliability

# The Next Five Years Computing Market Transformation



#### **Data Center and Cloud**

Insatiable Performance Demands Workload Optimized Compute/Networking Edge Compute: Distributed DC Security from Core to Edge Efficiency and Sustainability Focus

#### **Explosion of Al**

Al Workloads Proliferating Dominating the Data Center Expanding to Edge and Endpoint Increasingly Large Models



#### **PCs & Gaming**

Hybrid Work Focused on Improving Collaboration, Battery Life, Security Billions of Gamers Gaming Anywhere and at Anytime AI-powered Productivity, Creativity and Gaming

#### **Technology Scaling**



Process Technology is not scaling at Moore's Law New approaches are required Chiplets & Die Stacking are becoming ubiquitous

> AMD together we advance

### Is the industry achieving the mission?

"CPU SDCs [silent data corruptions] are orders of magnitude higher than soft-error based FIT simulations" [1]

#### **Silent Data Corruptions at Scale** Harish Dattatraya Sneha Pendharkar Chris Mason Matt Beadon Dixit Facebook, Inc. Facebook, Inc. Facebook, Inc. mbeadon@fb.com clm@fb.com spendharkar@fb.com Facebook, Inc. hdd@fb.com Bharath Muthiah Sriram Sankar Tejasvi Chakravarthy Facebook Inc. Facebook, Inc. Facebook, Inc. teiu@fb.com bharathm@fb.com sriramsankar@fb.com

"On the order of a few mercurial cores per several thousand machines" [2]

#### Cores that don't count

Peter H. Hochschild Paul Turner Jeffrey C. Mogul Google Sunnyvale, CA, US Rama Govindaraju Parthasarathy Ranganathan Google Sunnyvale, CA, US

David E. Culler Amin Vahdat Google Sunnyvale, CA, US

"CPU SDCs occur at a low but non-negligible frequency" [3]

#### Understanding Silent Data Corruptions in a Large Production CPU Population

| Shaobu Wang               | Guangyan Zhang*     | Junyu Wei           |  |
|---------------------------|---------------------|---------------------|--|
| Tsinghua University       | Tsinghua University | Tsinghua University |  |
| Yang Wang                 | Jiesheng Wu         | Qingchao Luo        |  |
| The Ohio State University | Alibaba Cloud       | Alibaba Cloud       |  |

Meta: "Silent Data Corruptions at Scale"
 Google: "Cores that don't Count"
 Alibaba: "Ladoratending Silent Data Corruptions in a Large Broduction (

[3] Alibaba: "Understanding Silent Data Corruptions in a Large Production CPU Population"

[Public]

#### What the industry has learned

#### Root causes

#### Small delay faults (SDFs) due to marginal defects [4] [5] [6]

[4] VTS 2023: "Silent data errors: Sources, Detection, and Modeling"
[5] SIGARCH CAT 2023: "Emerging Fault Modes: Challenges and Research Opportunities"
[6] IRPS 2024: "Defect Mechanisms Responsible for Silent Data Errors"







AMD together we advance\_

### How SDFs Affect Product Lifecycle

#### Increasing cost of detection ———



—— Imperative to move detection





### **Areas for Innovation**

### Testing

#### Burn-in techniques and coverage to accelerate latent defects

#### Structural tests that can better mimic mission mode conditions

Improved functional tests (manufacturing and online)

### **Burn-in**



New method: UDFM based on E-fields





Traditionally, scan toggle used in burn-in

Toggle coverage at the nodes used as the metric

Does that satisfy the requirement of getting electric field across dielectrics and channels?

### **Structural Testing**



A simplistic view of **Scan based test** application

Combinatorial explosion of # of faults for path delay fault models

Electrical environment during a scan test does not replicate mission mode

### **Functional Testing**

### **Manufacturing Test**



A simplistic view of *functional self-test* application

Limited generation of functional tests targeting fault models

Increased design complexity means system level tests don't fully represent mission mode

Coverage evaluation and other toolsets for functional tests are lagging

### **Online Test**



A simplistic view of **online functional test** lifecycle

Periodicity of the tests short enough to catch degraded parts before affecting real compute

Testing should not affect the overall utilization of the servers





Can we craft **high-coverage** functional tests, **targeted** at specific hardware blocks and **specific fault models**, in an **automated** manner?



Adapts hardware fuzzing techniques to automatically generate functional tests. Hardware Coverage metrics for grading tests: AVF: transient faults in arrays; IBR: stuck-at faults in functional units Maximizing hardware coverage  $\rightarrow$  Higher likelihood of catching a defect that manifests with given fault model



26 N. Karystinos, O. Chatzopoulos, G. Fragkoulis, G. Papadimitriou, D. Gizopoulos, S. Gurumurthi, Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation, ISCA 2024 together we advance\_

### **Reliability Architecture**

#### Precise techniques that approach coverage of "big hammer" techniques

#### Metrics that quantify ROI for protection of design blocks

#### Techniques to target reliability architecture at small delay faults

### **Big Hammer Techniques**



Use multiple copies of logic to check each other

Examples: lockstep, redundant multi-threading



Theoretically can cover a large portion of a design

Can be run during regular operation in fleet



Will not be replicating the electrical conditions seen in mission mode

Cost (performance/power/area)

32

### **Precise Techniques**



Information redundancy

Protect smaller sections of logic with lower-overhead protection techniques

Examples: parity checking, error correction codes (ECCs), parity prediction



Lower overhead to design

Better diagnosability



Covers only portions of the design and hence requires identification of logic to protect

Difficult to reason about the ROI of protecting different sections of the design

#### DelayAVF

| DelayAVF                | The probability that a small delay fault in a microarchitectural structure propagates to a program-<br>visible error                 |  |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------|--|
| DelayACE                | A circuit element $e$ is $DelayACE_d$ in cycle $i$ if an added fixed-length propagation delay $d$ results in a program-visible error |  |
| Calculating<br>DelayAVF | $DelayAVF_{d}(T) = \sum_{\forall e \in T} \sum_{i=1}^{N} \frac{DelayACE_{d}(e, i)}{N \cdot  E }$                                     |  |
|                         | Normalized Delay AVE Values<br>Delay Duration (% of Clock Cycle)<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10         |  |

together we advance\_

### Industry efforts

#### **OCP Server Component Resilience: Research Grant Awards**

| University                                               | Торіс                                                                                                                                           |  |
|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Arizona State                                            | MOTION: Probabilistic Fault Modeling and Test Generation using On-chip Telemetry IntegratION & Generative AI                                    |  |
| Auburn                                                   | Understanding Test Escapes and SDC Failures in ICs Caused by Transistors with Extreme Device<br>Parameters from Random Manufacturing Variations |  |
| Carnegie Mellon                                          | SDC Detection and Correction In Software via Application-level Coding Techniques                                                                |  |
| Stanford                                                 | Mobilizing Hardware and Software Towards SDC Testing, Detection, and Correction                                                                 |  |
| U of Athens                                              | Grade Early and Detect Fast – Tackling Silent Data Corruption through the Power of Microarchitectural Modeling                                  |  |
| U of Chicago                                             | Formal Verification of HW Failures & Understanding Impact on Accelerators                                                                       |  |
| 6 winning proposals with wide-ranging solutions proposed |                                                                                                                                                 |  |

Demonstrates strong industry and academic commitment to solving SDC



together we advance\_

### Learning from the Past



41 Reference: S. Mukherjee, "Architecture Design for Soft Errors," Morgan Kaufmann, 2008



together we advance\_

42 Reference: S. Mukherjee, "Architecture Design for Soft Errors," Morgan Kaufmann, 2008

### Thank You

#### Disclaimer

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED "AS IS" WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.

© 2024 Advanced Micro Devices, Inc. All rights reserved.

## AMD together we advance\_