Outsmart Silent Data Corruption in AI Processors with Two-Stage Detection

GenAI and ML workloads are causing a ramp up in silent data corruption. Multi-stage detection with on-chip, AI-based telemetry offers smarter fault prevention.

Vijay N

Related To:

proteanTecs

Oct. 20, 2025

5 min read

What you'll learn:

The current breadth of the silent data corruption (SDC) problem no one is talking about.
How SDCs are increasing due to rising AI and ML workloads.
Solving the issue with multi-stage detection during both chip manufacturing and in-field operation using on-chip, AI-based telemetry.

68f655e6171b9ad9e7157748 Promotional Image Proteantecs Ai Chip Data Cente

As transistor geometries shrink and system complexity scales, one inconvenient truth becomes harder to ignore: Silent data corruption (SDC) is more common and consequential than most system architects assume. These errors leave no trace, making them notoriously hard to identify. Yet a single one can distort model weights across independent nodes, quietly derailing a training run that may span weeks, involve over 25,000 GPUs, and cost more than $100 million.

Even with massive investments in validation and testing, undetected faults continue to challenge silicon reliability in fleet-level AI deployments.

If a single chip introduces a silent error during synchronization, the corruption can propagate across the cluster. IEEE studies show a dramatic rise in soft error rates, from one failure per year at 65 nm to one every 1.5 hours at 16 nm (see figure).

Soft errors such as silent data corruption have increased from one failure per year at 65 nm to one failure every 1.5 hours at 16 nm — Soft errors such as SDC have increased from one failure per year at 65 nm to one failure every 1.5 hours at 16 nm.

Meta and Alibaba reported hardware errors every three hours and 361 defective parts per million (DPPM), respectively, in their AI and cloud infrastructures. While 361 DPPM or even several thousands of them might not raise alarms at a small scale, the picture changes dramatically across fleets with millions of devices, when SDC events become frequent enough to jeopardize system-wide reliability.

As AI Grows, So Does the Threat of Silent Data Corruption

SDC is a growing reliability threat to scaling generative AI and machine-learning (ML) workloads, including model training, inference, and high-performance AI applications. These processes often push processors to their limits, increasing the probability of silent corruption.

Unlike memory bit flips, which are typically mitigated by error-correcting codes, SDC stems from subtle compute-level faults: timing violations, aging effects, or marginal defects that escape conventional semiconductor testing. These errors silently distort computations, often without triggering alerts, and go undetected until they manifest as incorrect outputs or potentially flawed decision-making. The larger and more complex the AI system, the more likely these faults will occur and the more damaging their effects.

Traditional redundancy methods can protect memory and communication paths, but they offer little defense against execution-level faults — the primary source of SDC in modern AI environments. Real-world consequences range from barely noticeable miscalculations to business-impacting failures. Industry reports have documented cases including lost database files due to miscalculated mathematical operations in a defective CPU and a storage application reporting checksum mismatches of user data caused by defective CPUs.

Trying to Stem the SDC Problem

As process nodes shrink and chip architectures become more advanced, traditional test methods such as scan ATPG (automatic test pattern generation), BIST (built-in self-test), and basic functional testing haven’t kept pace. While sufficient for catching discrete manufacturing defects, they often fail to detect the subtler semiconductor process variations that lead to SDC.

This creates a persistent blind spot, underscoring the necessity of in-field monitoring. According to Meta, SDC debugging can take months. Troubleshooting a fault that leaves no trace requires ingenuity, often alongside extensive resources. To make matters worse, many SDC investigations end inconclusively despite substantial investment, effectively perpetuating the uncertainty.

At an ITC-Asia 2023 session, Broadcom reported that up to 50% of its SDC investigations ended without a resolution, labeled "No Trouble Found." Such challenges highlight the limits of conventional testing and the urgent need for more advanced approaches.

In-field testing also presents gaps. In-situ methods using canary circuits are often blind to real, critical path timing margins, which might decrease due to aging and process variations. This consideration has become crucial with the increase in on-chip variation within a device, as mentioned in the “MRHIEP.”

Periodic maintenance testing might not be sensitive enough, mostly identifying distinct failures while overlooking subtler SEC-related issues. It also lacks the real-life conditions that characterize in-situ monitoring, as tested devices are removed from the fleet. Subtle anomalies that lead to SDC remain undetected.

Some organizations attempt to overcome these limitations with redundant compute methods, replicating execution across multiple cores, which are only considered correct if they all produce the same result. While this can prevent propagation of SDC, it’s hardware-intensive, costly, and unscalable at hyperscale.

Two-Stage Detection Approach to Solving the Issue

As data centers expand and energy demands rise, it's not sustainable to pour extensive engineering hours into tracing undetectable faults across thousands of servers. A scalable solution lies in superior testing methods, namely AI-enabled, two-stage deep data detection.

Multi-stage detection during both chip manufacturing and in-field operation allows chipmakers to recover product reliability and gives fleet operators renewed confidence in their hardware. Monitoring multiple stages with deep data visibility greatly improves the probability of detecting SDC-prone components before they fail.

To be effective, testing must move beyond binary pass/fail grading. Higher-granularity silicon testing with parametric grading that accounts for process variation and predicted performance margins can flag outlier devices even if they technically pass standard tests. This prevents "walking wounded" chips from reaching production fleets.

Reaching this level of detection demands a shift in chip diagnostics: away from boundary checks and toward embedded AI-based telemetry that continuously assesses the health of each device. By embedding intelligence into the silicon and applying ML to rich telemetry data, it's possible to enable continuous visibility both during manufacturing and throughout in-field operation.

AI algorithms can detect subtle parametric variations and predict failure modes that conventional testing overlooks, identifying latent vulnerabilities long before they lead to silent faults. This proactive, data-rich approach catches vulnerabilities early and enables smarter decisions around chip binning, deployment, and fleet-wide reliability management, all without adding major cost or delay.

As AI continues to scale, the cost of undetected faults will rise with it. Silent data corruption is no longer a theoretical concern; it’s a material risk to performance, reliability, and business outcomes. Traditional testing methods weren’t built for this challenge. New solutions that combine deep data, lifecycle monitoring, and AI-driven analytics offer a clear path forward. With a two-stage detection approach, the industry can finally begin to outsmart SDC before it disrupts the systems we rely on most.

About the Author

Vijay N

Vice President of Worldwide Field Operations, proteanTecs

Vijay N is vice president of worldwide operations for proteanTecs. Prior to proteanTecs, Vijay was director of engineering at Intel and led the implementation of multiple projects, including structured ASIC and IOTG SoC subsystems. Vijay has over 22 years of experience and is skilled in resolving complex business and technical issues.

He received his BS in electrical engineering from Model Engineering College and MS in microelectronics from Birla Institute of Technology and Science.

What’s the Difference Between DIMM and CAMM?

Outsmart Silent Data Corruption in AI Processors with Two-Stage Detection

What you'll learn:

As AI Grows, So Does the Threat of Silent Data Corruption

Trying to Stem the SDC Problem

Two-Stage Detection Approach to Solving the Issue

About the Author

Vijay N

Vice President of Worldwide Field Operations, proteanTecs

Related

What’s the Difference Between DIMM and CAMM?

Novel Liquid-Cooled Rack Integrates NVIDIA HGX B300 Platform

DC-DC Converter Design Made Easy

MAX66250/MAX66301 NFC Secure Authenticators and Coprocessors

Voice Your Opinion!

To join the conversation, and become an exclusive member of Electronic Design, create an account today!

Trending

2025 Physics Nobel Prize Awarded for Large-Scale Electron-Tunneling Insight

CATL’s Next-Gen Sodium-Ion Battery Supports 500-km EV Range

Polestar 3 to Adopt 800-V Architecture for Faster Charging and More Power

Recommended

Designing Accurate Gas Monitoring Systems with Chemiresistive Devices

LTC4296-1/LTC9111 SPoE/PD Controllers

Powering modern AI data centers with an integrated 48V hot-swap eFuse device