Silent errors, as they are called, are hardware defects that don’t leave behind any traces in system logs. The occurrence of these problems can be further exacerbated by factors such as temperature and age. It is an industry-wide problem that poses a major challenge for datacenter infrastructure, since they can wreak havoc across applications for a prolonged period of time, all while remaining undetected.
In a newly published paper, Meta has detailed how it detects and mitigates these errors in its infrastructure. Meta uses a combined approach by testing both while machines are offline for maintenance as well as to perform smaller tests during production. Meta has found that while the former methodology achieves a greater overall coverage, in-production testing can achieve robust coverage within a much shorter timespan.
Silent errors, also called silent data corruptions (SDC), are the result of an internal hardware defect. To be more specific, these errors occur at places where there is no check logic, which leads to the defect being undetected. They can be further influenced by factors such as temperature variance, datapath variations and age.
The defect causes incorrect circuit operation. This can then manifest itself at the application level as a flipped bit in a data value, or it may even lead the hardware to execute the wrong instructions altogether. Their effects could even propagate to other services and systems.
As an example, in one case study a simple calculation in a database returned the wrong answer 0, resulting in missing rows and subsequently led to data loss. At Meta’s scale, the company reports to have observed hundreds of such SDCs. Meta has found an SDC occurrence rate of one in thousand silicon devices, which it claims is reflective of fundamental silicon challenges rather than particle effects or cosmic rays.
Meta has been running detection and testing frameworks since 2019. These strategies can be categorized in two buckets: fleetscanner for out-of-production testing, and ripple for in-production testing.
Silicon testing funnel
Before a silicon device enters the Meta fleet, it goes through a silicon testing funnel. Already prior to launch during development, a silicon chip goes through verification (simulation and emulation) and subsequently post silicon validation on actual samples. Both of these tests can last several months. During manufacturing, the device undergoes further (automated) tests at the device and system level. Silicon vendors often exploit this level of testing for the purposes of binning, as there will be variations in performance. Nonfunctional chips result in a lower manufacturing yield.
Finally, when the device arrives at Meta, it undergoes infrastructure intake (burn-in) testing on many software configurations at the rack-level. Traditionally, this would have concluded the testing, and the device would have been expected to work for the rest of its lifecycle, relying on built-in RAS (reliability-availability-serviceability) features to monitor the system’s health.
However, SDCs cannot be detected by these methods. Hence, this requires dedicated test patterns that are run periodically during production, which requires orchestration and scheduling. In the most extreme case, these tests are done during
It is notable that the closer the device gets to running production workloads, the shorter the duration of the tests, but also the lower the ability to root cause (diagnose) silicon defects. In addition, the cost and complexity of testing, as well as the potential impact of a defect, also increases. For example, at the system level multiple types of devices have to work in cohesion, while the infrastructure level adds complex applications and operating systems.
Fleetwide testing observations
Silent errors are tricky since they can produce erroneous results that go undetected, as well as impact numerous applications. These errors will continue to propagate until they produce noticeable differences at the application level.
Moreover, there are multiple factors that impact their occurrence. Meta has found that these faults fall into four major categories:
- Data randomization. Corruptions tend to be dependent on input data, for example due to certain bit patterns. This creates a large state space for testing. For example, perhaps 3 times 5 is evaluated correctly to 15, while 3 times 4 is evaluated to 10.
- Electrical variations. Changes in voltage, frequency and current may lead to higher occurrences of data corruptions. Under one set of these parameters, the result may be accurate, while this might not be the case for another set. This further complicates the testing state space.
- Environmental variations. Other variations such as temperature and humidity can also impact silent errors, since these may directly influence the physics associated with the device. Even in a controlled environment like a datacenter, there can still be hotspots. In particular, this could lead to variations in results across datacenters.
- Lifecycle variations. Like regular device failures, the occurrence of SDCs can also vary across the silicon lifecyle.
Meta has implemented two categories of fleetwide testing across millions of machines. These are out-of-production and in-production testing.
In out-of-production testing, machines are taken offline and subjected to known patterns of inputs. The output is then compared to references. In these tests, all variables as discussed above are taken into account and tested against using state search policies.
Mostly, machines are not specifically taken offline for testing on silent errors, but rather they are opportunistically tested while the machine is offline for various other reasons such as firmware and kernel upgrades, provisioning or traditional server repair.
During such a server maintenance, Meta performs silent error detection with a test tool called fleetscanner. This way of operation minimizes overhead and hence cost. When a silent data corruption is detected, the machine is quarantined and subjected to further tests.
Since out-of-production is slow, as it has a long reaction time to newly identified signatures, Meta also performs in-production testing with a tool called ripple. It co-locates with the workload and executes test instructions in millisecond level intervals. Meta reported that it has been able to perform shadow testing by running A/B testing across different variables, and also has the tool always on. Meta has identified ripple testing in particular as a major evolution for silent data corruption tools.
Findings and tradeoffs
Based on three years of observations, fleetscanner achieved 93% coverage for a certain defect family, and 23% unique coverage that was not reachable by ripple. However, the cost is of course a nontrivial amount of time (and hence cost) that is spent testing. By contrast, ripple offered 7% unique coverage. Meta argues this coverage would be impossible to achieve with fleetscanner due to the frequent transition of workloads with ripple.
When comparing the time to achieve an equivalent SDC coverage of 70%, fleetscanner would take 6 months compared to just 15 days for ripple.
When remaining undetected, applications may be exposed for months to silent data corruptions. This in turn could lead to significant impacts such as data loss that could take months to debug. Hence, this poses a critical problem for datacenter infrastructure.
Meta has implemented a comprehensive testing methodology consisting of an out-of-production fleetscanner that runs during maintenance for other purposes, and faster (millisecond level) in-production ripple testing.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More