Error Correcting Code (ECC) technology, such as Low-Density Parity Check codes, has been around longer than most of you reading this have been alive. The reason is: no storage or transmission medium is perfect, and all of them experience some level of errors. To prevent the need to re-read or re-transmit information every time an error occurs, storage and networking systems are all equipped with built-in ECC technology.  The strength of the ECC needed is a function of the raw error rates of the storage or networking system, and the acceptable output error rate of the final solution.

This article provides an introduction to Low-Density Parity Check (LDPC) Codes, a very powerful ECC technology that is now being used with an increasingly popular storage medium: NAND flash memory.

The Need for (Better) SSD Error Correction

NAND flash memory is a non-volatile, solid state storage medium that affords four powerful advantages over rotating magnetic media such as hard disks: higher performance; higher density; higher reliability; and lower power consumption. These advantages make flash memory ideal for use in portable devices, as well as in high-performance solid state disks (SSDs) and server-side caching systems.

But NAND flash memory has a weakness: the memory cells deteriorate slightly with each program/erase (P/E) or write/delete cycle. As each individual cell deteriorates, its ability to accurately hold a given charge state diminishes, causing its read error rate to increase. At some point (after too many P/E cycles), the errors can no longer be corrected, making the cell unusable.

The stronger the error correction, therefore, the longer the usable life of the flash memory cells. In other words, a really strong ECC technology enables cells to become substantially “weaker” and still be read reliably. To date, ECC technologies like Bose-Chaudhuri-Hocquenghem (BCH) and Reed-Solomon (RS) code have worked quite well in solid state storage solutions. But that is now changing as chip fabrication geometries shrink, and as densities increase from single- and multi- to three-level cells storing one, two or three bits, respectively.

Storing more bits in smaller cells makes it possible to fit more storage into smaller form factors, but the smaller/denser cells hold a proportionally smaller charge and cause an increase in the raw bit error rate of data stored in the cells. NAND flash memory provides a fixed amount of “spare” storage for ECC beyond the “binary” user capacity, for example an extra 80 bytes for every 1K bytes of user data.  Given the fixed spare storage, BCH and RS codes are only able to meet output bit error rate requirements with up to a certain raw bit error rate, and when the cells deteriorate beyond that point, there is an unacceptable rate of uncorrectable errors. A more powerful error correction technology that withstands a higher raw bit error rate would let the cells deteriorate further – that is, would enable more P/E cycles on the NAND flash memory.

LDPC 101

Low-Density Parity Check decoding is a powerful ECC algorithm that dates back to the 1960s. The problem is: LDPC decoding is so computation-intensive that it took years of advances in processor performance (including the transition from vacuum tubes to integrated circuits!) to enable the algorithms to operate in real-time.

LDPC codes were first deployed about a decade ago in the telecommunications industry to correct transmission errors across various media. They are currently used, for example, in 10GBase-T Ethernet (10 Gb/s over twisted-pair cabling), as well as for the 802.11n and 802.11ac Wi-Fi standards as part of the High Throughput PHY specification. More recently, LDPC codes have been used for error correction for the magnetic media in hard disk drives (HDDs).

This renaissance of LDPC codes has not gone unnoticed in the solid state storage industry. Universities and vendors alike are researching the best ways to utilize LDPC codes for error correction in next-generation flash controllers, and this work has already resulted in the debut of a few commercial implementations that will soon be shipping in SSDs and flash cache products.

All ECCs, including LDPC codes, have a probability of failing at a given Raw Bit Error Rate (RBER).  You can flip a coin and get 32 heads in a row – it’s unlikely, but it is possible.  The design objective for the ECC of an SSD might be to have only a 10-15 chance of encountering an uncorrectable error at a RBER less than or equal to a limit based on the expected lifetime of the NAND flash memory. Compared to BCH, LDPC codes have the capability to meet that same probability objective with a RBER that is significantly higher, resulting in an ability to get more P/E cycles from the NAND flash cells.

It is important to note that some enterprise-class SSDs and flash cache solutions include provisions to detect or remedy any errors that cannot be detected and corrected by the ECC. Two such provisions are an end-to-end cyclical redundancy check (CRC) and RAID-like data protection. The slight negative effect on price/performance and usable capacity with these additional provisions is readily justifiable in mission-critical applications.

There are two types of decoding with LDPC codes: hard-decision and soft-decision. Hard-decision decoding provides error correction comparable to that of BCH codes. By employing only one quantization level between two adjacent storage states, hard-decision decoding is effectively binary on a per-bit basis, including in multi- and tri-level cells. This makes hard-decision LDPC (HLDPC) decoding implementable with reasonable performance. But HLDPC decoding alone offers very little improvement in error correction over typical implementations of BCH.

In contrast with HLDPC decoding, soft-decision LDPC (SLDPC) decoding uses more levels of quantization per bit. Think of each bit is not just being a zero or a one, but as being a probability of being a zero or a one. The extra information provided by the probabilities for each bit are what give SLDPC decoding so much additional error correction performance. The more information, the stronger the error correction.

Whereas hard-decision decoding is a purely binary technology, soft-decision decoding requires diving deep into the analog voltage levels between adjacent storage states. For this reason, many of the mechanisms used require fairly sophisticated digital signal processing technology to convert the data read from the NAND flash cells at different reference voltage levels into probabilities to be processed by SLDPC decoding.

In some sophisticated systems, soft-decision LDPC (SLDPC) decoding takes over whenever the hard-decision decoding fails to correct an error. A variety of soft-decision mechanisms are possible, some of which are capable of providing significant improvements in error correction. This is why soft-decision decoding is the current frontier for advancing the state-of-the-art in flash memory ECC. Protecting this intellectual property is why vendors are reluctant to disclose too much information about the mechanisms they use, but it is possible to provide some insight (at a high level!).

As might be expected, the additional error-correcting capabilities of soft-decision decoding come at a price: increased latency. There are two sources for this latency. One is the need to perform additional and more precise readings of the NAND flash cells at various reference voltage levels and/or to collect additional information about the error characteristics of the NAND flash cells; the other is the need to perform more complex digital signal processing and algorithmically complex decoding of the results.

One technique for minimizing the latency while fully preserving SLDPC decoding’s error correction capabilities is to apply progressively stronger levels of soft-decision decoding only as needed to correct errors. One solution has five such levels of SLDPC decoding built atop the very fast HDLPC decoding (Fig. 1).

Soft-decision decoding performance itself can be enhanced through the use of advanced digital signal processing (DSP) and multi-processor parallelism. Robust, application-specific DSP technology is able to process flash memory cell voltage measurements more quickly and accurately to enable correcting errors at lower (and therefore faster) levels of SLDPC decoding, while parallel processing is a general technique capable of accelerating decoding at every level.

Another way to improve performance is to increase the amount of over-provisioned memory used for ECC over time. When the NAND flash memory chips are new and errors are few, less memory is needed for ECC (to achieve the desired output bit error rate). As the cells begin to wear out and raw bit errors increase, allocating more memory for the error correction information helps improve the ECC performance. The goal with such adaptive ECC memory allocation is to strike a prudent balance between capacity and endurance.

Yet another technique involves anticipating and mitigating the various sources of raw bit errors (e.g. P/E cycling, retention, read disturb, etc.) to which smaller NAND flash memory cells are increasingly prone. These error sources, if managed properly, can be handled by the very fast hard-decision LDPC decoding. Identifying the cause and decreasing the impact of these error sources is, therefore, an effective way to avoid the performance penalty that would otherwise be incurred to correct for the same error sources with the slower soft-decision LDPC decoding.

Conclusion

What does all this mean at a practical level? It means greater flash memory endurance and higher capacities in next-generation solid state storage solutions. NAND flash memory chips are specified to have a certain number of P/E cycles before experiencing an unacceptably high error rate. LDPC error-correction technology, particularly through judicious use of soft-decision LDPC decoding, is able to meet output error rate requirements with much higher raw bit error rates, and thus can greatly extend the usable P/E cycles of NAND flash memory. These techniques can make sub-20 nanometer three-level cell chips commercially viable in high-capacity, high-performance SSDs and flash cache solutions.

LDPC codes and smaller/denser chips are not alone, of course, in the advances being made in solid state storage. Indeed, there are many other technologies available for increasing endurance, improving performance and reliability, and reducing power consumption. Examples of these technologies include data reduction to minimize writes and, therefore, write amplification, and wear-leveling so that all the cells last about the same amount of time—and far longer than even the more optimistic expectations for the service life of the host tablet, PC or server.