New “Hidden” Compression Techniques Pack More Data On Your Devices

May 7, 2012
Smplify CEO Al Wegener reviews the many ways files are compressed in personal computer systems, including the "x" extension on Windows files, Windows "Disc-Compress," Lempel-Ziv (LZ) for OpenXML, MLC and Sandforces SLC, APAX, and HTTP compression.

Today’s desktops, laptops, and mobile devices can store thousands of compressed photographs, songs, and other files. Operating systems make it easy to identify them by their extensions, such as .jpg, .mp3, or .zip. Most users are aware of these categories as they browse their computers for the files they want.

Many media-centric apps and programs, such as Windows Media Player and Adobe Photoshop, only process compressed media. With so many visibly compressed files, it may surprise readers that embedded or “hidden” compression is becoming increasingly prevalent in their computing systems.

In 2007, Microsoft introduced new Office file formats for Word (.doc), Excel (.xls), and PowerPoint (.ppt) files, with a new “x” suffix (.docx, .xlsx, and .pptx). While many PC users scratched their heads about the new file extensions, they weren’t aware that the “x” suffix indicates that the file is encoded in a flexible format called OpenXML and is then compressed.

Since text documents, spreadsheets, and presentations (XML format or not) cannot be lossy-compressed, Microsoft used the famous Lempel-Ziv (LZ) lossless compression algorithm to compress the new OpenXML files. LZ typically achieves between 1.5:1 and 2:1 lossless compression ratios on OpenXML-format files.

Office files consume a measurable percentage of most desktop and laptop disk drive storage, so it makes sense to compress these files prior to writing them to disk. So whenever you create, open, or save a Microsoft Office 2007 file, you’re using embedded compression.

Since 2004, the Microsoft Windows operating system has offered a little-known “disk compress” feature (see the figure). By navigating to a disk drive in Windows, right-clicking the drive’s properties, and selecting the box “Compress this drive to save disk space,” Windows will apply LZ compression for all disk traffic, which can provide 20% to 50% more disk capacity.

The Microsoft Windows operating system can compress disk drive files transparently, increasing disk capacity and accelerating transfer rates for compressible files.

Once the box is checked, the Windows user shouldn’t notice that the files are compressed, because the CPU performs the LZ compression, which is orders of magnitude faster than the disk drive that stores the compressed files.

Of course, the disk compress option won’t achieve any compression on .jpg, mp3, .docx, .xlsx, or .pptx files that are already compressed. As time goes by, then, the compression performed by these applications is slowly degrading the benefits of the “disk compress” feature.

New Techniques

Sandforce, a startup acquired in January 2012 by LSI Logic, produced a particularly interesting example of embedded compression in 2008. As many users are aware, flash memory supports a limited number of write-erase cycles, typically about 5000 for multi-level-cell (MLC) flash and 100,000 for single-level-cell (SLC) flash. Sandforce differentiated itself by offering SLC’s five-year durability at MLC prices.

Sandforce’s secret sauce was using embedded LZ compression to reduce the number of flash sector writes, compared to writing uncompressed data. Reducing the amount of data written to MLC flash resulted in longer MLC flash lifetime.

Had Sandforce advertised that LZ compression was its differentiating technique, customers would have asked questions regarding the average compression ratio over a typical mix of files. By emphasizing how its controller lengthened MLC flash life, the company adeptly sidestepped direct questions about LZ features—compression ratios and file type distributions—and just emphasized the user benefit, longer MLC flash life.

When Sandforce integrated lossless LZ compression into MLC flash controllers, it often didn’t achieve much overall compression on the typical mix of user files since, as described above, many of today’s files are already compressed (.jpg, .docx, .pptx, .mp3, etc.).

In contrast, a new compression approach called APplication AXceleration encoding technology (APAX) for numerical data—integers, floating-point values, and uncompressed images and video—lets users choose the appropriate compression setting for the dataset.

APAX compression ratios on numerical data are typically between 2:1 and 6:1—much higher than Sandforce’s average lossless compression ratios. Consumer media (numerical data) is always lossy-compressed. But users are satisfied with the quality, so lossy compression already plays a big role in optimizing bandwidth and storage, which was also Sandforce’s original goal.

Lossy compression ratios between 2:1 and 6:1 can be achieved on many numerical datasets, especially those generated by sensors and used in high-performance computing (HPC). So LZ lossless compression isn’t always required, and the benefits of integrating technologies like APAX into hardware can be significant, while still preserving the original results.

Finally, readers might be surprised to learn that most Web browsers support HTTP compression, and it’s actually rather difficult to disable! If the Web server and the client browser agree, via the “accept-encoding” setting, the server and client will exchange Web pages (HTML code) using gzip and deflate, popular versions of LZ compression. This typically speeds Web page downloads up by a factor of two, not including images that may be embedded in the Web page. For such images, .gif compression is most typically used.

A Hidden Benefit

In conclusion, compression is now a daily part of most people’s lives, from the high-frequency bond trader’s spreadsheets to the Web-browsing coffee farmer in Brazil. In many cases, compression has been so smoothly integrated into the data transfer or storage that users don’t even realize it’s there. And that’s how it should be.

About the Author

Al Wegener

Al Wegener is the CTO and founder of Samplify Systems, a fabless semiconductor startup in Santa Clara, Calif. He holds 17 patents and is named on additional Samplify patent applications. He earned a BSEE from Bucknell University and an MSCS from Stanford University.

Sponsored Recommendations

Comments

To join the conversation, and become an exclusive member of Electronic Design, create an account today!