Sampling, streaming cut big data down to size

Given a massive amount of data to process, you could consider employing sampling or streaming techniques. Sampling is straightforward, you randomly select samples of your complete original data and perform your calculations only on the samples. With streaming, your algorithm—which fits into very limited memory—examines input data element by element in a single pass, deriving a summary, sketch, or synopsis of the data.

MIT Professor Piotr Indyk discussed both approaches in a lecture on fast algorihms in the online course “Tackling the Challenges of Big Data,” which runs through March 17.

With sampling, he said, performing a computation only on a sample reduces the needed storage and computation time. On the other hand, he said, streaming ensures that “no element is left behind.” Streaming can easily answer this question: Does the digit 1 appear in a stream of decimal digits? If 1 does appear in the stream but only very infrequently, sampling can give you the wrong answer.

Streaming algorithms, Indyk said, are both randomized and approximate—they use pseudorandom number generators to guide decisions, and they present an approximate answer with a controllable level of accuracy

He cited as an example of a fast streaming algorithm LogLog, which requires only 128 bytes to estimate the number of distinct words in all the works of Shakespeare. Other applications of streaming include finding heavy hitters—for instance, given a stream of clicks, identify the most popular websites—or estimating entropy.

He next turned his attention to sampling, using the sparse FFT as an example. A naïve Fourier transform algorithm, he said, takes quadratic time and is too slow for big data applications. The FFT, developed by Cooley and Tukey in 1965, takes nlogn time—a big improvement but still too slow for bid data applications. Fortunately, we can take advantage of the sparcity of a signal—the spectrum is dominated by a few large peaks—to accomplish the FFT in sublinear time.

And because the sparse Fourier transform operates on only a sample of the signal, you reduce potentially costly data-acquisition time and lower the communication bandwidth burden—and save power. The most recent version of the algorithm, sFFT 3.0, offers run times that scale with the number of non-zero frequencies but it offers orders of magnitude performance improvements over FFT for many applications.

Indyk concluded by saying, “Both streaming and sampling have their pros and cons. But both are very efficient ways to process a massive amount of data using only very limited resources.”

See these related posts on the online course “Tackling the Challenges of Big Data”: