Let’s examine an impending problem looming at the intersection of big data and cloud computing. Big data is the vague, all-encompassing name given to immense datasets stored on enterprise servers like those at Google (which organizes 100 trillion Web pages), Facebook (1 million gigabytes of disk storage), and YouTube (20 petabytes of new video content per year).

Big data also is found in scientific applications such as weather forecasting, earthquake prediction, seismic processing, molecular modeling, and genetic sequencing. Many of these applications require servers with tens of petabytes of storage, such as the Sequoia (Lawrence Livermore) and Blue Waters (NCSA) supercomputers.

Cloud computing simply performs a desired computation (often on big data) on a remote server that a subscriber configures and controls, rather than on the subscriber’s local desktop PC or tablet. Amazon EC2, Microsoft Azure, and Google Compute Engine (still in beta) are leading commercial cloud computing providers. Cloud computing providers charge users as little as $0.10 per CPU-hour for renting MIPS, memory, and disk space.

Cloud servers can house up to a few hundred thousand processor cores, plus many terabytes of disk storage. Cloud computing also offers virtualization technology that gives users a selection of operating systems, applications, and network interconnects—additional software flexibility for their modest rental fee.

The Plumbing Problem

So where’s the plumbing problem? It can be identified from statistics found in two excellent recent reports: Akami’s “State of the Internet” report from the third quarter of 2012,1 and IDC’s “The Digital Universe in 2020” report from December 2012.2 These trends vary widely from country to country, but global trends are unmistakable.

The plumbing problem arises because of the rate at which data is being created and stored. The digital universe will approximately double every two years, or 41% per year,2 and it is rising significantly faster than the bandwidth of network connections. In 2012, there was just 11% growth in wired speeds, compared to an average connection rate of 2.8 Mbits/s.1 The growth of connections isn’t keeping up with the growth in data. That’s the plumbing problem.

Where will this new data come from, and how will it get into the digital universe that IDC describes? In 2020, according to IDC, the digital universe will comprise 40,000 exabytes, and 68% of that will either be created or consumed by end users (versus businesses). NetFlix and similar video-on-demand services occupied 30% of Internet bandwidth in December 2012. Similarly, YouTube received 72 hours of new video every minute, which required 17 petabytes of new storage in 2012.

Mobile devices will both consume and generate much of this data. By the end of 2012, mobile devices generated 25% of Internet traffic. According to Cisco, video will account for 86% of all wireless traffic by 2016. Mobile devices also generate lots of sensor data, such as GPS location data and patient monitoring. Thus, they are the primary source of the machine-to-machine (M2M) traffic that comprises the Internet of Things. The IDC report forecasts that machine-generated data will represent 42% of all data by 2020, up from 11% in 2005.

Which Internet connections aren’t keeping up? For employees at work, or people living in large, well-wired cities like Seoul, Korea, or Tokyo, Japan, download speeds across wired connections and can reach 50 Mbits/s, but upload speeds can be one tenth of that. In the U.S., college students often have the fastest Internet connections, since many campuses are well-wired with multiple 10-Gbit/s fiber-optic connections. Mobile 4G networks support average download speeds of 5 Mbits/s and upload speeds of 3 Mbits/s, but U.S. carriers like AT&T limit the maximum data transferred to about 5 Gbytes per month. So, uploading big data via 4G would be very expensive, at $10 per gigabyte!

Unclogging The Pipes

One solution to the plumbing problem may be shipping disk drives to cloud computing. For example, Amazon’s AWS Import/Export service will receive your shipped disk drive and transfer your data to a local AWS server.3 If you’re a business, you can buy a faster Internet connection from a variety of vendors, as long as you live near a medium-sized city. If you live in Kansas City, you can subscribe to Google Fiber.

The Amazon AWS option assumes a FedEx cost of $50 per drive, an AWS fee of $80 per drive, and $2.49 per hour of transfer time. Business Internet providers like Comcast offer 50-Mbit/s download and 10-Mbit/s upload speed for $200 per month. For those readers fortunate enough to live near Kansas City, Google Fiber delivers 1-Gbit/s upload and download speeds for just $120 per month (including television channels). This option (1-Gbit/s Internet access) is available to tech-savvy folks at many colleges and universities, whose institutions often provide Internet connections via 1G or 10G fiber-optic links.

The table shows the incredible advantage of a fast, affordable upload pipe.4 A 100-Tbyte dataset could be uploaded in nine days for just $37 using a gigabit/second link. In contrast, the same 100-Tbyte dataset could be stored on Amazon AWS servers utilizing 34 disk drives with a capacity of 3 Tbytes each, and all of the data can be copied in a little over a day, where FedEx takes most of that time. But this option costs nearly $5000.

Option 2 (10-Mbit/s upload) isn’t really practical for big data because both upload times and costs quickly become prohibitive. Depending on the data type of the dataset being moved, dataset compression reduces upload times for all three options, as long as compression and decompression processing is fast enough. Until more of us have access to gigabit/second upload links, plumbing problems in cloud computing probably will be alleviated by shipping disk drives—but it won’t be cheap!

References

1. “The State of the Internet,” Akamai, www.akamai.com/stateoftheinternet/

2. “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East,” IDC, www.emc.com/leadership/digital-universe/iview/index.htm

3. “AWS Import/Export,” Amazon Web Services, http://aws.amazon.com/importexport/

4. Information compiled by Samplify based on information from company Web sites

Al Wegener, CTO and founder of Samplify, www.samplify.com, earned an MSCS from Stanford University and a BSEE from Bucknell University.