How To Manage Big Data From An Analog World

Differentiation is no longer about who can collect the most data. It’s about who can quickly make sense of the data they collect. There once was a time when hardware sampling rates, limited by the speed at which analog-to-digital conversion took place, physically restricted how much data was acquired. But today, hardware is no longer the limiting factor in acquisition applications. The management of acquired data is the challenge of the future.

Related Articles

Advances in computing technology, including increasing microprocessor speed and hard-drive storage capacity, combined with decreasing costs for hardware and software have provoked an explosion of data coming in at a blistering pace. In measurement applications in particular, engineers and scientists can collect vast amounts of data every second of every day. For every second that the Large Hadron Collider at CERN runs an experiment, the instrument generates 40 terabytes of data. For every 30 minutes that a Boeing jet engine runs, the system creates 10 terabytes of operations information (Gantz, 2011). That’s “big data.”

The big data phenomenon adds new challenges to data analysis, search, integration, reporting, and system maintenance that must be met to keep pace with the exponential growth of data. And the sources of data are many. However, among the most interesting to the engineer and scientist is data derived from the physical world. This is analog data that is captured and digitized. Thus, it can be called “Big Analog Data.” It is collected from measurements of vibration, RF signals, temperature, pressure, sound, image, light, magnetism, voltage, and so on. Challenges unique to Big Analog Data^TM have provoked three technology trends in the widespread field of data acquisition.

Contextual Data Mining*

The physical characteristics of some real-world phenomena prevent information from being gleaned unless acquisition rates are high enough, which makes small data sets an impossibility. Even when the characteristics of the measured phenomena allow more information gathering, small data sets often limit the accuracy of conclusions and predictions in the first place.

Consider a gold mine where only 20% of the gold is visible. The remaining 80% is in the dirt where you can’t see it. Mining is required to realize the full value of the contents of the mine. This leads to the term “digital dirt,” meaning digitized data can have concealed value. Hence, data analytics and data mining are required to achieve new insights that have never before been seen.

Data mining is the practice of using the contextual information saved along with data to search through and pare down large data sets into more manageable, applicable volumes. By storing raw data alongside its original context, or “metadata,” it becomes easier to accumulate, locate, and later manipulate and understand. For example, examine a series of seemingly random integers: 5126838937. At first glance, it is impossible to make sense of this raw information. However, when given context like (512) 683-8937, the data is much easier to recognize and interpret as a phone number.

Descriptive information about measurement data context provides the same benefits and can detail anything from sensor type, manufacturer, or calibration date for a given measurement channel to revision, designer, or model number for an overall component under test. In fact, the more context that is stored with raw data, the more effectively that data can be traced throughout the design life cycle, searched for or located, and correlated with other measurements in the future by dedicated data post-processing software.

Intelligent DAQ Nodes

Data acquisition applications are incredibly diverse. But across a wide variety of industries and applications, data is rarely acquired simply for the sake of acquiring it. Engineers and scientists invest critical resources into building advanced acquisition systems, but the raw data produced by those systems is not the end game. Instead, raw data is collected so it can be used as an input to analysis or processing algorithms that lead to the actual results system designers seek.

For example, automotive crash tests can collect gigabytes of data in a few tenths of a second that represent speeds, temperatures, forces of impact, and acceleration. But one of the key pieces of pertinent knowledge that can be computed from this raw data is the Head Injury Criterion (HIC), a single scalar, calculated value representing the likelihood of a crash dummy to experience a head injury in the crash.

Additionally, some applications—particularly in the environmental, structural, or machine condition monitoring spaces—avail themselves to periodic, slow acquisition rates that can be drastically increased in bursts when a noteworthy condition is detected. This technique keeps acquisition speeds low and minimizes logged data while allowing sampling rates that are adequate enough for high-speed waveforms when necessary in these applications. To incorporate tactics such as processing raw data into results or adjusting measurement details when certain criteria are met, you must integrate intelligence into the data-acquisition system.

Though it’s common to stream test data to a host PC (the “intelligence”) over standard buses like USB and Ethernet, high-channel-count measurements with fast sampling rates can easily overload the communication bus. An alternative approach is to store data locally and transfer files for post-processing after a test is run, which increases the time it takes to realize valuable results. To overcome these challenges, the latest measurement systems integrate leading technology from ARM, Intel, and Xilinx to offer increased performance and processing capabilities as well as off-the-shelf storage components to provide high-throughput streaming to disk.

With onboard processors, the intelligence of measurement systems has become more decentralized by having processing elements closer to the sensor and the measurement itself. Modern data acquisition hardware includes high-performance multicore processors that can run acquisition software and processing-intensive analysis algorithms in line with the measurements. These intelligent measurement systems can analyze and deliver results more quickly without waiting for large amounts of data to transfer, or without having to log it in the first place, which optimizes the system to use disk space more efficiently.

The Rise Of Cloud Storage And Computing**

The unification of DAQ hardware and onboard intelligence has enabled systems to be increasingly embedded or remote. In many industries, it has paved the way for entirely new applications. As a result, the Internet of Things is finally unfolding before our very eyes as the physical world is embedded with intelligence and humans now can collect data sets about virtually any environment around them. The ability to process and analyze these new data sets about the physical world will have profound effects across a massive array of industries. From health care to energy generation, from transportation to fitness equipment, and from building automation to insurance, the possibilities are virtually endless.

In most of these industries, content (or the data collected) is not the problem. There are plenty of smart people collecting lots of useful data out there. To date this has mainly been an IT problem. The Internet of Things is generating massive amounts of data from remote, field-based equipment spread literally across the world and sometimes in the most remote and inhospitable environments.

These distributed acquisition and analysis nodes (DAANs) embedded in other end products are effectively computer systems with software drivers and images that often connect to several computer networks in parallel. They form some of the most complex distributed systems and generate some of the largest data sets the world has ever seen. These systems need remote network-based systems management tools to automate the configurations, maintenance, and upgrades of the DAANs and a way to efficiently and cost-effectively process all of that data.

Complicating matters is that if you reduce the traditional IT topology for most of the organizations collecting such data to a simple form, you find they are actually running two parallel networks of distributed systems: “the embedded network” that is connected to all of the field devices (DAANs) collecting the data and “the traditional IT network” where the most useful data analysis is implemented and distributed to users.

More often than not, there is a massive fracture between these two parallel networks within organizations, and they are incapable of interoperating. This means that the data sets cannot get to the point(s) where they are most useful. Think of the power an oil and gas company could achieve by collecting real-time data on the amount of oil coming out of the ground and running through a pipeline in Alaska and then being able to get that data to the accounting department, the purchasing department, the logistics department, or the financial department—all located in Houston—within minutes or hours instead of days or months.

The existence of parallel networks within organizations and the major investment made in them have been major inhibitors for the Internet of Things. However, today cloud storage, cloud computational power, and cloud-based “big data” tools have met these challenges. It is simple to use cloud storage and cloud computing resources to create a single aggregation point for data coming in from a large number of embedded devices (such as the DAANs) and provide access to that data from any group within the organization. This solves the problem of the two parallel embedded and IT networks that don’t interoperate.

Placing near infinite storage and computing resources from the cloud that are used and billed on-demand at the fingertips of users provides solutions to the challenges of distributed system management and crunching huge data sets of acquired measurement data. Big data tool suites offered by cloud providers make it easy to ingest and make sense of these huge measurement data sets.

To summarize, cloud technologies offer three broad benefits for distributed system management and data access: aggregation of data, access to data, and offloading of computationally heavy tasks.

* Contribution by Dr. Tom Bradicich, R&D Fellow, National Instruments

** Contribution by Matt Wood, Senior Manager and Principal Data Scientist, Amazon Web Services

Richard McDonell is the director of Americas technical marketing at National Instruments. He joined National Instruments with a BSEE in 1999 and led in the successful adoption of NI TestStand test management software and PXI modular instrumentation while serving as an industry leader in the test engineering community through many technical presentations, articles, and whitepapers. His specific technical focus areas include modular test software and hardware system design, parallel test strategy, and instrument control bus technology. He holds a bachelor’s degree in electrical engineering from Texas A&M University.