Today's online connected lifestyle is delivering unprecedented amounts of raw data on consumer preference to modern businesses. However, those companies working to harness this information must first winnow innumerable petabytes of raw data to glean some concentrated, actionable, useful intelligence. Data such as server logs, click streams, search history, likes, links, and postings contain valuable information about customer interest and preferences. However the available data do not fit comfortably into the kind of data base management systems that have been designed to execute more structured IT tasks, such as accounting, customer relationship management and enterprise resource planning.
The explosion in the volume of this unstructured data has prompted the development of new data management and processing tools to convert this vast amount of raw information into intelligence. Foremost among these new tools is the Apache Hadoop framework. Hadoop provides the tools necessary to process data sets that are much too big to fit in the file system of a single server. It harnesses the power of large clusters of servers to sift through vast amounts of data to distill essential intelligence.
This article discusses techniques to accelerate the Apache Hadoop framework using hardware accelerated data compression and a transparent compression/decompression file system (CeDeFS).