The case for gathering and analyzing big data streams is growing. A recent study by the MIT Center for Digital Business found that the more a company characterizes itself as data-driven, the better the company performs in terms of productivity and profitability1. So what is “Big Data?”
To start, we at Performance Architects define “Big Data” as a broad category of data that is of a higher capacity and more unstructured in nature than data typically found in traditional financial reports and IT relational databases. Big Data size and subject matter varies by industry.
So what characterizes Big Data? There are examples in nearly every industry, but the “3 Vs” of Big Data include:
- Volume: Big Data volumes usually start in the terabytes (1TB = ~1000GB). However, it is not uncommon for companies to process petabytes (1PB = ~1,000,000GB) or more. It is estimated that WalMart collects more than 2.5 PB of data every second from customer transactions2.
- Velocity: Big Data accumulates in organizations at a rapid pace, far more than the traditional data sets. The most often used example of Big Data velocity is mobile-based social media opinions.
- Variety: Big Data are uniquely varied. Streams like social media and GPS tracking have only been ubiquitous for a few years, and differ greatly from traditional Chart of Account (COA) based views of a business.
So what will adopting Big Data mean for an organization? As with many IT paradigm shifts, adopting Big Data means changes in hardware, software, and expertise. This does not necessarily translate into a multi-year, multi-million dollar investment, however.
In terms of hardware, the new trend is to break the incoming stream into parts and distribute processing onto many smaller scale nodes at once. So, instead of purchasing a few very expensive high end servers, Big Data streams can be processed with clusters of off-the-shelf hardware.
On the software side, many of the offerings out there are open source and in use by some of the most advanced tech companies. Hadoop is one such open source software framework that is in use by firms such as Facebook and Yahoo to process enormous Big Data sets. The Hadoop framework also offers a platform to process streams through MapReduce programs as well as query through tools like Hive.
And finally, processing and interpreting these Big Data sets will require some new skill sets, but these skills should be complimentary to many of those already found in organizations today. Identify those people in your organization who best fit the Data Architect role, and let them experiment with the Hadoop framework to better understand the technology behind Big Data. Interpreting and correlating these data streams is yet another skill set, one that is being defined under the emerging Data Scientist role.
The bottom line is getting started with a Big Data initiative should be thought of in terms of weeks, rather than months or years. Start with the hypothesis that capturing and storing Big Data streams and analyzing these against existing structured and newer unstructured information streams may reveal new insights into your organization.
Author: Michael Bender, Performance Architects
Sources: 1, 2 Big Data: The Management Revolution, Andrew McAfee and Eric Brynjolffson, Harvard Business Review October 2012