Big data has become the next big thing. In my field of industrial-organizational psychology and management there are articles about it in major journals and sessions about it at conferences. You see it mentioned often in social media. People talk about it as being a new tool that will revolutionize things. But as I look between the headlines to see what people are talking about, I wonder is it really big data?
What Is Big Data
Big data is a term that was originally defined as a case where you have a massive amount of information—so much that it takes multiple computers to analyze. An IT colleague told me he did one problem that took 1200 computers. The entire data set cannot fit on your new 8 terabyte hard drive.
Big data is described by the three Vs
- Volume. To be big data you must have a massive amount of data.
- Velocity. Data is being continuously collected at a very high rate. Think about the constant stream of tweets on Twitter.
- Variety. Data can take many forms such as numbers, text, audio and video, as we see in social media posts such as LinkedIn.
What Is It Really?
Often when people talk about big data, they are talking about applications with massive amounts of streaming data. For example, companies like Amazon use people’s browsing behavior on their website to predict sales. We have discussions of using social media posts to screen job applicants. However, often times I see people talking about “big data” when their situation has none of the three Vs. They are using the term to refer to the use of analytics and data mining to solve real world problems. In other words their application is using data mining tools to make evidence based decisions.
Data mining combines machine learning with statistics to find interesting and useful patterns in a set of data. This is an exploratory approach in which we investigate a data set with little or no preconceived notion of what to expect. Our data set might be huge (big data) but it doesn’t have to be massive. The size does not determine if it is data mining or not, although it does set limits to what we can reasonably accomplish. You see these tools used in human resources (predicting who will be a good employee), marketing (predicting purchasing behavior), and politics (predicting voting behavior), just to name a few use cases.
What Do People Mean When They Use the Term?
The use of the term big data can be confusing because it has taken on two meanings—the original one that involves the three Vs, and the more recent one that refers to analytics and data mining. Perhaps this is why many data scientists do not like to use the term, as its meaning has become unclear. When I hear the term big data I like to ask specifics so I am sure what the person means. Is it really big data, or is it just data?