When you read about Big Data online, you cannot help getting incredibly excited about the opportunity at hand. Experiencing it hands on is a little bit different. Looking at it and seeing where it resides, how it is queried, and how unstructured the data is an experience. A few weeks ago I wrote about the difference between big data vs. small data. Even though no actual model was built, I garnered some interesting facts on what needs to be done to build a big data model.
Big Data Overview
The data lives in the cloud and is highly unstructured. It is not immediately usable for analysis because it is missing data and response variables. From my initial interpretation of the data there is demographic data, session data, but very few variables that are (response variables) which are organizational goals. For a data modeler to be successful, we need response variable.
Creating Knowledge From Big Data
Having an abundance of information is great, but not having a goal to model dependent variables to response variables is an issue. Thus, your first task is to go to the business leaders and define a response variable. A response variable can be created by doing some Descriptive Statistics on your Big Data set.
For example, if your goal is Sales Volume and you do not have a response variable. You can run Descriptive Statistics of the Mean, Median, and Max, Min of your sales volume data by transaction. When you know your average transaction you can segment your transactions by greater than average sale and below average sale. Thus, creating a model to understand what predictor variables define Higher than average sales and lower than average sales.
Another example is session data, you can divide sessions by below average session, above average session, and completed session. Thus, having three different ways to look at predictor variables versus the response variables. In this case, it is more optimal for your time to be spent modeling for the completed and above average sessions, than modeling towards negative sessions. Your time should focus on what creates the most good for your organization.
Building the Model
Depending on your data type, determines the model and the processing technique. Your availability of software also determines the model used. You will have to clean your data up to be used. In big data, there are many missing values, that you may arrive at only 30% of your data is usable. Big data sets are so large that 30% may be enough sample size.
Building the model takes time as many steps and procedures are taken to validate the parameters and find the handful of variables that define your response variable. Statistical and graphical steps are taken to validate your model. Keep in mind that the goal is to create model reliability and model validity.
There is much excitement about data modeling of Big Data, but few managers know the processes in-between to create a valid model. Managers have to keep in mind that Data Scientists cannot be kept on a leash and need time to do the necessary building and model validation steps. If time is allotted, the usefulness of the model to business goals is of tremendous value.