February 19, 2017

Getting over the fear of using big data and predictive modeling

In the last year, I have noticed some fear of using data models. Referencing the statement from George E. P. Box that “Essentially, all models are wrong, but some are useful”, we should take a directional approach when starting to data model. Without knowing it, most of us use predictive modeling daily by looking at our car fuel range predictor.

Taking a step back, we understand data models on a holistic level. When you first fill up your car and it says you have a range of 415 miles, we accept that’s not exactly precise. Yet when we look at our dashboard mid tank and see 271 miles left we learn that based on our current driving behavior this is a more accurate reading. As your driving gets closer to zero, your fuel range prediction gets more precise. Unless there is a complete standstill of traffic, our fuel range prediction is usually on par.

The same can be applied to data modeling. Initially, we get a sense of what direction the data is taking. In mid-point, we can refine variables and work to get a much stronger statistical significance. There is nothing to fear with data modeling. You have to start somewhere that is directional and realize it’s an iterative process.

Men’s March Madness Descriptive Data Model: Predicted Duke with 74.6% Accuracy

Every year during March Madness Men’s Basketball Tournament thousands of people fill out their brackets. Most of these brackets are filled out with gut based decision making or a bias towards your favorite team or alma mater. But what if there is a way to put data science to use and win your pool? This is the answer I set out to seek, and I can tell you that I not only predicted Duke to win it all but also guessed 47 of 63 games right, with an accuracy of 74.6%. My actual bracket is below, click to enlarge.

bracket2015Building the Data Model
Data Modeling the NCAA tournament isn’t the easiest thing to do as there are many data sources and you don’t know exactly which variables will help you derive the right model. This is why I started to look at 2014 tournament data to understand what variables can be gathered to help me make a descriptive model to predict the 2015 tournament.

I downloaded the 2014 data with the following variables: Wins, Losses, Winning Percentage, RPI, Strength of Schedule, AP ranking, Coaches Poll Ranking, Tournament Seed, Points Per Game, and Points Allowed. All of these variables are descriptive variables and not the dependent variable or goal of the data model.

The goal of the data model is to predict the winner of each game based on the descriptive variables. In order to do this, I created a goal variable called Tournament Wins. The maximum wins that a team in 2014 could have had is 5, which was University of Connecticut. I manually counted the number of wins for all 64 teams in the 2014 tournament and added the goal variable to my spreadsheet.

Processing the Data Through a Data Mining Tool
To process the descriptive variables and how they relate to the dependent variable Tournament Wins, I used a free machine learning tool called Weka. This allowed me to use different mining algorithms to process the data and create a descriptive model. First model I ran was a Conjunctive Rule model, which told me that Strength of Schedule had a 73% correlation with Tournament Wins.

Second, Decision Tree model told me that Strength of Schedule had a 74% correlation coefficient with Tournament Wins. Third, Linear Regression model told me that it was Winning Percentage and Strength of Schedule that had the highest impact on Tournament Wins with 64% correlation coefficient. It also showed that Points Per Game had a negative correlation with Tournament Wins. Last, the M5 rule model told me that winning percentage, and strength of schedule had an 82% correlation coefficient with Tournament Wins.

Descriptive Model Conclusion
From the data models, I concluded that Tournament Wins were very much influenced by the Strength of Schedule and Winning Percentage of the team. I started to fill out my bracket focusing on both of these rules and realized that sometimes one team had the stronger strength of schedule while the other team had the winning percentage. I needed a tie breaker for the tournament. I knew that Points Per Game was negatively correlating Tournament Wins, I decided to use the team that allows the fewest points as the tie breaker.  Thus, it was Strength of Schedule, Winning Percentage and if needed the team with the lowest points per game allowed. Based on these rules I filled out the tournament bracket above and was able to get 74.6% of the games right, predicted the winner to be Duke, and won my pool.

Forecast Your Website Traffic With Historical Data

Many times as managers, marketers, and advertisers are asked to forecast future behavior. Without formal education in time series analysis, this can be a pretty daunting and scary task. However, with the proper tools and education, you can create a model that fits the data and creates a good forecast of your future website traffic.

The Challenge
Your website data has many fluctuations based on campaigns. Campaigns can be timely, constant, or pulsating, making traffic to your website change drastically over time. These variances in traffic data make it difficult for the model to predict future forecasts. [Read more…]

How To Improve Digital Media Return on Investment (ROI)

You just saw a commercial for the nth time and you think to yourself, “doesn’t this company know that they are wasting their advertising money.” This is a stage where your audience is still receiving Impressions, but their propensity to Convert has diminished. There is a name for this phenomenon, and it is called Ad Ware-Out. With digital media, we are very lucky, because we can track when this phenomenon happens and I will tell you how to do so.

advertising ROI bell curveFirst, lets look at how this happens in the digital media return on investment bell curve below. First stage in the curve is Awareness, every campaign has a stage where the customer is becoming aware of your offering.  Awareness, slowly ramps up into a stage that is between the orange and blue bar called Return On Investment. This middle stage, is where most of your revenue is gained as users are aware of your product and are purchasing. [Read more…]

Measuring KPIs: Always Capture Diagnostic Metrics

KPI InfluencersEvery executive wants to see the top Key Performance Indicators (KPIs) on a regular reporting cadence. However, when major changes happen in your KPIs, you need to be able to explain what happened, what it means, and what you are going to improve.

Factors Influencing Your Key Performance Indicators
The factors influencing your key performance indicator to the left are a few that I picked as an example. Increases or decreases in your purchase funnels will highly affect your KPIs. The factors I chose on the left are a guess, but there is a way to find exactly what is affecting your KPIs through diagnostic metrics. Your analyst needs to find direct relationships between variables (influencing diagnostic metrics) and your KPI (response variable). [Read more…]

My First Hands On Experience With Big Data

When you read about Big Data online, you cannot help getting incredibly excited about the opportunity at hand. Experiencing it hands on is a little bit different. Looking at it and seeing where it resides, how it is queried, and how unstructured the data is an experience. A few weeks ago I wrote about the difference between big data vs. small data. Even though no actual model was built, I garnered some interesting facts on what needs to be done to build a big data model.

Big Data Overview
The data lives in the cloud and is highly unstructured. It is not immediately usable for analysis because it is missing data and response variables. From my initial interpretation of the data there is demographic data, session data, but very few variables that are (response variables) which are organizational goals. For a data modeler to be successful, we need response variable. [Read more…]

Decision Trees vs. Neural Networks

Two popular data modeling techniques are Decision Trees, also called classification trees and Neural Networks. These two data modeling techniques are very different from the way they look to the way they find relationship within variables. The neural network is an assembly of nodes, looks somewhat like the human brain. While the decision tree is an easy to follow top down approach of looking at the data.

decision tree vs neural network

[Read more…]

What is The Difference Between Big Data and Small Data

Big Data is here. Before looking at the difference between Big Data vs. Small data, it is important to assess the data mining process for the data. This will allow you to see the steps necessary to arrive at valid data insights.


[Read more…]

What Web Analytics Platforms Cannot Do: Prove Causality

Web analytics platforms such as Omniture (Adobe Marketing Cloud), WebTrends and Google Analytics are amazing tools. They are trending tools of your websites actions and conversions. In a web analytics tool you can pull forward paths to conversion, pages users saw prior to conversion, and many other reports that allow you to assume conversion. However, you cannot assume causality with statistical significance.

In order to have a causal relationship, you need to have association. Once a strong association has been created, you can infer a causal relationship between two variables. In other words, Page A caused Conversion A. For this, you need raw data and access to IBM SPSS, SAS or R.

[Read more…]

What is Machine Learning, Big Data Modeling

Big Data, is the hottest topic in tech news. Everyone talks about what Big Data is, but rarely does anyone show how data mining algorithms actually work. This post is to show how to make a simple stock data decision tree model, and how to understand it.

The Data
Data was downloaded from a free online source for the John Deere stock ticker DE, from August 3, 2012 to May 3, 2012. Data shows DE Open, DE High, DE Low, DE Volume, and other stock indicators such as Dow Jones, NYSE, Shanghai, Nasdaq, Greek, Euro VKG, Price of Corn, and to Buy DE stock Yes/No. Buy DE is calculated based upon if DE Close Price > DE Open Price.

[Read more…]