## May 12 Making Predictions With Simple Linear Regression Models

One of the things that really got me interested in Machine Learning algorithms and Neural Networks was their ability to make pretty good predictions. The best way to appreciate how these technologies work is to find some real world data and come up with a project that will be interesting, and hopefully, reveal some pretty useful results.

I also find it rather interesting how all of the mathematics I learned years ago in school makes more sense when you apply it to real world situations. If nothing else, Machine Learning makes you appreciate the beauty of math and statistics. Math becomes more than just a set of rules and formulas with obscure meaning and instead presents itself as a valuable tool in the realm of Artificial Intelligence.

For Example, let's take a look at linear regression...

## Linear Regression Formula

``y = β0 + β1 * x``

This formula was rammed down our throats in school. We were told that it is how you can estimate a line and to make sure we memorized it for the test. Like most, I jammed this into memory and held on to it not really understanding the true significance of this amazing formula and how it can be used to make interesting predictions.

Let's examine this formula starting with the obvious. We know 'x' and 'y' are variables. We know that a variable can represent anything we want it to, but what is interesting about this formula is not the variables themselves, but their relationship to one another. Statisticians call this relationship dependent vs independent. This makes sense because in order to know what 'y' is one must solve for it. In order to solve for 'y', we will need a ton of extra information in order to accomplish this goal. This is the reason 'y'  can be thought of as the dependent variable. On the other hand, you have 'x'  which is going to be whatever it is regardless of what happens to 'y'. In fact, 'y' cannot have its meaning unless 'x' itself has one, that is why 'x' is known as the independent variable. 'x' exists regardless of whether you have 'y' or not.

What does this formula have to do with a line, and more importantly how is this useful in the real world and automation?

After taking another look at the formula we can see the two coefficients B0 & B1.  One can think of the 'B0' as the origin (intercept) of the line or the constant. All lines must have a starting point and 'B0' represents this. 'B1' simply tracks the slope of the line. The cool thing about this formula is it can be used to find the best fitting line for any dataset that has a linear relationship.

Linear relationships are all around us. Many times there are variables we assumed are correlated only to find out those variables are not. One can't always comfortably assume that all variables have a correlation. The assumption of correlation must always be put to the test.

For example, suppose someone said that in Football the more points you score per game the more games you will win. This sounds like common sense. In fact, it's so common one should not feel like there is a need to prove a statement like that with mathematics. But what if we had to prove it. What if we could take that assumption and have real data to back it up?! Even better, what if we could even use that data to make further predictions? It turns out we can.

## THE PROBLEM

Let's look at some data from the real world. Below is a chart of some statistics from 2016's NFL regular season.

We see a list of teams and two variables. The dependent variable, in this case, is the number of wins in a given season. The independent variable is the average points per game. What's of interest to us is to find out whether or not the more points a team averages a game leads to more wins in the season. That's the fundamental assumption. In addition, we would like to know if there is a correlation between the two, and is it possible to predict how many games a team will win based on a number of points a team averages a game. If we were to plot this data using average points per game as 'x' on the graph, and the number of wins in a season as 'y' we could visually see how all these data points stack up. Here is a graphical representation of this dataset.

Right off the bat, we notice that it is extremely difficult to determine whether or not a high average of points per game leads to more wins. In fact, this graph can be used to argue that the majority of the team averages exists in the center of the graph. Notice how we have a team that averaged more than 32 points per game and still could only manage to get about 11 wins on the season. Another point to see is a team that has achieved 14 wins on the season but averaged around 28 points a game. Where do we go from here?

Before we lose all hope and consider our initial assumption trash we can get our answer to the question of the relationship between our two variables by drawing a line through our data. This is where our linear regression formula comes in.

## THE CODE

Our goal is to teach our machine to find the relationship in this data set using the Linear Regression algorithm within the R ecosystem. Once training is complete we will test our system to see if it has learned this relationship. Finally, we will give our algorithm datasets from NFL seasons it has never seen before, and determine how it handles it.

- First, we import the 2016 NFL data

- Our data set has 3 columns but we are only interested in the last 2 as the first column is a list of team names. Our algorithm only needs the numerical data.

After importing our data and selecting only rows 2 & 3 we get the following output...

- Output of NFL data as a R data frame.

Next, we need to split our data into a set used for training, and also a set used for testing. They say a good rule of thumb is 80% of the data should go to training and 20% should go to testing the system. We'll do that next.

- Here we take a sampled amount of the number of wins in the seasons and give it a split ratio of 80%

- That data is then turned into a subset called training_set

- Anything that was not a part of the 80% will be stored into a subset called test_set

Now we will use the training set we have to train our brand new data model.

- Our NFL_model gets trained using the linear Model/lm function in R.

- The lm function takes a formula argument which is where we place our dependent variable and our independent variable separated by the  " ~ ".

- Finally, the data we give it is our training_set variable which is an R data frame

Now we are ready to make predictions! Luckily, R has a built-in prediction function we can take advantage of ...

- We make a prediction and store the results in our wins_predictions variable.

- The predict function takes the nfl model we trained as well as the test set

Next, we are ready to see how our model performed on the training set. For this part, we will need to take our model and make predictions. Once the predictions are made we will then need to graph them accordingly. This will give us a good idea of how our system did on our training set.

- Here we use the ggplot library to do the heavy lifting for us. First, we need to add all the X & Y points on the graph from our original training set using the aes function, we also added a color red to distinguish those points

- Next, we add our prediction line utilizing our average points per game on the x-axis and our new predicted y values using our training set data. we will make our line blue so that we can see the difference

- Finally, we add some labels for x & y as well as a title to our plots so it's easy to read.

Now let's see how our system performed on the training set.

- Training Set showing our prediction line through the data

Not too bad. We can see our system has found pretty good coefficients on our training data and was able to generate a clear trend line for us to see. Let's get a quick summary of our newly trained data model. In R, this is pretty easy if we use the summary function.

- Three points of concern for us

-p-value: is something you want to be low because this indicates that there is a close relationship between the two variables we are learning. In this case, our p-value is just above zero indicating a pretty close relationship.

-Residual Standard Error: AKA RSE, is the average amount the systems guess will be away from the true regression line. In this case, our system has an RSE of 2.717. The smaller this number is the better our predictions.

- Finally, the RSquared view squashes our prediction accuracy between 0 and 1. This lets us get an idea of how well our predictions will be in simpler terms. In our case, our system is at 0.3871 which I think that's great performance after training.

This is only half the battle though. We will need to see how it does on different datasets to see how it performs there. Initially, we reserved about 20% of the data for testing purposes. Let's see how we did on that last 20% of the test data.

- Test 1 Performance

Looks like our model did pretty well with this dataset as well. Notice how it found the best fitting line for all the data points. That's pretty Impressive! Lets really try and trip it up by showing our model data from 2015's NFL season. Let's see how it does.

- Test 2

Above you can see how it performed on the 2015 Regular season data set. We are still getting our consistent line even though the data is completely different! Let's see what happens when we give our system 2014 data.

- Test 3

Different data, same trend line.

## THE PREDICTION

Our model has seen data from 3 separate NFL seasons, and now that these observations have been made we can start seeing some consistencies. The first thing we know is that there is a consistent relationship between a team's average points per game and their number of wins in a season. This is probably a no-brainer for any offensive coordinator, but this does serve as a mathematical explanation. Notice the line has an upwards slope in every single data set.

We can also make some predictions based on this data as well. For example, teams that average up to, or more than 30pts a game can expect to win more than 12 games a season. This prediction is based on the data we have over the past 3 seasons utilizing the two variables average points per game and number of wins in the season. On the opposite side of the spectrum, teams should not expect to win more than 4 games if they can't average more than 15 points per game. If I fed this model all the data from every NFL season that ever existed it would likely find the exact stat lines it would take for a team to win the exact number of games it is shooting for! This is the beauty of Machine Learning and Linear Regression.

## CONCLUSION

The Machine Learning Age is here. You are likely to hear some breakthroughs happening at breakneck speed over the next few months and years due to so many of these powerful algorithms being utilized in all kinds of fields. There is so much left to be discovered in all the data that is around us. All it takes is knowing these algorithms well and applying them to the right situations. It's very important that we start putting these techniques to good use as it will likely be the difference in one company's operating efficiency and survival as opposed to another. An organization not making good use of their data doesn't stand a chance against an organization that is gathering insights and leveraging algorithms to accomplish their goals.

*NFL data source: Pro Football Statistics