Back to Modeling & Machine Learning
Modeling & Machine Learning

Modeling & Machine Learning

18 of 63 Completed

Introduction to GLMs

Ask any member of the general public what they think “Statistics” is, and most will probably arrive at the iconic “line of best fit,” where a straight line is sandwiched between a scattering of data points, showing the general trend between them.

Visualization of a line of best fit

This “line of best fit” is the most simple case of an important class of statistical models called generalized linear models (GLMs). You probably know them under the more familiar name regression models; the distinction really isn’t important for our purposes.

These models seek to model a relationship between a idependent variable, yy (often called the target), and one or more independent variables, collectively referred to as x\vec{x} and individually referred to as x1,x2,x_{1},x_{2},\ldots, etc. GLMs assume that there exists some link function, denoted gg, that relates yy to the values in X\mathbf{X}. Given a list of values for a data point in X\mathbf{X}, say x\vec{x}, a good GLM model should have that

yg(x)y \approx g\left( \vec{x} \right)

Because the association can never be 100% exact due to errors in the data gathering process (exclusion of data, missing features, measurement errors, etc.), GLMs explain this by including an error term, denoted ε\varepsilon

y=g(x)+εy = g\left( \vec{x} \right) + \varepsilon

In our line of the best-fit example above, the link function is the linear function g(x)=xg(x) = x (Note we use xx instead of x\vec{x} since we are only considering a single independent variable in the above example).

Fitting GLMs

It is very unlikely that the association between yy and x\vec{x} is completely one-to-one. A small increase in some xixx_{i} \in \vec{x} may lead to a disproportionately large increase in yy. It could also be the opposite scenario, where a large change in xix_{i} leads to only a small change in yy. Additionally, not all xix_{i} will influence yy to the same degree. Because of this, GLMs have parameters which are tuned to each individual xixx_{i} \in \vec{x}, denoted collectively as β\vec{\beta} and individually as βi\beta_{i}. So a GLM takes the form

y=g(xβ)+εy = g\left( \vec{x} \cdot \vec{\beta} \right) + \varepsilon

Note ()( \cdot ) is used as a dot product here

In practice, GLMs do not train on a single data point x\vec{x} and yy. Instead, we have a data set XX and a list of targets for each xiy{\vec{x}}_{i} \in y, called yy. Each GLM then has a loss or score function, L(X,y,β)\mathcal{L}\left( X,y,\vec{\beta} \right), which can be minimized with respect to β\vec{\beta} through various techniques like ordinary least squares or gradient descent. The set of βi\beta_{i}s that minimize L\mathcal{L} are the parameters of the GLM.

Small note: many GLM include an “intercept” term, denoted β0\beta_{0}, that helps align the model correctly. The linear regression case, β0\beta_{0} shifts the line-of-best-fit upwards or downwards. Because accounting for β0\beta_{0} adds many annoying details to a lot of equations, we will ignore it for the sake of simplicity.

Why Use GLMs

GLMs have been historically important because of their simplicity. The simplest GLM, simple linear regression, predates the modern computer by about 100 years! In an era before the 2000s where computers either didn’t exist or were too slow to meet the computational demands, GLM offered a way to make statistical inferences in a reasonable amount of time. Just remember, when you are waiting for your neutral net to finish training, you at least aren’t calculating the standard deviation by hand like we used to!

In an age of random forests, kernel methods, and neural nets, one might ask why are we still using these old-school GLMs? There are several reasons:

  1. Quick Deployment and Training: Even on very large data sets, GLMs take at most a minute to train and can be deployed very quickly due to having a defined set of parameters that can be quickly exported.

  2. Computational Time: Sometimes we need a prediction quickly (within milliseconds) and can’t afford to wait on a neural net or parallelize our code. For example, we might need to find out how much a user on a streaming service would enjoy a specific show based on their past ratings. In this case, we want the recommendation very quickly to promote good UX. GLMs offer this speed.

  3. Explainability: GLMs have a standardized output in both base R and Python’s statsmodels package and can easily be digested by anyone who understands the basic principles of regression. This is especially helpful when we want to perform feature selection and find out which features are most important and significant for a prediction. An example of this would be when you need to prepare a model that will be shown in court or some other official government setting, like patent fillings.

  4. Legacy Fields: Fields like finance, defense, international development, and especially medicine are very slow to adopt new statistical models because of the strong regulatory environment and the high-risk nature of these fields. Because of these, a data scientist who works for companies in these fields may find themselves needing to “restrict” themselves to older models when performing their analysis.

For these reasons, GLMs remain an essential part of every data scientist’s tool kit. Companies like Amazon, Walmart, Uber, and Aetna require a good understanding GLMs from any potential employee!

Still, it’s important to keep in mind the limitations of GLMs:

  1. Shallowness: GLMs are among the most shallow learning methods available. This means they require more careful feature selection than other shallow learning methods like SVMs and Random Forests. When feature selection is hard, either because of the nature of the task (like in geospatial, photographic, or textual data) or because of the sheer amount of features in a dataset, GLMs tend to do very poorly.

  2. Requires Specification of Relationships Between Variables: GLMs require pre-specification of the type of relationship between independent and dependent variables (for example, if it is linear, exponential, quadratic, etc). This leads to GLMs having less robust decision boundaries compared to other shallow learning methods like Vector Support Machines and Random Forests. Thus, if you don’t have at least a somewhat decent intuition about the form of the relationship between your variables, GLMs may not be appropriate for the task.

  3. Over-fitting: GLMs are prone to over-fitting, meaning they perform poorly on data they’ve never seen before. If your task will involve extrapolation to scenarios not accounted for in your training data (for example, having exceptionally high or low traffic one day in a food delivery time estimator), GLMs may not be well suited for the task unless there is consistent retraining of your parameters. This also applies to tasks where you expect to add features to your model that were not in the original training set as time goes on.

  4. Strong Assumptions on Data: GLMs are among the most parametric models available, meaning they make a variety of assumptions about the data set they are trained on. If these assumptions aren’t true to your dataset, the regression model you generate will perform very poorly when deployed. This is particularly true if your data set is small (less than 1000 data points).

Course Outline

While there are dozens of GLMs, we will be focusing on three main types of GLMs:

  1. Traditional Linear and Non-Linear Regression

  2. Logistic Regression (Regression with categorical dependent variables)

  3. Times Series Models (Regression with a temporal component)

Good job, keep it up!

28%

Completed

You have 45 sections remaining on this learning path.

Advance your learning journey! Go Premium and unlock 40+ hours of specialized content.