What You Can Learn from a Single Machine Learning Course

This is a republication of a post I originally wrote on November 10, 2022 over on my dedicated blog site that I am rolling into wonksecurity.com.

Disclaimer: This post discusses Andrew Ng’s newly updated Machine Learning Specialization on Coursera. Other than being a student in the specialization on Coursera, I am not affiliated in any way with Coursera, Ng, or DeepLearning.AI. These are entirely my own unsolicited views and opinions.

Recently, I had the pleasure of enrolling in Andrew Ng’s Machine Learning Specialization on Coursera and just completed the first of three courses in the specialization. Instead of providing a standard review of sorts for the first course, what I would like to do in this post is show what kind of powerful tools can be learned from just a single very well-taught, three-week long course.

I believe there are many people who could benefit professionally from grasping the essentials of machine learning. However, some of those people may be reluctant to get into the subject due to how technical it is. If that describes you, my hope is that by the end of this post you will: 1) have a taste of the analytical and predictive power afforded by machine learning and 2) believe that you too can learn how to grasp and utilize that power.

About the Machine Learning Specialization

The Machine Learning Specialization courses are produced by Andrew Ng, one of the superstars of the machine learning field, and his technology education company DeepLearning.AI. When the original version of the specialization first came out in 2012, it was an immediate hit and helped millions of analysts, data scientists, and machine learning engineers get their start in the field. Earlier this year, Ng unveiled an updated version of the specialization, complete with lessons on some of the newest advances in machine learning since the early 2010s.

Following at the recommended learning pace, I recently completed the first course, titled “Supervised Machine Learning: Regression and Classification” in just under three weeks. In those three weeks, I learned, first and foremost, what machine learning is, the different types and use cases of machine learning, and how to implement one of the most popular categories of machine learning algorithms—supervised learning.

Machine Learning Prerequisites

Now, before I get into the specific concepts that you can learn in the first course, I want to touch on the prerequisites for the specialization. All you need to succeed in the class is a basic understanding of Algebra and some minimal experience with Python (such as data structures and NumPy). Andrew Ng does an amazing job of explaining all the rest of the mathematics you could possibly need. I can attest to this personally, since I’ve never taken a single course in calculus or linear algebra. Despite this, after Dr. Ng explained the intuition behind how an algorithm works the big and messy equations on the screen were no longer confusing or scary.

Furthermore, if all you’re interested in is having an understanding of the concepts so you can better understand what people mean when they talk about machine learning, you don’t even really need Python knowledge—Ng explains regression and classification in plain English, and no code is required outside of the optional labs. If you do learn to code, however, you can implement the complicated looking calculus equations in Python and let the computer solve it for you in seconds. That means no need to solve partial derivatives, polynomials, or dot products by hand like in math class.

So What Can You Learn?

The supervised machine learning course, as its name suggests, focuses on the two most powerful and popular learning algorithms in the machine learning field today—regression and classification. At its heart, learning algorithms create models that predict a certain output given a certain input. In supervised learning, data is given to the algorithm, referred to as training data, which has the proper answers attached. That is to say, the training data given to the algorithm has both an input and an output, known as x values and y values, respectively, specified.

The example that recurs throughout the course is house sales, which I will summarize here in an abridged and relatively math-free way. In this example, the x, or input, values might be some features about a collection of houses that have already sold, and the y, or output, values might be the prices those houses sold for. Based on this training data, the algorithm fits a model that allows it to predict the price a new house might sell for based off the new house’s features. This particular example uses an algorithm known as linear regression (which many analysts and researchers also use to model relationships between variables).

Linear Regression

In the context of machine learning, a linear regression algorithm fits a straight line to a series of x and y data points. Based off the parameters of the model, a person can input a value x and get a prediction of what y would be. In the housing example, this x value might be square footage, and the y value would be price. Now this would obviously make for a very simplistic and likely inaccurate model. For example, numerous factors influence house price.

Multiple Linear Regression

That is where multiple variable linear regression, sometimes shortened to multiple linear regression, comes into play. Multiple linear regression works in the same way that single variable linear regression works by mapping the input x to the output y; however, multiple linear regression allows for multiple x’s to be used. Now, instead of just being limited to square footage as the only x variable, it is possible to assign square footage to x₁ while including the number of bedrooms as x₂, the number of bathrooms as x₃, the number of floors as x₄, and so on. Multiple linear regression allows for increasingly complex models that more accurately fit the data, which in turn allows for more accurate predictions.

The representation below depicts a multiple linear regression model having its parameters adjusted to fit the data. As you can see, instead of being a line on a flat surface, the model is now a three dimensional plane as another variable was added to the simple linear regression model. The inclusion of an x₂ changes the graph by adding a z-axis.

gif showing a multiple linear regression model in action — *Graphical representation of a multiple linear regression model being adjusted to fit the data. By Cfbaf, Public domain, via Wikimedia Commons.*

Polynomial Regression

There are certain times where a straight line just does not fully represent the data, such as in areas where the rise in price of a home might taper off after going above a certain square footage or where the price of a home might increase at an exponential rate as you move closer to a busy city. For that, polynomial regression, a subset of linear regression is used. Polynomial regression allows for the inclusion of non-linear features that can allow the line to curve in ways that can more accurately fit the data using exponentiation. Instead of just taking feature x₁, x₂, and x₃, polynomial regression allows for the inclusion of x₁² or even x₁³ to be included in the model alongside the original features to improve the accuracy of the model when a straight line does not fit the data well.

What Is Linear Regression Used For?

Linear regression is an incredibly powerful way to produce models that can let a person understand what input variables most heavily influence the output variable. Even more importantly, though, linear regression allows a person to make accurate predictions based on new data. As IBM puts it, linear regression is “a proven way to scientifically and reliably predict the future.” While I would caveat that there are limitations to such predictions, namely that certain assumptions must be met, there is no doubt that linear regression is highly important across a wide range of use cases. This article by Statology has some great additional examples of how linear regression is used in the real world to solve business and scientific problems.

Logistic Regression

While the previous regression models were all variations of the same linear regression algorithm, logistic regression is different in that it is used for classification. Unlike linear regression, which can encompass virtually infinite different values of both x and y, logistic regression uses binary values for y, which take the form of either 0 or 1. The logistic regression algorithm is very popular for classification, which is a way of making predictions based on two categories, a positive case and a negative case.

The resulting model produces a probability value that indicates whether a given data point belongs in one basket or another, based on a pre-determined decision boundary. This probability value allows the model to ultimately determine how a certain data point should be coded. For example, based on input values x, is a tumor malignant or benign, is a transaction legitimate or fraudulent, or does an image depict a cat or a dog? However, it is important to note that in each case, a different decision boundary would likely be pre-determined according to the problem at hand.

To elaborate on decision boundary selection, if we were predicting the malignancy of tumors, we might set our decision boundary to over-detect because the cost of being wrong could kill people. Thus, we might code a particular data point as 1, or malignant, if the model outputs a probability value of 0.35 or higher, meaning the model thinks there is a 35-percent chance the particular data point is malignant. Conversely, we might use a more conservative boundary for lower stakes examples, such as classifying pictures of cats or dogs. For this, an output of 0.5 and above would classify the image as 1, corresponding to dogs, and 0 for cats if it was under 0.5.

graphical depiction of a sigmoid function, which is a key component of logistic regression — *Image of a logistic curve known as a sigmoid function. By Qef, Public domain, via Wikimedia Commons*

What Is Logistic Regression Used For?

Like linear regression, logistic regression is an incredibly powerful way to make predictions, though it differs in that it makes these predictions have a categorical component as opposed to a numeric component. One of the most popular fields for the application of logistic regression is medicine. In addition to helping doctors detect malignant tumors earlier, logistic regression models can be used to predict a patient’s risk of having a heart attack, for example. The banking industry also makes heavy usage of logistic regression to predict if a transaction is fraudulent or to predict whether a person’s credit score will go up or down in the future. Another common implementation is the labeling of emails as spam or not spam. A great deal of real world problems can be addressed through well-designed logistic regression models.

Cost Functions

In both regression and classification, the algorithm determines the optimal model by working to minimize the cost of parameters w and b. The best way to think of the parameters w and b are with the equation of a line from algebra: y = mx + b, were m is the slope of the line and b is the point where the line crosses the y-axis at 0, known as the y-intercept. In the case of linear regression, imagine that w is the m or slope and b remains the y-intercept. The goal of linear regression is to get to a line that fits the data as accurately as possible, which means finding the correct slope and y-intercept to pass through the data as closely as possible. This is accomplished using a cost function and gradient descent, which will be covered in the next section. Similarly, a cost function is used for logistic regression to determine the most accurate decision barrier, albeit with a different cost function formula that includes some additional steps discussed in the course.

As the algorithm tries to find the optimal values to fit the data, a cost function is used to calculate how inaccurate the model is by taking the difference between the actual data and the model’s predictions. This cost function is a mathematical equation that derives a value for how well the model fits the data. The worse of a fit the model is to the data, the higher the cost. This cost is sometimes referred to as residuals or standard error in regression analysis like that used by researchers mentioned above. It can be thought of as the average cost of how far off the line is from each observation in the dataset.

Gradient Descent

However, calculating a bunch of different possible lines to find the one with the lowest cost would be highly inefficient. Thus, a very powerful algorithm known as gradient descent is employed. This algorithm does simultaneous updates to some initial values of w and b, usually set to 0, based on a pre-set learning rate known as alpha, denoted by the lower-case Greek letter α, to incrementally try different models until it reaches the lowest possible cost of w and b. The learning rate, α, dictates how big these increments are, thus affecting how long it may take the algorithm to arrive at the correct answer. Across multiple iterations of gradient descent, the error of the model becomes smaller and smaller until it approaches zero, which is known as convergence due to the error of w and b converging toward zero. The graphic below shows a visual representation of the rapidly dissolving error in the function F(X).

Once gradient descent does its work, it outputs values of the parameters w and b that enable the model to, hopefully, fit the data. Yet these values are not always perfect. The course discusses some of the challenges of finding a model that is a good fit, such as problems with underfitting, overfitting, and problems where gradient descent may not converge on the best values of w and b.

A simplistic graphic showing the perils of over and underfitting a machine learning model to data. — On the left, we have an example of a model that is underfit. In the middle, the model is a good fit for the data. On the right, the model is overfit and curves wildly trying to encompass as many points as possible. Image my own.

Start Learning About Machine Learning

As you might’ve gleaned from this last paragraph, implementing regression and classification algorithms are as much art as they are science. The features and parameters that go into the model have to be carefully selected, engineered, and adjusted to fit the problem at hand. However, Dr. Ng also dives into this and provides multiple best practices and troubleshooting tips to ensure that the model produced is the best that it can be.

With that in mind, the material that I’ve covered in this blog post is just the tip of the iceberg in terms of what you can learn in just the first course in the Machine Learning Specialization on Coursera. I can’t recommend taking this course enough if you’re interested in using machine learning or even if you just want to have an understanding of the concepts that are increasingly shaping our daily lives. The specialization itself is also just the tip of the iceberg in terms of what is possible in the machine learning field, but these classes are a great place to start to get a firm grasp of the essential concepts that everything else in the field builds upon.