What is Regression? (And a tutorial)

Aryan Jha
5 min readNov 11, 2021

If you’re into machine learning, you’ve probably heard the word “regression”. However, some articles or videos don’t bother to explain what it means. That’s what I hope to do in this article.

Basically, regression is a method that connects a dependent variable and one or multiple independent variables. There are two types of regression: linear regression and logistical regression.

Linear regression is a method that uses the line of best fit between one or multiple independent variables and one dependent variable. Simple linear regression is a linear regression method with only one independent variable and one dependent variable. A dependent variable could be the price of a car and the independent variables could be the the model, the make, etc. Multiple linear regression is a method with multiple independent variables and one dependent variable. Linear regression can be used when you need to predict a value that is not categorical (like the price of a house).

Example of a linear regression line of best fit

Logistical regression is a method that is used for categorical data. An example of that would be predicting if a product is a good deal or a bad deal. If there are only two possibilities for the output (like yes or no), then it would be binary logistical regression. If there were multiple, unordered independent variables, it would be multinomial logistical regression. If there were multiple, ordered independent variables, then it would be ordinal logistical regression. An example of ordered variables would be medals in the Olympics, while an example of unordered variables would be the colour of pencils.

Using them in a program

So, now you know the difference between linear regression and logistical regression. How about we do some coding?

Using the scikit-learn package, we can actually do that! In case you don’t have it already installed, go to a terminal and type:

pip install scikit-learn

I will be doing my work in a jupyter notebook, which you can get by doing:

pip install jupyterlab

The files will be available here.

I don’t want to make this article too long, so I’ll only explain how to make a linear regression model. Don’t worry, an article will be coming out soon about how to make a logistical regression model with scikit-learn.

First, we need to import the scikit-learn package, and specifically, the linear_model part of it.

from sklearn.linear_model import LinearRegression

Then, we will need to import pandas and numpy. These are two packages that will help us while creating the dataset.

import pandas as pd
import numpy as np

Next, we need to define the dataset. This isn’t a real dataset, but it’s just one I made up. I chose to make it about the salary of workers and the experience they have in that field (in years). We will need two datasets: one for training, and one for testing. The training dataset is what it will learn on, while the testing dataset is what we will test it on. Here’s the data for the training dataset:

dataset = pd.DataFrame([[100000, 15],[95000, 13], [90000, 10], [85000, 8], [70000, 7], [60000, 3]])

While here’s the data for the testing dataset:

dataset_test = pd.DataFrame([[97750, 14], [87750, 9], [75000, 7]])

Now, we need to separate the dataset into the dependent variable and the independent variable. If you recall, the dependent variable is the variable that changes based on the independent variable or variables. In our case, we only have one independent variable.

dependent = dataset.iloc[:,[0]]
independent = dataset.iloc[:,[1]]
dependent_test = dataset_test.iloc[:,[0]]
independent_test = dataset_test.iloc[:,[1]]

What .iloc does in this case, is locate the data at the index that we provide. The reason why the indexes are 0 and 1 is because in python, indexes start at 0. The reason why we have another pair of [] around the 0 and 1 is because it is only one column each, so it would be automatically converted into a Series, but we want it as a DataFrame because that is the type of data that the model can use.

regressor = LinearRegression()

The line above just creates the model, using the LinearRegression package we imported earlier.

The next line trains the model on the dataset, using the independent and dependent variables we specified earlier.

regressor.fit(independent, dependent)

Now, you’re done the model! Surprisingly simple, right? As an extra, I’ll show you how to visualize the model and test it using the testing dataset.

First, we’ll want to import matplotlib so we can create our graphs.

import matplotlib.pyplot as plt

Then, we’ll make a scatter plot using the test variables so we can see where the real values lie on the graph. The independent variable is the x value, while the dependent variable is the y value. We will make this red, so it can contrast against the other elements on the graph.

plt.scatter(independent_test, dependent_test, color='red')

Now, we’ll make a line representing our model’s prediction. The independent variable is the x value, while the model’s prediction is the y value. The regressor.predict() part tests the model. The reason why put independent_test in the brackets instead of dependent_test for the prediction is because the model is supposed to use the independent variable to find the dependent variable. We’ll make this line blue.

plt.plot(independent_test, regressor.predict(independent_test), color='blue')

This next command shows the graph, making it so we can’t edit it anymore. This is useful for when you have multiple graphs in the same python file.

plt.show()

The graph should look something like this:

Not bad for a model trained and tested on made up data.

And you’re done! You made a linear regression model, tested it, and visualized it using a graph.

Some next steps would be:

  • Using real datasets. I would recommend using Kaggle.com to find datasets.
  • Using a more complex package like Keras to make your model.
  • Experimenting with other types of models like image classification models.

If some of those sound a bit intimidating, don’t worry, because I’ll have articles coming out explaining how to do those.

Thanks for reading! If you have any questions or feedback, feel free to message me on Linkedin.

--

--