Introduction

Future public policy makers must posses the skills to analyze and interpret data for different purposes. Quantitative methods is an important aspects that covers statistical and mathematical techniques used for analyzing numerical data. These methods are fundamental if we want to propose public policies that are based on scientific evidence. Linear regression, a core component of quantitative methods, is particularly significant because it helps us understanding about the relationship between variables, often used to analyze the impact of policy decisions, and to forecast future trends. Although linear regression is a simple technique, having clear and well explain the fundamentals is important to understand more complex models and, hence, to be able to use them in a more effective way.

Learning objectives

Preprocess and explain a typical pipeline to wrangle data ready for analysis
Understand the fundamentals of linear regression
Learn how to interpret the results of a linear regression model
Learn what are the best tools to visualize the results of a linear regression model and the assumptions behind it
Understand the limitations of linear regression and when to use it

Linear Regression 101

What is linear regression?

Linear regression is a statistical technique that helps us to understand the relationship between a dependent variable and one or more independent variables. The dependent variable is the variable that we want to predict or explain, while the independent variables are the variables that we use to predict or explain the dependent variable. The dependent variable is also known as the outcome variable or the response variable, while the independent variables are also known as the explanatory variables or the predictor variables. The dependent variable is always a continuous variable, while the independent variables can be continuous or categorical variables (Wooldridge 2012). In other words, we want to know how the dependent variable changes when the independent variables change. For example, we want to know how the price of a house changes when the size of the house changes. In this case, the price of the house is the dependent variable and the size of the house is the independent variable. Of course, this is a simple example, but the main idea is that we want to predict a numeric response based on one or more numeric and/or categorical predictors (independent variables).

In a nutshell, we want to find the best line that fits the data. This line is called the regression line. The regression line is a straight line that minimizes the distance between the actual values and the predicted values. The “linear” refers to the model being linear in the parameters. The predicted values are the values that the regression line predicts. The actual values are the values that we observe in the data. The distance between the actual values and the predicted values is called the error. The regression line is also known as the line of best fit or the least squares line. The regression line is defined by the equation:

The equation of a simple Ordinary Least Squares (OLS) regression, also known as the regression line or the line of best fit, is given by:

\[ y = \beta_0 + \beta_1x + \varepsilon \]

Where: - $y$ represents the dependent variable. - $x$ is the independent variable. - $\beta_0$ is the intercept of the regression line. - $\beta_1$ is the slope of the regression line, indicating the change in $y$ for a one-unit change in $x$. - $\varepsilon$ is the error term, denoting the difference between the actual and predicted values.

This is a simple linear regression model because it has only one independent variable. Usually, trying to explain a dependent variable with only one independent variable is not enough because there are other factors that can affect the dependent variable, especially in social sciences since we want to do causal inference, for example. If we want to try predict GDP growth, we need to include other variables that can affect GDP growth, such as inflation, unemployment, interest rates, etc. If we just use one independent variable, we will not be able to explain GDP growth very well. If we have more than one independent variable, then we have a multiple linear regression model. The equation of a multiple linear regression model is given by:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon \]

Where: - $y$ represents the dependent variable. - $x_1, x_2, ..., x_n$ are the independent variables. - $\beta_0$ is the intercept of the regression line. - $\beta_1, \beta_2, ..., \beta_n$ are the slopes of the regression line, indicating the change in $y$ for a one-unit change in $x_1, x_2, ..., x_n$. - $\varepsilon$ is the error term, denoting the difference between the actual and predicted values.

The multiple linear regression model is more complex than the simple linear regression model because it has more than one independent variable. The multiple linear regression model is more “realistic” because it allows us to explain the dependent variable with more than one independent variable. However, the multiple linear regression model is also more complex because it has more than one independent variable.

If you want to understand more about the fundamentals of linear regression in an interactive way, I recommend you to read the following resources:

Source: Wilber (2022)

Assumptions of linear regression

Before we start to analyze the results of a linear regression model, we need to understand the assumptions behind it. If you want to employ linear regression to analyze your data, you need to make sure that your data meets the assumptions of linear regression. If your data does not meet the assumptions of linear regression, the explanation of the results of your model could be biased and that could lead to wrong conclusions about the problem you are trying to solve. The assumptions of linear regression are the following (Wooldridge 2012):

Linear in Parameters: The model is linear in parameters.

It doesn’t mean that the variables themselves must be linear, but the parameters (coefficients) multiplying these variables in the regression equation should be linear. This allows for transformations of the variables (like log, square) as long as the coefficients are linear.

Code

#dataset
data(mtcars)

#LM, we predict miles per gallon (mpg) based on the number of cylinders (cyl) and weight (wt)
lm_model_ex <- lm(mpg ~ cyl + wt, data = mtcars)

#Summary 
summary(lm_model_ex)


Call:
lm(formula = mpg ~ cyl + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
wt           -3.1910     0.7569  -4.216 0.000222 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

mpg is the dependent variable, and both cyl and wt are independent variables. The model is linear in parameters because it expresses mpg as a linear function of cyl and wt. The coefficients for cyl and wt indicate how much mpg changes with a one-unit change in these variables, assuming a linear relationship.

Random Sampling: The sample of observations is randomly drawn from the population.

Sample data should be randomly drawn from the population. This is vital to ensure that the sample represents the population accurately, allowing for generalizable and unbiased inferences from the model. Violations of this assumption can lead to biased and ungeneralizable results, but in practice, truly random samples are rare. When we deal with real data, is difficult to obtain a truly random sample. However, we can try to get a sample that is as random as possible. Also, there are statistical techniques that allow us to deal with non-random samples, such as bootstrapping and jackknifing.

An example of how a random sample would look like is the following:

Code

#Setting a seed 
set.seed(66)

#Generate a synthetic population
v_pop_size <-  10000
df_pop <- data.frame(
  v1 = rnorm(v_pop_size, mean = 50, sd = 10), #We chose a normal distribution because it is easier to visualize
  v2 = runif(v_pop_size, min = 10, max = 60)  #We chose a uniform distribution because it is easier to visualize and it is bounded
)

# Random sample from the population
v_sample_size <- 1000 #We set the sample size to 1000
v_sample <- df_pop[sample(1:nrow(df_pop), v_sample_size), ] #Here, we are sampling 1000 rows from the population

# Compare summary statistics of the population and the sample
sum_pop <- summary(df_pop)
sum_sample <- summary(v_sample)

sum_pop

       v1              v2       
 Min.   :14.22   Min.   :10.01  
 1st Qu.:43.29   1st Qu.:22.64  
 Median :49.73   Median :35.04  
 Mean   :49.80   Mean   :35.07  
 3rd Qu.:56.24   3rd Qu.:47.52  
 Max.   :87.66   Max.   :60.00

Code

sum_sample

       v1              v2       
 Min.   :17.80   Min.   :10.08  
 1st Qu.:43.53   1st Qu.:21.87  
 Median :49.93   Median :34.60  
 Mean   :50.04   Mean   :34.89  
 3rd Qu.:56.38   3rd Qu.:47.30  
 Max.   :79.33   Max.   :59.94

As you can see, the summary statistics of the population and the sample are very similar. This is because we took a random sample from the population. If we had taken a non-random sample, the summary statistics of the sample would have been very different from the summary statistics of the population.

No Perfect Multicollinearity: There is no perfect multicollinearity among the independent variables.

None of our independent variables should be an exact linear combination of other independent variables. If this assumption is violated, it becomes difficult to assess the effect of each independent variable on the dependent variable.

library(car)

Loading required package: carData

#Predict miles per gallon (mpg) based on the number of cylinders (cyl), horsepower (hp), and weight (wt)
lm_model_ex_2 <- lm(mpg ~ cyl + hp + wt, data = mtcars)


#Calculate VIF (Variance Inflation Factor)
vif <- vif(lm_model_ex_2)

#VIF result
vif

     cyl       hp       wt 
4.757456 3.258481 2.580486

A VIF value greater than 5 or 10 suggests high multicollinearity. In our case, we don’t reach those values, but still we have some sort of multicollinearity. We nned to be careful with this when we interpret the results of our model.

Zero Conditional Mean: The error term has a conditional mean of zero given any value of the independent variables.

This implies that the model does not omit any variable that influences the dependent variable and is correlated with the independent variables. If this assumption is violated, the model will be biased. This assumption is also known as the exogeneity assumption, which means that the independent variables are not correlated with the error term.

Code

#LM
lm_model_ex_3 <- lm(mpg ~ cyl + wt, data = mtcars)

#We get the residuals and the fitted values
v_residuals <- residuals(lm_model_ex_3)
v_fitted_values <- fitted(lm_model_ex_3)

#Plot residuals against fitted values
plot(v_fitted_values, v_residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")  # Adding a horizontal line at 0

According to the plot, the residuals are randomly distributed around the horizontal line at 0, but they are also high in terms of mangnitude. This means that there might be some variables that we are not considering in our model that are correlated with the independent variables. This could be a violation of the zero conditional mean assumption. The idea for this assumptions is that we need to plot the residuals against the fitted values and see if there is a pattern. If there is a pattern, then we might have a violation of the zero conditional mean assumption. There are other models like instrumental variables that can help us deal with this violation using a exogeneous variable that is correlated with the independent variables but not with the error term.

Homoscedasticity: The error term has a constant variance, meaning there is no heteroscedasticity.

Spread of the residuals should be roughly the same for all fitted values. When this condition is met, it’s referred to as homoscedasticity; if not, it’s called heteroscedasticity. If this assumption is violated, the model will be biased and inefficient. This assumption is also known as the homogeneity of variance assumption.

Code

#LM
lm_model_ex_4 <- lm(mpg ~ cyl + wt, data = mtcars)

#We get the residuals and the fitted values
v_residuals <- residuals(lm_model_ex_4)
v_fitted_values <- fitted(lm_model_ex_4)

#Plot residuals against fitted values
plot(v_fitted_values, v_residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")  # Adding a horizontal line at 0

#Breusch-Pagan test
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(lm_model_ex_4)


    studentized Breusch-Pagan test

data:  lm_model_ex_4
BP = 3.2548, df = 2, p-value = 0.1964

Similar to the zero conditional mean assumption, we need to plot the residuals against the fitted values and see if there is a pattern. If there is a pattern, then we might have a violation of the homoscedasticity assumption. There are other models like weighted least squares that can help us deal with this violation. It’s worth mention that the Breusch-Pagan test is not very reliable when the sample size is small, but is a test that we can use to test for heteroscedasticity. For example in our case, the test is not significant, so we can say that we don’t have heteroscedasticity.

Normality of the Error Terms

The error term is normally distributed. If this assumption is violated, the model will be biased and inefficient. This assumption is also known as the normality assumption.

In larger samples, thanks to the Central Limit Theorem, the sampling distributions of the OLS estimators become approximately normal regardless of the error distribution. The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.

Code

#Q-Q plot
qqnorm(residuals(lm_model_ex_4))
qqline(residuals(lm_model_ex_4), col = "red")

Code

#Shapiro-Wilk normality test
shapiro.test(residuals(lm_model_ex_4))


    Shapiro-Wilk normality test

data:  residuals(lm_model_ex_4)
W = 0.93745, p-value = 0.06341

According to the plot, the the residuals are borderlining the red line. But there still some residuals that are over the red line. This means that we might have a violation of the normality assumption. However, OLS can produce unbiased estimates that have the least variance even if the error terms are not normally distributed. We can also use the Shapiro-Wilk test to test for normality. In our case, the test is not significant, so we can say that we don’t have a violation of the normality assumption. Neverless, the p-value is close to 0.05, so we can say that we are borderlining the normality assumption.

In summary for all the assumptions, we should be aware that some of them are strong and in reality are difficult to meet. However, knowing them give us the basis to understand the basis of quantitative methods. For know it’s important to know these and also is worth mention that there are other methods that can help us deal with the violations of these assumptions. For example, we can use instrumental variables to deal with the zero conditional mean assumption, weighted least squares to deal with the homoscedasticity assumption, and robust standard errors to deal with the normality assumption.

The key focus of OLS is on the linear relationship between the independent variables and the dependent variable. For estimation purposes, as long as the other OLS assumptions are met, the method can be used regardless of the distribution of the data. However, in reality and when we are doing research, we need to pay attention in how the data is distributed. Thi is key since, this distribution can tell us what is the best model to use. For example, if we have a dependent variable that is binary, then we should use a logistic regression model. If we have a dependent variable that is count, then we should use a poisson regression model. If we have a dependent variable that is continuous, then we should use a linear regression model (Wooldridge 2012).

This image shows us the possible data distributions that we can find out there.

Pipeline (steps) to do a linear regression: from the theory to the code

Here I want to explain the steps I would follow to preprocess the data, do a linear regression, and evaluate the model. Here we will use different packages for this purpose and you’ll see that it’s very easy to do a linear regression in R; we just need to put attention in the theory behind the model and the assumptions, which is really important to know. For this example, we will work with a real estate dataset. The dataset contains information about the price of houses in Boston. This is a basic example that we can use to understand the steps to do a linear regression.

1.-Establish an introduction

The first step is to establish an introduction. Here we need to explain the problem and how we can describe and taggle it using statistics. For example, we can say that we want to know how the price of a house is related to the number of rooms and the size of the house. We can also say that we want to know how the price of a house is related to the number of rooms and the size of the house, controlling for the neighborhood. This is important since we need to know what we want to know and how we can describe it using statistics.

2.-Research question

The second step is to establish a research question. Here we need to establish a question that we want to answer regarding our introduction and what we are trying to predict.

Example: How is the price of a house related to the number of rooms and the size of the house?

3.-Hypothesis

The third step is to establish a hypothesis. Here we need to establish a hypothesis that we want to test. A hypothesis is a tentative explanation or prediction that can be tested with statistical analysis. In our case, I want to test if the number of rooms and the size of the house are related to the price of the house.

Example:

Null hypothesis (Ho): The number of rooms and the size of the house are not related to the price of the house.

Alternative hypothesis(Ha): The number of rooms and the size of the house are related to the price of the house.

Important

Important: The null hypothesis is the hypothesis that we want to reject. The alternative hypothesis is the hypothesis that we are trying to prove. If we don’t have enough evidence to reject the null hypothesis, then we can’t prove the alternative hypothesis, at least statistically speaking.

4.-Data

The fourth step is to get the data. Here we need to get the data that we need to answer our research question and test our hypothesis. In our case, we need to get the data about the price of the house, the number of rooms, and the size of the house. For this, we will use the Boston dataset from the MASS package. This dataset contains information about the price of houses in Boston.

Code

# Load the MASS package

library(MASS)
library(dplyr)
# Load the Boston dataset

df_boston <- MASS::Boston

glimpse(df_boston)

Rows: 506
Columns: 14
$ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

5.-Data preprocessing

The fifth step is to describe the data in Mean, Standard Deviation, and Median. Here we can use the command st() from the vtable package. This command will give us the mean, standard deviation, and median of the variables in the dataset.

library(vtable) #Load the vtable package
st(df_boston) #Describe the data in a table

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
crim	506	3.6	8.6	0.0063	0.082	3.7	89
zn	506	11	23	0	0	12	100
indus	506	11	6.9	0.46	5.2	18	28
chas	506	0.069	0.25	0	0	0	1
nox	506	0.55	0.12	0.39	0.45	0.62	0.87
rm	506	6.3	0.7	3.6	5.9	6.6	8.8
age	506	69	28	2.9	45	94	100
dis	506	3.8	2.1	1.1	2.1	5.2	12
rad	506	9.5	8.7	1	4	24	24
tax	506	408	169	187	279	666	711
ptratio	506	18	2.2	13	17	20	22
black	506	357	91	0.32	375	396	397
lstat	506	13	7.1	1.7	7	17	38
medv	506	23	9.2	5	17	25	50

6.-Data visualization

The sixth step is to visualize the data. Here we need to visualize the data to see how the data is distributed. For this, we can use the ggplot2 package. This package allows us to visualize the data in a very easy way.

Code

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Plot 1: Price of the house
gg1 <- ggplot(df_boston, aes(x = medv)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Price of the house", x = "Price of the house", y = "Frequency") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot 2: Number of rooms
gg2 <- ggplot(df_boston, aes(x = rm)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Number of rooms", x = "Number of rooms", y = "Frequency") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot 3: Size of the house
gg3 <- ggplot(df_boston, aes(x = lstat)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Size of the house", x = "Size of the house", y = "Frequency") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine all plots into one
grid.arrange(gg1, gg2, gg3, ncol = 2)

7.-Data analysis

Plot medv (price of the house) vs rm (number of rooms)

Code

# Plot medv (price of the house) vs rm (number of rooms)

ggplot(df_boston, aes(x = rm, y = medv)) +
  geom_point() +
  labs(title = "Price of the house vs Number of rooms", x = "Number of rooms", y = "Price of the house") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Plot medv (price of the house) vs lstat (size of the house)

Code

# Plot medv (price of the house) vs lstat (size of the house)

ggplot(df_boston, aes(x = lstat, y = medv)) +
  geom_point() +
  labs(title = "Price of the house vs Size of the house", x = "Size of the house", y = "Price of the house") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Now plot the pairs. The ggpairs() function in R creates a matrix of scatterplots for each variable paired with each of the other variables. This is a great way to quickly look at the relationships between multiple variables at once.

Code

# Load necessary libraries
library(ggplot2)
library(GGally)

ggpairs(df_boston, 
        columns = c("medv", "ptratio", "black", "lstat", "dis", "rm", "crim"), 
        title = "Boston Data Pairs Plot",
        upper = list(continuous = wrap("cor", size = 4)), # Show correlations in the upper panel
        lower = list(continuous = "smooth"), # Use smooth lines in the lower panel
        diag = list(continuous = "densityDiag")) + 
  theme_light() + # Apply a clean theme
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"), #Customize title
    axis.text = element_text(size = 12), #Customize axis text
    axis.title = element_text(size = 14, face = "bold"))

8.-Modeling

The eighth step is to build the model. Here we need to build the model that we will use to answer our research question and test our hypothesis. Remember, our hypothesis is that the number of rooms and the size of the house are related to the price of the house.

\[ \small \begin{align*} MedianValue = \beta_0 + \beta_1 * num\_rooms + \beta_2 * lower\_status + \epsilon \end{align*} \]

Where:

$\beta_0$ is the intercept
$\beta_1$ is the coefficient for the number of rooms
$\beta_2$ is the coefficient for the size of the house
$\epsilon$ is the error term

# Build the model

lm_1 <- lm(medv ~ rm + lstat, data = df_boston)

# Summarize the model

summary(lm_1)


Call:
lm(formula = medv ~ rm + lstat, data = df_boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.076  -3.516  -1.010   1.909  28.131 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.35827    3.17283  -0.428    0.669    
rm           5.09479    0.44447  11.463   <2e-16 ***
lstat       -0.64236    0.04373 -14.689   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.54 on 503 degrees of freedom
Multiple R-squared:  0.6386,    Adjusted R-squared:  0.6371 
F-statistic: 444.3 on 2 and 503 DF,  p-value: < 2.2e-16

\[ \small \begin{align*} MedianValue = 5.0948 + 4.5154 * num\_rooms - 0.5718 * lower\_status + \epsilon \end{align*} \]

We can see that more rooms (rm) in a house positively impact its median value, with each additional room increasing the value by about $5,095$. The size of the house (lstat) negatively impacts its median value, with each additional percentage point of lower status decreasing the value by about $642$. Both variables are statistically significant, with p-values less than 0.05. The adjusted R-squared value of 0.64 indicates that the model explains 64% of the variance in the median value of a house.

Now, let’s say that we want to predict the whole dataset. We can use the predict() function to do this. We will use the predict() function to predict the median value of a house based on the number of rooms and the size of the house. Then, we will add the predicted values to the original dataset and finally, we will plot the predicted values against the actual values.

# Predict the median value of a house based on the number of rooms and the size of the house

df_boston$predicted <- predict(lm_1, df_boston)

# Plot the predicted values against the actual values

ggplot(df_boston, aes(x = medv, y = predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Actual vs Predicted", x = "Actual", y = "Predicted") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

This plot shows that the predicted values are kind of close to the actual values. However, there are some outliers. Let’s see if we can improve the model by removing the outliers.

#Remove the outliers

df_boston <- df_boston %>% 
  filter(predicted > 0 & predicted < 50)

# Build the model

lm_2 <- lm(medv ~ rm + lstat, data = df_boston)

# Summarize the model

summary(lm_2)


Call:
lm(formula = medv ~ rm + lstat, data = df_boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.4225  -3.4386  -0.9385   1.9941  28.1030 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.73138    3.13449  -0.552    0.581    
rm           5.19033    0.43957  11.808   <2e-16 ***
lstat       -0.66488    0.04363 -15.238   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.471 on 501 degrees of freedom
Multiple R-squared:  0.6464,    Adjusted R-squared:  0.645 
F-statistic: 457.9 on 2 and 501 DF,  p-value: < 2.2e-16

What is the difference between the two models?

By removing the outliers, we might expect a better fit. Usually we review the R squared and the p-values to see if the model is better. However, I personally prefer to use the AIC (Akaike Information Criterion) and the BIC (Bayesian Information Criterion) to compare models. The AIC and BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. These two metrics are better than R squared because they penalize the number of parameters in the model. The AIC and BIC are calculated as follows:

\[ AIC = 2k - 2ln(L) \]

\[ BIC = ln(n)k - 2ln(L) \]

Where:

$k$ is the number of parameters in the model
$n$ is the number of observations
$L$ is the likelihood function of the model

The likelihood function is the probability of observing the data given the model parameters. The higher the likelihood, the better the model. The AIC and BIC are both negative, and the lower the value, the better the model (Chakrabarti and Ghosh 2011). Let’s compare the AIC and BIC of the two models.

# Compare the AIC and BIC of the two models

AIC(lm_1, lm_2)

     df      AIC
lm_1  4 3173.542
lm_2  4 3148.251

BIC(lm_1, lm_2)

     df      BIC
lm_1  4 3190.448
lm_2  4 3165.142

9.- Adding more variables to the model

All the variables that we are not including to predict the median value of a house, go to the error term. This means that we are not taking into account the effect of these variables on the median value of a house. Let’s see if we can improve the model by adding more variables. We will add the following variables to the model:

\[ \small \begin{align*} MedianValue =\ &\beta_0 + \beta_1 \times num\_rooms + \beta_2 \times lower\_status \\ &+ \beta_3 \times pupil\_teacher\_ratio + \beta_4 \times black \\ &+ \beta_5 \times dist\_to\_employ\_ctr \\ &+ \beta_6 \times crime\_rate + \epsilon \end{align*} \]

Where:

$\beta_0$ is the intercept
$\beta_1$ is the coefficient for the number of rooms
$\beta_2$ is the coefficient for the size of the house
$\beta_3$ is the coefficient for the pupil-teacher ratio
$\beta_4$ is the coefficient for the proportion of black residents
$\beta_5$ is the coefficient for the distance to employment centers
$\beta_6$ is the coefficient for the crime rate
$\epsilon$ is the error term

# Build the model

lm_3 <- lm(medv ~ rm + lstat + ptratio + black + dis + crim, data = df_boston)

# Summarize the model

summary(lm_3)


Call:
lm(formula = medv ~ rm + lstat + ptratio + black + dis + crim, 
    data = df_boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.3216  -2.8206  -0.7685   1.6822  27.6463 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.627007   4.210866   3.949 9.00e-05 ***
rm           4.617388   0.412863  11.184  < 2e-16 ***
lstat       -0.608469   0.048251 -12.611  < 2e-16 ***
ptratio     -0.879837   0.113297  -7.766 4.67e-14 ***
black        0.010248   0.002725   3.761  0.00019 ***
dis         -0.683896   0.124865  -5.477 6.88e-08 ***
crim        -0.082178   0.031055  -2.646  0.00840 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.949 on 497 degrees of freedom
Multiple R-squared:  0.7129,    Adjusted R-squared:  0.7095 
F-statistic: 205.7 on 6 and 497 DF,  p-value: < 2.2e-16

BIC(lm_1, lm_2, lm_3)

     df      BIC
lm_1  4 3190.448
lm_2  4 3165.142
lm_3  8 3084.927

BIC(lm_3)

[1] 3084.927

Results

We realized that new variables through us interesting results.

Average Number of Rooms (rm): Each additional room increases the median home value by approximately $4,172, holding other factors constant.
Lower Status of the Population (lstat): Each percentage point increase in lstat decreases the median value by about $459, holding other factors constant.
Pupil-Teacher Ratio (ptratio): Each unit increase in ptratio reduces the median home value by around $888, holding other variables constant.
Proportion of Black Residents (black): A slight positive effect on home value, with each unit increase slightly raising the median value, other factors being constant.
Distance to Employment Centers (dis): An increase in this distance slightly decreases the median home value, holding other variables constant.
Crime Rate (crim): An increase in crime rate leads to a decrease in median home value, other factors held constant.

AIC, BIC

# Compare the AIC and BIC of the three models

AIC(lm_1, lm_2, lm_3)

     df      AIC
lm_1  4 3173.542
lm_2  4 3148.251
lm_3  8 3051.146

BIC(lm_1, lm_2, lm_3)

     df      BIC
lm_1  4 3190.448
lm_2  4 3165.142
lm_3  8 3084.927

We compare now the AIC and BIC of the three models. We can see that the AIC and BIC of the third model are lower than the first two models, which means that increasing the number of variables in the model improves the model fit.

10.- Reporting the results

We can report the results of the model as follows by hand or using a tool called report, which gives us automatically the results of the model in words:

# Report the results
library(report)
report(lm_3)

We fitted a linear model (estimated using OLS) to predict medv with rm, lstat,
ptratio, black, dis and crim (formula: medv ~ rm + lstat + ptratio + black +
dis + crim). The model explains a statistically significant and substantial
proportion of variance (R2 = 0.71, F(6, 497) = 205.72, p < .001, adj. R2 =
0.71). The model's intercept, corresponding to rm = 0, lstat = 0, ptratio = 0,
black = 0, dis = 0 and crim = 0, is at 16.63 (95% CI [8.35, 24.90], t(497) =
3.95, p < .001). Within this model:

  - The effect of rm is statistically significant and positive (beta = 4.62, 95%
CI [3.81, 5.43], t(497) = 11.18, p < .001; Std. beta = 0.35, 95% CI [0.29,
0.41])
  - The effect of lstat is statistically significant and negative (beta = -0.61,
95% CI [-0.70, -0.51], t(497) = -12.61, p < .001; Std. beta = -0.46, 95% CI
[-0.53, -0.39])
  - The effect of ptratio is statistically significant and negative (beta =
-0.88, 95% CI [-1.10, -0.66], t(497) = -7.77, p < .001; Std. beta = -0.21, 95%
CI [-0.26, -0.16])
  - The effect of black is statistically significant and positive (beta = 0.01,
95% CI [4.89e-03, 0.02], t(497) = 3.76, p < .001; Std. beta = 0.10, 95% CI
[0.05, 0.15])
  - The effect of dis is statistically significant and negative (beta = -0.68,
95% CI [-0.93, -0.44], t(497) = -5.48, p < .001; Std. beta = -0.16, 95% CI
[-0.21, -0.10])
  - The effect of crim is statistically significant and negative (beta = -0.08,
95% CI [-0.14, -0.02], t(497) = -2.65, p = 0.008; Std. beta = -0.08, 95% CI
[-0.13, -0.02])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

11.-Report the results on tables

# Report the results

library(sjPlot)

tab_model(lm_1, lm_2, lm_3, title = "Regression Table Example",
dv.labels = c("(LM) (1)", "(LM: no outliers) (2)", "(LM) (more variables)"), show.aic = T)

Regression Table Example
	(LM) (1)			(LM: no outliers) (2)			(LM) (more variables)
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	-1.36	-7.59 – 4.88	0.669	-1.73	-7.89 – 4.43	0.581	16.63	8.35 – 24.90	<0.001
rm	5.09	4.22 – 5.97	<0.001	5.19	4.33 – 6.05	<0.001	4.62	3.81 – 5.43	<0.001
lstat	-0.64	-0.73 – -0.56	<0.001	-0.66	-0.75 – -0.58	<0.001	-0.61	-0.70 – -0.51	<0.001
ptratio							-0.88	-1.10 – -0.66	<0.001
black							0.01	0.00 – 0.02	<0.001
dis							-0.68	-0.93 – -0.44	<0.001
crim							-0.08	-0.14 – -0.02	0.008
Observations	506			504			504
R² / R² adjusted	0.639 / 0.637			0.646 / 0.645			0.713 / 0.709
AIC	3173.542			3148.251			3051.146

12.- Plot assumptions

We can plot the assumptions of the model using the performance package. We can see how the assumptions are met.

Model

# Plot the assumptions
library(performance)
check_model(lm_3)

Look how in the normality of residuals plot, the residuals don’t follow the normal distribution. for the other assumptions, the model is ok, but especifically for the normality of residuals, the model is not holding the assumptions. This can affect the reliability of hypothesis testing. To address this, consider transforming the data, adding variables, or using robust or nonparametric methods, especially if the sample size is small (Wooldridge 2012).

13.- Tips:

Finally, here are some tips according to my experience that can help you deal with linear regression models:

1.- Try different units of observations:

Sometimes, the units of observations are not the best to analyze the data. For example, in the Boston dataset, the median value of a house is in thousands of dollars. This means that the median value of a house is between 5 and 50. This is a very small range of values. We can try to multiply the median value of a house by 1000. This will give us a range of values between 5000 and 50000. This will help us to better analyze the data.

2.- Try different transformations of the variables:

Sometimes, the variables are not in the best form to analyze the data. For example, in the Boston dataset, the crime rate is in per capita. This means that the crime rate is between 0 and 100. This is a very small range of values. We can try to multiply the crime rate by 100. This will give us a range of values between 0 and 10000. This will help us to better analyze the data.

3.-Gathered data:

Using external data can help us to have a better model. Here for example, we can add to the Boston dataset the median income of the neighborhood. This can help us to better predict the median value of a house. Remember the commands that we used to gather data from the internet. We can use these commands to gather data from the internet and use it in our model. For know, we can give a basic example of how you would use left join, right join, inner join, and outer join to merge two data frames.

# Create a data frame

df_1 <- data.frame(
  id = c(1, 2, 3, 4, 5),
  name = c("Jorge", "Carmen", "Arturo", "Rocky", "Yayo"),
  age = c(20, 30, 40, 50, 60)
)

# Print the data frame

df_1

  id   name age
1  1  Jorge  20
2  2 Carmen  30
3  3 Arturo  40
4  4  Rocky  50
5  5   Yayo  60

Now, let’s say that we have the following data frame:

# Create a data frame

df_2 <- data.frame(
  id = c(1, 2, 3, 4, 5),
  income = c(1000, 2000, 3000, 4000, 5000)
)

# Print the data frame

df_2

  id income
1  1   1000
2  2   2000
3  3   3000
4  4   4000
5  5   5000

We can merge the two data frames using the following command:

# Merge the two data frames

df_3 <- left_join(df_1, df_2, by = "id")

# Print the data frame

df_3

  id   name age income
1  1  Jorge  20   1000
2  2 Carmen  30   2000
3  3 Arturo  40   3000
4  4  Rocky  50   4000
5  5   Yayo  60   5000

We can also use right join, inner join, and outer join. You can read more about these commands here.

4.- Have a good justification choosing the variables:

Testing your hypothesis means that you have a theory behind your model. You need to select the variables that, according to your theory, can describe and predict the problem in the best possible way. Don’t be wrroy about looking that your variables are significant, but rather that they make sense in your theory.

5.- Creating factors:

Sometimes, you need to create factors to better analyze the data. For example, in the Boston dataset, we can create a factor that divides the crime rate into three categories: low, medium, and high. You can include this as part of your theory. For example, you can say that the crime rate is low, medium, or high, and this can affect the median value of a house. You can create factors using the following command:

#Create a factor

df_boston$crime_rate_factor <- cut(df_boston$crim, breaks = c(0, 1, 10, 100), labels = c("low", "medium", "high"))

head(df_boston)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv predicted crime_rate_factor
1 24.0  28.94101               low
2 21.6  25.48421               low
3 34.7  32.65907               low
4 33.4  32.40652               low
5 36.2  31.63041               low
6 28.7  28.05453               low

Conclusion

Understanding the assumptions behind the linear regression and applying them in a real context is something that you need to have really clear. This tutorial was an exercise to demonstrate you that there are multiple ways to do this analysis. You can use the tools that you feel more comfortable with. The important thing is that you understand the assumptions and how to apply them.

References

Chakrabarti, Arijit, and Jayanta K. Ghosh. 2011. “AIC, BIC and Recent Advances in Model Selection.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:583–605. Handbook of the Philosophy of Science. Amsterdam: North-Holland. https://doi.org/https://doi.org/10.1016/B978-0-444-51862-0.50018-6.

Wilber, Jared. 2022. “Linear Regression: A Visual Introduction to (Almost) Everything You Should Know.” https://mlu-explain.github.io/linear-regression/.

Wong, Jason. 2020. “Linear Regression Explained: A High-Level Overview of Linear Regression Analysis.” Towards Data Science, November. https://towardsdatascience.com/linear-regression-explained-1b36f97b7572.

Wooldridge, Jeffrey M. 2012. Introductory Econometrics: A Modern Approach. Mason, Ohio: South-Western Cengage Learning.

Note

Cite this page: Roa, J. (2023, December 10). Linear Regression 101. Hertie Coding Club. URL