Master Multiple Linear Regression in Python with scikit-learn and statsmodels

Master Multiple Linear Regression in Python with scikit-learn and statsmodels

Introduction

Mastering multiple linear regression in Python with libraries like scikit-learn and statsmodels is an essential skill for any data scientist. In this article, we’ll walk you through the process of implementing and evaluating multiple linear regression models, using tools like scikit-learn and statsmodels to preprocess data, select features, and address challenges like multicollinearity and outliers. You’ll also learn how to evaluate model performance with metrics like R-squared and Mean Squared Error, and understand the impact of cross-validation on model reliability. Whether you’re analyzing the California Housing Dataset or working with your own, this guide will help you master the art of multiple linear regression in Python.

What is Multiple Linear Regression?

Multiple Linear Regression is a statistical method used to predict a target variable based on the relationship with multiple independent variables. It helps in understanding how different factors contribute to a specific outcome, such as predicting house prices using features like income, number of rooms, and occupancy rates. This technique is commonly used in data analysis and machine learning for making predictions and identifying important factors influencing a particular result.

What is Multiple Linear Regression?

So, here’s the thing: imagine you’re trying to predict something, like the price of a house, but you know that just one factor won’t give you the full picture. Sure, the size of the house is important, but so is the number of bedrooms, the location, and even how old the house is. That’s where Multiple Linear Regression (MLR) comes in, and it’s pretty great because it lets you consider all these factors at once, helping you get a clearer idea of what’s actually going on.

Think of it like making a smoothie. If you only use one ingredient, say bananas, you’re stuck with just that one flavor. But if you mix in some strawberries and blueberries, you get a much richer, more complex taste. In the same way, MLR looks at several factors—like the size, number of bedrooms, and location of a house—to predict the house price.

Here’s how it works:

The relationship between the dependent and independent variables in MLR is represented by an equation like this:


Y = β₀ + β₁ X₁ + β₂ X₂ + ⋯ + βₙ Xₙ + ε

Let’s break that down:

  • Y is the dependent variable, the thing you’re trying to predict—in this case, the house price.
  • β₀ is the intercept, the starting point of your model. It’s like the baseline from where the rest of the model builds.
  • β₁, β₂, …, βₙ are the coefficients. These tell you how much each independent variable affects the outcome—basically, how important each factor is in determining the result.
  • X₁, X₂, …, Xₙ are the independent variables, like the size of the house, the number of bedrooms, or the location.
  • Finally, ε is the error term. It accounts for all the small things the model doesn’t capture—like factors that might influence the price but aren’t included in the model.

Let’s make this a little more relatable with a real-world example:

Imagine you’re looking at a house, and you want to predict its price. Here’s how the variables fit in:

  • The dependent variable? The house price—this is what you’re trying to predict.
  • The independent variables? Things like the house’s size (how big it is), the number of bedrooms, and its location (is it in a fancy neighborhood or next to a freeway?).

With multiple linear regression, you can see how each of these factors plays a role in determining the price. Maybe the size of the house has a big impact, but the number of bedrooms doesn’t matter much. And, of course, location might be the biggest factor—if it’s in a great spot, the price could shoot up. By plugging all these factors into one model, MLR gives you a more accurate price prediction than if you only considered one thing at a time.

And here’s the cool part: this technique is used everywhere. Whether you’re analyzing the stock market, predicting how diseases progress in biology, or designing something in engineering, MLR helps make sense of complex problems where a lot of different factors are at play. It’s like reading a map and trying to understand how all the roads and landmarks connect so you can get to your destination.

Link

Assumptions of Multiple Linear Regression

Let’s dive into a story about Multiple Linear Regression (MLR). Imagine you’re about to build a model to predict something important, like housing prices. But before you get into the fun part of making predictions, there are a few assumptions you need to check—think of them as the foundation of a house. If the foundation isn’t solid, the house (or in this case, the model) might collapse. So, let’s walk through these assumptions, one by one. It’s like prepping a recipe before you cook. If you miss a key ingredient or step, the dish might not turn out how you want. But if you follow the steps correctly, you’ll have a great model!

Linearity

Here’s the first assumption: Linearity. For MLR to work, the relationship between the dependent variable (the thing you’re predicting, like house price) and the independent variables (the factors influencing that price, such as house size or number of rooms) must be linear. In simpler terms, the price should change in a straight-line relationship with the factors.

Imagine this: you’re tracking the price of a house, and every time the number of rooms increases by one, the price increases by a fixed amount. If this doesn’t happen and the price starts to change unpredictably or non-linearly, your model could make wildly inaccurate predictions. You can check for linearity by plotting scatter plots of the dependent variable against each independent variable. Look for that nice straight line—if you see it, you’re in business!

Independence of Errors

Next up is Independence of Errors. This assumption is about the residuals (the differences between the predicted and actual values) being independent from each other. Think of residuals as those little mistakes the model makes when predicting. If these mistakes start to show a pattern (maybe they’re connected in some way), it means the model missed something important.

For example, if you’re predicting house prices over time and your residuals show a trend (like prices always being too low during a certain period), this suggests your model hasn’t caught something in the data. To test this, you can use the Durbin-Watson test. If you get a value close to 2, that’s a good sign—the errors are independent, just like you want them.

Homoscedasticity

Now, let’s talk about Homoscedasticity—that’s a fancy word, right? But don’t worry! This assumption just means that the spread of residuals should remain constant across all levels of the independent variables. In simpler terms, as you go through the data, the amount of error in your predictions should stay about the same.

Imagine your predictions about house prices. Whether you’re predicting for a small studio or a massive mansion, the spread of errors (the residuals) should be pretty consistent. If it’s not, and the errors start growing larger as the house price increases, that’s a problem. You can check this with a plot of residuals against the predicted values. A nice random scatter means you’re good to go, but if you see a funnel or cone shape, you’ve got yourself a case of heteroscedasticity—and that’s something you’ll need to fix.

No Multicollinearity

Here’s another biggie: No Multicollinearity. Multicollinearity happens when two or more independent variables are too closely related to each other. It’s like trying to predict house prices using both the square footage of a house and the number of rooms. These two are often highly correlated, which can confuse your model about which factor is really affecting the price.

The solution? You need to check for this using the Variance Inflation Factor (VIF). If you see a VIF over 10, that’s a red flag! It means one or more of your independent variables is too similar to another. To fix this, you might need to remove or combine some variables. Think of it like cleaning up a cluttered room—you just need to clear out the duplicates.

Normality of Residuals

This one’s important for accurate hypothesis testing: Normality of Residuals. In order for your tests to be valid, the residuals (remember, those are the model’s prediction errors) should follow a normal distribution. If they don’t, any statistical tests you run, like t-tests or F-tests, might not be reliable.

To check this, you can use a Q-Q plot. This is a simple visual tool that shows you if your residuals follow a straight line, which would indicate normality. If your plot looks like a line, you’re golden. If it’s wobbly or bent, then the residuals are probably not normal, and you might need to adjust your model.

Outlier Influence

Finally, let’s talk about Outlier Influence. We’ve all seen those data points that are way out in left field, right? Well, these outliers—also known as high-leverage points—can cause big problems in your regression model. They can pull your predictions off-course and make your model less reliable.

You’ll want to identify these outliers and decide what to do with them. Should you remove them or use a more robust regression method that can handle them? You can find outliers with leverage plots or by calculating Cook’s distance. If you spot any outliers, take a closer look—don’t let them skew your model’s predictions.

And there you have it! These assumptions might seem like a lot to keep track of, but they’re essential to making sure that your Multiple Linear Regression model is on the right track. If any of these assumptions are violated, your model could lead you astray, offering misleading results. So, it’s a good idea to check these assumptions before jumping into the analysis—just like you wouldn’t bake a cake without checking the recipe!

Preprocess the Data

Alright, let’s imagine you’re about to embark on a journey to predict house prices using Multiple Linear Regression in Python. You’ve got this trusty California Housing Dataset at your side, filled with information about houses, and you’re ready to figure out how different factors—like house size, number of rooms, and location—affect the price. But before you get into the thick of the analysis, there’s some prep work to do, like cleaning up the data, picking the right features, and getting everything ready for the big regression showdown.

Step 1 – Load the Dataset

The first step in this grand adventure is to load the dataset. The California Housing Dataset is a well-known dataset in the world of machine learning. It’s a perfect starting point for learning about regression tasks since it’s packed with details like house sizes, average number of rooms, population in the neighborhood, and—most importantly—house prices.

To kick things off, you’ll need to install a few Python packages. It’s like gathering your tools before you start fixing the car. Use the following command:

$ pip install numpy pandas matplotlib seaborn scikit-learn statsmodels

With your tools ready, you’re good to go! Now, let’s import the necessary libraries to load the dataset and start tinkering with it:


from sklearn.datasets import fetch_california_housing # To fetch the California Housing dataset
import pandas as pd # For data manipulation
import numpy as np # For numerical operations

Now, let’s load the actual data and get it into a pandas DataFrame. Think of this as organizing your workbench before you start building:


housing = fetch_california_housing() # This loads the dataset
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names) # Creating the DataFrame
housing_df[‘MedHouseValue’] = housing.target # Add the target column: house price

You’re almost there! To see what you’re working with, print out the first few rows of the dataset:

print(housing_df.head())

You’ll see something like this:

Output

MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseValue
0 8.3252  41.0  6.984127  1.023810  322.0  2.555556  37.88  -122.23  4.526
1 8.3014  21.0  6.238137  0.971880  2401.0  2.109842  37.86  -122.22  3.585
2 7.2574  52.0  8.288136  1.073446  496.0  2.802260  37.85  -122.24  3.521
3 5.6431  52.0  5.817352  1.073059  558.0  2.547945  37.85  -122.25  3.413
4 3.8462  52.0  6.281853  1.081081  565.0  2.181467  37.85  -122.25  3.422

And just like that, you’ve got your hands on the data!

Understanding the Variables

Now, let’s break down what we’re looking at here. The dataset contains several variables (or features), and each one plays a role in predicting house prices (the dependent variable, MedHouseValue):

  • MedInc: The median income in the block
  • HouseAge: The median age of houses in the block
  • AveRooms: The average number of rooms in each house
  • AveBedrms: The average number of bedrooms in each house
  • Population: The total population of the block
  • AveOccup: The average number of people per household
  • Latitude and Longitude: The geographical location of the houses
  • MedHouseValue: The median price of houses (our target!)

Step 2 – Preprocess the Data

Before diving into regression, we need to make sure the data is clean. A messy dataset is like trying to build a house with broken bricks—it just won’t work. Let’s check for missing values:

print(housing_df.isnull().sum())

This will show if there are any missing values. Ideally, you want the output to look something like this:

Output

MedInc  0
HouseAge  0
AveRooms  0
AveBedrms  0
Population  0
AveOccup  0
Latitude  0
Longitude  0
MedHouseValue  0
dtype: int64

No missing values? Great! You can move on and start selecting the most relevant features for the regression model.

Feature Selection

Here’s where the fun really begins: choosing the right features for the model. You need to figure out which features (independent variables) have the strongest connection to the house prices (dependent variable). This is where the magic of correlation comes in.

To get a clearer picture, create a correlation matrix to see how closely each feature is related to house prices. The correlation matrix is like a map showing how all the variables are connected.


correlation_matrix = housing_df.corr() # Create the correlation matrix
print(correlation_matrix[‘MedHouseValue’]) # See how each feature correlates with the house price

After running this, you might get output like this:

Output

MedInc  0.688075
HouseAge  0.105623
AveRooms  0.151948
AveBedrms  -0.046701
Population  -0.024650
AveOccup  -0.023737
Latitude  -0.144160
Longitude  -0.045967
MedHouseValue  1.000000

From here, it’s easy to see that MedInc (median income) has the strongest relationship with MedHouseValue. AveRooms (average number of rooms) also has a positive correlation, but it’s not as strong. AveOccup (average occupancy) has a weak negative correlation, suggesting that as occupancy increases, the house price tends to decrease—though not by much.

Selecting Features for the Model

Now, based on the correlations, we’ll pick the features that matter most: MedInc, AveRooms, and AveOccup. These three are the stars of the show:


selected_features = [‘MedInc’, ‘AveRooms’, ‘AveOccup’]
X = housing_df[selected_features] # Use these features for our model
y = housing_df[‘MedHouseValue’] # Our target is the house price

Step 3 – Scaling the Features

Now that you’ve selected your features, it’s time to scale them. This is where we make sure all the features are on the same level playing field. You don’t want one feature—say, house size—to dominate just because it’s measured in square feet, while income is measured in thousands of dollars.

To fix this, we’ll use StandardScaler from scikit-learn, which standardizes the data by adjusting it to have a mean of 0 and a standard deviation of 1. This helps the regression model perform better, without being swayed by one feature having larger numbers:


from sklearn.preprocessing import StandardScaler # Import the scaler
scaler = StandardScaler() # Initialize the scaler
X_scaled = scaler.fit_transform(X) # Apply scaling

Now, let’s take a look at the scaled data:

print(X_scaled)

The output will show something like this:

Output

[[ 2.34476576  0.62855945  -0.04959654]
[ 2.33223796  0.32704136  -0.09251223]
[ 1.7826994  1.15562047  -0.02584253]
… [-1.14259331  -0.09031802  -0.0717345]
[-1.05458292  -0.04021111  -0.09122515]
[-0.78012947  -0.07044252  -0.04368215]]

Now that all the features are standardized, they’re ready to be fed into the model. You’re one step closer to predicting house prices with Multiple Linear Regression! By following these steps, you’ve prepped your data—cleaned it, selected the best features, and standardized it—making sure everything is in tip-top shape for the next phase.

And that’s it! You’ve just set the stage for creating a model that will predict house prices based on these carefully selected features. Pretty cool, right?

California Housing Dataset

Implement Multiple Linear Regression

Okay, we’re almost there! You’ve completed the initial steps of cleaning and preparing your data, so now it’s time for the fun part—implementing multiple linear regression in Python. Think of this like building a house from scratch; now that we’ve got the materials, it’s time to put everything together and see what we can create! We’re going to walk through splitting the data, training the model, and evaluating how well it predicts house prices. Buckle up!

Step 1: Import Libraries and Prepare the Data

Before diving into the code, let’s gather all the necessary tools. These are your trusty sidekicks for the journey ahead. Here’s what we’ll need to import:


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

These libraries are like your toolkit for the project. From scikit-learn for machine learning tasks to matplotlib and seaborn for visualizing the results, you’ve got everything you need.

Step 2: Split the Data into Training and Testing Sets

Before we start building our model, we need to split the data into two parts: one for training and one for testing. Think of training as teaching your model and testing as checking its homework. We’ll use scikit-learn’s train_test_split function to do this. 80% of the data will be used to train the model, and the other 20% will be held back for testing.


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

By doing this, we ensure that we’re training the model on one set of data, then testing it on another set. This gives us an idea of how well the model will perform on new, unseen data.

Step 3: Train the Linear Regression Model

Now comes the exciting part: training the model! This is where Linear Regression steps in, the hero of the show. It takes the independent variables (like house size, number of rooms, etc.) and tries to learn their relationship with the dependent variable (house price). Here’s how we train it:


model = LinearRegression()
model.fit(X_train, y_train)

Now the model has learned from the training data. It knows how different factors like MedInc, AveRooms, and AveOccup influence the price of a house. But does it know enough to make predictions? Let’s find out!

Step 4: Make Predictions and Evaluate the Model

The next step is to see how well our trained model can predict house prices. We’ll use it to predict prices based on the test data and then evaluate its performance. We’ll focus on two metrics: Mean Squared Error (MSE) and R-squared (R²).


y_pred = model.predict(X_test)
print(“Mean Squared Error:”, mean_squared_error(y_test, y_pred))
print(“R-squared:”, r2_score(y_test, y_pred))

Let’s break it down:

  • MSE measures how far off the predictions are from the actual values. Lower is better here.
  • R-squared tells us how well the independent variables explain the variance in the house prices. A higher R² means the model is doing a better job of explaining the price.

You might see something like this:

Output

Mean Squared Error: 0.7007
R-squared: 0.4653

That means, on average, the difference between the predicted and actual house prices is 0.7007 (not terrible, but could be better). And the model explains about 46.53% of the variance in house prices. It’s not perfect, but it’s a good start!

Step 5: Visualize the Results

To really understand how our model is performing, let’s make some plots. First up, we’ll create a Residual Plot to see the difference between the predicted and actual values. If everything’s going smoothly, the residuals should be randomly scattered around zero. If there’s a pattern, that could signal a problem.


residuals = y_test – y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.xlabel(‘Predicted Values’)
plt.ylabel(‘Residuals’)
plt.title(‘Residual Plot’)
plt.axhline(y=0, color=’red’, linestyle=’–‘)
plt.show()

Next, let’s look at the Predicted vs. Actual Plot. This one’s really cool because it shows how closely the predicted house prices match the actual prices. If our model was perfect, the points would all lie along that red line.


plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel(‘Actual Values’)
plt.ylabel(‘Predicted Values’)
plt.title(‘Predicted vs Actual Values’)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], ‘r–‘, lw=4)
plt.show()

Step 6: Using Statsmodels for Statistical Analysis

While scikit-learn has served us well for the machine learning aspects, statsmodels is here to offer more detailed statistical analysis. Think of it like getting the “behind-the-scenes” information about the model—like how each variable affects the outcome.

Let’s use statsmodels to fit a regression model and get detailed stats:


import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)  # Add an intercept term to the model
model_sm = sm.OLS(y_train, X_train_sm).fit()

And now, let’s print the summary of our model:


print(model_sm.summary())

Here’s what the output might look like:

Output

==============================================================================
Dep. Variable:    MedHouseValue          R-squared:    0.485
Model:        OLS          Adj. R-squared:  0.484
Method:    Least Squares    F-statistic:      5173.0
Date:     Fri, 17 Jan 2025       Prob (F-statistic):  0.00
Time:     09:40:54              Log-Likelihood:     -20354.0
No. Observations:  16512      AIC:    4.072e+04       Df Residuals:  16508
Df Model:                  Covariance Type:  nonrobust
==============================================================================
                     coef    std err     t    P>|t|        [0.025    0.975]
——————————————————————————
const    2.0679    0.006            320.074    0.000     2.055    2.081
x1    0.8300    0.007            121.245    0.000     0.817    0.843
x2    -0.1000    0.007            -14.070    0.000    -0.114    -0.086
x3    -0.0397    0.006            -6.855    0.000    -0.051    -0.028
==============================================================================

Output Interpretation:

  • The R-squared value here is slightly better than before, at 0.485, which means the model now explains 48.5% of the variance in house prices.
  • We also see that the coefficient for MedInc (median income) is 0.8300, suggesting that for every unit increase in median income, house prices go up by 0.83 units, assuming all other factors stay the same.

Model Diagnostics:

  • Omnibus Test: High value = residuals not normally distributed.
  • Durbin-Watson: Value close to 2 means no significant autocorrelation in residuals.
  • Jarque-Bera: High value = residuals are not normally distributed.
  • Skewness: Residuals are a bit skewed to the right.
  • Kurtosis: Heavier tails than a normal distribution.

But don’t worry too much—this isn’t unusual in real-world data!

And that’s it! By applying multiple linear regression, we’ve built a model that predicts house prices based on important features like median income, average number of rooms, and average occupancy. We’ve also evaluated its performance with mean squared error and R-squared and dug deeper with statsmodels. You’ve just unlocked a whole new world of predictive analysis! 🎉

A Comprehensive Guide to Regression Models (2023)

Using statsmodels

Alright, now that we’ve talked about the basics of multiple linear regression, it’s time to dive a little deeper into statsmodels, a Python library that’s a total game-changer for anyone doing statistical analysis. It’s like the Swiss army knife of statistics, giving you access to a wide range of models and tests. Whether you’re working with linear regression, time series analysis, or something a bit more complex, statsmodels has got your back. In this story, we’re going to use statsmodels to fit a linear model to our data and see what kind of magic we can work!

Using statsmodels for Linear Regression

Here’s the fun part. You’ll start by using statsmodels to build a linear regression model. But before we get into the code, let me walk you through it a bit. You’ll need to add a constant (which is just the intercept term in our equation), fit the model using Ordinary Least Squares (OLS), and then display the summary. It’s kind of like setting up your ingredients, cooking the meal, and then sitting down to savor the details.

Let’s take a look at the code to do that:


import statsmodels.api as sm  # Import the statsmodels API
# Add a constant to the model (intercept term)
X_train_sm = sm.add_constant(X_train)
# Fit the model using Ordinary Least Squares (OLS)
model_sm = sm.OLS(y_train, X_train_sm).fit()
# Display the model summary
print(model_sm.summary())

This code does a couple of things:

  • Add the constant: We’re adding the intercept term (the ‘b’ in our equation) because it’s important to have that baseline value in our regression.
  • Fit the model: We’re training the model using the data. This is where the model learns how the independent variables (like income, room sizes, etc.) influence the dependent variable (house price).
  • Display the summary: The magic happens here—the summary gives us all the important stats that show how well the model is doing.

Model Summary Output

Once you run this, you’ll get a model summary that looks something like this:

Output

==============================================================================
    Dep. Variable:    MedHouseValue      R-squared:      0.485
    Model:      OLS          Adj. R-squared:      0.484
    Method:      Least Squares      F-statistic:      5173.0
    Date:        Fri, 17 Jan 2025      Prob (F-statistic):      0.00
    Time:      09:40:54      Log-Likelihood:      -20354.0
    No. Observations:    16512      AIC:      4.072e+04
    Df Residuals:    16508      BIC:      4.075e+04
    Df Model:      3      Covariance Type:      nonrobust
==============================================================================
      coef          std err      t      P>|t|      [0.025     0.975]
    const      2.0679      0.006      320.074      0.000      2.055      2.081
    x1      0.8300      0.007      121.245      0.000      0.817      0.843
    x2      -0.1000      0.007      -14.070      0.000      -0.114      -0.086
    x3      -0.0397      0.006      -6.855      0.000      -0.051      -0.028

Interpretation of Model Summary

Okay, so what does all this mean? Let me break it down for you:

  • R-squared: This is one of the most important stats. It tells us how well the model is explaining the variation in the house prices. In our case, the R-squared value is 0.485, meaning the model explains 48.5% of the variation in the target variable (house price). It’s a good start, but not perfect. We still have some room to improve.
  • Adjusted R-squared: This value is very similar to R-squared, but it takes into account how many variables are in the model. It helps make sure we’re not just fitting a model that looks good but doesn’t really work well. Here, it’s 0.484, which is pretty much the same as R-squared, suggesting we haven’t overcomplicated the model.
  • F-statistic: This one tells us how much better our model is compared to a model with no predictors. With an F-statistic of 5173.0 and a p-value of 0.00, this suggests that our model is statistically significant and is doing a great job fitting the data.
  • Coefficients: These numbers are the heart of your model. They tell us how each independent variable affects the dependent variable (house price). For example, the MedInc coefficient is 0.8300, which means that for every unit increase in median income, the house price increases by 0.83 units. It’s pretty straightforward, right? But for AveRooms and AveOccup, the coefficients are negative, meaning that increasing the average number of rooms or average number of occupants tends to decrease the house price, though the impact isn’t as strong.

Model Diagnostics

Next, we get into the diagnostics. These stats give us more insight into how well the model is performing and if there are any red flags we need to address:

  • Omnibus Test: This test tells us if the residuals (errors) are normally distributed. A high value here suggests they’re not. In our case, the statistic is 3981.290, which means our residuals aren’t perfectly normal.
  • Durbin-Watson: This checks for autocorrelation in the residuals. A value close to 2 means there’s no significant autocorrelation. Ours is 1.983, so we’re good here.
  • Jarque-Bera Test: This test further checks if the residuals are normally distributed. The value is 11583.284, which suggests the residuals deviate from normality.
  • Skewness: 1.260 means our residuals are right-skewed. So, we have more negative residuals than positive ones.
  • Kurtosis: 6.239 suggests our residuals have a higher peak and heavier tails than a normal distribution, meaning there might be some outliers affecting the model more than expected.

Q-Q Plot for Residuals

To check out how well our residuals match a normal distribution, we can plot them in a Q-Q plot. This plot compares the distribution of residuals to a normal distribution. Here’s how we do it:


sm.qqplot(model_sm.resid, line=’s’)
plt.title(‘Q-Q Plot of Residuals’)
plt.show()

Ideally, if our residuals are normally distributed, the points should align along the diagonal line. If they deviate significantly, that’s a sign we might need to adjust the model.

Conclusion

Using statsmodels for linear regression in Python is incredibly powerful. You’ve got everything you need to understand how well your model fits the data: coefficients, R-squared values, diagnostic tests, and even a way to check the residuals. With statsmodels, you’re not just running a regression model—you’re diving into the heart of your data, uncovering hidden relationships, and making predictions that are backed by strong statistical analysis. Keep using these tools, and you’ll become a stats wizard in no time!

statsmodels documentation

Using statsmodels

In the world of statistical analysis, when it’s time to dive deep into understanding the relationship between variables, statsmodels is like your trusty sidekick. Think of it as the Sherlock Holmes of your data world, uncovering patterns, insights, and making predictions with precision. Whether you’re looking at linear regression, time series analysis, or even nonparametric methods, statsmodels has you covered. Today, we’re going to use statsmodels to explore the magic of multiple linear regression and see how it helps us understand the relationship between factors that influence house prices.

Using statsmodels for Linear Regression

Let’s begin with the basics. To build a linear regression model using statsmodels, you’ll first need to import the necessary library. Then, we’ll walk through adding a constant to the model (aka the intercept term), fitting the regression, and finally, checking out the results.

Here’s the Python code to get us started:


import statsmodels.api as sm   # Import the statsmodels API
# Add a constant to the model (intercept term)
X_train_sm = sm.add_constant(X_train)
# Fit the model using Ordinary Least Squares (OLS)
model_sm = sm.OLS(y_train, X_train_sm).fit()
# Display the model summary
print(model_sm.summary())

The first thing you’ll notice is how we’ve added a constant term to the model. This is critical because it represents the intercept in our regression equation. Now, we’re going to apply Ordinary Least Squares (OLS) regression to the data, which is a method that helps us find the line of best fit. Once it’s trained, we can pull up the summary of the model, which is like a treasure trove of information about how our model is performing.

Model Summary Output

Let’s look at the output from the model summary—this is where all the important details come together:

Output

==============================================================================
Dep. Variable:      MedHouseValue     R-squared:    0.485
Model:      OLS         Adj. R-squared:   0.484
Method:       Least Squares     F-statistic:    5173.0
Date:        Fri, 17 Jan 2025     Prob (F-statistic):    0.00
Time:          09:40:54           Log-Likelihood:   -20354.0
No. Observations:   16512            AIC:      4.072e+04
Df Residuals:      16508            BIC:      4.075e+04
Df Model:        3                 Covariance Type:   nonrobust
==============================================================================
                    coef         std err        t            P>|t|          [0.025    0.975]
——————————————————————————
const       2.0679         0.006                320.074               0.000        2.055       2.081
x1         0.8300         0.007                121.245               0.000        0.817       0.843
x2         -0.1000         0.007                -14.070               0.000        -0.114       -0.086
x3         -0.0397         0.006                -6.855               0.000        -0.051       -0.028
==============================================================================

Interpretation of Model Summary

Now that we’ve got this information, let’s break it down:

  • R-squared: This tells us how well the independent variables explain the variation in the dependent variable (house prices). With an R-squared of 0.485, the model explains 48.5% of the variance in house prices. It’s decent, but there’s always room for improvement.
  • Adjusted R-squared: A slight tweak to the R-squared value to account for the number of predictors in the model. In this case, the adjusted R-squared is 0.484, which tells us that the model is a pretty good fit without overfitting.
  • F-statistic: This statistic checks how well the model fits the data. A value of 5173.0 suggests that our model is doing a good job, with a very low p-value (0.00), indicating that the model is statistically significant.
  • Coefficients: These tell us how much the independent variables (like Median Income or Average Rooms) affect house prices.
    • Intercept (const): The intercept is 2.0679, which means that when all independent variables are zero, the predicted house price is around 2.068 (pretty far off, but remember, the intercept is rarely a real-world value).
    • MedInc (x1): This coefficient is 0.8300, which means that for every unit increase in median income, the predicted house price goes up by 0.83 units, assuming everything else stays constant.
    • AveRooms (x2): This coefficient is -0.1000, meaning that each extra room in the house slightly decreases the house price by 0.10 units.
    • AveOccup (x3): The coefficient for AveOccup is -0.0397, suggesting that as the average number of occupants per household increases, the house price decreases by about 0.04 units, again assuming other factors don’t change.

Model Diagnostics

This part is all about checking how well our model’s residuals (the errors) behave. Are they normal? Do they have patterns? Let’s dive into the stats:

  • Omnibus Test: A value of 3981.290 suggests that the residuals aren’t perfectly normally distributed. Not a big surprise, but it’s something we’ll keep in mind.
  • Durbin-Watson: This checks for autocorrelation (when errors are related to each other). The value is 1.983, which is really close to 2—this is perfect because we don’t want the errors to be correlated.
  • Jarque-Bera: This test, with a value of 11583.284, again suggests that our residuals don’t follow a normal distribution. It’s something we’ll keep an eye on.
  • Skewness: The skewness is 1.260, which tells us that the residuals are right-skewed—more lower values than higher values. No big deal, but it’s worth noting.
  • Kurtosis: 6.239 suggests that the residuals are leptokurtic—meaning they have a higher peak and thicker tails than normal. This could mean that extreme values (outliers) are having a bigger impact on our model than we’d like.

Q-Q Plot for Residuals

If you’re wondering just how far off our residuals are from a normal distribution, we can plot them on a Q-Q plot. This plot helps us visually check if the residuals fall along the normal distribution line. Here’s how we can do it:


sm.qqplot(model_sm.resid, line=’s’)
plt.title(‘Q-Q Plot of Residuals’)
plt.show()

If our residuals follow the normal distribution, the points should fall neatly along the diagonal line. Any major deviation from the line could mean we need to adjust our model.

Conclusion

Using statsmodels for linear regression in Python is like stepping into a world where you’re not just analyzing data, but also uncovering the secrets hidden beneath the surface. You get to see how well the model fits, dive deep into the statistical significance of the variables, and make predictions that are grounded in solid analysis. So, whether you’re a beginner or a seasoned data scientist, statsmodels gives you the tools.

Cross-Validation Techniques

Imagine you’ve built a shiny new machine learning model, and you’re eager to see how well it performs. The catch? You don’t have a ton of data to test it with, and you’re worried that if you just train and test on the same set, the model might get a bit too comfortable with that data, essentially memorizing it. This is where cross-validation swoops in, like a trusty sidekick, ensuring that your model isn’t just good with the data it’s seen but can generalize to new, unseen data.

Cross-validation is a technique that lets you split your data into multiple subsets, then train and test the model on each subset. Why is this important? Because it helps us avoid overfitting—that nasty tendency where a model works great on the training data but falls apart when it encounters real-world scenarios. Cross-validation helps give you a much clearer, more reliable estimate of the model’s true performance. It also ensures that the model is more adaptable, rather than overly attached to one specific set of data.

Here’s the cool part: The number of times you split the data is controlled by a parameter called k. Think of k as the number of “folds” or chunks your data gets divided into. In k-fold cross-validation, the data is split into k equal parts. The model gets trained on k-1 of these parts, with the remaining part used for testing. Then, the process repeats for each fold, so every data point gets a turn at being in the test set.

Implementation in Python

Now, let’s get our hands dirty with some code. To implement k-fold cross-validation in Python, we’re going to use the cross_val_score function from the scikit-learn library. This handy function automates the resampling process for us and spits out performance metrics for each fold. Here’s how you can do it:

First, we’ll need to import the necessary function from sklearn.model_selection :


from sklearn.model_selection import cross_val_score

Next, let’s run the cross-validation on our model. We’ll use the R-squared metric, which will tell us how well the model explains the variance in the dependent variable (in our case, house prices):


scores = cross_val_score(model, X_scaled, y, cv=5, scoring=’r2′)
print(“Cross-Validation Scores:”, scores)
print(“Mean CV R^2:”, scores.mean())

Here, cv=5 tells the function to use 5 folds for the cross-validation. The scoring='r2' ensures that we evaluate the model’s performance using the R-squared metric. The R-squared value ranges from 0 to 1, with values closer to 1 indicating a better fit.

Visualizing Cross-Validation Results

Once we have the scores, it’s time to visualize them. A line plot can help us understand how consistent the model’s performance is across the different data subsets. Think of it like checking how your favorite sports team plays in various conditions—are they consistent or does their performance depend on the opponent?

Here’s how you can visualize it:


# Line Plot for Cross-Validation Scores
plt.plot(range(1, 6), scores, marker=’o’, linestyle=’–‘)
plt.xlabel(‘Fold’)
plt.ylabel(‘R-squared’)
plt.title(‘Cross-Validation R-squared Scores’)
plt.show()

Output Explanation

So, what kind of output can you expect from running the cross-validation? Here’s an example:

Output
Cross-Validation Scores: [0.42854821 0.37096545 0.46910866 0.31191043 0.51269138]
Mean CV R^2: 0.41864482644003276

Let’s break it down:

  • Cross-Validation Scores: Each value in this array represents the R-squared score for each fold. The values range from 0.3119 to 0.5127, showing that the model’s performance varies across the different data subsets. This variation is normal, but it tells us that some data points may have a bigger influence on the model than others.
  • Mean CV R²: This is the average R-squared across all the folds. In this case, the Mean CV R² is 0.4186, meaning that, on average, our model explains about 41.86% of the variance in house prices. It’s not a perfect fit (we can always improve), but it’s a good start. The closer this value is to 1, the better our model’s predictions are.

Interpretation of the Results

What does all this tell us? Well, the cross-validation scores suggest that our model’s performance is somewhat inconsistent, with scores ranging from 0.3119 to 0.5127. This means the model’s effectiveness might depend on the specific data it’s trained on. If you think of it like trying to predict the weather, sometimes you get sunny days (high R-squared) and other times it’s a bit cloudy (lower R-squared).

With a mean R² of 0.4186, it’s clear that the model is capturing some relationships, but it could definitely use some tuning. This suggests that adding more features or experimenting with other techniques (like regularization) could help improve the model’s accuracy.

Conclusion

Cross-validation is a powerful tool in your machine learning toolkit, especially when you’re working with limited data. By splitting your data into multiple folds and training/testing the model on different combinations of those folds, you get a more reliable estimate of how well your model will perform in the real world. It’s like getting a sneak peek at how your model behaves across different scenarios, making sure it’s not just memorizing the data but truly learning how to generalize. With the right tweaks and further exploration of the data, you can make that R-squared value climb higher and higher!

Cross-Validation in Machine Learning

Feature Selection Methods

Feature selection—sounds a bit like a shopping spree, doesn’t it? You’re sifting through shelves, trying to pick the best items for your shopping cart. In the world of machine learning, feature selection is a crucial step, especially when you’re dealing with large datasets full of variables (or features). It’s about figuring out which features really matter for predicting the target variable and getting rid of those that just add noise. Think of it like choosing the perfect ingredients for a dish—only the best will make it to the table.

One popular method to do this is Recursive Feature Elimination (RFE). RFE is like a skillful chef who starts with a mountain of ingredients, tastes them all, and keeps only the most flavorful ones, gradually getting rid of the ones that don’t contribute to the perfect dish. The goal here? To reduce the complexity of the model while keeping the features that have the most impact on the final prediction.

Recursive Feature Elimination (RFE)

Now, let’s dive into how this technique actually works. RFE is a bit like a dance. It starts by training a model on all the features, ranking them by importance, and then one-by-one, it eliminates the least important features. This continues until we’re left with only the features that really matter.

Imagine we’re working with a dataset of house prices, where we have several features—things like house size, number of rooms, and location. We want to figure out which of these features play the biggest role in predicting the price. Here’s how we can do that using RFE in Python:


from sklearn.feature_selection import RFE

First, we need to import the necessary class from the sklearn.feature_selection module:


rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)

Then, we create an RFE object. This is where we specify the model we’ll use to evaluate feature importance. In this case, we’re going with LinearRegression. We also tell RFE to keep the top 3 most important features:


rfe.fit(X_scaled, y)

Next, we fit the RFE object to the scaled features (X_scaled) and the target variable (y). This step runs the feature elimination process:


print(“Selected Features:”, rfe.support_)

Now, we can check which features were selected. RFE gives us a boolean array where True means the feature was kept, and False means it was eliminated:

Visualizing Feature Rankings

But what if you want to see which features are the most important? Well, RFE doesn’t leave you hanging. It also ranks the features based on their importance. The lower the rank, the more crucial the feature is to the model. So, we can visualize this ranking with a bar chart. Here’s the code to create that chart:


feature_ranking = pd.DataFrame({ ‘Feature’: selected_features, ‘Ranking’: rfe.ranking_})

feature_ranking.sort_values(by=’Ranking’).plot(kind=’bar’, x=’Feature’, y=’Ranking’, legend=False)

plt.title(‘Feature Ranking (Lower is Better)’)

plt.ylabel(‘Ranking’)

plt.show()

Output Interpretation

When you run the above code, the output will tell you which features were selected and how they rank. For example, it might look something like this:

Output
Selected Features: [ True True False ]

What does this mean? It means that the top two features—MedInc (Median Income) and AveRooms (Average Number of Rooms)—made the cut, while the third feature was eliminated. This makes sense when you consider the correlation matrix, where MedInc and AveRooms had a strong relationship with the target variable, MedHouseValue.

In other words, we’re focusing on the factors that really drive house prices, leaving the less influential ones behind.

Conclusion

In the world of machine learning, feature selection is like picking out the best players for your team. You don’t need everyone on the field—just the stars that will help you win the game. Recursive Feature Elimination (RFE) helps you do exactly that. By iteratively eliminating the least important features, it helps ensure that the model is efficient and not bogged down by irrelevant data. In our house price prediction example, the two features that mattered most were MedInc (Median Income) and AveRooms (Average Number of Rooms)—the ones with the highest impact on predicting house prices. This method is essential when you’re working with large datasets and want to create a lean, high-performing model. With the power of RFE, you’re well on your way to building better, more focused machine learning models!

Recursive Feature Elimination (RFE) Explanation

FAQs

How do you implement multiple linear regression in Python?

Alright, let’s say you want to predict something based on multiple factors. Maybe you’re trying to figure out how house prices change based on the number of bedrooms, the size of the house, and the location. Enter multiple linear regression. This is where Python comes in handy. You can use libraries like scikit-learn or statsmodels to get the job done. Let’s check out a simple example using scikit-learn.

First, we need to import the necessary modules. Here’s how you’d set it up:


from sklearn.linear_model import LinearRegression
import numpy as np

Now, let’s create some example data. This will represent the predictors (things like size and number of rooms) and the target variable (house prices):


X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])  # Predictor variables
y = np.array([5, 7, 9, 11])  # Target variable

Next, you’ll create and fit the linear regression model:


model = LinearRegression()
model.fit(X, y)

At this point, your model is trained and ready to give you some insights. To find the coefficients (how much each predictor variable contributes) and the intercept (the baseline value when all predictors are zero), you can use:


print(“Coefficients:”, model.coef_)
print(“Intercept:”, model.intercept_)

Finally, make some predictions:


predictions = model.predict(X)
print(“Predictions:”, predictions)

There you go! You just fit a multiple linear regression model, got the coefficients, and made predictions using scikit-learn. Simple, right?

What are the assumptions of multiple linear regression in Python?

When you’re working with multiple linear regression, there are a few key assumptions you need to be aware of. These assumptions help ensure that the model you’re building will give you reliable results. So, here’s what you need to keep in mind:

  • Linearity: This means the relationship between the predictors (independent variables) and the target variable is linear. In other words, a change in the predictors should cause a proportional change in the target. If this assumption is violated, the model might not perform well.
  • Independence: The observations (data points) should be independent of each other. Think of it like a group of people making decisions—if one person’s decision affects another, you can’t treat them as independent, and the model might give you skewed results.
  • Homoscedasticity: This is a fancy word that means the variance of the residuals (the difference between actual and predicted values) should be constant. If the spread of the residuals changes as you move along the x-axis, you might have a problem.
  • Normality of Residuals: The residuals should follow a normal distribution. You can check this assumption by using residual plots or Q-Q plots to see if the errors are distributed randomly.
  • No Multicollinearity: This means that your predictors shouldn’t be highly correlated with each other. If they are, it makes it hard to figure out which predictor is really causing the effect on the target variable. You can check this using Variance Inflation Factor (VIF).

These assumptions are crucial for building a valid model. If they’re violated, your model might give you unreliable predictions.

How do you interpret multiple regression results in Python?

Now that you’ve built your model, it’s time to interpret the results. What do the numbers mean? Let’s break it down:

  • Coefficients (coef_): These are the numbers that tell you how much each predictor affects the target variable. For example, if you have a coefficient of 2 for X1, it means that for each 1-unit increase in X1, the target variable will increase by 2 units, assuming all other variables stay the same.
  • Intercept (intercept_): This is the value of the target variable when all the predictors are zero. So, if X1 and X2 are both zero, the intercept tells you what the target variable will be.
  • R-squared: This number tells you how much of the variation in the target variable is explained by the model. If the R-squared value is 0.85, it means 85% of the variation in the target variable is explained by the predictors in the model. A higher R-squared is better, but don’t get too caught up in chasing a perfect score.
  • P-values: These values help you assess whether each predictor is statistically significant. If the p-value is less than 0.05, it suggests that the predictor has a significant impact on the target variable.

Together, these metrics help you understand how well your model is doing and whether the predictors you’ve chosen are truly meaningful.

What is the difference between simple and multiple linear regression in Python?

Let’s talk about simple linear regression vs. multiple linear regression. Here’s the deal: The key difference is the number of independent variables involved.

In simple linear regression, you’re dealing with just one predictor variable. Imagine you’re trying to predict house prices based only on the number of bedrooms. The model is pretty straightforward, and interpreting it is relatively easy.

Now, multiple linear regression adds complexity. Instead of just one predictor, you’re using multiple ones—things like the number of bedrooms, square footage, and location. It’s like solving a puzzle with more pieces. This makes the model more flexible and better at handling complex relationships, but it also comes with challenges like overfitting, where the model gets too tailored to the training data.

So, in short:

  • Simple Linear Regression: One predictor, simpler, easier to interpret.
  • Multiple Linear Regression: Multiple predictors, more flexible, but harder to interpret and more prone to overfitting.

Here’s the quick breakdown:

Feature Simple Linear Regression Multiple Linear Regression
Number of Independent Variables One More than one
Model Equation 𝑦 = 𝛽₀ + 𝛽₁𝑥 + ϵ 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + 𝛽₂𝑥₂ + ⋯ + 𝛽ₙ𝑥ₙ + ϵ
Model Complexity Less complex More complex
Risk of Overfitting Lower Higher
Interpretability Easier More challenging
Applicability Simple relationships Complex relationships

So, the choice depends on what you’re working with. If you’ve got a lot of features and want to capture a more complex relationship, multiple linear regression is your best bet. But if you’re keeping it simple and just want to understand one key factor, simple linear regression might be all you need.

USA Annual House Price Growth Rate (2025)

Conclusion

In conclusion, mastering multiple linear regression (MLR) in Python with libraries like scikit-learn and statsmodels is essential for data scientists looking to build powerful, reliable models. We’ve walked through the key steps of preprocessing data, selecting relevant features, and addressing challenges like multicollinearity and outliers. With tools like R-squared, Mean Squared Error, and cross-validation, you can confidently evaluate model performance and refine your results for better accuracy. As you continue to work with Python, scikit-learn, and statsmodels, remember that feature selection and understanding key assumptions are vital for creating stable, high-performing regression models. Stay updated on future advancements in machine learning and explore new techniques to keep improving your modeling skills.For more insights on mastering regression models, continue experimenting with Python and scikit-learn to optimize your results and drive meaningful analysis.

Master Python Programming: A Beginner’s Guide to Core Concepts and Libraries (2025)

Alireza Pourmahdavi

I’m Alireza Pourmahdavi, a founder, CEO, and builder with a background that combines deep technical expertise with practical business leadership. I’ve launched and scaled companies like Caasify and AutoVM, focusing on cloud services, automation, and hosting infrastructure. I hold VMware certifications, including VCAP-DCV and VMware NSX. My work involves constructing multi-tenant cloud platforms on VMware, optimizing network virtualization through NSX, and integrating these systems into platforms using custom APIs and automation tools. I’m also skilled in Linux system administration, infrastructure security, and performance tuning. On the business side, I lead financial planning, strategy, budgeting, and team leadership while also driving marketing efforts, from positioning and go-to-market planning to customer acquisition and B2B growth.

Any Cloud Solution, Anywhere!

From small business to enterprise, we’ve got you covered!

Caasify
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.