Master Multiple Linear Regression with Python, Scikit-learn, Statsmodels

Introduction

Mastering multiple linear regression with Python, scikit-learn, and statsmodels is a crucial skill for data scientists looking to build predictive models. This article guides you through implementing MLR, from preprocessing data to evaluating model performance using techniques like cross-validation and feature selection. You’ll learn how to use powerful tools like scikit-learn and statsmodels to predict outcomes such as house prices based on key factors, including median income and room size. By the end, you’ll understand how to measure the model’s effectiveness with metrics like R-squared and Mean Squared Error.

What is Multiple Linear Regression?

Multiple Linear Regression is a statistical method used to predict an outcome based on several different factors. It helps to understand how different independent variables, like house size, number of bedrooms, and location, can influence a dependent variable, such as the price of a house. This method is applied by creating a mathematical model that explains the relationship between these variables and can be used to predict future values.

What is Multiple Linear Regression?

Multiple Linear Regression (MLR) is a pretty basic statistical method, and it’s super helpful for modeling how one thing (the dependent variable) relates to two or more other things (the independent variables). It’s kind of like an upgrade to simple linear regression, which only looks at the relationship between one dependent variable and one independent variable. But with MLR, you’re diving deeper to see how multiple factors work together to influence the thing you’re trying to predict. You can use it to predict future outcomes based on these relationships.

So, here’s the thing: multiple linear regression works on the idea that there’s a straight-line relationship between the dependent variable and the independent variables. What that means is, as the independent variables change, the dependent variable will change in a proportional way.

The formula for MLR looks like this:

? = ?₀ + ?₁?₁ + ?₂?₂ + ⋯ + ?ₙ?ₙ + ϵ

Where:

? is the dependent variable (the thing you want to predict),
?₁, ?₂, … , ?ₙ are the independent variables (the factors you think affect ?),
?₀ is the intercept (basically where the line starts),
?₁, ?₂, … , ?ₙ are the coefficients (they show how much each independent variable impacts ?),
ϵ is the error term, which covers any random fluctuations that can’t be explained by the model.

Let’s look at an example to make it clearer: imagine you’re trying to predict how much a house costs. Here, the price of the house would be the dependent variable ?, and your independent variables ?₁, ?₂, ?₃ might be things like the size of the house, the number of bedrooms, and where it’s located. In this case, you can use multiple linear regression to figure out how these factors (size, bedrooms, location) all come together to affect the price of the house.

Now, the great thing about using multiple linear regression is that it looks at all these variables together. This gives you a more accurate prediction because it takes more factors into account. This is a lot better than simpler models that only look at one variable at a time. And when you think about real-life situations, we all know that more than one factor plays a part in most outcomes, right? So, MLR gives you a much clearer picture.

Read more about multiple linear regression techniques and applications in this detailed guide on Multiple Linear Regression and Its Applications.

Assumptions of Multiple Linear Regression

Before you dive into implementing Multiple Linear Regression (MLR), it’s really important to make sure that some key assumptions are met. These assumptions are like the foundation of a solid house—they help ensure that the regression model you’re working with is reliable and that the results you’re getting are meaningful. If you skip these steps, you might end up with predictions that are a bit off or even completely misleading. Let’s break down each assumption and see why it matters for MLR.

Linearity

The first assumption you need to check is that the relationship between the dependent variable and the independent variables is linear. What does that mean? Well, a change in an independent variable should lead to a proportional change in the dependent variable. To check this, you can use scatter plots or look at residuals for patterns. If the relationship isn’t linear, using linear regression could mess up your predictions. If this happens, you might need to transform your variables or try using a different model altogether.

Independence of Errors

Next up, the errors (or residuals) of your model need to be independent of one another. In simple terms, the error for one data point shouldn’t affect the error for another. To test for this, you can use the Durbin-Watson statistic, which helps check if there’s autocorrelation in your residuals. Autocorrelation happens often with time-series data, where errors might get all tangled up over time. If this assumption is broken, you might end up with underestimated standard errors and unreliable significance tests.

Homoscedasticity

This one’s a bit of a mouthful, but it’s important! The idea here is that the variance of your residuals should be the same no matter the value of your independent variables. If the variance isn’t constant (a situation called heteroscedasticity), it can mess with your regression coefficients and their statistical significance. You can use a residual plot to check this. If the plot looks like a funnel or has any patterns, it could mean your data doesn’t meet this assumption. If that happens, there are ways to fix it, like using weighted least squares regression.

No Multicollinearity

Here’s where things get interesting: in MLR, you don’t want your independent variables to be too closely related to each other. If they are, it’s called multicollinearity, and it can cause issues with the stability of your coefficient estimates. Basically, it makes it tough to figure out the effect of each independent variable on your dependent variable. You can use the Variance Inflation Factor (VIF) to spot multicollinearity. If the VIF is over 5 or 10, it’s time to investigate. If you do have multicollinearity, you might need to remove or combine some variables or even use principal component analysis (PCA).

Normality of Residuals

Your residuals should follow a normal distribution, especially if you’re planning on doing hypothesis testing or calculating confidence intervals. To check this, you can use a Q-Q plot or statistical tests like the Shapiro-Wilk test. If your residuals aren’t normal, don’t panic—it doesn’t mess with the predictions themselves, but it can throw off the accuracy of your p-values and confidence intervals. If that’s the case, transforming your variables might help.

Outlier Influence

Outliers are data points that stand out from the rest—like those really high or really low values that don’t seem to fit with the rest of your data. These outliers can have an outsized impact on your regression model, making the results less reliable. It’s important to identify these points and handle them properly. Tools like Cook’s Distance or leverage statistics can help you spot influential points. Now, don’t just remove outliers automatically—sometimes they’re important, but you do want to understand their impact on your model to make sure your predictions hold up.

Meeting these assumptions is key to building a solid multiple linear regression model. If one of these assumptions doesn’t hold up, it could mean your results are a bit off. In that case, you might need to look at other modeling techniques to get the most accurate predictions.

For a deeper understanding of the assumptions underlying multiple linear regression, explore this comprehensive resource on Multiple Linear Regression Assumptions.

Preprocess the Data

Data preprocessing is a super important step before you jump into using a Multiple Linear Regression (MLR) model. It’s like getting your data ready for the main event! You want to make sure everything is in tip-top shape before applying your fancy regression model. In this part, we’ll go through how to load, clean, and prep the data so it’s all set for modeling. Trust me, the better the prep, the better your model will perform. Preprocessing includes fixing missing values, picking the right features, and scaling those features to make sure everything’s consistent. Let’s dive in and see how to do it all with the California Housing Dataset.

Step 1 – Load the Dataset

The first thing you need to do is load your dataset. For this tutorial, we’re using the California Housing Dataset, which has all sorts of interesting data, like the median income, house age, average rooms per house, and, of course, the target variable—the median house value. It’s a popular dataset for regression tasks, so we’re in good company!

To load the dataset into Python, here’s the code you’ll need:


from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np</p>
<p># Load the California Housing dataset using the fetch_california_housing function
housing = fetch_california_housing()
# Convert the dataset’s data into a pandas DataFrame, using the feature names as column headers
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
# Add the target variable ‘MedHouseValue’ to the DataFrame, using the dataset’s target values
housing_df[‘MedHouseValue’] = housing.target
# Display the first few rows of the DataFrame to get an overview of the dataset
print(housing_df.head())

This little bit of code loads your dataset, turns it into a pandas DataFrame called housing_df, and adds the target variable ‘MedHouseValue’ (the median house value) to the mix. After running it, you can check out the first few rows of your data and get a good feel for how it’s structured.

Step 2 – Handle Missing Values

Now that the data is loaded, you’ve got to check for any missing values. Missing data can mess up your model’s performance, so you definitely don’t want that. Thankfully, the California Housing Dataset doesn’t have any missing values, but it’s always a good idea to double-check.

Here’s the code to do that:


print(housing_df.isnull().sum())

This code checks each column in the dataset and tells you if there are any missing values. If there are, you’ve got options. You can either fill in the missing data with something like the mean or median of the column (that’s called imputation), or you can just drop the rows or columns if they’re too messy. Whatever works for your data!

Step 3 – Feature Selection

Next, it’s time to pick the features that matter the most. Feature selection is about deciding which independent variables (the ones you think will help predict the target) should actually make it into the model. One way to do this is by checking how strongly each feature is related to the target variable. If there’s a strong correlation, that feature is probably important.

You can check the correlation with this code:


correlation_matrix = housing_df.corr()
print(correlation_matrix[‘MedHouseValue’])

This will give you a nice matrix showing how each feature correlates with the target variable ‘MedHouseValue.’ You might find that things like ‘MedInc’ (median income) and ‘AveRooms’ (average number of rooms) have a strong correlation with house prices, while features like ‘HouseAge’ or ‘Latitude’ might be less important. Based on this, you can decide which features to keep in your model.

Step 4 – Feature Scaling

Feature scaling is all about making sure all your features are on the same playing field. Why? Well, some features might have a really big range (like income) while others might be smaller (like the number of rooms). This can mess with your model, especially in regression where we want everything to be on equal terms.

A popular technique for scaling is called standardization. This transforms all the features to have a mean of 0 and a standard deviation of 1, which is super helpful for MLR.

Here’s how you can scale your features using scikit-learn:


from sklearn.preprocessing import StandardScaler</p>
<p># Initialize the StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the data and transform it
X_scaled = scaler.fit_transform(X)
# Print the scaled data
print(X_scaled)

This code sets up a StandardScaler, fits it to your selected features (which are in X), and transforms them so everything’s on the same scale. After running it, you’ll have X_scaled, which is now ready to be used in your regression model.

Step 5 – Prepare the Data for Model Training

Now that the data is prepped and scaled, it’s time to split it into training and testing sets. This way, you can train your model on one set of data and test it on another to see how well it’s performing. You don’t want to test your model on the same data you trained it on, or else you won’t get an honest read on how well it’s working.

Here’s how you split the data:


from sklearn.model_selection import train_test_split</p>
<p># Define the independent variables (X) and target variable (y)
X = housing_df[selected_features]
y = housing_df[‘MedHouseValue’]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Display the shapes of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

This code takes the selected features (X) and the target variable (y), and then splits them into training and testing sets. We’re using 80% of the data for training and 20% for testing. The random_state=42 makes sure that the data is split the same way every time. After running it, you’ll get the shapes of your training and testing sets so you can check that everything was split correctly.

Once the data is all prepped and split, you’re good to go! You can move on to implementing your multiple linear regression model, using the training data to teach the model, and the testing data to see how well it performs.

For more insights on data preprocessing techniques and their role in machine learning models, check out this helpful guide on Data Preprocessing in Machine Learning.

Implement Multiple Linear Regression

Once you’ve prepped your data and made sure everything’s in order for multiple linear regression, you’re ready to dive into implementing the model itself. This is where the magic happens: creating the regression model, training it with your data, and then evaluating how well it performs. Let’s walk through the steps of implementing multiple linear regression using Python’s awesome scikit-learn library.

Step 1: Import Necessary Libraries

Before we get started with building the regression model, you need to make sure you’ve got the right libraries in place. For this job, we’re going to be using scikit-learn for the regression algorithm, as well as a few helper functions—like the one that splits our data into training and testing sets. We’ll also need matplotlib and seaborn for visualizing the results.

Here’s how you import everything you need:


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

Here’s what each of these does:

train_test_split : Splits your dataset into two parts—training and testing.
LinearRegression : This is the model we’re going to use for multiple linear regression.
mean_squared_error and r2_score : These are the metrics that will help you measure how well your model is performing.
matplotlib.pyplot and seaborn : These are used to create visualizations of the results.

Step 2: Split the Data into Training and Testing Sets

You can’t just use all the data to train your model and test it. You need to make sure you have separate training and testing sets, so you can evaluate how well your model generalizes to new, unseen data. For that, we’ll use the train_test_split() function.

Here’s how you do it:


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Here’s what’s happening:

X_scaled : The features from the dataset that have already been scaled.
y : The target variable (for example, house prices).
test_size=0.2 : We’re using 20% of the data for testing and 80% for training.
random_state=42 : This ensures that every time you run the code, the data splits the same way, so you get consistent results.

Step 3: Train the Linear Regression Model

With your training data ready, now it’s time to train your linear regression model. This means teaching the model how the independent variables (features) are related to the dependent variable (target). To do this, you’ll initialize the LinearRegression model and then fit it to the training data like this:


model = LinearRegression()
model.fit(X_train, y_train)

What happens here?

model.fit(X_train, y_train) : This step trains the model using your training data. The model will figure out the best coefficients for the features to predict the target.

Step 4: Make Predictions

Once your model is trained, it’s time to test it! You’ll use the predict() method to make predictions using the test data. Here’s the code to do it:


y_pred = model.predict(X_test)

This is where you actually get the predicted values for your target variable, using the test data you split earlier.

Step 5: Evaluate the Model’s Performance

Now that we’ve got some predictions, it’s time to check how well the model is doing. We’ll use a couple of common metrics: Mean Squared Error (MSE) and R-squared (R²).

Mean Squared Error (MSE)

MSE tells you how far off your model’s predictions are from the actual values on average. The lower the MSE, the better your model is performing. Here’s how to calculate MSE:


mse = mean_squared_error(y_test, y_pred)
print(“Mean Squared Error:”, mse)

R-squared (R²)

R² measures how well your independent variables explain the variance in the target variable. It ranges from 0 to 1, with 1 meaning perfect predictions. Here’s how to calculate R²:


r2 = r2_score(y_test, y_pred)
print(“R-squared:”, r2)

The higher the R², the better your model fits the data.

Step 6: Visualize the Results

It’s always nice to see your results visually to get a better understanding of how your model is performing. Two popular plots for regression models are residual plots and predicted vs actual plots.

Residual Plot

A residual plot helps you see the errors of the model—the differences between predicted and actual values. Ideally, these should be randomly scattered around zero, meaning the model captured the underlying patterns in the data.

Here’s how to make a residual plot:


residuals = y_test – y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.xlabel(‘Predicted Values’)
plt.ylabel(‘Residuals’)
plt.title(‘Residual Plot’)
plt.axhline(y=0, color=’red’, linestyle=’–‘)
plt.show()

Predicted vs Actual Plot

This plot shows how your predicted values stack up against the actual values. In a perfect model, the points should line up along a straight line. Here’s how to make it:


plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel(‘Actual Values’)
plt.ylabel(‘Predicted Values’)
plt.title(‘Predicted vs Actual Values’)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], ‘r–‘, lw=4)
plt.show()

Step 7: Interpretation of Coefficients

One of the coolest things about a multiple linear regression model is the coefficients it gives you. These coefficients show how much the target variable (for example, house price) changes when one independent variable changes by one unit, while holding all other variables constant.

For instance, if the coefficient for median income (MedInc) is 0.83, that means for every one-unit increase in median income, the predicted house price increases by 0.83 units, assuming everything else stays the same.

You can access the coefficients like this:


print(“Coefficients:”, model.coef_)

This will show you the coefficients for each feature in your model, helping you understand how strongly each feature is related to the target.

By following these steps, you’ll be able to implement multiple linear regression in Python and use it to predict outcomes based on your data. With the evaluation metrics, visualizations, and coefficient interpretations, you’ll get a solid understanding of how well your model is working and where you might need to improve.

For a detailed step-by-step approach to implementing multiple linear regression models, check out this guide on Multiple Linear Regression in Python.

Using statsmodels

The statsmodels library in Python is like your trusty toolbox when it comes to statistical analysis. It’s packed with all kinds of statistical models and tests, making it a great choice for exploring the relationships between variables. In the world of multiple linear regression, statsmodels stands out because it gives you a much deeper statistical output compared to other libraries like scikit-learn . This can be a real lifesaver when you want to dive deeper into things like model coefficients, how well the model fits, and even run diagnostic tests.

Step 1: Import the statsmodels Library

To get started with multiple linear regression in statsmodels, the first thing you need to do is import the necessary libraries. You’ll typically use the OLS (Ordinary Least Squares) method to fit the model, and you’ll also need to add a constant to the feature matrix (this is the intercept).

Here’s the code to get you started:


import statsmodels.api as sm

Now, let’s add that intercept term to the feature matrix, which is pretty important for the model to give you accurate results:


X_train_sm = sm.add_constant(X_train)

What’s going on here? sm.add_constant(X_train) adds a column of ones to your feature matrix, which accounts for the intercept in the regression model. It’s important because, without this, your model would ignore the intercept, leading to incorrect results.

Step 2: Fit the Model Using OLS

Once you’ve got your data set up, the next step is to fit the model using the OLS method. OLS works by finding the line (or hyperplane, in the case of multiple variables) that best fits the data by minimizing the sum of squared errors (residuals).

Here’s how you can fit your model:


model_sm = sm.OLS(y_train, X_train_sm).fit()

What does this do? sm.OLS(y_train, X_train_sm) initializes the OLS regression model, taking in the target variable ( y_train ) and the features ( X_train_sm ) with the intercept term added. .fit() fits the model to the data, which means it calculates the coefficients (the “weights” that tell the model how much influence each feature has) to best predict the target variable.

Step 3: Model Summary

Once your model is fitted, statsmodels gives you a detailed summary of the regression results. This summary includes stats that help you evaluate how well your model did. Some of the key values you’ll want to focus on are:

Coefficients: These tell you how much the target variable changes when one of the features changes by one unit, assuming everything else stays the same.
R-squared: This shows how well the independent variables explain the variation in the dependent variable.
P-values: These help you understand the significance of each feature. A low p-value (typically less than 0.05) means the feature is statistically significant.
Confidence Intervals: These give you a range of values within which the true coefficients are likely to fall (usually at a 95% confidence level).

Here’s how you can view the summary:


print(model_sm.summary())

Step 4: Diagnostic Plots

One of the best things about statsmodels is that it offers diagnostic plots to help you check the assumptions of your regression model. These plots help you figure out whether your model is working well or if there are any potential issues. For example, a Q-Q (quantile-quantile) plot can help you see if the residuals follow a normal distribution. This is important because, for linear regression to be valid, the residuals should follow a normal distribution.

Here’s how to make a Q-Q plot:


sm.qqplot(model_sm.resid, line=’s’)
plt.title(‘Q-Q Plot of Residuals’)
plt.show()

What’s happening here? model_sm.resid : This gives you the residuals (errors) from your fitted model. sm.qqplot() : This function creates the Q-Q plot, which will tell you whether the residuals are normally distributed. If the points lie along a straight line, it’s a good sign.

Step 5: Interpreting the Results

Once the model is fitted and the summary is printed, interpreting the results is key to understanding what’s going on. The coefficients show how much the target variable changes when one of the features changes by one unit.

For example, if the coefficient for median income (MedInc) is 0.83, that means for every increase of 1 unit in median income, the predicted median house value will go up by 0.83 units, assuming everything else stays the same.

To access the coefficients, you can run this code:


print(“Coefficients:”, model_sm.params)

This will give you the coefficients for each feature, including the intercept. You can use these to understand the strength and direction of the relationship between each feature and the target.

Step 6: Make Predictions

Now that your model is trained and you’ve interpreted the results, it’s time to make some predictions! The predict() method from statsmodels makes this super easy. Here’s how you can predict values using the test data:


y_pred_sm = model_sm.predict(X_test_sm)

What’s going on here? X_test_sm : This is your test data with the constant term added (just like we did for the training data). y_pred_sm : This contains the predicted values of the target variable for the test data.

Once you’ve got those predictions, you can compare them with the actual values to evaluate the model’s performance, using metrics like Mean Squared Error (MSE) or R-squared.

Using statsmodels gives you a more detailed statistical output compared to other libraries, which can be a huge advantage when you need to make sense of your regression model’s performance. It’s especially helpful when you want to dive deeper into the significance of your predictors and perform hypothesis testing.

For an in-depth explanation of using statsmodels for regression analysis, take a look at this comprehensive guide on Logistic Regression in Python with Statsmodels.

Handling Multicollinearity

So, multicollinearity—what’s that about? Well, it happens when two or more independent variables in a multiple regression model are super closely related. Imagine you’re trying to figure out how a couple of different factors affect house prices, but, oh no, some of those factors are basically telling you the same story. When this happens, it can get tricky to figure out how each predictor is actually impacting your outcome. Essentially, the regression model gets confused, and it can’t reliably calculate the coefficients for those closely related variables, which might mess up your results and lead to some wonky conclusions.

Why Multicollinearity Matters

You might be wondering why you should care about multicollinearity. Here’s the thing—if multicollinearity is lurking around, it can mess with your regression analysis in several ways:

Inflated Standard Errors: When your independent variables are highly correlated, the model’s coefficients become more “spread out” (variance increases). This causes the standard errors to get bigger, making it harder to figure out whether a variable is really making a difference or if it’s just statistical noise.
Unstable Coefficients: Multicollinearity can make the coefficients unstable. This means that small changes in the data might cause big swings in the model’s coefficients. It can also mess with the signs and sizes of the coefficients when you use different subsets of data, making the model super unreliable.
Incorrect Statistical Inference: You know how the p-value tells you if a variable is important? Well, multicollinearity can make p-values tricky to interpret. Even if a variable looks like it has a high p-value (meaning it’s not significant), it could still actually be an important predictor, but the model is just having trouble figuring it out.

Detecting Multicollinearity

Now that we know why it’s a problem, how do we spot it? There are a few ways to check for multicollinearity in your regression model:

Correlation Matrix: This is a super simple and first step way to see if some variables are getting a bit too friendly with each other. If you see a high correlation (say, above 0.8 or 0.9), it’s a good sign that you might have some multicollinearity going on.

You can create a correlation matrix like this:


correlation_matrix = housing_df.corr()print(correlation_matrix)

This will show you the correlation coefficients between all the independent variables. If some of them are close to 1 (or -1), you’ve got some multicollinearity.

Variance Inflation Factor (VIF): If you really want to dig deep, the VIF tells you how much the variance of a regression coefficient is inflated due to collinearity with other variables. If your VIF is super high (over 5 or 10), it’s a clear sign of multicollinearity.

Here’s how to check it out using statsmodels:


from statsmodels.stats.outliers_influence import variance_inflation_factorvif_data = pd.DataFrame()vif_data[‘Feature’] = selected_featuresvif_data[‘VIF’] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])] print(vif_data)

This will give you the VIF values for each variable. A high VIF means you’re dealing with multicollinearity.

Dealing with Multicollinearity

Okay, so you’ve spotted multicollinearity. Now what? Don’t worry; there are plenty of ways to deal with it:

Remove Highly Correlated Variables: If two variables are really similar, you might want to just drop one. But, here’s the thing—you need to be careful not to remove something important. You don’t want to throw the baby out with the bathwater!
Combine Correlated Variables: Sometimes, instead of removing variables, you can combine them into one. For example, if two variables measure similar things, you might add them together or take the average. This way, you keep the useful info without the multicollinearity headache.
Principal Component Analysis (PCA): PCA is like a magic trick for handling multicollinearity. It takes all the correlated variables and combines them into a smaller number of uncorrelated components. These components are then used in your regression model. It’s a cool trick if you need to reduce the dimensionality of your data.
Ridge Regression: If you don’t want to remove variables but still want to deal with the multicollinearity, ridge regression might be your best friend. Ridge regression adds a penalty to the regression equation, which helps shrink the influence of the correlated variables, making the model more stable. You can use scikit-learn to do it like this:


from sklearn.linear_model import Ridgeridge_model = Ridge(alpha=1.0)ridge_model.fit(X_train, y_train)

The alpha parameter controls the strength of the regularization. The bigger the alpha, the more regularization happens.

Lasso Regression: Another option is lasso regression, which is similar to ridge regression but with an added twist—it can also remove unnecessary variables altogether by setting some of their coefficients to zero. This is super helpful if you want to simplify your model and get rid of irrelevant features. Here’s how to use lasso regression:


from sklearn.linear_model import Lassolasso_model = Lasso(alpha=0.1)lasso_model.fit(X_train, y_train)

Again, the alpha parameter controls the regularization strength. By using these techniques, you can deal with multicollinearity and still create a solid multiple linear regression model. The goal is to get rid of any noise and make sure your model gives you accurate, reliable results.

To further explore techniques for handling multicollinearity in regression models, check out this detailed article on Multicollinearity in Machine Learning.

Cross-Validation Techniques

Cross-validation is a super handy technique in machine learning to check how well your model is going to perform on new, unseen data. It’s like testing your model’s ability to generalize beyond just the training data. Essentially, cross-validation splits your dataset into several smaller chunks, tests the model on different combinations of those chunks, and checks how well it performs each time. It’s a great way to ensure that your model doesn’t overfit to your training data, which could make it do poorly when it sees new data. This is especially useful when you’ve got a limited dataset and want to make the most of what you’ve got.

K-Fold Cross-Validation

One of the most popular cross-validation methods is K-fold cross-validation. Here’s how it works: You take your data and divide it into “k” equal chunks, or folds. You then train your model using k-1 of those folds, and the last fold is used to test the model. You repeat this process k times, so every fold gets a chance to be the test set. After that, you average the performance results (like R-squared or Mean Squared Error) across all k folds to get a more reliable estimate of the model’s overall performance.

To use K-fold cross-validation in scikit-learn, you can use the cross_val_score function, which will evaluate your model based on the chosen scoring metric (like R-squared). Here’s how you can do it:


from sklearn.model_selection import cross_val_score   # Using cross-validation to evaluate the model
scores = cross_val_score(model, X_scaled, y, cv=5, scoring=’r2′)   # Print cross-validation scores for each fold
print(“Cross-Validation Scores:”, scores)
print(“Mean CV R^2:”, scores.mean())

cv=5 : This tells the function to do 5-fold cross-validation. You can change this number based on your dataset.

scoring='r2' : This sets R-squared as the evaluation metric, but you can use others like 'neg_mean_squared_error' if needed.

scores.mean() : This gives you the average performance from all the folds, which gives you a more reliable estimate than just a single train-test split.

Stratified K-Fold Cross-Validation

Stratified K-fold cross-validation is a variation of K-fold that’s especially useful for classification tasks where your target variable might be imbalanced. For instance, if you’re predicting customer churn and only a small percentage of your customers churn, stratified cross-validation ensures that each fold has the same proportion of churn and non-churn cases. This makes the results more stable and reliable.

In scikit-learn, you can use StratifiedKFold for this:


from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_valScore   # Initialize the StratifiedKFold object
skf = StratifiedKFold(n_splits=5)
scores = cross_valScore(model, X_scaled, y, cv=skf, scoring=’r2′)   # Print cross-validation scores
print(“Stratified Cross-Validation Scores:”, scores)
print(“Mean Stratified CV R^2:”, scores.mean())

Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out Cross-Validation (LOOCV) is like the extreme version of cross-validation. In this case, the number of folds is equal to the number of data points you have. So for each iteration, you train the model on all but one data point, and use that one data point to test the model. This process is repeated for each data point in your dataset.

While LOOCV gives you super low bias (because it uses almost all the data for training every time), it can be very slow, especially if you’ve got a large dataset. It’s useful when your dataset is small and you want the most precise estimate of your model’s performance. Here’s an example of how you might use LOOCV in Python:


from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_valScore

Time Series Cross-Validation

When working with time series data, traditional K-fold cross-validation isn’t suitable because it doesn’t respect the chronological order of the data. In real life, you can’t test on future data that the model hasn’t seen yet, right? So, time series cross-validation (or rolling forecast origin) comes to the rescue. In this case, the training set keeps expanding with each fold, and the test set always contains data points that come after the training set. This reflects how the model would behave in a real-world forecasting situation.

For time series, you can use TimeSeriesSplit in scikit-learn:


from sklearn.model_selection import TimeSeriesSplit   # Initialize the TimeSeriesSplit object
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_valScore(model, X_scaled, y, cv=tscv, scoring=’r2′)   # Print the time series cross-validation scores
print(“Time Series Cross-Validation Scores:”, scores)
print(“Mean Time Series CV R^2:”, scores.mean())

Cross-Validation with Custom Scoring

Sometimes, you might need a custom scoring metric to evaluate your model—something specific to your business or project. Maybe you don’t want to use R-squared or Mean Squared Error; you could create your own scoring function and plug that into the cross-validation process.

Here’s how you can use a custom scoring function in scikit-learn:


from sklearn.model_selection import cross_valScore
from sklearn.metrics import make_scorer   # Define a custom scoring function (e.g., Mean Absolute Error)
def custom_scoring(y_true, y_pred):
    return np.mean(np.abs(y_true – y_pred))
custom_scorer = make_scorer(custom_scoring)
scores = cross_valScore(model, X_scaled, y, cv=5, scoring=custom_scorer)   # Print the custom cross-validation scores
print(“Custom Cross-Validation Scores:”, scores)
print(“Mean Custom CV Score:”, scores.mean())

Evaluating Model Performance Using Cross-Validation

Once you’ve run the cross-validation, it’s important to analyze the results. The mean score from all the folds gives you a good, unbiased estimate of how well the model is performing. However, you should also take a look at how much the scores vary across the different folds. If you see a lot of variability, it could mean the model is sensitive to the specific data it’s trained on, and you might need to adjust the model or try some regularization techniques.

Cross-validation is key when you want to know how your model will perform in the real world—on data it hasn’t seen before. Whether you’re doing K-fold, LOOCV, or even using custom metrics, cross-validation ensures that you’re getting a solid and trustworthy performance estimate for your model.

To dive deeper into cross-validation techniques and their applications in model evaluation, check out this informative guide on K-Fold Cross-Validation in Machine Learning.

Feature selection methods

Feature selection is a big deal when you’re building machine learning models. It’s all about picking the most important features (or variables) from your dataset that really make a difference in your model’s predictions. By getting rid of irrelevant or redundant features, you not only simplify the model but also make it easier to understand and improve its ability to generalize. This is key for better performance and avoiding overfitting. There are a bunch of ways to do feature selection, like statistical tests, recursive techniques, and regularization methods.

Recursive Feature Elimination (RFE)

Let’s talk about Recursive Feature Elimination (RFE). This method works by getting rid of the least important features one by one. It starts by fitting a model using all the features, then ranks them based on how important they are. After that, it removes the least important feature, trains the model again, and repeats this until you’re left with the features that matter the most. RFE is great for identifying which features really matter because it methodically eliminates the less important ones.

RFE is typically used with models that have a built-in feature importance measure, like linear regression, decision trees, or support vector machines (SVMs). The best part about RFE is that it works with any machine learning model and gives you the optimal set of features that contribute the most to prediction accuracy.

Here’s how you can use RFE with a linear regression model in Python:


from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression</p>
<p># Initialize the linear regression model
model = LinearRegression()</p>
<p># Initialize RFE with the linear regression model
rfe = RFE(estimator=model, n_features_to_select=3)</p>
<p># Fit RFE to the data
rfe.fit(X_scaled, y)</p>
<p># Print the selected features
print(“Selected Features:”, rfe.support_)</p>
<p># Print the ranking of features
print(“Feature Ranking:”, rfe.ranking_)

In this example, n_features_to_select=3 means you’re keeping the top 3 most important features. rfe.support_ gives you a boolean array showing which features were selected, and rfe.ranking_ shows the ranking of all features, where lower values indicate more important features.

Variance Thresholding

Variance thresholding is a simple method where you get rid of features that have low variance. If a feature doesn’t vary much (it’s basically constant), it probably won’t help the model much. This method is super useful when you have lots of features, some of which might be constant or nearly constant across all data points.

Here’s how to do it in Python using VarianceThreshold:


from sklearn.feature_selection import VarianceThreshold</p>
<p># Initialize VarianceThreshold with a threshold of 0.1 (remove features with variance below 0.1)
selector = VarianceThreshold(threshold=0.1)</p>
<p># Fit and transform the data to select features
X_selected = selector.fit_transform(X_scaled)</p>
<p># Print the selected features
print(“Selected Features after Variance Thresholding:”, X_selected.shape[1])

This removes any feature with a variance below 0.1. X_selected.shape[1] will tell you how many features are left after applying this threshold.

Univariate Feature Selection

Univariate feature selection is a method where you evaluate each feature individually using statistical tests. You look at how each feature relates to the target variable and keep the ones that show a strong connection. It’s great when you’ve got lots of features and want to reduce the number by focusing on their individual significance.

For example, you can use the SelectKBest method from scikit-learn, which picks the top k features based on a statistical test like the chi-square test or the f-test.

Here’s how to implement univariate feature selection using the f-test:


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif</p>
<p># Initialize SelectKBest with f-test as the scoring function
selector = SelectKBest(score_func=f_classif, k=5)</p>
<p># Fit the selector to the data
X_selected = selector.fit_transform(X_scaled, y)</p>
<p># Print the selected features
print(“Selected Features after Univariate Feature Selection:”, selector.get_support())

In this case, k=5 means you’re keeping the top 5 features based on their f-test scores. The get_support() method gives you a boolean array showing which features were selected.

L1 Regularization (Lasso Regression)

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is another awesome technique for feature selection. It adds a penalty term to the model’s objective function that penalizes the absolute values of the coefficients. This causes the coefficients of less important features to shrink to zero, effectively removing them from the model. Lasso is super handy when you have a lot of features and want to do both feature selection and regularization at the same time.

Here’s how you can use Lasso in Python:


from sklearn.linear_model import Lasso</p>
<p># Initialize the Lasso model with alpha (regularization strength)
lasso = Lasso(alpha=0.01)</p>
<p># Fit the Lasso model to the data
lasso.fit(X_scaled, y)</p>
<p># Print the coefficients
print(“Lasso Coefficients:”, lasso.coef_)</p>
<p># Identify the selected features (non-zero coefficients)
selected_features = [i for i, coef in enumerate(lasso.coef_) if coef != 0]
<p>print(“Selected Features after Lasso:”, selected_features)

In this case, alpha=0.01 controls the strength of the regularization. lasso.coef_ gives you the coefficients for each feature, and the non-zero coefficients indicate which features are selected.

Feature Importance from Tree-based Models

Another powerful method for feature selection is using tree-based models, like decision trees, random forests, or gradient boosting machines. These models can calculate the importance of each feature based on how useful they are in splitting the data. Features that are used often to split the data and reduce impurity are considered more important.

Here’s how you can get feature importances using a random forest model:


from sklearn.ensemble import RandomForestRegressor</p>
<p># Initialize a RandomForest model
rf = RandomForestRegressor()</p>
<p># Fit the model to the data
rf.fit(X_scaled, y)</p>
<p># Get feature importances
feature_importances = rf.feature_importances_</p>
<p># Print the feature importances
print(“Feature Importances from RandomForest:”, feature_importances_)</p>
<p># Select features with the highest importance
important_features = [i for i, importance in enumerate(feature_importances) if importance > 0.1]
<p>print(“Selected Important Features:”, important_features)

Here, feature_importances_ returns an array of importance scores, and features with an importance greater than 0.1 are selected.

Conclusion

Feature selection is crucial for building efficient and accurate machine learning models. Whether you’re using Recursive Feature Elimination (RFE), variance thresholding, univariate feature selection, L1 regularization (Lasso), or tree-based feature importance, each method helps identify and keep the most important features while removing the ones that are irrelevant or redundant. By choosing the right fea

To explore more on how feature selection methods impact machine learning models, check out this detailed article on Feature Selection Techniques in Machine Learning with Python.

Conclusion

In conclusion, mastering multiple linear regression (MLR) with Python, scikit-learn, and statsmodels equips you with powerful tools for building robust predictive models. By following the steps of data preprocessing, model fitting, and evaluation with techniques like cross-validation and feature selection, you can confidently analyze and predict outcomes, such as house prices using real-world datasets like the California Housing Dataset. Understanding key metrics like R-squared and Mean Squared Error helps you assess your model’s performance accurately. As data science continues to evolve, staying up to date with tools like scikit-learn and statsmodels will remain essential for tackling more complex regression challenges and enhancing your data analysis skills.

Master Multiple Linear Regression in Python with scikit-learn and statsmodels (2025)

Alireza Pourmahdavi

I’m Alireza Pourmahdavi, a founder, CEO, and builder with a background that combines deep technical expertise with practical business leadership. I’ve launched and scaled companies like Caasify and AutoVM, focusing on cloud services, automation, and hosting infrastructure. I hold VMware certifications, including VCAP-DCV and VMware NSX. My work involves constructing multi-tenant cloud platforms on VMware, optimizing network virtualization through NSX, and integrating these systems into platforms using custom APIs and automation tools. I’m also skilled in Linux system administration, infrastructure security, and performance tuning. On the business side, I lead financial planning, strategy, budgeting, and team leadership while also driving marketing efforts, from positioning and go-to-market planning to customer acquisition and B2B growth.

Master Multiple Linear Regression with Python, Scikit-learn, Statsmodels

Table of Contents

Introduction

What is Multiple Linear Regression?

What is Multiple Linear Regression?

Assumptions of Multiple Linear Regression

Linearity

Independence of Errors

Homoscedasticity

No Multicollinearity

Normality of Residuals

Outlier Influence

Preprocess the Data

Step 1 – Load the Dataset

Step 2 – Handle Missing Values

Step 3 – Feature Selection

Step 4 – Feature Scaling

Step 5 – Prepare the Data for Model Training

Implement Multiple Linear Regression

Step 1: Import Necessary Libraries

Step 2: Split the Data into Training and Testing Sets

Step 3: Train the Linear Regression Model

Step 4: Make Predictions

Step 5: Evaluate the Model’s Performance

Step 6: Visualize the Results

Step 7: Interpretation of Coefficients

Using statsmodels

Handling Multicollinearity

Why Multicollinearity Matters

Detecting Multicollinearity

Dealing with Multicollinearity

Cross-Validation Techniques

K-Fold Cross-Validation

Stratified K-Fold Cross-Validation

Leave-One-Out Cross-Validation (LOOCV)

Time Series Cross-Validation

Cross-Validation with Custom Scoring

Evaluating Model Performance Using Cross-Validation

Feature selection methods

Recursive Feature Elimination (RFE)

Variance Thresholding

Univariate Feature Selection

L1 Regularization (Lasso Regression)

Feature Importance from Tree-based Models

Conclusion

Conclusion

Alireza Pourmahdavi