Master Multiple Linear Regression in Python with Scikit-learn and Statsmodels

Introduction

Mastering multiple linear regression in Python is essential for anyone looking to build powerful predictive models. In this tutorial, we’ll dive into how to implement multiple linear regression (MLR) using Python’s popular libraries, scikit-learn and statsmodels. We’ll walk through key concepts like data preprocessing, handling multicollinearity, and performing cross-validation, all using the California Housing Dataset. Whether you’re new to MLR or aiming to refine your skills, this guide will provide practical, step-by-step instructions to help you build and evaluate robust regression models.

What is Multiple Linear Regression?

Multiple Linear Regression is a statistical method used to predict a target variable based on multiple factors or independent variables. It helps analyze the relationship between one dependent variable and several independent variables, making it useful for predicting outcomes like house prices based on factors such as size, location, and number of rooms. This method requires preprocessing the data, ensuring it meets specific assumptions, and evaluating the model using metrics like R-squared and Mean Squared Error.

Feature selection methods

The Recursive Feature Elimination (RFE) method is a technique for selecting the most important features by removing the less important ones. It works by gradually eliminating features until we have the number we want. It’s especially helpful when you have a large number of features and want to focus on the most informative ones.

Here’s how it works: first, you import the RFE class from scikit-learn‘s feature_selection module. Then, you create an RFE instance using an estimator, in this case, LinearRegression, and set n_features_to_select to 2, meaning you want to pick the top 2 features.

Next, you fit the RFE object to the scaled features X_scaled and the target variable y . The support_ attribute of the RFE object will give you a boolean mask that tells you which features are selected. To see how the features are ranked, you create a DataFrame with the feature names and their corresponding rankings. The ranking_ attribute of RFE will show you the rank of each feature, where lower values mean the feature is more important.

Then, you plot a bar chart of these rankings to make it easy to understand which features matter most in your model. This visualization helps highlight the relative importance of each feature.

Here’s the code to do this:


from sklearn.feature_selection import RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
# Fit the RFE object to the scaled features and target variable
rfe.fit(X_scaled, y)
# Print which features have been selected
print(“Selected Features:”, rfe.support_)
# Create a bar plot for feature rankings
feature_ranking = pd.DataFrame({
     ‘Feature’: selected_features,
     ‘Ranking’: rfe.ranking_ 
})
# Sort the feature rankings and plot them
feature_ranking.sort_values(by=’Ranking’).plot(kind=’bar’, x=’Feature’, y=’Ranking’, legend=False)
plt.title(‘Feature Ranking (Lower is Better)’)
plt.ylabel(‘Ranking’)
plt.show()

Output Example:

Output

Selected Features: [ True  True  False ]

This tells you that the first two features, MedInc (median income) and AveRooms (average rooms per household), were selected as the most important for predicting MedHouseValue (median house value). The third feature, AveOccup (average house occupancy), didn’t make the cut, which means it has less influence on the target variable.

The bar plot you generate clearly shows how each feature ranks, and based on the output, it’s clear that MedInc and AveRooms are the key features driving the predictions for house values.

Read more about multiple linear regression and its applications Complete Guide to Multiple Linear Regression in Python.

Assumptions of Multiple Linear Regression

Before jumping into multiple linear regression, it’s really important to make sure certain basic assumptions are in place. These assumptions help make sure that your model’s results are both reliable and valid. If any of these assumptions are off, your results might be skewed or just plain misleading. Let’s break down each one:

Linearity: So, here’s the thing—you need a straight-line relationship between the dependent variable (the one you’re trying to predict) and the independent variables (the ones you’re using to make that prediction). Basically, if you change one of your independent variables, the dependent variable should change in a straight-line way. You can check this by plotting your variables and seeing if the relationship looks linear or using statistical tests. If the relationship isn’t linear, you might need to try polynomial regression or transform your variables to get things on track.

Independence of Errors: Now, the residuals (aka the errors between your predicted and actual values) should be totally independent of each other. What does this mean? Well, if the errors are linked together, it suggests that your model’s missing something important—like it’s not capturing all the patterns in your data. To test this, people usually run the Durbin-Watson test. If the result is about 2, then all’s good. If it’s much higher or lower than 2, you’ve got autocorrelation in your residuals, which could cause problems.

Homoscedasticity: This is a big word, but it’s pretty simple when you break it down. It means the spread of the residuals should stay roughly the same no matter what values your independent variables have. If the residuals start to fan out or squish together as the values change, your model might be off. This is called heteroscedasticity, and you can spot it with a residual plot. If you see that fan-shaped pattern, you might want to transform your data or consider using weighted least squares regression to fix it.

No Multicollinearity: This one’s important—your independent variables shouldn’t be too closely related to each other. If they are, it makes it hard to figure out how each one is affecting the dependent variable because they’re all stepping on each other’s toes. To detect this, you can calculate the Variance Inflation Factor (VIF). If your VIF is over 5 or 10, that’s a sign that some of your predictors are too similar, and you might need to drop or combine a few of them.

Normality of Residuals: For your statistical tests to be reliable, the residuals should follow a normal distribution. If they don’t, the results from tests like t-tests or F-tests might be off. You can check this with a Q-Q plot, which compares your residuals to a normal distribution. If the points on the plot form a straight line, you’re good to go. If they don’t, you might need to do some data transformations.

Outlier Influence: You know those weird data points that stick out like a sore thumb? Those are outliers, and they can really mess up your model, especially if you’ve got a small dataset. These high-leverage points can make your predictions way off. To check for them, you can use leverage and Cook’s distance techniques. If they’re really influencing your model, you might need to remove them or use robust regression methods that are less affected by outliers.

So, to wrap it up, these assumptions form the backbone of multiple linear regression. If you don’t check them and they turn out to be wrong, your model’s conclusions could end up being pretty useless. That’s why it’s super important to check for any violations and fix them as needed. It’ll save you a lot of headache down the road and ensure your results are solid.

Read more about the key assumptions of multiple linear regression and how they affect model reliability Assumptions of Multiple Linear Regression.

Preprocess the Data

In this section, you’ll get hands-on with using the Multiple Linear Regression model in Python to predict house prices based on features from the California Housing Dataset. The process will guide you through how to preprocess the data, fit a regression model, and evaluate its performance. Along the way, we’ll also take a look at common challenges, like multicollinearity, outliers, and feature selection, that pop up now and then.

Step 1 – Load the Dataset

So, we’re going to use the California Housing Dataset—it’s pretty famous in the world of regression tasks. This dataset has 13 features about houses in the suburbs of Boston and their corresponding median house prices. Let’s kick things off by installing the packages we need to analyze this data. You’ll want to run this in your terminal:


$ pip install numpy pandas matplotlib seaborn scikit-learn statsmodels

Next, we’re going to import the libraries and load the dataset. Here’s the Python code for that:


from sklearn.datasets import fetch_california_housing   # Import the function to fetch the dataset.
import pandas as pd   # Import pandas for data manipulation and analysis.
import numpy as np   # Import numpy for numerical computing.# Load the California Housing dataset using the fetch_california_housing function.
housing = fetch_california_housing()# Convert the dataset’s data into a pandas DataFrame, using the feature names as column headers.
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)# Add the target variable ‘MedHouseValue’ to the DataFrame, using the dataset’s target values.
housing_df[‘MedHouseValue’] = housing.target# Display the first few rows of the DataFrame to get an overview of the dataset.
print(housing_df.head())

Running this code will show you an output like this:

Output


MedInc   HouseAge   AveRooms   AveBedrms   Population   AveOccup   Latitude   Longitude   MedHouseValue
0  8.3252   41.0   6.984127   1.023810   322.0   2.555556   37.88   -122.23   4.526
1  8.3014   21.0   6.238137   0.971880   2401.0   2.109842   37.86   -122.22   3.585
2  7.2574   52.0   8.288136   1.073446   496.0   2.802260   37.85   -122.24   3.521
3  5.6431   52.0   5.817352   1.073059   558.0   2.547945   37.85   -122.25   3.413
4  3.8462   52.0   6.281853   1.081081   565.0   2.181467   37.85   -122.25   3.422

Explanation of Variables:

MedInc: Median income in the block
HouseAge: Median house age in the block
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household
Population: Block population
AveOccup: Average house occupancy
Latitude: Latitude of the block
Longitude: Longitude of the block

Step 2 – Preprocess the Data

Check for Missing Values

Before we dive into the regression stuff, we need to make sure there are no missing values in the dataset—missing data can mess things up. We can easily check for missing values with this simple code:


print(housing_df.isnull().sum())

If you run this, you should see that there are no missing values:

Output


MedInc        0
HouseAge        0
AveRooms        0
AveBedrms        0
Population        0
AveOccup        0
Latitude        0
Longitude        0
MedHouseValue        0
dtype: int64

Feature Selection

Next up, let’s take a look at how the features relate to each other and to the target variable (in this case, MedHouseValue, the price). This will help us figure out which features to keep for the regression model. A great way to do this is by creating a correlation matrix. Here’s the code to do that:


correlation_matrix = housing_df.corr()
print(correlation_matrix[‘MedHouseValue’])

The output will look something like this:

Output


MedInc        0.688075
HouseAge        0.105623
AveRooms        0.151948
AveBedrms        -0.046701
Population        -0.024650
AveOccup        -0.023737
Latitude        -0.144160
Longitude        -0.045967
MedHouseValue        1.000000

Interpretation:

MedInc (Median Income) has a strong positive correlation (0.688075) with MedHouseValue, which means that higher median income in a block is strongly linked to higher house prices.
AveRooms (Average Number of Rooms) has a moderate positive correlation (0.151948) with MedHouseValue, so houses with more rooms tend to have higher prices.
AveOccup (Average Occupancy) has a weak negative correlation (-0.023737) with MedHouseValue, meaning that as the average number of people per house goes up, house prices tend to drop a little, but not much.

You can visualize this correlation matrix using a heatmap to make things even clearer:


import seaborn as sns
import matplotlib.pyplot as plt# Assuming ‘housing_df’ is the DataFrame containing the data
plt.figure(figsize=(10, 8))
sns.heatmap(housing_df.corr(), annot=True, cmap=’coolwarm’)
plt.title(‘Correlation Matrix’)
plt.show()

This will create a heatmap where darker colors represent stronger correlations.

Feature Selection

Now, based on the correlation matrix, we’ll select the features that are most strongly related to MedHouseValue (the target variable). In this case, we’re going to focus on the three features that show the strongest correlations: MedInc, AveRooms, and AveOccup. Here’s the code for that:


selected_features = [‘MedInc’, ‘AveRooms’, ‘AveOccup’]
X = housing_df[selected_features]
y = housing_df[‘MedHouseValue’]

This creates a new DataFrame X that only includes the selected features and extracts the target variable MedHouseValue into y.

Scaling Features

Alright, now it’s time to scale the features. Why? Well, features like MedInc (income) and AveRooms (rooms) might be on different scales, and if they are, the model could get confused. To make sure everything is on the same page, we standardize the features so they all have a mean of 0 and a standard deviation of 1. Here’s how you can do that:


from sklearn.preprocessing import StandardScaler# Initialize the StandardScaler object
scaler = StandardScaler()# Fit the scaler to the data and transform it
X_scaled = scaler.fit_transform(X)# Print the scaled data
print(X_scaled)

When you run this, you’ll see the scaled values for your features, like this:

Output


[[ 2.34476576  0.62855945  -0.04959654]
[ 2.33223796  0.32704136  -0.09251223]
[ 1.7826994   1.15562047  -0.02584253]
…
[ -1.14259331  -0.09031802  -0.0717345]
[ -1.05458292  -0.04021111  -0.09122515]
[ -0.78012947  -0.07044252  -0.04368215]]

Each row represents a data point, and each column corresponds to one of the features: MedInc, AveRooms, and AveOccup. After applying the StandardScaler, everything is now on the same scale, which is crucial for models like multiple linear regression that are sensitive to how big or small the features are.

For more insights on how to preprocess data effectively in Python, check out this comprehensive guide on Data Cleaning and Preprocessing in Python.

Implement Multiple Linear Regression

Now that we’ve done all the necessary data preprocessing, let’s dive into implementing Multiple Linear Regression in Python. This part will walk you through splitting the data, fitting the model, and then evaluating how well the model performs. You’ll also get to see how we can visually check the model’s accuracy.

Step 1 – Splitting the Data

Before we get into the model, we need to split the data into two parts: training and testing. This is super important because we want to train the model using one part of the data, then test it on another part that it hasn’t seen before. It’s like studying for a test and then taking the test, you know? We’ll use 80% of the data for training and save the remaining 20% for testing, which is a standard practice in machine learning. Here’s how you do that:


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

This train_test_split function will split the data, keeping the test data separate from the training data, so our model can be evaluated properly. The random_state=42 ensures that we can get the same split every time we run the code, just in case we need to reproduce the results.

Step 2 – Fitting the Model

Now that the data is split, it’s time to create our LinearRegression model and fit it to the training data. This is where the model learns how the predictors (independent variables) relate to the target variable (dependent variable). The model gets trained on the data, like a student learning for an exam. Here’s the code:


model = LinearRegression()  # Initialize the Linear Regression model.
model.fit(X_train, y_train)  # Fit the model to the training data.

Step 3 – Making Predictions

Now that the model has been trained, it’s time to put it to work and make some predictions! The model will use what it’s learned to predict the target variable (house prices) based on the test data. Here’s how you make the predictions:


y_pred = model.predict(X_test)  # Predict the target variable for the test set.

Step 4 – Model Evaluation

To check how well our model is performing, we’ll use two key metrics: Mean Squared Error (MSE) and R-squared (R2). MSE tells us how far off the predictions are from the actual values (lower is better), and R-squared tells us how much of the variance in the target variable is explained by the model (higher is better). Let’s evaluate the model’s performance:


print(“Mean Squared Error:”, mean_squared_error(y_test, y_pred))
print(“R-squared:”, r2_score(y_test, y_pred))

Explanation of Metrics:

Mean Squared Error (MSE): The MSE for the model is 0.7006855912225249. This number represents how far off, on average, our predictions are from the actual values. A lower MSE means the model is performing better because the predicted values are closer to the real values.

R-squared (R2): The R-squared value is 0.4652924370503557, which means the model explains about 46.53% of the variance in house prices based on the features we’ve used. Ideally, this number should be closer to 1, but this shows that our model captures a decent portion of the data’s behavior.

Step 5 – Visualizing the Results

To really get a feel for how well our model is doing, let’s create a couple of visualizations.

Residual Plot:

This helps us see if there are any patterns in the errors (residuals). Ideally, the errors should be randomly scattered, like confetti. If they’re not, it might mean the model is missing something.

Predicted vs Actual Plot:

This one compares our model’s predictions to the real values. If the model were perfect, all the points would lie right on the line.

Here’s how you make these plots:


# Residual Plot
residuals = y_test – y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.xlabel(‘Predicted Values’)
plt.ylabel(‘Residuals’)
plt.title(‘Residual Plot’)
plt.axhline(y=0, color=’red’, linestyle=’–‘)  # Add a horizontal line at y=0 for reference
plt.show()# Predicted vs Actual Plot
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel(‘Actual Values’)
plt.ylabel(‘Predicted Values’)
plt.title(‘Predicted vs Actual Values’)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], ‘r–‘, lw=4)  # Add a diagonal line for comparison
plt.show()

The Residual Plot will show if the errors are scattered around zero or if there’s some pattern (which could mean the model is missing something). The Predicted vs Actual Plot will help us see how close the predictions are to the real values. If the dots are close to the red dashed line, the model is doing well.

By looking at these metrics and plots, you’ll be able to figure out how well your multiple linear regression model is working and what areas need a bit more work.

To dive deeper into implementing multiple linear regression and its various techniques, check out this detailed resource on Understanding Multiple Linear Regression with Python.

Implement Multiple Linear Regression

Now that we’ve finished prepping our data, let’s jump into implementing Multiple Linear Regression in Python. This section will walk you through splitting the data, fitting the model, and then checking how well it performs.

Step 1 – Splitting the Data

Before we start training the model, we need to split the data into two parts: one to train the model and one to test it. This way, we’re not just testing the model on the data it has already seen. We’ll use 80% of the data for training and 20% for testing, which is pretty standard in machine learning. Here’s how we do it:


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

The train_test_split function does exactly what it says, splitting the data into training and testing sets. The random_state=42 makes sure that every time we run the code, we get the same split, just in case we want to reproduce our results.

Step 2 – Fitting the Model

Now that the data is split, it’s time to create a LinearRegression model and fit it to the training data. This is where the model learns the relationships between the predictors (independent variables) and the target (dependent variable). It’s like teaching the model by showing it examples of how the input data relates to the outcome. Here’s the code for that:


model = LinearRegression()  # Initialize the Linear Regression model.
model.fit(X_train, y_train)  # Fit the model to the training data.

Step 3 – Making Predictions

With the model trained, it’s time to use it to predict some values. This will allow us to see how well the model performs with the test data that it hasn’t seen before. Here’s how we predict the target variable (in this case, house prices) for the test data:


y_pred = model.predict(X_test)  # Predict the target variable for the test set.

Step 4 – Model Evaluation

We want to know how well our model is doing, right? To evaluate its performance, we’ll use two key metrics: Mean Squared Error (MSE) and R-squared (R2).

MSE tells us how much, on average, our predictions are off from the actual values (lower is better).

R-squared tells us how much of the variation in the target variable can be explained by the model (higher is better).

Here’s the code to check those metrics:


print(“Mean Squared Error:”, mean_squared_error(y_test, y_pred))
print(“R-squared:”, r2_score(y_test, y_pred))

Explanation of Metrics:

Mean Squared Error (MSE): In our case, the MSE is 0.7006855912225249. This tells us the average squared difference between what the model predicted and what the actual values were. Lower values are better, meaning the predictions are closer to the actual numbers.

R-squared (R2): This is 0.4652924370503557. So, our model explains about 46.53% of the variance in the target variable. While not perfect, it shows that the model is picking up some useful patterns in the data.

Step 5 – Visualizing the Results

To get a better feel for how the model is doing, let’s create a couple of plots.

Residual Plot:

This plot helps us see the errors (residuals) in the model. We want the errors to be randomly distributed, so if there’s a pattern, it suggests that the model is missing something.

Predicted vs Actual Plot:

This one compares our predictions to the actual values. If everything went well, the predictions will line up pretty closely with the actual values.

Here’s the code to make those plots:


# Residual Plot
residuals = y_test – y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.xlabel(‘Predicted Values’)
plt.ylabel(‘Residuals’)
plt.title(‘Residual Plot’)
plt.axhline(y=0, color=’red’, linestyle=’–‘)  # Add a horizontal line at y=0 for reference
plt.show()# Predicted vs Actual Plot
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel(‘Actual Values’)
plt.ylabel(‘Predicted Values’)
plt.title(‘Predicted vs Actual Values’)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], ‘r–‘, lw=4)  # Add a diagonal line for comparison
plt.show()

Residual Plot:

This gives us a look at how the model’s errors are distributed. If the points are spread randomly around zero, that’s a good sign. If they form a pattern, the model might not be capturing something important in the data.

Predicted vs Actual Plot:

This one lets us visually compare our predictions to the real values. If the points are close to the red dashed line, it means the predictions are spot on. The closer they are to the line, the better the model is performing.

By checking out these metrics and visualizations, you’ll have a clear picture of how your multiple linear regression model is doing, and where it might need a little tweaking.

For a more in-depth understanding of using statsmodels in regression analysis, refer to this insightful guide on Statistics in Python with Statsmodels.

Handling Multicollinearity

So, here’s the thing about multicollinearity—it’s a common problem when doing multiple linear regression analysis. Basically, when two or more of the independent variables are too closely related, it makes it super tricky to figure out what each variable is actually doing. This can throw off your results because the regression coefficients can get all wobbly and unreliable. And when that happens, the model’s predictions can become biased, which is not what we want, right?

Now, how do you find and handle multicollinearity? A great tool to help with that is the Variance Inflation Factor (VIF). The VIF helps you measure how much the variance of each regression coefficient gets inflated due to the relationships between the predictors. Basically, it tells you if one of your variables is too chatty with the others. A VIF of 1 means no correlation at all, while a VIF above 5 or 10 is like a red flag, suggesting that your predictors are too chummy with each other. That’s when you might want to step in and fix things.

Here’s how we can calculate the VIF for each independent variable in our model and check if any of them are too cozy with each other. If you spot a VIF greater than 5, you might want to reconsider that variable and think about removing it. Here’s how you can do that:


from statsmodels.stats.outliers_influence import variance_inflation_factor
# Create an empty DataFrame to store the VIF values.
vif_data = pd.DataFrame()
# Add the names of the selected features (independent variables) to the DataFrame.
vif_data[‘Feature’] = selected_features
# Calculate the VIF for each feature.
vif_data[‘VIF’] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
# Print the VIF values for each feature.
print(vif_data)
# Generate a bar plot for the VIF values.
vif_data.plot(kind=’bar’, x=’Feature’, y=’VIF’, legend=False)
plt.title(‘Variance Inflation Factor (VIF) by Feature’)
plt.ylabel(‘VIF Value’)
plt.show()

Explanation of VIF Values:

Let’s say you run the code, and you get the following output for the VIF:

Output


Feature       VIF
0    MedInc   1.120166
1    AveRooms 1.119797
2    AveOccup 1.000488

Here’s what this means:

MedInc: The VIF for MedInc is 1.120166, which is pretty low. This means that MedInc isn’t really correlated with the other variables, so it’s not causing multicollinearity.
AveRooms: The VIF for AveRooms is 1.119797. Again, this is low, so no worries here. AveRooms isn’t causing any multicollinearity.
AveOccup: The VIF for AveOccup is 1.000488, which is basically zero correlation with the other variables. So, AveOccup is totally independent and not causing any issues.

Since all of these VIF values are well below 5, you’re in the clear

To dive deeper into understanding multicollinearity and its effects on regression models, check out this comprehensive article on Multicollinearity in Regression Analysis.

Cross-Validation Techniques

Alright, so let’s talk about cross-validation. This is like your secret weapon to check how well your machine learning model is actually performing. It helps you figure out how well your model will do when it sees new, unseen data. You know, when you don’t have an infinite amount of data, and you’re trying to get a more reliable idea of how effective your model is.

The idea is pretty simple. The whole method is based on this one thing called k —which represents how many groups, or “folds,” your data gets split into. This method is called k-fold cross-validation, and it works like this:

You break your data into k equal parts.
Then, you train your model on k-1 of those parts and test it on the leftover one.
You repeat this for every fold, and in the end, you average all the results to get a better overall performance score.

Sounds cool, right?

Now, let’s dive into how you can actually do this in Python. Here’s how you can use scikit-learn to calculate your cross-validation scores using R-squared (it’s a measure of how well your model explains the variance of the target variable):


from sklearn.model_selection import cross_val_score
# Perform k-fold cross-validation with 5 folds and calculate R-squared scores.
scores = cross_val_score(model, X_scaled, y, cv=5, scoring=’r2′)
# Print the individual cross-validation scores for each fold.
print(“Cross-Validation Scores:”, scores)
# Calculate and print the mean R-squared score across all folds.
print(“Mean CV R^2:”, scores.mean())
# Plot the cross-validation scores.
plt.plot(range(1, 6), scores, marker=’o’, linestyle=’–‘)
plt.xlabel(‘Fold’)
plt.ylabel(‘R-squared’)
plt.title(‘Cross-Validation R-squared Scores’)
plt.show()

Explanation of the Code:

cross_val_score : This is the function that does all the hard work. It splits your data into k folds (in this case, 5) and calculates the R-squared score for each fold.
model : This is the regression model you’ve already trained on your dataset. It’s what you’re testing.
X_scaled and y : These are your input features and the target variable you’re trying to predict.
scoring='r2' : This tells the function that you want to use R-squared as your evaluation metric.

Once the function does its thing, we get this cool line plot of R-squared values across the folds.

Example of the Output:

You might get something like this:

Output

Cross-Validation Scores: [0.42854821 0.37096545 0.46910866 0.31191043 0.51269138]
Mean CV R^2: 0.41864482644003276

What Does This Mean?

The cross-validation scores show how the model performed across different subsets of your data. As you can see, they range from 0.3119 to 0.5127, which means the model didn’t do the same in every fold, but it wasn’t a huge disaster either.

The Mean CV R-squared value is 0.4186, which means that, on average, the model explains 41.86% of the variance in the target variable. That’s a pretty decent start, but it’s also a sign that there’s room for improvement. If you want your model to do better, you’ll want to tweak it further.

If you look at how the scores vary across the folds, it shows you whether the model’s performance is consistent. If the R-squared is higher, closer to 1, that’s good news. If it’s lower, it’s like, “Okay, we need to fix this.” These results help you figure out if your model is overfitting (doing well on training data but poorly on new data) or underfitting (struggling to make good predictions on both training and test data). You want a balance here, and cross-validation is perfect for making sure you’re not missing the mark.

For more on cross-validation techniques and their application in machine learning models, check out this insightful article on Understanding Cross-Validation Techniques in Machine Learning with Python.

Feature Selection Methods

The Recursive Feature Elimination (RFE) method is a really handy way to pick out the most important features for your machine learning model. You know, sometimes, you end up with a bunch of features, but not all of them are really necessary. RFE helps solve this by gradually eliminating the less important ones until you’re left with just the top ones. This makes your model more efficient and can even help reduce overfitting—winning all around! It’s especially useful when you’ve got a ton of features and want to focus on the ones that actually make a difference.

So, here’s the plan: we’re going to use RFE with Linear Regression to figure out which features matter the most. In the code below, we’re telling RFE to pick just the top 2 features that have the biggest impact on our target variable. This way, we narrow down our focus to the ones that truly matter.

Here’s how you do it:


from sklearn.feature_selection import RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
# Fit the RFE object to the scaled features and target variable.
rfe.fit(X_scaled, y)
# Print which features have been selected.
print(“Selected Features:”, rfe.support_)
# Create a bar plot for feature rankings.
feature_ranking = pd.DataFrame({
    ‘Feature’: selected_features, 
    ‘Ranking’: rfe.ranking_ 
})
# Sort the feature rankings and plot them.
feature_ranking.sort_values(by=’Ranking’).plot(kind=’bar’, x=’Feature’, y=’Ranking’, legend=False)
plt.title(‘Feature Ranking (Lower is Better)’)
plt.ylabel(‘Ranking’)
plt.show()

Here’s what the code does:

RFE: This is the magic function that does the heavy lifting. It uses an estimator (in this case, LinearRegression) to figure out how important each feature is based on its impact on the model.
n_features_to_select=3: This tells RFE that we want the top 3 features. You can tweak this number if you want more or fewer features depending on what you’re looking for.
rfe.fit(X_scaled, y): This fits the RFE model to your data—basically, it starts figuring out which features matter.
support_: This shows you which features were selected. It’s like a little cheat sheets to know which ones made the cut.
ranking_: This one ranks your features, so the lower the number, the more important the feature is. We then plot these rankings so it’s easy to see which ones stand out.

Example Output:

Output

Selected Features: [ True  True  False ]

This tells us that the first two features, MedInc (median income) and AveRooms (average rooms per household), are the most important for predicting MedHouseValue (the median house value). The third feature, AveOccup (average house occupancy), didn’t make the cut, so it’s less important.

The bar plot you get from the code will give you a nice, visual breakdown of which features matter the most. As you can see, MedInc and AveRooms are the heavy hitters here, and that matches up with the model’s output. By focusing on these key features, your model can make b

For more insights into feature selection methods and their importance in model optimization, you can refer to this detailed guide on Feature Selection Techniques in Machine Learning with Python.

Conclusion

In conclusion, mastering multiple linear regression (MLR) in Python is a valuable skill for building powerful predictive models. By utilizing tools like scikit-learn and statsmodels, you can effectively apply MLR to analyze datasets, handle multicollinearity, and perform cross-validation. In this guide, we’ve covered essential steps, from preprocessing data to selecting important features, allowing you to predict outcomes like house prices based on various factors. As you continue exploring machine learning, keep in mind that refining your MLR skills will open doors to more advanced techniques and applications. The future of predictive modeling is evolving, and understanding tools like scikit-learn and statsmodels will keep you ahead of the curve.

Master Multiple Linear Regression with Python, Scikit-learn, Statsmodels

Alireza Pourmahdavi

I’m Alireza Pourmahdavi, a founder, CEO, and builder with a background that combines deep technical expertise with practical business leadership. I’ve launched and scaled companies like Caasify and AutoVM, focusing on cloud services, automation, and hosting infrastructure. I hold VMware certifications, including VCAP-DCV and VMware NSX. My work involves constructing multi-tenant cloud platforms on VMware, optimizing network virtualization through NSX, and integrating these systems into platforms using custom APIs and automation tools. I’m also skilled in Linux system administration, infrastructure security, and performance tuning. On the business side, I lead financial planning, strategy, budgeting, and team leadership while also driving marketing efforts, from positioning and go-to-market planning to customer acquisition and B2B growth.

Master Multiple Linear Regression in Python with Scikit-learn and Statsmodels

Master Multiple Linear Regression in Python with Scikit-learn and Statsmodels

Table of Contents

Introduction

What is Multiple Linear Regression?

Feature selection methods

Assumptions of Multiple Linear Regression

Preprocess the Data

Step 1 – Load the Dataset

Explanation of Variables:

Step 2 – Preprocess the Data

Feature Selection

Interpretation:

Feature Selection

Scaling Features

Implement Multiple Linear Regression

Step 1 – Splitting the Data

Step 2 – Fitting the Model

Step 3 – Making Predictions

Step 4 – Model Evaluation

Explanation of Metrics:

Step 5 – Visualizing the Results

Residual Plot:

Predicted vs Actual Plot:

Implement Multiple Linear Regression

Step 1 – Splitting the Data

Step 2 – Fitting the Model

Step 3 – Making Predictions

Step 4 – Model Evaluation

Explanation of Metrics:

Step 5 – Visualizing the Results

Residual Plot:

Predicted vs Actual Plot:

Residual Plot:

Predicted vs Actual Plot:

Handling Multicollinearity

Cross-Validation Techniques

Feature Selection Methods

Conclusion

Alireza Pourmahdavi

Any Cloud Solution, Anywhere!

Navigation

Useful Links

Contact us