Exploring Machine Learning - Part 2

Predicting Car Fuel Efficiency

Welcome back to the second part of our project on predicting car fuel efficiency. In Part 1, we laid the groundwork by loading, exploring, and preparing the Auto MPG dataset. Now, it's time to put that preparation to use and build our machine learning model.

In Part 2, we'll focus on training the model, evaluating its performance, and visualizing the results. These steps will help you understand how well your model performs and how to interpret its predictions.

What We'll Cover in Part 2:

Train the Model: Build and train a linear regression model.
Evaluate the Model: Assess the performance of the model using various metrics.
Visualize the Results: Create visual representations to understand the model's performance.
Conclusion: Summarize the findings and discuss potential improvements.

By the end of Part 2, you'll have a fully trained model capable of predicting car fuel efficiency and a deeper understanding of the machine learning process.

Let's dive in and complete our project!

Step 5: Prepare the Data

We’ll split the data into training and testing sets and preprocess it for modeling.

Why Split the Data into Training and Testing Samples?

Splitting the data into training and testing sets is a crucial step in machine learning for the following reasons:

Evaluation of Model Performance:
- The primary purpose of splitting the data is to evaluate how well a trained model generalises to unseen data. If we train and test the model on the same data, it can lead to overfitting, where the model performs well on the training data but poorly on new, unseen data.
Avoiding Overfitting:
- By splitting the data, we ensure that the model's performance is evaluated on data it hasn't seen during training. This helps in identifying overfitting, where the model learns the noise in the training data instead of the actual patterns.
Model Validation:
- The testing set acts as a proxy for new data. Evaluating the model on the testing set gives a realistic estimate of its performance on new data. This helps in validating the model before deploying it in a real-world scenario.
Hyperparameter Tuning:
- Having a separate testing set allows us to tune the model's hyperparameters and evaluate their impact on the model's performance. This helps in selecting the best parameters for the final model.

from sklearn.model_selection import train_test_split

# Define features and target

X = data[['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year']]

y = data['MPG']

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training samples: {len(X_train)}')

print(f'Testing samples: {len(X_test)}')

Explanation

1. Importing the Function:

from sklearn.model_selection import train_test_split

This line imports the train_test_split function from the model_selection module of scikit-learn. This function is used to split the dataset into training and testing sets.

2. Defining Features and Target:

X = data[['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year']]

y = data['MPG']

X: This variable contains the features (independent variables) used for prediction. Here, we select multiple columns ('Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year') from the dataset as features.
y: This variable contains the target variable (dependent variable) that we want to predict. Here, we select the 'MPG' column from the dataset as the target.

3. Splitting the Data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_test_split(X, y, test_size=0.2, random_state=42):

X: The features to be split.
y: The target variable to be split.
test_size=0.2: This parameter specifies the proportion of the dataset to include in the test split. Here, 20% of the data is allocated to the testing set, and the remaining 80% is allocated to the training set.
random_state=42: This parameter sets the seed for random number generation. By setting this parameter, we ensure that the split is reproducible. The same seed value will always produce the same split.
X_train: The training set features.
X_test: The testing set features.
y_train: The training set target variable.
y_test: The testing set target variable.

4. Printing the Sizes of the Training and Testing Sets:

print(f'Training samples: {len(X_train)}')

print(f'Testing samples: {len(X_test)}')

This code prints the number of samples in the training and testing sets. It helps verify that the data has been split correctly according to the specified proportions.

Summary of Explanation

By splitting the data into training and testing sets, we can effectively train our model on one portion of the data and evaluate its performance on another, unseen portion. This process helps ensure that our model generalises well to new data and provides an accurate estimate of its performance in real-world scenarios.

Output

Step 6: Train the Model

We’ll use a simple linear regression model to predict MPG.

from sklearn.linear_model import LinearRegression

# Create and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Display the coefficients

print(f'Coefficients: {model.coef_}')

print(f'Intercept: {model.intercept_}')

Explanation

Let's break down and explain the code:

1. Creating the Model:

model = LinearRegression()

LinearRegression():

This creates an instance of the LinearRegression class from the sklearn.linear_model module.
LinearRegression is a linear model that fits a linear equation to observed data. It finds the best-fit line that minimises the sum of squared residuals (the differences between the observed and predicted values).
By default, it includes an intercept (constant term) in the model, which represents the value of the dependent variable when all the independent variables are zero.

model:

This variable holds the instance of the LinearRegression model, which will be used for training and making predictions.

2. Training the Model:

model.fit(X_train, y_train)

fit(X_train, y_train):

This method trains the linear regression model on the training data.
- X_train:
  - The training set features. These are the independent variables (or predictors) that the model will use to learn patterns and relationships.
- y_train:
  - The training set target variable. This is the dependent variable (or response) that the model is trying to predict.

What happens during fit?:

The fit method performs the following steps

1. Computes the Parameters:

It calculates the coefficients (slopes) for each feature in X_train that minimise the sum of squared residuals between the actual target values (`y_train`) and the predicted target values.
These coefficients represent the relationship between each feature and the target variable.

2. Learns the Intercept:

It determines the intercept (constant term) of the best-fit line.

3. Stores the Model:

The computed coefficients and intercept are stored within the model object for use in making predictions on new data.

Summary of Explanation

Creating the Model: The line model = LinearRegression() initialises a linear regression model.
Training the Model: The line model.fit(X_train, y_train) trains the model using the training data, learning the relationship between the features (`X_train`) and the target variable (`y_train`).
After these steps, the model object contains the fitted linear regression model, which can be used to make predictions on new data and to evaluate the model's performance on the testing set or other unseen data.

Output

The output of the coefficients and intercept from the linear regression model provides insights into the relationship between the features and the target variable (MPG in this case).

Let's break down what each part of the output means:

Coefficients

The coefficients represent the change in the target variable (MPG) for a one-unit change in each feature, holding all other features constant. The output shows:

Coefficients: [-0.116173 0.00101347 -0.00227634 -0.00656101 0.06173551 0.7603644]

Each coefficient corresponds to a feature in the same order they were provided in X. Given the features used were:

['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year']

The coefficients can be interpreted as follows:

1. Cylinders: -0.116173

For each additional cylinder, the MPG decreases by approximately 0.116173 units, assuming other factors remain constant.

2. Displacement: 0.00101347

For each unit increase in displacement, the MPG increases by approximately 0.00101347 units, assuming other factors remain constant.

3. Horsepower: -0.00227634

For each additional horsepower, the MPG decreases by approximately 0.00227634 units, assuming other factors remain constant.

4. Weight: -0.00656101

For each additional pound of weight, the MPG decreases by approximately 0.00656101 units, assuming other factors remain constant.

5. Acceleration: 0.06173551

For each unit increase in acceleration, the MPG increases by approximately 0.06173551 units, assuming other factors remain constant.

6. Model Year: 0.7603644

For each additional year, the MPG increases by approximately 0.7603644 units, assuming other factors remain constant.

Intercept

The intercept is the expected value of the target variable (MPG) when all the features are set to zero. The output shows:

Intercept: -15.0577585282361

This means that when all features are zero, the predicted MPG is approximately -15.0577585282361. In the context of this dataset, this intercept might not have a practical real-world interpretation since a car with all features set to zero (e.g., zero cylinders, zero displacement) is not realistic. However, the intercept is still a necessary part of the regression equation.

Summary

Coefficients: Indicate the direction and magnitude of the relationship between each feature and the target variable (MPG). Positive coefficients suggest a direct relationship, while negative coefficients suggest an inverse relationship.
Intercept: Represents the baseline value of the target variable when all features are zero, though its practical interpretation might be limited.
These values together form the regression equation used to predict MPG based on the provided features. The equation would look something like this:

MPG = −0.116173×Cylinders + 0.00101347×Displacement − 0.00227634×Horsepower − 0.00656101×Weight + 0.06173551×Acceleration + 0.7603644×Model Year − 15.0577585282361

Step 7: Evaluate the Model

We’ll check how well our model performs on the testing data.

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions

y_pred = model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')

print(f'R^2 Score: {r2}')

Explanation

Let's break down and explain the code in detail:

Making Predictions and Evaluating the Model

The provided code snippet shows how to use a trained linear regression model to make predictions and evaluate its performance.

Here’s a detailed explanation of each part:

1. Make Predictions:

model.predict(X_test):

This line uses the trained linear regression model to make predictions on the test set features (`X_test`).
The predict method computes the predicted values of the target variable (MPG) based on the learned coefficients and intercept from the training phase.

y_pred:

This variable stores the predicted MPG values for the test set. These predictions will be compared with the actual MPG values (`y_test`) to evaluate the model's performance.

2. Evaluate the Model: Mean Squared Error (MSE):

mean_squared_error(y_test, y_pred):

This function calculates the Mean Squared Error (MSE) between the actual target values (`y_test`) and the predicted values (`y_pred`).
MSE is a common metric for regression models, representing the average of the squared differences between the actual and predicted values.

Why MSE?:

MSE is useful because it penalises larger errors more significantly due to the squaring of differences, providing a clear indication of model performance.

mse:

This variable stores the computed MSE value. A lower MSE indicates better model performance, as it means the predicted values are closer to the actual values.

3. Evaluate the Model: R-Squared (R²) Score:

r2_score(y_test, y_pred):

This function calculates the R-Squared (R²) score, which represents the proportion of the variance in the dependent variable (MPG) that is predictable from the independent variables (features).

Why R²?:

The R² score ranges from 0 to 1. A score of 1 indicates that the model perfectly explains the variance in the target variable, while a score of 0 indicates that the model does not explain any variance.
R² is useful for understanding the goodness-of-fit of the model, i.e., how well the model captures the variability of the target variable.

r2:

This variable stores the computed R² score. A higher R² score indicates better model performance, as it means the model explains more of the variance in the target variable.

4. Print the Evaluation Metrics:

print(f'Mean Squared Error: {mse}')

print(f'R^2 Score: {r2}')

These lines print the computed Mean Squared Error (MSE) and R-Squared (R²) score, providing a summary of the model's performance.

Summary of Explanation

Making Predictions: The predict method generates predictions for the test set features based on the trained model.
Evaluating the Model: The mean_squared_error function calculates the average squared differences between actual and predicted values, while the r2_score function calculates the proportion of variance explained by the model.

Output

Interpreting the Metrics:

Mean Squared Error (MSE):
- The MSE value of 10.502 indicates the average squared difference between the actual and predicted MPG values.
- A lower MSE is better, and this value suggests reasonable model performance, though there might be room for improvement.
R-Squared (R²) Score:
- The R² score of 0.794 indicates that the model explains approximately 79.4% of the variance in the target variable (MPG).
- A higher R² score (closer to 1) is better, and this value suggests a good fit of the model to the data.

Step 8: Visualise the Results

Let’s visualise the predicted MPG against the actual MPG to see how our model performs.

plt.figure(figsize=(10, 6))

plt.scatter(y_test, y_pred)

plt.xlabel('Actual MPG')

plt.ylabel('Predicted MPG')

plt.title('Actual vs. Predicted MPG')

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')

plt.show()

Explanation

plt.plot():
- The plot function from Matplotlib is used to create a 2D line plot. It takes at least two arguments: the x-coordinates and the y-coordinates of the points to be plotted.
[min(y_test), max(y_test)] (First argument):
- This list represents the x-coordinates of the two points that define the line.
- min(y_test) gives the minimum value of the actual target values in the test set.
- max(y_test) gives the maximum value of the actual target values in the test set.
- Together, [min(y_test), max(y_test)] defines the range of the x-coordinates from the minimum to the maximum value of y_test.
[min(y_test), max(y_test)] (Second argument):
- This list represents the y-coordinates of the two points that define the line.
- Similar to the x-coordinates, min(y_test) gives the minimum value, and max(y_test) gives the maximum value of the actual target values in the test set.
- Together, [min(y_test), max(y_test)] defines the range of the y-coordinates from the minimum to the maximum value of y_test.
color='red':
- This argument sets the color of the line to red. It helps to distinguish the reference line from other elements in the plot.

The purpose of this line of code is to draw a red reference line that goes from the bottom-left to the top-right of the scatter plot of actual vs. predicted values. This reference line represents the ideal scenario where the predicted values exactly match the actual values.

Output

Interpretation of the Scatter Plot and Results

The provided image shows a scatter plot of actual vs. predicted MPG values, along with a red reference line representing the ideal predictions where actual values equal predicted values. Let's interpret the results shown in the plot and the code.

Scatter Plot

Axes:
- X-axis: Actual MPG (Miles Per Gallon) values from the test set (y_test).
- Y-axis: Predicted MPG values (y_pred) generated by the linear regression model.
Data Points:
- Each blue dot represents a car from the test set. The horizontal position of a dot corresponds to the actual MPG of that car, while the vertical position corresponds to the predicted MPG.
Reference Line:

The red line is the reference line where predicted values perfectly match the actual values.
This line is drawn using the command:

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')

Points lying on this line indicate perfect predictions by the model.

Interpretation of the Scatter Plot

Alignment with the Reference Line:
- Many points are close to the red line, indicating that the model's predictions are reasonably accurate for those data points.
- Points that are further away from the red line indicate larger prediction errors. For example, if a point is above the red line, the model overestimated the MPG. If a point is below the red line, the model underestimated the MPG.
Overall Trend:
- There is a noticeable positive correlation between actual and predicted values, suggesting that the model captures the general trend well.
- However, there are some deviations from the line, indicating that the model's predictions are not perfect.

Conclusion

Well done! You’ve built a model to predict car fuel efficiency based on various features. This project highlights the steps involved in any machine learning workflow: data loading, cleaning, exploration, preparation, modeling, and evaluation.

Feel free to tweak the model or experiment with different algorithms. The more you practice, the better you'll get at building and understanding machine learning models.

	Sponsored simple.ai by @dharmeshAI and agents, made simple. Learn how to grow your career or business in the AI age with Dharmesh Shah (co-founder & CTO of HubSpot). Join 1,000,000+ readers.

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into data analytics with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨