Regression Analysis Tutorial
Introduction to Regression Analysis
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. While there are many types of regression analysis, at the core, they all examine the influence of one or more independent variables on a dependent variable.
Types of Regression Analysis
There are several types of regression analysis, including:
- Linear Regression
- Multiple Linear Regression
- Polynomial Regression
- Logistic Regression
- Ridge Regression
- Lasso Regression
In this tutorial, we'll focus primarily on Linear Regression.
Linear Regression
Linear regression is the simplest form of regression. It allows us to predict a dependent variable (Y) based on the value of an independent variable (X). The relationship between X and Y is modeled by a linear equation:
Y = β0 + β1X + ε
Where:
- Y is the dependent variable we are trying to predict.
- X is the independent variable we are using to make the prediction.
- β0 is the y-intercept.
- β1 is the slope of the line.
- ε is the error term.
Example of Linear Regression
Let's consider a simple example where we are trying to predict a person's weight based on their height. We can use Python to perform this analysis.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample data
heights = np.array([150, 160, 170, 180, 190]).reshape(-1, 1)
weights = np.array([50, 60, 70, 80, 90])
# Create a linear regression model
model = LinearRegression()
model.fit(heights, weights)
# Predict weight for a new height
new_height = np.array([175]).reshape(-1, 1)
predicted_weight = model.predict(new_height)
print(f'Predicted weight for height 175 cm: {predicted_weight[0]} kg')
# Plot the data and the regression line
plt.scatter(heights, weights, color='blue')
plt.plot(heights, model.predict(heights), color='red')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Height vs Weight Linear Regression')
plt.show()
Predicted weight for height 175 cm: 75.0 kg
Assumptions of Linear Regression
Linear regression analysis relies on several key assumptions:
- Linearity: The relationship between the independent and dependent variable is linear.
- Independence: The residuals (errors) are independent.
- Homoscedasticity: The residuals have constant variance.
- Normality: The residuals of the model are normally distributed.
Multiple Linear Regression
Multiple linear regression is an extension of simple linear regression. It allows us to predict a dependent variable based on multiple independent variables. The relationship is modeled by the equation:
Y = β0 + β1X1 + β2X2 + ... + βnXn + ε
Where:
- X1, X2, ..., Xn are the independent variables.
- β1, β2, ..., βn are the coefficients for each independent variable.
Example of Multiple Linear Regression
Consider predicting a person's weight based on their height and age. Here is an example using Python:
import numpy as np
from sklearn.linear_model import LinearRegression
# Sample data
heights = np.array([150, 160, 170, 180, 190])
ages = np.array([20, 25, 30, 35, 40])
weights = np.array([50, 60, 70, 80, 90])
# Combine the features into a single array
X = np.column_stack((heights, ages))
# Create a linear regression model
model = LinearRegression()
model.fit(X, weights)
# Predict weight for new height and age
new_data = np.array([[175, 28]])
predicted_weight = model.predict(new_data)
print(f'Predicted weight for height 175 cm and age 28: {predicted_weight[0]} kg')
Predicted weight for height 175 cm and age 28: 74.99999999999999 kg
Evaluating Regression Models
It is important to evaluate the performance of your regression model. Common metrics include:
- R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Mean Absolute Error (MAE): The average of the absolute errors.
- Mean Squared Error (MSE): The average of the squared errors.
- Root Mean Squared Error (RMSE): The square root of the MSE.
Conclusion
Regression analysis is a fundamental tool in data science for understanding and predicting relationships between variables. Whether you are using simple linear regression or more complex methods, the principles remain the same. By understanding and applying these techniques, you can gain valuable insights from your data.