Regression Analysis | Statistical Analysis

Introduction to Regression Analysis

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. While there are many types of regression analysis, at the core, they all examine the influence of one or more independent variables on a dependent variable.

Types of Regression Analysis

There are several types of regression analysis, including:

Linear Regression
Multiple Linear Regression
Polynomial Regression
Logistic Regression
Ridge Regression
Lasso Regression

In this tutorial, we'll focus primarily on Linear Regression.

Linear Regression

Linear regression is the simplest form of regression. It allows us to predict a dependent variable (Y) based on the value of an independent variable (X). The relationship between X and Y is modeled by a linear equation:

Y = β0 + β1X + ε

Where:

Y is the dependent variable we are trying to predict.
X is the independent variable we are using to make the prediction.
β0 is the y-intercept.
β1 is the slope of the line.
ε is the error term.

Example of Linear Regression

Let's consider a simple example where we are trying to predict a person's weight based on their height. We can use Python to perform this analysis.


                    import numpy as np

                    import matplotlib.pyplot as plt

                    from sklearn.linear_model import LinearRegression


                    # Sample data

                    heights = np.array([150, 160, 170, 180, 190]).reshape(-1, 1)

                    weights = np.array([50, 60, 70, 80, 90])


                    # Create a linear regression model

                    model = LinearRegression()

                    model.fit(heights, weights)


                    # Predict weight for a new height

                    new_height = np.array([175]).reshape(-1, 1)

                    predicted_weight = model.predict(new_height)


                    print(f'Predicted weight for height 175 cm: {predicted_weight[0]} kg')


                    # Plot the data and the regression line

                    plt.scatter(heights, weights, color='blue')

                    plt.plot(heights, model.predict(heights), color='red')

                    plt.xlabel('Height (cm)')

                    plt.ylabel('Weight (kg)')

                    plt.title('Height vs Weight Linear Regression')

                    plt.show()

Predicted weight for height 175 cm: 75.0 kg

Assumptions of Linear Regression

Linear regression analysis relies on several key assumptions:

Linearity: The relationship between the independent and dependent variable is linear.
Independence: The residuals (errors) are independent.
Homoscedasticity: The residuals have constant variance.
Normality: The residuals of the model are normally distributed.

Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression. It allows us to predict a dependent variable based on multiple independent variables. The relationship is modeled by the equation:

Y = β0 + β1X1 + β2X2 + ... + βnXn + ε

Where:

X1, X2, ..., Xn are the independent variables.
β1, β2, ..., βn are the coefficients for each independent variable.

Example of Multiple Linear Regression

Consider predicting a person's weight based on their height and age. Here is an example using Python:


                    import numpy as np

                    from sklearn.linear_model import LinearRegression


                    # Sample data

                    heights = np.array([150, 160, 170, 180, 190])

                    ages = np.array([20, 25, 30, 35, 40])

                    weights = np.array([50, 60, 70, 80, 90])


                    # Combine the features into a single array

                    X = np.column_stack((heights, ages))


                    # Create a linear regression model

                    model = LinearRegression()

                    model.fit(X, weights)


                    # Predict weight for new height and age

                    new_data = np.array([[175, 28]])

                    predicted_weight = model.predict(new_data)


                    print(f'Predicted weight for height 175 cm and age 28: {predicted_weight[0]} kg')

Predicted weight for height 175 cm and age 28: 74.99999999999999 kg

Evaluating Regression Models

It is important to evaluate the performance of your regression model. Common metrics include:

R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Mean Absolute Error (MAE): The average of the absolute errors.
Mean Squared Error (MSE): The average of the squared errors.
Root Mean Squared Error (RMSE): The square root of the MSE.

Conclusion

Regression analysis is a fundamental tool in data science for understanding and predicting relationships between variables. Whether you are using simple linear regression or more complex methods, the principles remain the same. By understanding and applying these techniques, you can gain valuable insights from your data.

Regression Analysis Tutorial