Hi everyone, today we will talking about the R-Square and Adjusted R-Square, so to get more knowledge about the goodness of your linear model keep stay here.
The objective of any regression exercise is to explain the variation in the dependent variable Y. As far as regression models are concerned next step to evaluate the model performance and understand how good our model is against a benchmark model.
In this blog we will discuss the things are mentioned below:
- What is R²?
- How to Calculate R²?
- Range of R².
- What is a good R² value?
- Limitation of R².
- Adjusted R².
What is R²
R-square(R²) is also known as the coefficient of determination, It is the proportion of variation in Y explained by the independent variables X. It is the measure of goodness of fit of the model.
If R² is 0.8 it means 80% of the variation in the output can be explained by the input variable. So, in simple term, higher the R², the more variation is explained by your input variable and hence better is your model.
How to Calculate R-square (R²)?
R² is the ratio between the residual sum of squares and the total sum of squares.
- SSR (Sum of Squares of Residuals) is the sum of the squares of the difference between the actual observed value (y) and the predicted value (y^)
- SST (Total Sum of Squares) is the sum of the squares of the difference between the actual observed value (y) and the average of the observed y value (Avg)
Let us understand these terms with the help of an example. Consider a simple example where we have some observations on how the experience of a person affects the salary.
We have the black line which is the regression line that depicts where the predicted values of Salary lies with respect to the experience along the x-axis. The stars represent the actual values of the salary which is the observed y value with respect to experience. The cross marks represent the predicted value of salary for an observed value of experience which is denoted by y^.
Where n is the number of observations.
The black line in the above image denotes where the average Salary lies with respect to the experience.
R squared can now be calculated by,
Range of R-square (R²)
Generally, it is said that the range of R² is 0 to 1, but it is actually (-infinity) to 1.
R²=0:- It indicates poor fit of the regression line to the data. i.e. no linear relationship between X and Y.
R²= 1:- It indicates a perfect fit
R²= Negative:- It is negative when the prediction is so bad that the Residual Sum of Squares becomes greater than the Total Sum of Squares.
“And what does a negative R-square mean?
It means that the model is performing worse than the horizontal line which predicts the mean value every time.”
What is a good R² value?
A value of 0 indicates that the dependent variable cannot be explained by the independent variable at all.
A value of 1 indicates that the dependent variable can be perfectly explained without error by the independent variable.
Now consider a hypothetical situation when all the predicted values exactly match the actual observations in the dataset. In this case, y will be equal to y^ for all the observations, hence resulting in SSR to be equal to zero. So R² = 1.
In another scenario, if the predicted values lie far away from the actual observations, SSR will increase towards infinity. This will increase the ratio SSR/SST, hence resulting in a decreased value for R Square. R² = -ve.
Thus R² will help us determine the best fit for a model. The closer R² is closer to one means regression goes to better.
When is R-square negative?
Appearances can be deceptive. R² is not really the square of anything. While it is surprising to see something called “squared” have a negative value, it is not impossible(since R² is not actually the square of R).
R² will be negative when the best-fit line or curve does an awful job of fitting the data. This can only happen when you fit a poorly chosen model (perhaps by mistake), or you apply constraints to the model that don’t make any sense (perhaps you entered a positive number when you intended to enter a negative number).
If R² is negative, check that you picked an appropriate model, and set any constraints correctly.
Limitation of R².
As above we considered a simple example where we have two variables Experience and Salary. we are predicting the salary based on the experience of the employee. R² and regression can be calculated as below:
And there is a problem which is if we add another variable in the second equation.
Once we added a new variable to our model SSR will minimize and SST will not be affecting, then (SSR / SST ) will decrease. So the value of R² will increase. Now, this is the limitation of R² when we add variables the R² will never decrease.
So after adding a variable, you don’t find how it will affect your model or not because R² is never going to decrease it will increasing always after adding a variable. So to overcome this problem we are using Adjusted R².
same as R², the Adjusted R² measures the variation in the dependent variable.
The formula for Adjusted R-square:
Adjusted R² formula
While R² increases as variables are added, the fraction n-1/n-p-1 increases as variables are added.
Thus the concept of adjusted R² imposes a cost on adding variables to the regression. So, Adjusted R-square can decrease when variables are added to a regression.
Hence, adjusted R² will only increase when the added variable is relevant.
“Note that Adjusted R² is always less than or equal to R².”
Therefore, it is recommended to use Adjusted R² over R² when measuring the goodness of fit of the model.
So this is about the R² and adjusted R², I hope you enjoyed this! ?
If you have any questions or suggestions, please let me know!
Thank You! ?