Situations when multicollinearity in regression model variables isn’t important

When creating basic multiple regression models, if your predictor variables correlate with each other this usually presents a problem in that you can end up with unstable estimates for the resulting coefficients.

One way to test for multi-collinearity is to check for a relatively high Variance Inflation Factor, or VIF. Many packages exist that make this easy in R. I personally most recently used the check_collinearity function from the performance package.

Over on the Statistical Horizons blog, Paul Allison writes clearly about three circumstances where you may not actually need to worry about high VIFs and multicollinearity all that much.

In summary, they are:

  1. If the variables with high VIFs are all control variables and not of individual interest to you. His example is a model where the dependent variable is graduation result and the variable of interest is type of college. It is not a great worry if the control variables you’re using – e.g. SAT scores and ACT scores – have high VIFs, assuming you don’t intend to interpret them. Multicollinearity is “only a problem for the variables that are collinear”.
  2. If the variables with high VIFs are transforms involving other variables, for instance being powers or products of other variables. In a model like y = x + x^2 + … then you might expect x and x^2 to be highly correlated.
  3. If the variables with high VIFs are indicator/dummy variables that represent a single categorical variable with several categories. Particularly in the case where your reference category is small, you might expect high VIFs between the categories. Here you might find that the p-values for the individual indicator values are high, but an overall test that all indicators have zero coefficients isn’t affected.

Leave a comment