You need to log-in or subscribe in order to use Student access.

Regression

Regression is the statistical approach to find the relationship between variables. It allows you to estimate the value of a dependent variable (Y) from a given independent variable (X). 

When to use

  • You want to understand how two numerical variables are related.
  • Your data is continuous (see Types of Data for more information).
  • Your data is linked (paired) – each value of your independent variable has a matching value for the dependent variable (like X-Y coordinates).

Features

Regression allows you to estimate the value of a dependent variable (Y) from a given independent variable (X). Variable X is known as the predictor variable and variable Y is known as the response variable.

There are many types of regression:

Using linear regression, we find the straight line that best “fits” the data, known as the least squares regression line. An online calculator can be found HERE or you can use Google Sheets or Office Excel spreadsheet programs.

Logistic regression is used to fit a regression model that describes the relationship between one or more predictor variables and a binary response variable (such as yes/no or does/does not). For example, researchers want to know how exercise and weight impact the probability of developing diabetes. To understand the relationship between the predictor variables and the probability of developing diabetes, researchers can perform logistic regression because there are only two potential outcomes: either someone develops diabetes, or they do not. An online calculator can be found HERE.

Polynomial regression is used to fit a regression model that describes the relationship between one or more predictor variables and a numeric response variable. This is sometimes done after you try linear regression and observe that a polynomial curve would fit the data better. In the figure below, polynomial regression results in a higher R2 value (0.9749 compared to 0.8928) which indicates that the polynomial curve fits the data better. An online calculator can be found HERE.

Multiple linear regression finds the line of best fit for data comprising two independent X values (X1 and X2) and one dependent Y value. For example, if you collected data on height, age and number of flowers of a certain plant species, multiple regression would allow you to predict the number of flowers based on a plant’s height and age. An online calculator can be found HERE.

Using regression to predict/explain results

Once you have done your regression model, you will have an equation which predicts the response variable for different values of the predictor variable. This can be used to extend the data beyond what you collected or as a basis for explaining the relationship between the variables.

Meaning of R2 in regression

R-squared is a goodness-of-fit measure for regression models. It is called the coefficient of determination. It uses the differences between each data point and your line/curve, as shown in the figure to the right.

It measures the strength of the relationship between your line/curve and the dependent variable on a 0-100% scale. The closer R2 is to 100% (1), the better your line/curve fits the data.

NOTE: This R-squared value (the coefficient of determination) is different to the Pearson Correlation Coefficient and the Spearman Rank Correlation. Don't confuse them!

The graphs below show the difference between a high R2 value and a low R2 value. The data points in the graph on the left are closer to the line than those on the right.

Are low R2 values always bad? 

No. Some areas of study have an inherently higher amount of unexplainable variation. In these areas, your R2 values will always be lower. For example, studies that try to explain human behaviour generally have R2 values less than 50%. 

However, if you have a low R2 value but the independent variables are statistically significant, you can still draw important conclusions about the relationships between the variables.

Are high R2 values always good? 

No! A regression model with a high R2 value can have problems. For example, the regression equation may predict negative values for the response variable, which may be nonsensical, or in a polynomial model the equation may predict the response variable will begin reducing in value after a certain point which may also be nonsensical.

Your Turn

You decide to investigate the usefulness of Plecoptera (stonefly) nymphs as indicators of environmental factors in streams. Samples from 15 streams are obtained by displacing nymphs from a streambed into a net by means of a standardised-kick technique. Values of water hardness – calcium carbonate concentration – are obtained from the local water authority. The observations are shown in the table below.

You decide to do linear regression on the two variables.

Using the online calculator HERE, enter your data.

The results show the following...

 

 The R2 value is

26

 The equation of the line is

15 and 38

 When water has zero calcium carbonate concentration, the expected number of Plecoptera (stonefly) nymphs is about

-0.1727X + 26.31

 When water has zero calcium carbonate concentration, we can say with 95% confidence that the expected number of Plecoptera (stonefly) nymphs will be between about

152

 We expect there to be zero nymphs when the calcium carbonate concentration is about

0.4231

The y-intercept (where the regression line crosses the y-axis) indicates the expected number of nymphs when the calcium carbonate concentration is zero.

The x-intercept (where the regression line crosses the x-axis) indicates the concentration of calcium carbonate at which there will be zero nymphs.

The R2 value of 0.4231 is not high. This indicates that the regression line does not fit the data particularly well. We can see this from the graph on the output page.

The graph also highlights that the first three data points look like outliers (or at least require explanation).

Re-run your data exlcuding the first three points.

You should see the new R2 value of 0.7764. What can you conclude from this?

 

The new linear regression line fits the data better.
The first three points should clearly be excluded.
If the first three points are indeed valid results, you should try a different form of regression (perhaps polynomial regression).

The new linear regression line does fit the data better.

The first three data points need to be investigated. You should not make the decision to exclude them based solely on the regression results.

It seems reasonable that polynomial regression might fit the complete data set better. However, using the polynomial regression calculator HERE, we can see that the R2 value reduces to 0.6785. It also raises questions - is it reasonable that the number of nymphs starts increasing once the CaCO3 concentration reaches about 120?

 

Total Score:

All materials on this website are for the exclusive use of teachers and students at subscribing schools for the period of their subscription. Any unauthorised copying or posting of materials on other websites is an infringement of our copyright and could result in your account being blocked and legal action being taken against you.