Pearson's (r) for Discrete data?

Saturday 12 December 2020

A subscriber asked the below, 'good' question:

"I have a student that wants to look at the correlation between time students spend on their phones and IB predicted grade. Will the student get penalized for using predicted grade, which is not continuous data, when doing Pearson's?"

This question is a ‘good’ (lengthier to answer !) one. Replies to such « stats » questions often (I think) reveal ‘grey’ areas.

I imagine the student would want to calculate Pearson’s to justify the use of a regression equation to make predictions( ?) i.e. rather than whether mobile use & IB grade are independent, or not (X² test for independence).

The scatter plot at the top of this page consists of made-up ‘IB grade’(y) and ‘time spent on mobile phone’(x) which I designed to have a negative, correlation - it’s advisable to plot two variable data in a scatter plot to 'visualise' likely patterns/models etc.

The causal/independent factor is plotted on the x-axes (though arguments might be made, depending on what the student is using their mobile phone for e.g. IB revision websites, research, a calculator etc, that poor or good grades could be ‘causes’ of changes in a student’s time spent on their mobile !).


The calculated R² value = 0.963 (which is the same as Pearson’s squared, for « linear » data).
The interpretation of this would usually be something along the lines of « 96.3% of the variation in y(IB grade) can be explained by variations in x(time spent on mobile) ». Yet we can see the variation in the "time on mobile" data is not the same in each IB grade category . . This breaks the assumption of "homoscedasticity": the variance of any two data points (selected at random) from the least square's regression model is approximately the same. Pearson's (r) is a measure of the average  strength of the relationship between the x and y variables. As for mean averages, such measures tell us little about any particular point, or subset of points, if the variation around the mean is large (and/or changing).

Spearman’s and Pearson’s
Is the R² interpretation above a reasonable one, in this context and given our scatter graph ? Not really.

I think majority math opinion would prefer Spearman, or Kendall’s Tau type measures over Pearson’s where one of the two variables is ‘discrete’.
Spearman’s rank is a « Pearson’s correlation coefficient » calculation, but on « ranks » (not raw data). Any interpretation that follows is focused on the « ranks » of each, not on the raw data values. This is perhaps a more reasonable degree of accuracy to aim for with such broad, multi-variate questions such as « what influences a person’s IB grade » i.e. can I predict what IB grade (treated as a ‘rank’) a student will get based on what their ‘rank’ is, in terms of time spent on mobile.

ToK
‘Rank’, for ‘time spent on mobile app’ would have to be determined by the student here, throwing up some nice ToK points, relevant to the 2020 IB ToK Essay title : « Labels are a necessity in the organization of knowledge, but they also constrain our understanding.” Discuss this statement with reference to two areas of knowledge” – Jim Noble is our resident ToK ‘expert’ and, I’m sure, could offer a lot of interesting reflection on this . . .

Internal Assessment Criteria

Whilst you can’t be penalised for the same error across more than one criterion, you can get credit across more than one criterion for a given section.

Use of ‘r’ can touch on a range of criteria. The descriptors for crtierion E, from E3 (commensurate to course) upward, all focus on the degree of understanding shown: « limited, some, good, thorough»


Criterion D is obviously also very relevant here, as is criterion C, based on the degree with which they engage and persist with the ‘grey areas » that the use of statistical techniques unearths.

Some discussion of the degree of accuracy used on the parameters in the linear regression and R² values displayed by the student would be expected (criterion B). Justifications would need to be provided i.e. is the level of accuracy given in the model (and R²) values above 'appropriate' given the 'IB grades' and 'time in minutes on mobile' context?

A ‘Good’ statistical education

I think an essential part of our job as «educators » is to give students a very real, concrete understanding of the dangers of applying simple, or uni-dimensional analysis to complex problems. The classic problem facing questions such as: « what factors effect a student’s IB grade », is that the answer is, at the least, a very large number (if not an infinite number !) of very diverse factors.

The problem of multi-collinearity is highlighted by such questions (and many other ‘human science’ questions).

To suggest that a statistical technique, no matter how ingenius, can offer any certainty as to the effects of just one factor on someone’s IB grade, given the impossible task of isolating one single factor's effects from the multitude of other possible explanatory factors, should be (and this is not an IB grading opinion, but an « aim of education » professional view) of serious concern.

Moreover, not being sensitive, aware of multi-variate analysis is the route cause of prejudice and bias (another 2020-21 ToK essay title !) e.g. does the questionnaire check what students who spend a long time on their mobile are using it for ? A mobile phone today is a ‘computer’ with access to onilne learning, calculator apps, Youtube and TED talks etc. etc.

Making clear any assumptions, and the reasoning behind them, is important.

In today’s data-rich society, such complex issues are tackled by professionals skilled in both statistics and computer science. This talk offers an excellent overview to students interested in such careers in Data Science, as to the skillsets required to tackle such questions.