Predicting Value from Another Measured Variable

In research, it is often desirable to determine whether values of one variable can be predicted from the values of another variable. For example, in a psycholinguistics study, we may want to know how well reaction time in a lexical decision task (determining if a presented stimulus is a word or a non-word) can be predicted from word frequency (how frequent does the stimulus word appear in the language in general). In this section, we review the two most common methods for studying the relationship between two variables/distributions.

The example data came from an in-class survey where we collected students' GPAs and their reported hours of sleep. We want to know if there is a relationship between these two variables. The example data are as follows:

GPA	Hours of sleep
3.1	8.5
3.3	6
3.4	7
3.0	6.5
3.9	7
3.4	5
3.8	8
2.8	3
3.7	8.5
3.6	8

Our null hypothesis for the below analyses will be:

H₀: There is no relationship between college students' GPA and the number of hours slept (or that one variable does not predict the other).

We first apply the Pearson r test to this data set, and then construct a linear regression model.

Pearson r Correlation

The Pearson r correlation test is applicable in situations where measurements are assumed to come from Normal distributions. If you have no idea of what this means, it usually means you can use it for everything -- except when your variables are clearly categorical or ordinal (see "Experimental Design Basics"). In the above example, both GPA and hours of sleep are continuous variables and the samples are drawn from a relatively large population, so the Pearson r test is appropriate.

Calculating the Pearson correlation coefficient -- the r -- consists of three steps. Step 1, we need to find out the means and standard deviations of each variable. Let's use G to refer to the GPA variable and H to the hours of sleep variable. The mean GPA is $\overline{G}$ = 3.4 and the standard deviation s_G = 0.36. The mean hours of sleep is $\overline{H}$ = 6.75 and the standard deviation s_H = 1.74.

Step 2, we need to calculate the covariance between GPA and hours of sleep. Covariance intuitively reflects whether GPA and hours of sleep change in the same direction for each student. For example, if the student's GPA is above average and her hours of sleep is also above average, covariance is increased. The formula for calculating covariance is:

$Cov(G, H) = \frac{1}{N-1} \sum (G_{i}-\overline{G})(H_{i} - \overline{H}) = \frac{1}{10 -1}(8.5-6.75)(3.1-3.4) + \ldots + (8-6.75)(3.6-3.4) = 0.37$

Step 3 and the final step, the Pearson correlation coefficient is obtained by dividing the covariance by the product of the two standard deviations:

$r = \frac{Cov(G,H)}{s_{G}s_{H}} = \frac{0.37}{1.74 \times 0.36} = 0.59$

So, we know the correlation coefficient between GPA and hours of sleep is 0.59. Is this a strong correlation? Are GPA and hours of sleep significantly correlated?

To answer the first question: sort of. Correlation coefficient vary from -1 to 1, where -1 means a perfect negative correlation and 1 means a perfect positive correlation. (The null hypothesis is r = 0.) However, it is important to keep in mind that this correlation coefficient is calculated purely based on the sample data. To determine whether a correlation exists in the population (whether GPA and hours of sleep are correlated for all students), we must test the significance of the obtained r. And this will give us an answer to the second question.

The test for the significance of r is in fact a t-test. The formula is:

$t_{obt} = (r - \rho) / (\sqrt{\frac{1 - r^2}{N-2}})$

where r is the correlation coefficient we just calculated, N is the number of subjects, and $\rho$ is the population correlation coefficient as specified by the null hypothesis, which is 0. It is easy to arrive at the solution t_obt = 2.07.

Given the degrees of freedom N - 2 = 8, and assuming alpha level set to 0.05 (two-tailed), we get the critical value of t: 2.228. Clearly, this is not a significant correlation as our t is smaller than the t critical value. There is not sufficient evidence to suggest that GPA and hours of sleep are correlated in the population.

The lesson to take home: the absolute value of a correlation coefficient can be deceiving. Whether there is a significant correlation needs more than a calculation of the Pearson r.

Linear Regression

What if our research question is to find an equation that relates GPA to hours of sleep, so that we can predict a student's GPA given her reported hours of sleep? This can be achieved by constructing a best fitting linear regression model based on the statistics we just calculated.

In simple linear regression, a dependent variable is predicted by changes in value of an explanatory variable, and the linear equation usually have the following form:

$y = a + bx$

where y is the dependent variable, x the explanatory variable. b is the slope estimate and a the intercept estimate. Our goal is to find the values of a and b. (The null hypothesis might be said to be b = 0, as this would indicate a flat slope and thus that you cannot predict one from the other.)

In the context of our problem, y is GPA and x is hours of sleep. We can find the slope b directly by calculating:

$b = r\frac{s_{G}}{s_{H}} = 0.59 \times \frac{0.36}{1.74} = 0.12$

The intercept a is related to the slope b in the following way:

$a = \overline{G} - b \overline{H} = 3.4 - 0.12 \times 6.75 = 2.59$

Thus, the linear regression model is GPA = 2.59 + 0.12 * Hours of sleep. Of course, taking the equation to extreme cases yields funny interpretations. For example, according to the model, if one gets no sleep at all, they are guaranteed a GPA of 2.59, which is probably not true. For the same reason, if one gets 20 hours of sleep, their GPAs are over 4, which is also unlikely. Thus, when interpreting the result as "one more hour of sleep increases GPA by 0.12", we should also be careful and point out the applicability of the model.

Go back to the Homepage