Uses of Linear Regression
Learning Objectives
- Find and interpret the correlation coefficient
- Identify the line of best fit (Least Squares Regression)
- Make predictions using a line of best fit
Why This Matters
Every time a bank decides your credit limit, every time an insurance company sets your premium, every time a hospital predicts your surgical recovery time, a regression model turned your data into a prediction. The correlation coefficient r determined which of your attributes -- income, age, blood pressure -- mattered enough to include. The least squares line set the equation. And the prediction tool gave the number that decided what happened next. These three tools are the engine behind credit scoring, actuarial tables, and clinical risk calculators that affect billions of decisions every year.
How to Use This Simulation
- Drag any data point on the scatter plot and watch r, r², and the regression equation update in real time.
- Toggle "Show Residuals" to see the vertical distances from each point to the regression line and the Sum of Squared Residuals (SSE).
- Use the Prediction Tool to enter an x value and see the predicted ŷ plotted on the regression line, with interpolation vs extrapolation flagging.
- Switch between preset datasets to explore different correlation strengths and patterns.
What's Happening
Quick Check
A study finds that the correlation between daily exercise minutes and resting heart rate is r = −0.80. A health journalist writes: "Exercise explains 80% of the variation in resting heart rate." Is this claim accurate?
Try This
Load the "Hours Studied vs Exam Score" preset. Read off r and r² from the results panel. In one sentence, describe the strength and direction of the linear relationship using the qualitative label shown. Use the prediction tool to find ŷ when x = 4.5 hours. Write the predicted exam score with its units. Verify visually that your prediction falls on the regression line.
Using the "Apartment Sq Ft vs Monthly Rent" preset data (visible in the data table below the scatter plot), compute r by hand using the formula r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]. Build a calculation table with columns for xᵢ, yᵢ, (xᵢ − x̄), (yᵢ − ȳ), their product, and their squares. Verify your r against the simulation. Then compute r² and interpret it: what percentage of the variation in monthly rent is explained by apartment size? Finally, predict rent for a 1200 sq ft apartment (interpolation) and for a 1500 sq ft apartment (extrapolation). In one sentence, explain why the 1500 sq ft prediction is less reliable.
A real estate investor has data on 50 recent home sales: square footage (x) ranges from 900 to 3,800 sq ft, with r = 0.74 between square footage and sale price. The regression equation is ŷ = 45,200 + 142.50x. The investor asks you to predict the sale price of a 4,500 sq ft luxury home. (1) Compute the prediction using the equation. (2) Compute r² and state what percentage of price variation is explained by size alone. (3) This prediction is extrapolation since the maximum observed home is 3,800 sq ft. In two sentences, explain why using this prediction to make a $600,000+ purchasing decision is risky, and name one additional variable that would make the prediction more reliable if included in a more sophisticated model.
Instructor Notes
Teaching Notes
The most effective teaching moment in this simulation is the outlier drag. Have students load the "Hours Studied vs Exam Score" preset, note that r = 0.92, then drag the last point (7, 88) downward to about (7, 40). Watch r plummet to roughly 0.35. The visual shock -- a single point destroying a strong correlation -- creates the conceptual anchor for why outlier analysis matters in any regression context.
The residual toggle is best introduced after students have seen the regression line and understand what it represents. Turn on residuals and ask: "Why does this line, and not some other line, minimize these vertical distances?" The sum-of-squared-residuals counter makes the abstract "least squares" criterion concrete and observable.
Common Student Errors
- Confusing r (correlation coefficient, dimensionless, [-1, +1]) with the slope b (units, unbounded). The simulation displays both prominently to surface this distinction.
- Claiming r = 0.9 means 90% of variance is explained (should be r² = 0.81 = 81%). The Quick Check targets this directly.
- Treating extrapolation predictions as equally reliable to interpolation predictions. The prediction tool visually flags extrapolation to build this awareness.
- Interpreting r = 0 as "no relationship" rather than "no linear relationship." A strong curved pattern can produce r near zero.
Discussion Questions
- If two datasets have the same r but different slopes, what does that tell you about the data? (Hint: think about the spread of x and y values.)
- A news article says a study found r = 0.95 between ice cream sales and drowning deaths. Should we ban ice cream? What's missing from this analysis?
- When would you choose to report r² instead of r to a non-technical audience? Which number is more interpretable and why?
Exam Connection
Typical exam questions give a correlation coefficient and ask students to interpret it, compute r², or make a prediction using the regression equation. The most common exam error is confusing r with r² (the Quick Check targets this). The Stretch challenge directly practices the hand-computation of r using the formula. The Challenge tier prepares students for questions that ask them to evaluate whether a prediction is reliable -- a higher-order question that distinguishes strong from average students.