Correlation and Causation
Learning Objectives
- Interpret the slope and y-intercept of the least squares regression line
- Understand the difference between correlation and causation
Why This Matters
Every time a headline says "coffee causes cancer" or "chocolate prevents heart disease," a correlation got dressed up as a cause. The researchers found two variables moving together; the journalist wrote that one made the other happen. That one-word swap -- from "linked to" to "causes" -- has driven billions in misdirected health spending and contradictory dietary guidelines, and learning to spot it is the single most useful filter between a finding you can act on and one you can't.
How to Use This Simulation
- Start with the Interpret the Equation tab to practice reading slope and y-intercept in real-world context.
- Switch to Correlation vs Causation to classify real-world correlations using the four-causes framework.
- Watch the Explanation Panel below -- it updates as you interact and connects both skills to interpreting regression results responsibly.
- Work through the Quick Check and Try This challenges to test your understanding.
Slope Interpretation
Y-Intercept Interpretation
When two variables are correlated, four structures are possible: X causes Y, Y causes X (reverse causation), a lurking variable causes both, or the correlation is coincidental. Classify each scenario below.
What's Happening
Quick Check
A regression analysis of apartment rental data finds: predicted monthly rent = 800 + 2.4 × (square footage). A student writes: "The y-intercept tells us that a zero-square-foot apartment costs $800 per month." Which response best evaluates this interpretation?
Try This
A regression analysis of data from a local gym finds that predicted monthly weight loss (in pounds) is related to weekly exercise hours by the equation ŷ = −0.8 + 0.6x, where x is weekly exercise hours and ŷ is predicted monthly weight loss in pounds.
(1) Identify the slope and y-intercept. (2) Write the slope interpretation: "For each additional _____, the predicted _____ increases/decreases by _____." (3) The y-intercept predicts −0.8 pounds of weight loss (a slight weight gain) for someone who exercises 0 hours per week. Is this interpretation contextually meaningful, or does it represent extrapolation beyond the data? Explain in one sentence.
Verify your interpretations against the simulation's Interpretation tab using a similar scenario.
A study reports r = 0.65 between weekly hours of social media use and self-reported anxiety levels among college students. A campus wellness newsletter writes: "Social media is driving the anxiety epidemic among students."
(1) List all four possible causal structures consistent with this correlation. (2) Propose at least one plausible lurking variable and explain how it could produce the observed correlation without social media causing anxiety. (3) In two sentences, explain why the correlation alone does not support the newsletter's causal claim. (4) What kind of study design (think back to the first simulation in this series) would provide stronger causal evidence? Name the design and explain why it helps.
A health magazine publishes the headline: "Study Finds People Who Eat Breakfast Have Lower BMI -- Skipping Breakfast Causes Weight Gain." The underlying study surveyed 2,400 adults and found r = −0.38 between breakfast frequency (days per week) and BMI.
(1) Identify the correlation-causation conflation in the headline. What did the study actually find, and what did the headline claim? (2) Propose at least two alternative causal structures consistent with r = −0.38 (reverse causation, lurking variable). Name specific variables. (3) What evidence would be needed to support the causal claim in the headline? Be specific about study design. (4) Write a corrected headline (under 15 words) that accurately represents what the study established without implying causation.
Instructor Notes
Teaching Notes
This simulation works best when you let students encounter Card 4 (margarine vs divorce, r = 0.99) without warning. Most students will have built confidence classifying the first three cards correctly, and the near-perfect correlation on Card 4 triggers the prediction that it must be strongly causal. The surprise when they learn it's coincidental is the pedagogical hinge of the entire simulation -- and arguably the most important single moment in the 31-simulation series. Do not preview the punchline.
For the slope interpretation work, ask students to compare the Housing Prices scenario (where the intercept is contextually questionable) with the Coffee Shop scenario (where the intercept makes sense). The question "When does the y-intercept tell you something real?" produces more engagement than "Interpret the y-intercept."
Common Student Errors
- Classifying every strong correlation as causal (addressed by Cards 4 and 7)
- Classifying every correlation as "lurking variable" after first encountering the concept (overcorrection; Card 3's genuinely causal example corrects this)
- Omitting units from slope interpretation, treating "slope = 185" as a complete statement
- Assuming the y-intercept is always meaningful without evaluating whether x = 0 is within the observed range
- Confusing lurking variables with sampling bias (a concept from the sampling simulation -- lurking variables are real causal forces, not data collection errors)
Discussion Questions
- A news article reports that "countries with higher chocolate consumption have more Nobel Prize winners per capita (r = 0.79)." A student concludes chocolate makes people smarter. Walk through the four possible causal structures. Which is most plausible and why?
- If you wanted to determine whether social media actually causes anxiety (not just correlates with it), what kind of study would you design? What ethical constraints might prevent that study?
- Why do scientists prefer the phrase "associated with" over "causes" when describing correlational findings? What would have to be true for them to use "causes"?
Exam Connection
Typical exam questions give a regression equation in context and ask students to interpret the slope and y-intercept in the variables' units. The most common error is omitting units or giving a mechanical interpretation without addressing contextual meaningfulness. A second common format presents a correlation finding and asks whether causation can be inferred -- students must identify alternative explanations. Both formats are directly practiced in this simulation.