STATISTICS · HL
Presenting Data
Histograms, scatter, stem & leaf, skew.
Section 1 of 6
Histogram
A histogram shows grouped (continuous) data as bars. Bar height = frequency.
Bars touch — intervals are continuous, not separate categories.
(i) Worked example
Ages of 20 people:
| Age | 0–2 | 2–4 | 4–6 | 6–8 |
|---|---|---|---|---|
| People | 5 | 8 | 3 | 4 |
Bar width = age interval. Bar height = number of people.
(ii) Median from a histogram
Total = $5 + 8 + 3 + 4 = 20$ people.
Median position $= \dfrac{n+1}{2}$.
$\dfrac{20+1}{2} = \dfrac{21}{2} = 10.5$
So the median sits between the 10th and 11th person.
Count bars left to right: $5$ in $0$–$2$, then the next $8$ are in $2$–$4$.
The 10th and 11th both fall in the $2$–$4$ interval.
The median age lies between $2$ and $4$.
Must learn
1.Bar height = frequency.
2.Bars touch — data is continuous.
3.Median position $= \dfrac{n+1}{2}$. Count bars to find the interval.
YOU TRY · 1
From the same histogram, in which age interval does the median lie?
Median position = $(n+1)/2$. Then count bars.
$n = 20$, position $= 21/2 = 10.5$
5 in 0–2, next 8 in 2–4 — so 10th & 11th sit in 2–4.
Median is in the $2$–$4$ interval.
$2$–$4$ interval
Section 2 of 6
Scatter plot
A scatter plot shows pairs $(x, y)$ as dots. We're looking for a relationship between the two variables.
(i) Example — education vs income
Years of education ($x$) and income in €1,000s ($y$):
| Years | Income (€000s) |
|---|---|
| 11 | 65 |
| 11 | 28 |
| 12 | 30 |
| 13 | 35 |
| 13 | 43 |
| 14 | 55 |
| 15 | 38 |
| 16 | 45 |
| 17 | 58 |
| 19 | 70 |
(ii) The outlier
One point sticks out — far from the trend:
$(11, 65)$ — the "big farmer".
11 years of education but earning €65,000 — doesn't fit the trend. We flag it and explain.
(iii) Is there a relationship?
Yes. There is a correlation. Positive.
As years of education increase, in general so does income.
Correlation
1.Correlation coefficient is $r$, with $-1 \le r \le 1$.
2.Closer to $1$ or $-1$, the stronger the relationship.
3.Correlation is the LINEAR relationship between $2$ variables.
4.Slope of line of best fit = rate of change of one variable as the other changes.
$r = 0.45 \quad \Rightarrow \quad$ weak positive relationship.
(iv) Using the line of best fit
Equation of the line:
$y = a + bx$
$y = 9.1 + 2.5x$
(1) What is income after 18 years of education?
$x = 18$
$y = 9.1 + 2.5(18) = 54.1$
Income $\approx$ €$54{,}100$
(2) How many years to earn €50,000?
$y = 50$
$50 = 9.1 + 2.5x$
$2.5x = 40.9$
$x = 16.3$ years
(v) Meaning of the slope
$b = m = 2.5$
For every extra year in education, income increases by €2,500.
(vi) Line of best fit without a calculator
Pick any 2 points on the line. Carl uses $(12, 30)$ and $(17, 58)$.
$m = \dfrac{y_2 - y_1}{x_2 - x_1} = \dfrac{58 - 30}{17 - 12} = \dfrac{28}{5}$
$y - y_1 = m(x - x_1)$
$y - 30 = \dfrac{28}{5}(x - 12)$
YOU TRY · 2
Using $y = 9.1 + 2.5x$, predict the income for someone with $20$ years of education.
Sub $x = 20$ into the equation.
$y = 9.1 + 2.5(20) = 9.1 + 50 = 59.1$
$\approx$ €$59{,}100$
$\approx$ €$59{,}100$
Section 3 of 6
Causality
Correlation is not the same as causation.
Causality
Q.Has one variable caused the other to happen?
(i) Smoking causes cancer.
Yes — causation.
(ii) Hot weather causes more ice cream sales.
No — that's only correlation.
Both rise together in summer, but the temperature doesn't cause the sale. People choose to buy.
YOU TRY · 3
A study finds that students who eat breakfast score higher in exams. Does breakfast cause the higher score?
Correlation ≠ causation. Could something else explain it?
It's a correlation. Other factors — sleep, home support, organisation — could be the real cause.
No — correlation only.
No — correlation, not causation.
Section 4 of 6
Stem and leaf
A stem and leaf plot shows the raw data, sorted, with the leaf = one digit only.
The KEY is key — without it, nobody knows what the numbers mean.
(i) Worked example
Draw a stem and leaf of: $143, \; 137, \; 129, \; 133, \; 144$
| 12 | 9 |
| 13 | 3 7 |
| 14 | 3 4 |
Key: $12 \mid 9 = 129$
(ii) Back-to-back stem and leaf
Compare two sets on the same stem. Read the left side outwards.
John: $33, \; 45, \; 37, \; 42, \; 29$
Ann: $29, \; 21, \; 33, \; 48, \; 49$
| John | Ann | |
| 9 | 2 | 1 9 |
| 7 3 | 3 | 3 |
| 5 2 | 4 | 8 9 |
Keys: $9 \mid 2 = 29$ (John) · $4 \mid 9 = 49$ (Ann)
(iii) Compare the two
Compare using mean and standard deviation.
Smaller $\sigma$ = more consistent.
Must learn
1.Leaf = one digit only.
2.The key is key. Always include it.
3.Back-to-back: left side reads outwards.
4.Compare with mean and $\sigma$.
YOU TRY · 4
Looking at the John vs Ann back-to-back plot, which person has the more consistent scores?
Consistent = smaller spread = smaller $\sigma$.
John: $29, 33, 37, 42, 45$ range $= 16$
Ann: $21, 29, 33, 48, 49$ range $= 28$
John — smaller spread, smaller $\sigma$.
John — smaller $\sigma$.
Section 5 of 6
Line plot (dot plot)
A line plot (a.k.a. dot plot) shows frequencies as stacked dots above each value.
(i) Worked example
Goals scored:
| Goals | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Cases | 5 | 3 | 2 | 1 |
Show on a line plot — one × per case, stacked above each value.
The shape on the dot plot tells you about the skew — see next section.
YOU TRY · 5
From the line plot, what is the total number of cases?
Add up all the cases.
$5 + 3 + 2 + 1 = 11$
$11$ cases
$11$
Section 6 of 6
Skew
Skew = the cow's tail. The skew points in the direction of the long tail.
(i) Right skew (positive)
Mode < Median < Mean
Example: family size. Most families small, a few very large pull the mean up.
(ii) Left skew (negative)
Mean < Median < Mode
Example: 6th year shoe size. Most at the upper end, a few smaller pull the mean down.
(iii) Symmetrical
Mode $=$ Mean $=$ Median
Skew — remember
1.Skew = the cow's tail.
2.Right (positive): Mode < Median < Mean. (family size)
3.Left (negative): Mean < Median < Mode. (6th yr shoe size)
4.Symmetrical: all three equal.
YOU TRY · 6
A distribution has Mean $= 8$, Median $= 6$, Mode $= 5$. What's the skew?
Compare the three. Where does the tail point?
Mode $= 5$ < Median $= 6$ < Mean $= 8$. Tail points right.
Right skew (positive).
Right skew (positive)
SUM
The lot in one box
Presenting data toolkit
1.Histogram: continuous data, bars touch. Median position $= (n+1)/2$.
2.Scatter: outlier? relationship? line of best fit $y = a + bx$. Slope $b$ = rate of change.
3.Correlation $r$: $-1 \le r \le 1$. Closer to $\pm 1$ = stronger. LINEAR only.
4.Causality: has one variable caused the other? Correlation ≠ causation.
5.Stem & leaf: leaf $=$ one digit, key is key. Back-to-back reads outwards on left.
6.Line plot: stacked $\times$ above each value.
7.Skew: the cow's tail. Right: Mo < Me < Mean. Left: Mean < Me < Mo.
End of lesson
Presenting Data — HL · Mathslive.ie