STATISTICS · HIGHER LEVEL
Statistics — Basic
From variables to standard deviation · Tap NEXT to begin
Section 1 of 11
Variables and Data
Statistics is the study of data — collecting it, summarising it, and drawing conclusions from it. Before we touch a number we need to nail down the language.
The vocabulary — must learn
1.Variable = the characteristic of interest. e.g. exam results, hours of study, height.
2.Observation = a single value of the variable. e.g. one student scored 72%.
3.Data set = all the observations — all the information you have.
4.Population = all the data under consideration. e.g. every student in the school, every tree in the town.
5.Sample = a part of the population.
6.Census = collect information from the whole population.
7.Sampling frame = a list of every item in the population.
Spotting which one is being asked for is half the marks in a definitions question. Read the wording carefully — "every student" is a population; "100 students chosen at random" is a sample.
You try
A market researcher rings 200 households out of every household in Ireland and asks what brand of tea they buy. Name (a) the variable, (b) the population, (c) the sample.
What is being measured? Who is being studied in total? Who actually got asked?
Variable = the brand of tea bought (the characteristic of interest).
Population = every household in Ireland (everyone under consideration).
Sample = the 200 households actually rung.
Variable: brand of tea. Population: every household in Ireland. Sample: the 200 households rung.
You try
Explain the difference between a census and a sample.
One uses the whole population, the other only a part of it.
A census collects information from every member of the population.
A sample collects information from part of the population.
Census = everyone in the population. Sample = part of the population.
Section 2 of 11
The Three Averages
There are three different "averages" you need to know — and they often give different answers for the same data. The skill is knowing which one to compute and how.
The three averages — must learn
1.Mode = the value that occurs most often. In a frequency table it's the value with the biggest frequency. The mode is sometimes called the modal value.
2.Median = the middle value when the data is put in order of size.
3.Mean = the sum of all observations divided by the number of observations.
(i) Finding the median — position formula
Once the data is in order, the median sits in position $\dfrac{n+1}{2}$, where $n$ is the number of observations.
$n = 7$ (odd)
Position $= \dfrac{n+1}{2} = \dfrac{8}{2} = 4$
Median = 4th value in the ordered list.
When $n$ is even, the position formula lands halfway between two values — you take the mean of those two.
$n = 6$ (even)
Position $= \dfrac{n+1}{2} = \dfrac{7}{2} = 3.5$
Median = mean of the 3rd and 4th values.
(ii) The mean — formula and notation
Mean formula
$\text{Mean} \;=\; \dfrac{\sum x}{n}$
$\sum x$ = sum of all observations. $n$ = number of observations.
The symbol $\sum$ (capital sigma) just means "add them up". We'll meet two different symbols for the mean later:
$\bar{x}$ = sample mean
$\mu$ = population mean (Greek letter "mu")
For now, just write "Mean" — we'll come back to $\bar{x}$ vs $\mu$ once we've met sampling properly.
Section 3 of 11
Worked Averages and Outliers
(i) Find the mean, mode and median of 2, 3, 5, 1, 100, 1, 1
Mean — add them up and divide by 7.
Mean $= \dfrac{2 + 3 + 5 + 1 + 100 + 1 + 1}{7}$
$= \dfrac{113}{7}$
$= 16.14$
Mode — the value that appears most often is $1$ (it appears three times).
Mode $= 1$
Median — put the seven values in order, then pick the middle one.
Ordered: 1, 1, 1, 2, 3, 5, 100
Position $= \dfrac{n+1}{2} = \dfrac{7+1}{2} = 4$
Median = 4th value $= 2$
(ii) What just happened? — the outlier
Look at the answers side by side: mean $= 16.14$, but mode $= 1$ and median $= 2$. The mean is wildly different from the other two.
The reason is the value $100$ — it's much bigger than every other observation in the data. We call this an outlier.
Outlier — must learn
An outlier is an observation that is much bigger or much smaller than the other observations in the data.
Outliers pull the mean towards themselves, but barely affect the median or mode. That's why the median is sometimes a better "typical value" than the mean.
(iii) Find the mean, mode and median of 5, 3, 1, 7, 6, 4
Mode — no value appears more than once.
Mode = none (disadvantage of the mode!)
"No mode" is one of the weaknesses of the mode — sometimes it just doesn't exist.
Median — order the six values, then find the middle position.
Ordered: 1, 3, 4 || 5, 6, 7
Position $= \dfrac{n+1}{2} = \dfrac{6+1}{2} = 3.5$ (between 3rd and 4th)
Median $= \dfrac{4 + 5}{2}$
$= 4.5$
Mean — add and divide by 6.
Mean $= \dfrac{5 + 3 + 1 + 7 + 6 + 4}{6} = \dfrac{26}{6}$
$= 4.33$
You try
Find the mean, mode and median of 4, 7, 2, 7, 3, 5, 7, 1, 4.
$n = 9$. Order the data first — the median position is $\dfrac{9+1}{2} = 5$.
Ordered: 1, 2, 3, 4, 4, 5, 7, 7, 7
Mean $= \dfrac{1+2+3+4+4+5+7+7+7}{9} = \dfrac{40}{9}$
Mode $= 7$ (appears three times)
Median position $= \dfrac{9+1}{2} = 5$
Mean $= 4.44$, Mode $= 7$, Median $= 4$
Mean $= 4.44$, Mode $= 7$, Median $= 4$
You try
Find the mean and median of 8, 4, 6, 2, 250, 5, 3. Comment on which average is the better "typical value".
There's a 250 in there — that's an outlier. Watch what it does to the mean.
Ordered: 2, 3, 4, 5, 6, 8, 250
Mean $= \dfrac{2+3+4+5+6+8+250}{7} = \dfrac{278}{7} = 39.71$
Median position $= \dfrac{7+1}{2} = 4 \;\Rightarrow\;$ Median $= 5$
Mean $= 39.71$, Median $= 5$. The median is the better typical value — the outlier 250 has dragged the mean far away from the rest of the data.
Mean $= 39.71$, Median $= 5$. The median is more representative — the 250 is an outlier that distorts the mean.
Section 4 of 11
Algebra with the Mean
Exam questions don't always give you all the numbers. Sometimes one (or more) of the data values is a letter and you have to solve for it using the mean formula.
The strategy
Write the mean equation $\dfrac{\sum x}{n} = \text{mean}$ with the unknown letter included in $\sum x$, then solve.
(i) The numbers $5,\,3,\,4,\,x,\,9$ have a mean of $6$. Find $x$.
$\dfrac{5 + 3 + 4 + x + 9}{5} = 6$
$21 + x = 30$
$x = 9$
(ii) The numbers $x,\,2,\,3,\,y,\,7$ have a mean of $5$. Write $x$ in terms of $y$.
$\dfrac{x + 2 + 3 + y + 7}{5} = 5$
$x + y + 12 = 25$
$x = 13 - y$
(iii) Combined means — group them in a table
Question. 5 boys have a mean age of 9. 6 girls have a mean age of 12. Find the mean age of all 11 people.
Build a "number / mean / total" table. The total for each row is just $\text{number} \times \text{mean}$.
No.
Mean
Total
5 9 45
6 12 72
11 117
5 9 45
6 12 72
11 117
Mean of all 11 $= \dfrac{\text{combined total}}{\text{combined number}}$
$= \dfrac{117}{11} = 10.64$
(iv) Mean of 6 numbers is $x$. Mean of 9 more numbers is $y$. Mean of all 15 is $z$. Write $x$ in terms of $y$ and $z$.
No.
Mean
Total
6 $x$ $6x$
9 $y$ $9y$
15 $z$ $15z$
6 $x$ $6x$
9 $y$ $9y$
15 $z$ $15z$
$6x + 9y = 15z$
$x = \dfrac{15z - 9y}{6}$
$x = \dfrac{5z - 3y}{2}$
Always reach for the number / mean / total table when two or more groups are combined. It works every time.
You try
The numbers $7,\,k,\,4,\,8,\,6$ have a mean of $6$. Find $k$.
Sum $= 5 \times 6 = 30$. The four known numbers add to $25$.
$\dfrac{7 + k + 4 + 8 + 6}{5} = 6$
$25 + k = 30$
$k = 5$
$k = 5$
You try
The numbers $a,\,4,\,b,\,9,\,3$ have a mean of $7$. Write $a$ in terms of $b$.
Mean $\times$ number $=$ total $\;\Rightarrow\;$ everything in the numerator adds to $35$.
$\dfrac{a + 4 + b + 9 + 3}{5} = 7$
$a + b + 16 = 35$
$a = 19 - b$
$a = 19 - b$
You try
10 first-year students have a mean height of 142 cm. 15 second-year students have a mean height of 156 cm. Find the mean height of all 25 students.
Build a number / mean / total table. First-year total $= 10 \times 142$.
First years: total $= 10 \times 142 = 1420$.
Second years: total $= 15 \times 156 = 2340$.
Combined: $\dfrac{1420 + 2340}{25} = \dfrac{3760}{25}$
$= 150.4$ cm
$150.4$ cm
Section 5 of 11
Sampling and Inference
In real life we almost never get to study the full population — it's too big or too expensive. So we use a sample and try to make conclusions about the population from it.
Sampling vocabulary — must learn
1.Random sample = every member of the population has an equal chance of being selected.
2.Statistics = data collected from a sample.
3.Statistical inference = conclusions drawn from a sample and applied back to the population.
"Random" matters because a biased sample leads to wrong conclusions. If you survey only people leaving a gym about how often they exercise, your sample is not random — you'll wildly overestimate the population's exercise habits.
Whenever you see the phrase "applied to the population", the marker is looking for the words statistical inference.
You try
A school nurse weighs 50 students chosen at random and uses the average to estimate the average weight of all 600 students in the school. Which of the four words — variable, statistic, inference, population — best fits this process?
She measured 50 students and used the result to draw a conclusion about all 600 — that's the key idea.
She took a random sample of 50 students.
The data she collected from the sample is a statistic.
Using that statistic to estimate the average weight across all 600 students is statistical inference.
Statistical inference.
Section 6 of 11
Types of Data
Data comes in two big flavours: numerical (the answer is a number) and categorical (the answer is a category or label). Each splits into two sub-types.
(i) Primary vs Secondary data
Where the data comes from
PPrimary data = data you collect yourself (e.g. running a survey).
Advantage: you can trust it. Disadvantage: takes time and money.
Advantage: you can trust it. Disadvantage: takes time and money.
SSecondary data = data collected by someone else (e.g. internet, CSO, published research).
Advantage: quick and cheap. Disadvantage: trust — you don't know how it was collected.
Advantage: quick and cheap. Disadvantage: trust — you don't know how it was collected.
(ii) Numerical data — discrete or continuous
Numerical = quantitative
DDiscrete = takes only certain values (whole-number "counts"). e.g. number of children in a family, goals in a match.
CContinuous = takes any value on a scale. e.g. height, time, weight, temperature.
Quick test: can the value lie between two whole numbers and still make sense? If yes (1.73 m of height) it's continuous. If no (1.73 children doesn't exist) it's discrete.
(iii) Categorical data — ordinal or nominal
Categorical = qualitative
OOrdinal = categories with a natural order. e.g. exam grade (A, B, C, D), Olympic medal.
NNominal (un-ordinal) = categories with no order. e.g. gender, favourite colour, eye colour.
Final word on names:
Numerical $=$ Quantitative
Categorical $=$ Qualitative
You try
Classify each variable as numerical-discrete, numerical-continuous, categorical-ordinal, or categorical-nominal:
(a) the time it takes to run 100 m,
(b) the number of cars in a car park,
(c) the colour of a car,
(d) the LC grade a student got in Maths.
(a) the time it takes to run 100 m,
(b) the number of cars in a car park,
(c) the colour of a car,
(d) the LC grade a student got in Maths.
If you can answer in fractions, it's continuous. If the answer is a label, it's categorical — and then ask "is there an order?".
(a) Time can be 12.34 s — numerical, continuous.
(b) Cars come in whole numbers — numerical, discrete.
(c) Colour is a label with no order — categorical, nominal.
(d) Grades have an order (H1 better than H2) — categorical, ordinal.
(a) continuous (b) discrete (c) nominal (d) ordinal
You try
A researcher uses the CSO (Central Statistics Office) website to find population data for every county in Ireland. Is this primary or secondary data? Give one advantage and one disadvantage.
Did the researcher collect it themselves?
The researcher didn't collect it — the CSO did. So it's secondary data.
Advantage: quick and cheap to access.
Disadvantage: the researcher must trust the CSO's method of collection.
Secondary. Adv: quick and cheap. Disadv: trust — must rely on CSO's method.
Section 7 of 11
Range and Interquartile Range
Mean, mode and median are called the central tendencies — they tell us where the middle of the data sits. But two data sets can have the same mean and look completely different. We also need to measure how spread out the data is.
(i) Strengths and weaknesses of the three averages
Quick comparison — must learn
MMode: easy to find. Disadvantage: not used in further statistics.
MMedian: easy to find, not affected by outliers. Disadvantage: not used in further statistics.
MMean: the average that's used everywhere in statistics. Disadvantage: can be time-consuming and is distorted by outliers.
(ii) Range and IQR — the spread measures
Spread measures — must learn
RRange = top value $-$ bottom value.
QInterquartile range (IQR) = upper quartile $-$ lower quartile.
The quartiles divide the ordered data into four equal parts. We find their positions the same way we find the median position — by formula. Order the data first, then:
Median position $= \dfrac{n+1}{2}$
Lower quartile $Q_{1}$ position $= \dfrac{n+1}{4}$
Upper quartile $Q_{3}$ position $= \dfrac{3(n+1)}{4}$
Carl's quick trick: once you've found the median position, count the same number of steps from the bottom up for the lower quartile, and from the top down for the upper quartile.
(iii) Worked example — Find the mean, median, range and IQR of $3,\,1,\,5,\,7,\,6,\,4,\,9$
Mean $= \dfrac{3 + 1 + 5 + 7 + 6 + 4 + 9}{7} = \dfrac{35}{7}$
$= 5$
Ordered: 1, 3, 4, 5, 6, 7, 9
Median position $= \dfrac{n+1}{2} = \dfrac{7+1}{2} = 4 \;\Rightarrow\;$ 4th value
Median $= 5$
Lower quartile position $= \dfrac{n+1}{4} = \dfrac{8}{4} = 2 \;\Rightarrow\;$ 2 up from bottom
$Q_{1} = 3$
Upper quartile $\;\Rightarrow\;$ 2 down from top
$Q_{3} = 7$
Range $= 9 - 1$
$= 8$
IQR $= Q_{3} - Q_{1} = 7 - 3$
$= 4$
The IQR is often smaller and more useful than the range because it ignores the extreme top and bottom values — exactly the values most likely to be outliers.
You try
Find the mean, median, range and IQR of $2,\,8,\,5,\,1,\,7,\,4,\,9,\,3,\,6$.
$n = 9$. Median position $= \dfrac{9+1}{2} = 5$. Quartile position $= \dfrac{9+1}{4} = 2.5$ (so halfway between 2nd and 3rd).
Ordered: 1, 2, 3, 4, 5, 6, 7, 8, 9
Mean $= \dfrac{45}{9} = 5$ · Median position $= 5 \Rightarrow$ median $= 5$
$Q_{1}$ position $= 2.5 \Rightarrow Q_{1} = \dfrac{2+3}{2} = 2.5$
$Q_{3}$ position $= \dfrac{3(10)}{4} = 7.5 \Rightarrow Q_{3} = \dfrac{7+8}{2} = 7.5$
Range $= 9 - 1 = 8$. IQR $= 7.5 - 2.5 = 5$.
Mean $= 5$, Median $= 5$, Range $= 8$, IQR $= 5$.
Mean $= 5$, Median $= 5$, Range $= 8$, IQR $= 5$.
You try
Find the median, range and IQR of $12,\,15,\,11,\,18,\,9,\,14,\,10,\,16,\,13,\,17,\,8$.
$n = 11$. Median position $= \dfrac{12}{2} = 6$. Quartile position $= \dfrac{12}{4} = 3$ from each end.
Ordered: 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
Median = 6th value $= 13$.
$Q_{1}$ = 3rd value $= 10$. $Q_{3}$ = 9th value $= 16$.
Range $= 18 - 8 = 10$. IQR $= 16 - 10 = 6$.
Median $= 13$, Range $= 10$, IQR $= 6$.
Median $= 13$, Range $= 10$, IQR $= 6$.
Section 8 of 11
Standard Deviation
The range uses only two values — the very top and the very bottom. The IQR is better but still uses only four positions. The standard deviation uses every observation in the data — making it the most informative measure of spread we have.
Standard deviation — must learn
The standard deviation $\sigma$ is the mean distance of the observations from the mean.
It is our main measure of spread. A small $\sigma$ means the data clusters tightly around the mean; a large $\sigma$ means the data is widely scattered.
(i) The notation — sample vs population
Greek letters — must learn
$\bar{x}$sample mean — spoken "x bar". Used when the data is a sample.
$\mu$population mean — Greek "mu". Used when the data is a full population.
$\sigma$standard deviation — Greek "sigma" (lowercase).
In Leaving Cert exam questions either symbol can appear — read the wording and use the matching letter. If the question says "a class of 25 students" (a sample of the school), use $\bar{x}$. If it says "all 100 employees" (the full population of the firm), use $\mu$.
(ii) Computing $\sigma$ — use your calculator
The formula for $\sigma$ is messy by hand. In the Leaving Cert exam you use the statistics mode on your calculator. Steps for Casio fx-83/85/991:
1. MODE $\to$ STAT $\to$ 1-VAR (set the calc to 1-variable stats)
2. Type each value, pressing = after each
3. AC (closes the data entry screen)
4. SHIFT 1 $\to$ Var $\to$ $\bar{x}$ (for the mean)
5. SHIFT 1 $\to$ Var $\to$ $\sigma x$ (for the standard deviation)
Common slip — pressing $sx$ instead of $\sigma x$. $sx$ is the sample SD (divides by $n-1$); $\sigma x$ is the SD we want at Leaving Cert level. Always pick $\sigma x$.
(iii) Worked example — Find $\bar{x}$, median, IQR and $\sigma$ of $1,\,3,\,7,\,9,\,11$
Mean $\bar{x} = \dfrac{1 + 3 + 7 + 9 + 11}{5} = \dfrac{31}{5}$
$\bar{x} = 6.2$
Ordered (already): 1, 3, 7, 9, 11
Median position $= \dfrac{5+1}{2} = 3 \;\Rightarrow\;$ 3rd value
Median $= 7$
Quartile position $= \dfrac{5+1}{4} = 1.5 \;\Rightarrow\;$ between 1st and 2nd from each end
$Q_{1} = \dfrac{1 + 3}{2} = 2$ · $Q_{3} = \dfrac{9 + 11}{2} = 10$
IQR $= 10 - 2 = 8$
Standard deviation — straight off the calculator with STAT 1-VAR.
$\sigma = 3.7$ (correct to 1 dp)
You try
Use your calculator to find $\bar{x}$ and $\sigma$ for $4,\,7,\,8,\,10,\,12,\,15$. (Give $\sigma$ correct to 2 dp.)
MODE $\to$ STAT $\to$ 1-VAR. Type each number, =, then AC. SHIFT 1 $\to$ Var $\to$ $\bar{x}$ and $\sigma x$.
$\bar{x} = \dfrac{4+7+8+10+12+15}{6} = \dfrac{56}{6}$
$\bar{x} = 9.33$, $\sigma = 3.54$
$\bar{x} = 9.33$, $\sigma = 3.54$
You try
Use your calculator to find $\bar{x}$ and $\sigma$ for the test scores $62,\,68,\,71,\,74,\,77,\,80,\,85,\,90$. (Give $\sigma$ correct to 2 dp.)
$n = 8$. Sum $= 607$. Use the calculator.
$\bar{x} = \dfrac{607}{8} = 75.875$
$\bar{x} = 75.88$ (2 dp), $\sigma = 8.54$
$\bar{x} = 75.88$, $\sigma = 8.54$
Section 9 of 11
The Empirical Rule
Once a sample is big enough ($n \geq 30$), most real-world data sets follow a bell-shaped curve called the normal distribution. The normal distribution has a remarkable property — the spread is divided up by the standard deviation in fixed percentages.
The Empirical Rule — must learn
1$\sigma$Approximately 68% of the data lies within 1 standard deviation of the mean.
2$\sigma$Approximately 95% of the data lies within 2 standard deviations of the mean.
3$\sigma$Approximately 99.7% of the data lies within 3 standard deviations of the mean.
The 68%, 95% and 99.7% are split symmetrically around the mean. Here's the full picture:
(i) Worked example — Leaving Cert results
Question. A sample of $n \geq 30$ Leaving Cert results has mean $\bar{x} = 100$ and standard deviation $\sigma = 10$. Estimate what percentage of students scored:
(a) between $90$ and $110$, (b) between $80$ and $120$, (c) more than $120$.
(a) between $90$ and $110$, (b) between $80$ and $120$, (c) more than $120$.
(a) $90$ to $110$ = $\bar{x} - \sigma$ to $\bar{x} + \sigma$ = within $1\sigma$ of the mean.
$\approx 68\%$
(b) $80$ to $120$ = $\bar{x} - 2\sigma$ to $\bar{x} + 2\sigma$ = within $2\sigma$ of the mean.
$\approx 95\%$
(c) "More than $120$" is above $\bar{x} + 2\sigma$. The total outside 2$\sigma$ is $100\% - 95\% = 5\%$, split equally above and below.
$\dfrac{5\%}{2} = 2.5\%$ (equivalent to the diagram's 2.35% + 0.15%)
$\approx 2.5\%$ scored more than 120
You try
In a sample of $n \geq 30$ students, the mean weight is $60$ kg and $\sigma = 5$ kg. Approximately what percentage of students weigh between $55$ kg and $65$ kg?
$55 = \bar{x} - \sigma$ and $65 = \bar{x} + \sigma$. That's "within 1$\sigma$".
$55$ to $65$ = $60 \pm 5$ = within $1\sigma$ of the mean.
$\approx 68\%$
$\approx 68\%$
You try
A factory produces bolts with mean length $\mu = 50$ mm and $\sigma = 2$ mm. Approximately what percentage of bolts have length between $46$ mm and $54$ mm?
$46 = 50 - 2\sigma$ and $54 = 50 + 2\sigma$.
$46$ to $54$ = $50 \pm 2(2)$ = within $2\sigma$ of the mean.
$\approx 95\%$
$\approx 95\%$
You try
A test has $\mu = 70$ marks and $\sigma = 5$. The class has 200 students. Approximately how many students scored below 55 marks?
$55 = 70 - 15 = \mu - 3\sigma$. The total outside 3$\sigma$ is $0.3\%$, split equally.
$55$ is at $\mu - 3\sigma$. Outside $3\sigma$ is $100\% - 99.7\% = 0.3\%$ total.
Below $\mu - 3\sigma$ is half of that: $0.15\%$.
$0.15\%$ of $200 = 0.0015 \times 200 = 0.3$
$\approx 0$ students (so few that we'd expect none)
$0.15\%$ of 200 $\approx 0$ students.
Section 10 of 11
Frequency Tables
In real data you often see the same value many times — eight families with 2 children, twelve students who got an A. Rather than write every observation out separately, we summarise with a frequency table: each value on top, each frequency underneath.
(i) Tally — turning raw data into a frequency table
A survey records 25 dogs as pregnant ($P$) or not pregnant ($N$). Raw data:
P, N, P, P, P
P, P, N, N, P
P, P, N, P, P
N, P, N, P, P
P, P, P, P, P
P, P, N, N, P
P, P, N, P, P
N, P, N, P, P
P, P, P, P, P
Tally the $P$'s and the $N$'s, then write the frequencies in a table:
| Pregnant? | Yes | No |
|---|---|---|
| Number | 20 | 5 |
Pregnancy level $= \dfrac{20}{25} \times \dfrac{100}{1}$
$= 80\%$
Mode = Pregnant (it has the bigger frequency)
(ii) Numerical frequency tables — Goals per game
The number of goals scored in each of 20 matches was recorded:
0, 1, 2, 1, 1
2, 3, 3, 2, 1
0, 0, 0, 0, 1
2, 1, 0, 1, 1
2, 3, 3, 2, 1
0, 0, 0, 0, 1
2, 1, 0, 1, 1
Tally — count the 0's, the 1's, the 2's and the 3's:
| Goals | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Games | 7 | 8 | 3 | 2 |
To find the mean from a frequency table:
Mean from a frequency table — must learn
$\bar{x} \;=\; \dfrac{\sum f x}{\sum f}$
$f$ = frequency, $x$ = value. You multiply each value by its frequency, sum, then divide by the total frequency.
$\sum f x = (0)(7) + (1)(8) + (2)(3) + (3)(2)$
$= 0 + 8 + 6 + 6 = 20$
$\sum f = 7 + 8 + 3 + 2 = 20$
$\bar{x} = \dfrac{20}{20}$
$\bar{x} = 1$ goal per game
Standard deviation — straight off the calculator using FREQ mode (turn ON in SETUP), entering each value with its frequency.
$\sigma = 0.95$ (correct to 2 dp)
On the Casio: SHIFT $\to$ SETUP $\to$ STAT $\to$ Frequency ON. Then in STAT 1-VAR you'll see two columns — $x$ for values and FREQ for frequencies. Saves a huge amount of typing.
You try
The number of children per family in a survey of 30 families gave the table below. Find the mean and standard deviation.
| Children | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Families | 3 | 8 | 10 | 6 | 3 |
$\sum f = 30$. $\sum fx = (0)(3) + (1)(8) + (2)(10) + (3)(6) + (4)(3)$.
$\sum f x = 0 + 8 + 20 + 18 + 12 = 58$
$\sum f = 30$
$\bar{x} = \dfrac{58}{30} = 1.93$
$\bar{x} = 1.93$, $\sigma = 1.12$ (calculator, 2 dp)
$\bar{x} = 1.93$, $\sigma = 1.12$
You try
In a class test the marks (out of 5) were tabulated as below. Find the mean mark and the standard deviation.
| Mark | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Students | 2 | 5 | 9 | 8 | 6 |
$\sum f = 30$ students. $\sum fx = (1)(2) + (2)(5) + (3)(9) + (4)(8) + (5)(6)$.
$\sum f x = 2 + 10 + 27 + 32 + 30 = 101$
$\bar{x} = \dfrac{101}{30} = 3.37$
$\bar{x} = 3.37$, $\sigma = 1.17$ (calculator, 2 dp)
$\bar{x} = 3.37$, $\sigma = 1.17$
Section 11 of 11
Grouped Continuous Data
Continuous data — like height, time or age — is usually given in class intervals rather than exact values. You'll see ages grouped as $0\text{–}2$, $2\text{–}4$, $4\text{–}6$ instead of every individual age.
We don't know the exact age of each person in the $2\text{–}4$ class, so we use the mid-interval value as our best estimate.
Mid-interval method — must learn
For continuous grouped data, the mid-interval value replaces $x$ in the mean formula.
$\text{Mid-interval} \;=\; \dfrac{\text{lower} + \text{upper}}{2}$
Then compute the mean the same way as before: $\bar{x} = \dfrac{\sum f x}{\sum f}$, where $x$ is now the mid-interval value.
(i) Worked example — Ages of children at a playground
| Age (years) | $0\text{–}2$ | $2\text{–}4$ | $4\text{–}6$ | $6\text{–}8$ |
|---|---|---|---|---|
| People | 5 | 8 | 9 | 4 |
Step 1. Replace each class interval with its mid-interval value.
$0\text{–}2 \;\Rightarrow\; \dfrac{0+2}{2} = 1$
$2\text{–}4 \;\Rightarrow\; \dfrac{2+4}{2} = 3$
$4\text{–}6 \;\Rightarrow\; \dfrac{4+6}{2} = 5$
$6\text{–}8 \;\Rightarrow\; \dfrac{6+8}{2} = 7$
| Mid-interval $x$ | 1 | 3 | 5 | 7 |
|---|---|---|---|---|
| Frequency $f$ | 5 | 8 | 9 | 4 |
Step 2. Apply the frequency-table mean formula.
$\sum f x = (1)(5) + (3)(8) + (5)(9) + (7)(4)$
$= 5 + 24 + 45 + 28 = 102$
$\sum f = 5 + 8 + 9 + 4 = 26$
$\bar{x} = \dfrac{102}{26}$
$\bar{x} = 3.9$ years (1 dp)
Step 3. Standard deviation — calculator with FREQ on, entering mid-intervals and frequencies.
$\sigma = 1.9$ years (1 dp)
When the question gives continuous grouped data, the answer is an estimate — we don't have the exact values, only the intervals. Examiners always award full marks for the mid-interval method.
You try
The heights (cm) of 40 plants were grouped as below. Estimate the mean height.
| Height | $10\text{–}20$ | $20\text{–}30$ | $30\text{–}40$ | $40\text{–}50$ |
|---|---|---|---|---|
| Plants | 6 | 14 | 12 | 8 |
Mid-intervals are 15, 25, 35, 45. $\sum f = 40$.
Mid-intervals: 15, 25, 35, 45.
$\sum f x = (15)(6) + (25)(14) + (35)(12) + (45)(8)$
$= 90 + 350 + 420 + 360 = 1220$
$\bar{x} = \dfrac{1220}{40}$
$\bar{x} = 30.5$ cm
$\bar{x} = 30.5$ cm
You try
The times (minutes) taken by 25 students to complete a puzzle were grouped as below. Use mid-intervals to estimate the mean and standard deviation.
| Time | $0\text{–}5$ | $5\text{–}10$ | $10\text{–}15$ | $15\text{–}20$ | $20\text{–}25$ |
|---|---|---|---|---|---|
| Students | 2 | 7 | 9 | 5 | 2 |
Mid-intervals are 2.5, 7.5, 12.5, 17.5, 22.5. $\sum f = 25$.
Mid-intervals: 2.5, 7.5, 12.5, 17.5, 22.5.
$\sum f x = 5 + 52.5 + 112.5 + 87.5 + 45 = 302.5$
$\bar{x} = \dfrac{302.5}{25}$
$\bar{x} = 12.1$ minutes, $\sigma = 5.18$ minutes (calculator, 2 dp)
$\bar{x} = 12.1$ minutes, $\sigma = 5.18$ minutes
You're done.
Statistics — Basic · Higher Level · Mathslive.ie