The methods, techniques and warnings — the things that don't fit on a flashcard. Read it, learn it, use it.
1. Set a clear target population.
2. Set a clear sampling frame — the list you actually sample from.
3. Make sure the sample size is large enough.
4. Sample must be random.
5. No bias — don’t over-sample one group (e.g. men over women).
Bias = distorted results. Three classic ways it sneaks in:
1. Sample not representative — e.g. asking about alcohol abuse in a pub.
2. Failure to respond — the people who don’t answer may be different.
3. Dishonest answers.
Use a natural division (gender, year group, class) to split the population into strata.
Pick the sample in proportion to each stratum’s size.
Random within each stratum — otherwise you reintroduce bias.
Postal — cheap, large reach. But poor response rate, limited type of data.
Personal — high response, can ask lots. But expensive, interviewer bias.
Observation — systematic. But laborious, time-consuming.
• Brief, clear, simple questions first.
• Multiple-choice answers where possible.
• Clear instructions.
• Be clear who fills it in and how answers are recorded.
• NO leading questions.
• NO embarrassing or personal questions.
Sum = mean × n
If you know the mean and how many values there are, you know the total. Almost every “missing number” mean problem starts here.
1. Use Sum = mean × n.
2. Write: (sum of known) + x = mean × n.
3. Solve for x.
1. From = ΣfxΣf:
2. Write Σfx = mean × Σf.
3. Both sides contain the missing f. Expand and solve.
Use the mid-interval value of each group as x, then apply = ΣfxΣf.
The modal class is the group with the largest frequency — you can’t pick a single modal value in grouped data.
1. Find the mean .
2. Set up three columns: x · (x − ) · (x − )2.
3. Sum the squares, divide by n, square root.
σ = √[ Σ(x − )2 / n ]
Frequency table: σ = √[ Σf(x − )2 / Σf ].
Mode & Median — not affected by extreme values.
Mean — distorted by extremes but best for further analysis (σ, z-scores, etc.).
If the data is skewed or has outliers, quote the median. If it’s for further stats work, you need the mean.
Frequency on the VERTICAL axis.
Always LABEL both axes (with units).
If you skip a label or put frequency on the x-axis, you lose marks even if the bars are right.
Right (positive) skew — long tail on the right. Mode < Median < Mean. e.g. family size.
Left (negative) skew — long tail on the left. Mean < Median < Mode. e.g. 6th year shoe size.
Symmetric — Mean = Median = Mode.
The mean is pulled towards the tail. Just remember that and the order works itself out.
A straight line through the middle of the data on a scatter plot.
y = a + bx — a is the y-intercept, b is the slope.
Use it to predict y for a given x.
Slope = rate of change of one variable as the other changes.
r measures the strength and direction of the linear relationship.
−1 ≤ r ≤ 1. |r| > 0.7 strong; 0.3–0.7 moderate; < 0.3 weak.
For normally distributed data:
• Within 1σ of the mean → 68%
• Within 2σ → 95%
• Within 3σ → 99.7%
Look up a in the body of the z-table:
• Row = units & tenths (e.g. 1.2).
• Column = hundredths (e.g. 0.03).
The number you read off is P(z < a).
P(z > a) = 1 − P(z < a)
The total area under the standard normal is 1. Take the left tail off 1.
P(z < −a) = 1 − P(z < a) — the left tail of −a equals the right tail of +a.
P(z > −a) = P(z < a) — everything to the right of a negative equals everything to the left of the positive.
For an interval: P(−a < z < b) = P(z < b) − P(z < −a). Convert the negative one using symmetry first.
If the probability is > 0.5 (e.g. P(z < k) = 0.8765):
Find 0.8765 in the body of the table, then read k off the row & column. k is positive.
If the probability is < 0.5 (e.g. P(z < k) = 0.1234):
k is negative. Look up 1 − 0.1234 = 0.8766, read the z-value, then negate it.
For a large sample (n ≥ 30) the sample mean is approximately normal, with:
• Mean of the sample means = μ
• Standard deviation of the sample means = σ√n
Population proportion p, sample proportion p:
p − 1.96·SE ≤ p ≤ p + 1.96·SE
Or simply p ± 1.96·SE.
Where SE = √[p(1−p)/n] (or 1/√n as a rough version).
1. State H0 (null) and HA (alternative).
2. Find the standard error.
3. Find the margin of error = 1.96 · SE.
4. Build the confidence interval.
5. Is the claimed figure inside the CI?
• YES → fail to reject H0.
• NO → reject H0.
1. State H0 and HA.
2. Test statistic: − μσ / √n (in log tables).
3. Decision rule:
• −1.96 ≤ z ≤ 1.96 → fail to reject H0.
• Outside → reject H0.
4. State conclusion in plain English.
The p-value is the probability of a result as extreme as observed, IF H0 were true.
• p < 0.05 → reject H0.
• p ≥ 0.05 → fail to reject.
Two-sided test: multiply the tail probability by 2.