Statistics — Need to Know

Sampling & surveys

5 cards

Card 1

Sampling — guidelines before you pick a method

1. Set a clear target population.

2. Set a clear sampling frame — the list you actually sample from.

3. Make sure the sample size is large enough.

4. Sample must be random.

5. No bias — don’t over-sample one group (e.g. men over women).

Sample size — too small = unreliable. Too big = costly & slow.

Card 2

Three sources of bias

Bias = distorted results. Three classic ways it sneaks in:

1. Sample not representative — e.g. asking about alcohol abuse in a pub.

2. Failure to respond — the people who don’t answer may be different.

3. Dishonest answers.

Card 3

Stratified — proportional, not equal

Use a natural division (gender, year group, class) to split the population into strata.

Pick the sample in proportion to each stratum’s size.

e.g. 1st years = 50, 6th years = 100 ⇒ take twice as many 6th years as 1st years in the sample.

Random within each stratum — otherwise you reintroduce bias.

Card 4

Survey types — pros & cons

Postal — cheap, large reach. But poor response rate, limited type of data.

Personal — high response, can ask lots. But expensive, interviewer bias.

Observation — systematic. But laborious, time-consuming.

Card 5

Good questionnaire — rules

• Brief, clear, simple questions first.

• Multiple-choice answers where possible.

• Clear instructions.

• Be clear who fills it in and how answers are recorded.

• NO leading questions.

• NO embarrassing or personal questions.

Averages, mean problems & spread

6 cards

Card 6

Mean problems — the workhorse formula

Sum = mean × n

If you know the mean and how many values there are, you know the total. Almost every “missing number” mean problem starts here.

Card 7

Missing number, given the mean

1. Use Sum = mean × n.

2. Write: (sum of known) + x = mean × n.

3. Solve for x.

e.g. Mean of 5 numbers is 7, four of them are 3, 8, 6, 9. → 3 + 8 + 6 + 9 + x = 35 → x = 9.

Card 8

Missing frequency, given the mean

1. From x = ΣfxΣf:

2. Write Σfx = mean × Σf.

3. Both sides contain the missing f. Expand and solve.

Card 9

Grouped data — what to use

Use the mid-interval value of each group as x, then apply x = ΣfxΣf.

The modal class is the group with the largest frequency — you can’t pick a single modal value in grouped data.

Note. Discrete data can sit on a continuous scale — e.g. a wage of €25,000 falls in the €20,000–€30,000 group.

Card 10

σ — the method (without a calculator)

1. Find the mean x.

2. Set up three columns: x · (x − x) · (x − x)².

3. Sum the squares, divide by n, square root.

σ = √[ Σ(x − x)² / n ]

Frequency table: σ = √[ Σf(x − x)² / Σf ].

Tip. Use the calculator (1-VAR stats) unless the question says “without calculator”.

Card 11

Mean vs Median vs Mode — which to trust

Mode & Median — not affected by extreme values.

Mean — distorted by extremes but best for further analysis (σ, z-scores, etc.).

If the data is skewed or has outliers, quote the median. If it’s for further stats work, you need the mean.

Presenting data & correlation

4 cards

Card 12

Axis rules — easy marks

Frequency on the VERTICAL axis.

Always LABEL both axes (with units).

If you skip a label or put frequency on the x-axis, you lose marks even if the bars are right.

Card 13

Skew — mean follows the tail

Right (positive) skew — long tail on the right. Mode < Median < Mean. e.g. family size.

Left (negative) skew — long tail on the left. Mean < Median < Mode. e.g. 6th year shoe size.

Symmetric — Mean = Median = Mode.

The mean is pulled towards the tail. Just remember that and the order works itself out.

Card 14

Line of best fit — equation and use

A straight line through the middle of the data on a scatter plot.

y = a + bx — a is the y-intercept, b is the slope.

Use it to predict y for a given x.

Slope = rate of change of one variable as the other changes.

e.g. b = 4.5 ⇒ each extra year of education raises income by €4,500.

Card 15

r — what it is and what it isn’t

r measures the strength and direction of the linear relationship.

−1 ≤ r ≤ 1. |r| > 0.7 strong; 0.3–0.7 moderate; < 0.3 weak.

r is NOT the slope of the line of best fit. The slope is b. They’re different numbers.

r = 0 means no linear relationship — not “no relationship at all”.

Correlation ≠ Causality. Two variables can move together because of a hidden third variable, or by chance. e.g. hot day ↔ ice-cream sales: not causal. Smoking ↔ cancer: causal.

z-scores & the z-table

5 cards

Card 16

Empirical Rule — 68 / 95 / 99.7

For normally distributed data:

• Within 1σ of the mean → 68%

• Within 2σ → 95%

• Within 3σ → 99.7%

Quick check. Anything beyond 2σ is unusual; beyond 3σ is very rare.

Card 17

P(z < a) — direct lookup

Look up a in the body of the z-table:

• Row = units & tenths (e.g. 1.2).

• Column = hundredths (e.g. 0.03).

The number you read off is P(z < a).

Always sketch the curve and shade what you want before looking up — it stops sign mistakes.

Card 18

P(z > a) — total area is 1

P(z > a) = 1 − P(z < a)

The total area under the standard normal is 1. Take the left tail off 1.

Card 19

Negatives — use symmetry

P(z < −a) = 1 − P(z < a) — the left tail of −a equals the right tail of +a.

P(z > −a) = P(z < a) — everything to the right of a negative equals everything to the left of the positive.

For an interval: P(−a < z < b) = P(z < b) − P(z < −a). Convert the negative one using symmetry first.

Card 20

Inverse z — given a probability, find k

If the probability is > 0.5 (e.g. P(z < k) = 0.8765):

Find 0.8765 in the body of the table, then read k off the row & column. k is positive.

If the probability is < 0.5 (e.g. P(z < k) = 0.1234):

k is negative. Look up 1 − 0.1234 = 0.8766, read the z-value, then negate it.

Confidence intervals & hypothesis testing

5 cards

Card 21

CLT — use σ/√n, NOT σ

For a large sample (n ≥ 30) the sample mean x is approximately normal, with:

• Mean of the sample means = μ

• Standard deviation of the sample means = σ√n

Watch. Use σ/√n — NOT just σ. The sample-mean spread is smaller than the raw spread.

Card 22

95% Confidence Interval — formula

Population proportion p, sample proportion p:

p − 1.96·SE ≤ p ≤ p + 1.96·SE

Or simply p ± 1.96·SE.

Where SE = √[p(1−p)/n] (or 1/√n as a rough version).

Card 23

Hypothesis test — 5-step CI method

1. State H₀ (null) and H_A (alternative).

2. Find the standard error.

3. Find the margin of error = 1.96 · SE.

4. Build the confidence interval.

5. Is the claimed figure inside the CI?

• YES → fail to reject H₀.

• NO → reject H₀.

Card 24

Hypothesis test — z-score method

1. State H₀ and H_A.

2. Test statistic: x − μσ / √n (in log tables).

3. Decision rule:

• −1.96 ≤ z ≤ 1.96 → fail to reject H₀.

• Outside → reject H₀.

4. State conclusion in plain English.

Card 25

p-value method — and the wording warning

The p-value is the probability of a result as extreme as observed, IF H₀ were true.

• p < 0.05 → reject H₀.

• p ≥ 0.05 → fail to reject.

Two-sided test: multiply the tail probability by 2.

e.g. z = 2 ⇒ P(z > 2) = 0.0228 ⇒ p = 2 × 0.0228 = 0.0456 < 0.05 ⇒ reject H₀.

NEVER use the word “accept” — only “fail to reject”. Failing to reject doesn’t prove H₀ is true.

Low p-value = strong evidence against H₀.