Modern statistics
- motivate by problem solving
- start with visualization and exploring data
- focus on what can be reasonably learned from data, biases in data, concluding causation
- models and algorithms
- assessing uncertainty through re-sampling data (boostrap)
- probability theory as neat way of turning random variation into uncertainty about what is true
- hypothesis testing and its potential problems
- bayesian methods
- PPDAC: problem -> plan -> data -> analysis -> conclusion
What problems do I care about
- housing price
- employability
- …
Looking at data: What was the pattern of Harold Shipman’s murders
- Problem: can more detail tell us more about what Shipman did?
- Plan: compare actual times at which his patients died with the times of deaths recorded by other local GPs
- Data: a huge exercise requiring examination of death certificates
- Analysis: simple plotting
Inference and bias: How many sexual partners have people in Britain had in their lifetime?
- Problem: cannot know this as a fact
- Plan: survey in which people are carefully asked about the sexual activity
- Data: reports of numbers of partners
- Analysis: plotting and summary statistics
Regression, prediction and algorithms: Who was the luckiest person on the titanic?
The mysteries of the P-value
- P-value: a measure of the conflict between the data and a null hypothesis of no effect
- Specifically, P = probability of getting such an extreme result, were the null hypothesis true
- Not the probability of the null hypothesis
- Traditional threshold of 5% to declare statistically significant
- no significant does not mean no effect
- if many tests or crucial decision, use more stringent threshold
Quick analogy
- H0: The defendant is innocent
- Evidence (data): test results, testimony
- p-value: how likely you’d see that evidence if the defendant were actually innocent A very low p-value is like very strong evidence against the defendant being innocent -> convict the defendant (reject H0) If p=0.01 If the defendant is truly innocent, there is only 1% chance of seeing evidence this strong. The evidence is very unlikely under the assumption of innocence, which leads you to reject the assumption of innocence.
Another analogy - Explain to me p-value like I’m 5
- H0: I didn’t eat all the cookies
- T (data): The cookie crumble over their shirt, chocolate over their hands
- Question: If they really didn’t eat the cookies, how likely you will see these evidences? p=0.01 => 1% you will see these => extremely rare
- p = 0.01 < 0.05 => reject H0 null hypothesis you don’t believe them
- p > 0.05 => can’t reject H0 which mean you can’t tell if the defendant didn’t eat all the cookies