Situations when multicollinearity in regression model variables isn’t important

When creating basic multiple regression models, if your predictor variables correlate with each other this usually presents a problem in that you can end up with unstable estimates for the resulting coefficients. One way to test for multi-collinearity is to check for a relatively high Variance Inflation Factor, or VIF. Many packages exist that make … Continue reading Situations when multicollinearity in regression model variables isn’t important →

Notes on the book “Becoming a Data Head”

Below are notes that I took when reading Alex J. Gutman and Jordan Goldmeier's book "Becoming a Data Head - How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning". The notes simply aim to summarise the parts of the book that most attracted my attention, sometimes reworded or reorganised, and don’t necessarily … Continue reading Notes on the book “Becoming a Data Head” →

A super fast and flexible near-optimal matching method using the quickmatch library in R

Sometimes we don't have the luxury of running a gold-standard randomised controlled trial when wanting to understand the effect of some intervention on some population. Perhaps the required experiment would be unethical, too costly, or otherwise unfeasible. Or maybe the powers that be just never thought about doing a proper experiment, but yet want to … Continue reading A super fast and flexible near-optimal matching method using the quickmatch library in R →

Chebyshev’s inequality – a 68-95-99.7 style rule for all distributions

Pafnuty Chebyshev, looking stern, from Wikipedia. Most people that have studied a certain amount of statistics theory will likely have encountered the 68-95-99.7 rule. It could surely do with a more catchy name, but the point of it is to quickly express the proportion of values that should lie within 1, 2 and 3 standard deviations … Continue reading Chebyshev’s inequality – a 68-95-99.7 style rule for all distributions →

Are you (statistically) smarter than a politican?

Time to test yourself! Give the below three questions a go, before proceeding 1. If you toss a fair coin twice, what is the probability of getting two heads? 2. Suppose you roll a 6-sided die. The rolls are: 1, 3, 4, 1, and 6. What is the mean value? 3. And what is the mode value? 4. Suppose there was a diagnostic test for a virus. The false-positive rate (the proportion of people without the virus who get a positive result) is one in 1,000. You have taken the test and tested positive. What is the probability that you have the virus?

How to evaluate the results of an experiment early and often without increasing false positives

Most data folk I know love experiments. They're the ideal way to use data to answer the question of not only whether A is associated with B, but also if A causes B. Randomised Controlled Trials are a subset of experiments that most interested people seem to agree are the gold standard in, for instance, … Continue reading How to evaluate the results of an experiment early and often without increasing false positives →

The effectiveness of the Covid-19 vaccine: 95% or 0.84%?

At the time of writing, about 87% of UK adults have received at least one dose of a Covid-19 vaccine. The huge majority of mainstream scientific or journalistic sources report the vaccine efficacy as being very high, up to 95% depending on the specific vaccine and specific measure in question. It may be somewhat lower … Continue reading The effectiveness of the Covid-19 vaccine: 95% or 0.84%? →

Create similar test and control groups by randomising participants with blocking in R

In the classic randomised experiment we randomly assign participants to at least two groups, test and control, by metaphorically tossing a coin to allocate them to one or the other. However, in reality sometimes slightly more sophisticated methods can be useful. One such method is blocking. Here, you first create "blocks" of participants, usually based … Continue reading Create similar test and control groups by randomising participants with blocking in R →

Covid-19 testing in England’s secondary schools: the story so far

Last week, secondary schools in England reopened for all pupils, after several months of semi-closure due to the Covid-19 pandemic. During that period, most teachers were busy labouring hard to conduct their lessons remotely, often persevering with minimal extra resources and inconsistent central guidance. A smaller proportion were still going to school, in order to … Continue reading Covid-19 testing in England’s secondary schools: the story so far →

Using R to run many hypothesis tests (or other functions) on subsets of your data in one go

It's easy to run a basic hypothesis test in R, once you know how. For example, if you've a nice set of data that you know meets the relevant assumptions, then you can run a t test in the following sort of way . Here we'll assume that you're interested in comparing the differences in … Continue reading Using R to run many hypothesis tests (or other functions) on subsets of your data in one go →