Earlier this month, as anyone who has been anywhere near social media or a newspaper even here in the UK will know, a draft opinion from the US Supreme court was leaked. It showed a majority opinion that the decision that has protected the right of Americans to access abortion services for the past almost … Continue reading The Roe vs Wade leak and data privacy
Pafnuty Chebyshev, looking stern, from Wikipedia. Most people that have studied a certain amount of statistics theory will likely have encountered the 68-95-99.7 rule. It could surely do with a more catchy name, but the point of it is to quickly express the proportion of values that should lie within 1, 2 and 3 standard deviations … Continue reading Chebyshev’s inequality – a 68-95-99.7 style rule for all distributions
Time to test yourself! Give the below three questions a go, before proceeding 1. If you toss a fair coin twice, what is the probability of getting two heads? 2. Suppose you roll a 6-sided die. The rolls are: 1, 3, 4, 1, and 6. What is the mean value? 3. And what is the mode value? 4. Suppose there was a diagnostic test for a virus. The false-positive rate (the proportion of people without the virus who get a positive result) is one in 1,000. You have taken the test and tested positive. What is the probability that you have the virus?
Most data folk I know love experiments. They're the ideal way to use data to answer the question of not only whether A is associated with B, but also if A causes B. Randomised Controlled Trials are a subset of experiments that most interested people seem to agree are the gold standard in, for instance, … Continue reading How to evaluate the results of an experiment early and often without increasing false positives
Duolingo is a popular app-and-website for learning a new (human) language, with hundreds of millions of users across the world. You tell it what language you speak and which you'd like to learn, and it teaches you via bite-size lessons, stories and audio clips with interactive tests and the like. Even as someone who hasn't … Continue reading Accessing your Duolingo data for analysis via Python
At the time of writing, about 87% of UK adults have received at least one dose of a Covid-19 vaccine. The huge majority of mainstream scientific or journalistic sources report the vaccine efficacy as being very high, up to 95% depending on the specific vaccine and specific measure in question. It may be somewhat lower … Continue reading The effectiveness of the Covid-19 vaccine: 95% or 0.84%?
Sometimes one gets a dataset that is in one sense missing rows, but in another sense missing nothing, because those rows represent occasions where nothing happened. That's perhaps a rather confusing description, so to demonstrate with a common example of this let's imagine some sales data. Here each row tells you how much each customer … Continue reading Create rows for missing combinations of data with R
After perhaps day 2 of many real-world jobs, most analysts have likely learnt to never fully trust any dataset without doing at least a little pre-exploration of the content and quality of the data. Before beginning the work to generate your world-shattering insights, it's therefore usually wise to run a few checks. One of my … Continue reading A quick way to count the number of null values in each field of a BigQuery table
Whilst looking into a broken Google BigQuery query recently, I chanced upon the "time travel" feature. This lets you query your BQ database to see what results it would have returned given the state of the data in the past, even if they are different to the results it now returns. I used that to … Continue reading Travelling through time to query a BigQuery database from the past
In the classic randomised experiment we randomly assign participants to at least two groups, test and control, by metaphorically tossing a coin to allocate them to one or the other. However, in reality sometimes slightly more sophisticated methods can be useful. One such method is blocking. Here, you first create "blocks" of participants, usually based … Continue reading Create similar test and control groups by randomising participants with blocking in R