Duolingo, the company behind the famous language learning app, face a similar challenge to the majority of companies whose existence largely depends on usage of an app or website. They want to promote "user engagement". That's to say it's in their interest - and fortunately in this case also in the interest of their customers … Continue reading Multi-armed bandits, and the Duolingo example
The Roe vs Wade leak and data privacy
Earlier this month, as anyone who has been anywhere near social media or a newspaper even here in the UK will know, a draft opinion from the US Supreme court was leaked. It showed a majority opinion that the decision that has protected the right of Americans to access abortion services for the past almost … Continue reading The Roe vs Wade leak and data privacy
Chebyshev’s inequality – a 68-95-99.7 style rule for all distributions
Pafnuty Chebyshev, looking stern, from Wikipedia. Most people that have studied a certain amount of statistics theory will likely have encountered the 68-95-99.7 rule. It could surely do with a more catchy name, but the point of it is to quickly express the proportion of values that should lie within 1, 2 and 3 standard deviations … Continue reading Chebyshev’s inequality – a 68-95-99.7 style rule for all distributions
Are you (statistically) smarter than a politican?
Time to test yourself! Give the below three questions a go, before proceeding 1. If you toss a fair coin twice, what is the probability of getting two heads? 2. Suppose you roll a 6-sided die. The rolls are: 1, 3, 4, 1, and 6. What is the mean value? 3. And what is the mode value? 4. Suppose there was a diagnostic test for a virus. The false-positive rate (the proportion of people without the virus who get a positive result) is one in 1,000. You have taken the test and tested positive. What is the probability that you have the virus?
How to evaluate the results of an experiment early and often without increasing false positives
Most data folk I know love experiments. They're the ideal way to use data to answer the question of not only whether A is associated with B, but also if A causes B. Randomised Controlled Trials are a subset of experiments that most interested people seem to agree are the gold standard in, for instance, … Continue reading How to evaluate the results of an experiment early and often without increasing false positives
Accessing your Duolingo data for analysis via Python
Duolingo is a popular app-and-website for learning a new (human) language, with hundreds of millions of users across the world. You tell it what language you speak and which you'd like to learn, and it teaches you via bite-size lessons, stories and audio clips with interactive tests and the like. Even as someone who hasn't … Continue reading Accessing your Duolingo data for analysis via Python
The effectiveness of the Covid-19 vaccine: 95% or 0.84%?
At the time of writing, about 87% of UK adults have received at least one dose of a Covid-19 vaccine. The huge majority of mainstream scientific or journalistic sources report the vaccine efficacy as being very high, up to 95% depending on the specific vaccine and specific measure in question. It may be somewhat lower … Continue reading The effectiveness of the Covid-19 vaccine: 95% or 0.84%?
Create rows for missing combinations of data with R
Sometimes one gets a dataset that is in one sense missing rows, but in another sense missing nothing, because those rows represent occasions where nothing happened. That's perhaps a rather confusing description, so to demonstrate with a common example of this let's imagine some sales data. Here each row tells you how much each customer … Continue reading Create rows for missing combinations of data with R
A quick way to count the number of null values in each field of a BigQuery table
After perhaps day 2 of many real-world jobs, most analysts have likely learnt to never fully trust any dataset without doing at least a little pre-exploration of the content and quality of the data. Before beginning the work to generate your world-shattering insights, it's therefore usually wise to run a few checks. One of my … Continue reading A quick way to count the number of null values in each field of a BigQuery table
Travelling through time to query a BigQuery database from the past
Whilst looking into a broken Google BigQuery query recently, I chanced upon the "time travel" feature. This lets you query your BQ database to see what results it would have returned given the state of the data in the past, even if they are different to the results it now returns. I used that to … Continue reading Travelling through time to query a BigQuery database from the past