A quick way to count the number of null values in each field of a BigQuery table

After perhaps day 2 of many real-world jobs, most analysts have likely learnt to never fully trust any dataset without doing at least a little pre-exploration of the content and quality of the data. Before beginning the work to generate your world-shattering insights, it's therefore usually wise to run a few checks. One of my … Continue reading A quick way to count the number of null values in each field of a BigQuery table →

Travelling through time to query a BigQuery database from the past

Whilst looking into a broken Google BigQuery query recently, I chanced upon the "time travel" feature. This lets you query your BQ database to see what results it would have returned given the state of the data in the past, even if they are different to the results it now returns. I used that to … Continue reading Travelling through time to query a BigQuery database from the past →

Code-efficient alternatives to CASE WHEN statements in SQL

The CASE statement in SQL is the archetypal conditional statement, corresponding to the "if <A> then <B> else <C>" construct in other languages. Here's a quick refresher on what it looks like. Imagine we have a data table consisting of people, their ages and the number of years they've lived at their current and previous … Continue reading Code-efficient alternatives to CASE WHEN statements in SQL →

Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar)

I was recently working with records in a database that were identified by a Universally Unique Identifier, aka a UUID. These IDs are strings of characters that look something like "31ae75f0-cbe0-11e8-b568-0800200c9a66". I needed to know which records were generated during in a particular time period, but sadly there was no field about dates to be … Continue reading Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar) →

Free dataset: all Reddit comments available for download

As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn't protected, and made it available for download and analysis. This is about 1.65 million comments, in JSON format. It's pretty big, so you can download it via a torrent, as per the … Continue reading Free dataset: all Reddit comments available for download →