After perhaps day 2 of many real-world jobs, most analysts have likely learnt to never fully trust any dataset without doing at least a little pre-exploration of the content and quality of the data.
Before beginning the work to generate your world-shattering insights, it’s therefore usually wise to run a few checks. One of my favourites is to establish how many records in your database table have null values. This is often an indication of missing or unknown data. The presence of nulls might be perfectly fine in some instances and for some analytical approaches – sometimes we legitimately don’t have a datapoint but can still proceed – but often it isn’t.
Checking for nulls one field at a time is fairly basic SQL. If you have a table, let’s call it “participants”, that looks like this:
…then you can quickly count the number of records with a null participant, like this:
SELECT COUNT(1) FROM participants WHERE participant_id IS NULL
If we got with the assumption that the blank in column 1, row 3, of the above table is in fact a null in the database – not just a blank string – then that will give us the result ‘1’, because there’s one row with a null value in the participant_id column.
If you wanted to know the percentage of null records in a column – sometimes useful when selecting variables to insert into a statistical model where you rarely want to include a column that’s 99% blank, in addition to data quality checks, then an option that works on a single column would be:
SELECT COUNTIF(participant_id IS NULL) / COUNT(1) AS pct_null_participant_ids FROM participants
In this case that’d return 0.25, as 1 out of the 4, i.e. 25% of the records are null in the participant_id field within the tiny table above.
So far so good – but what if you want to test all fields for nulls? Maybe you want to know in one fell swoop that participant_id and location have one missing value, gender has two, whereas age and score are complete. And worse yet, what if it’s a super wide table, just brimming with features for consideration in your model. Do you really have to write a query out, one field at a time, to see how many and where the nulls are?
Naturally not! We can take advantage of a function in BigQuery that converts a whole table to a JSON text string, TO_JSON_STRING. When this function tries to convert a NULL in the database, it comes out as the text string “null”. So the export will contain an entry that looks like “your_fieldname”:null for each null present in the table, where your_fieldname is the name of the field that contains that null
So, something like:
SELECT column_name, COUNT(1) AS nulls_count FROM participants, UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(participants), r'"(\w+)":null')) column_name GROUP BY column_name ORDER BY nulls_count DESC
Running that will give us a nice table of results that looks like this:
…summarising all the fields that have any nulls in them, here presented in descending order of how many null records there are in that field. Note that fields with no nulls in them won’t appear in the results at all.
This is definite overkill for an example with either very few columns, or a tiny dataset you can trivially inspect by eye . But if you have a table on the wide side, it’s a nice quick way to check out which fields have those potentially dangerous null values in.
Credit is entirely due to stackoverflow for the original idea, although to my shame I don’t recall specifically which poster. Also to note, that given the magical flexibility of regular expressions, there’s no doubt many other potential uses for this convert-to-string then reg-exp type pattern.