The CASE statement in SQL is the archetypal conditional statement, corresponding to the “if <A> then <B> else <C>” construct in other languages. Here’s a quick refresher on what it looks like.
Imagine we have a data table consisting of people, their ages and the number of years they’ve lived at their current and previous address:
person_number | age | years_at_current_address | years_at_previous_address |
---|---|---|---|
1 | 41 | 5 | 3 |
2 | 17 | 6 | 7 |
3 | 25 | 4 | 2 |
4 | 64 | 3 | 3 |
We can then include this CASE snippet in the SELECT statement to categorise people as to whether they are children or adults, assuming we’re going to call you an adult if you are at least 18 years old.
SELECT
CASE WHEN age >= 18
THEN 'Adult'
ELSE 'Child'
END AS age_category
FROM people
You can have as many WHEN parts to the statement as you wish, with each having different conditions. Each one is checked in order, and the first one where the condition is true is the one that’s returned. If none of the conditions are true then whatever is in the ELSE part of the statement is returned, or if there is no ELSE part, then a null value. So something like this also makes sense if you want more than two categories.
SELECT
CASE
WHEN age >= 18 THEN 'Adult'
WHEN age BETWEEN 13 AND 17 THEN 'Teenager'
WHEN age IS NOT NULL THEN 'Younger child'
ELSE 'Age not known'
END AS age_category
FROM people
Anyway, case statements are super useful, flexible and common. But over time I’ve found myself using alternative functions to get results that are quicker to write and/or more readable for any observers. Whether or not they are in any way more database-efficient would depend on the database itself, but in the current world of fast and cheap databases I often find myself prioritising efficiency and understandability of the query above caring how the database feels about it.
Below are thus a few examples of places where you could use CASE statements, but don’t need to. Note that at present I’m most often working with Google BigQuery as a database, so the below code should work for that technology. However, most any database that supports SQL will often have the same, or similar, functions.
Choose your task:
- Checking if a value is null
- Returning the first non-null value from a set of fields
- Picking the highest or lowest value from a set of fields
- What percentage of rows match a certain criteria?
Checking if a value is null
This CASE statement checks whether the age entry is missing, aka null. If it is then it fills it in with the number 9999, otherwise it returns the actual age itself. In this and the other snippets below, imagine the code is placed in the SELECT part of the query, between SELECT and FROM.
CASE
WHEN age IS NULL
THEN 9999
ELSE age
END AS age_no_nulls
But instead of that, this use-case is entirely fulfilled with the nice short IFNULL function; the field corresponding to the first parameter to the function is returned if it’s not null, otherwise the second parameter is returned.
IFNULL(age, 9999)
Whilst in the above I used the fixed value of 9999 as what I want to see if there’s no age, you can also put in the name of another field as the second parameter if you want a construct to the effect of “use the contents of field A if it’s there, or the contents of field B if A is null”.
Returning the first non-null value from a set of several fields
Similar to the above, but more flexible, if less intuitively named, is COALESCE. This function takes a list of fields or values, and the first one that isn’t null is returned. If there are only 2 fields/values involved, this is basically equivalent to the IFNULL function above but harder to spell. So why bother?
We bother because it’s not limited to 2 fields or values. Maybe the database has a terrible design and the last five addresses someone has lived at are stored in the fields address_1, address_2, address_3, address_4, and address_5.
Then perhaps you want to return the first address the person lived in. But for someone who lived only in one place so far that’s in address_1. Someone else though may have lived in 3 houses to date; in that case we’d need the value from address_3.
You could do this:
CASE
WHEN address_5 IS NOT NULL THEN address_5
WHEN address_4 IS NOT NULL THEN address_4
WHEN address_3 IS NOT NULL THEN address_3
WHEN address_2 IS NOT NULL THEN address_2
ELSE address_1
END AS oldest_known_address
But instead why not try:
COALESCE(address_5 , address_4, address_3, address_2, address_1)
?
As a constant in SQL is by definition not null, you can also again use this to return a field filled in with a certain value if its null for a given row. Perhaps none of those addresses have values – oh no, missing data! – you want to return a message showing that we have no address for that person.
COALESCE(address_5 , address_4, address_3, address_2, address_1, "No address found")
By including the non-null value of “No address found” above, we ensure that if all of the address_x fields are null then as the string “No address found” is by definition never null, so that’s what gets returned.
Picking the highest or lowest value from a set of fields
Now let’s tackle the question of the longest time a person has lived at any address, based on their current and previous addresses. This means we want to return, per person, whichever the highest value is from the fields years_at_current_address and years_at_previous_address.
With CASE, that might look like this:
CASE
WHEN years_at_current_address > years_at_previous_address
THEN years_at_current_address
ELSE years_at_previous_address
END AS maximum_years_at_an_address
But there’s a function that can express this in a shorter and more easily understandable-at-a-glance way. Using GREATEST:
GREATEST(years_at_current_address, years_at_previous_address)
And if you wanted instead whatever the lowest number of years they lived at an address, the opposite of GREATEST is LEAST.
LEAST(years_at_current_address, years_at_previous_address)
What percentage of rows match a certain criteria?
So, what % of these folk are adults, aged at least 18 years old? It’s easy to find the number of rows with an age over 18 using a WHERE statement, but in doing that you lose the context of how many rows there were in total, which you need in order to get to a percentage.
Using CASE, you could do this:
SUM(
CASE WHEN age >= 18
THEN 1
ELSE 0
END) /
COUNT(1) AS pct_adults
But it’s a bit of a mouthful. Somehow I lived a fairly long time without realising there’s a COUNTIF function in (BigQuery) SQL. That works in the same way as the classic COUNTIF function in Excel, counting rows that match the criteria expressed within it. So the above snippet could be rewritten as:
COUNTIF(age >= 18) / COUNT(1) AS pct_adults
Or perhaps you want the count of folk who are adults, and the count of those who lived at their current address for just 1 year returned in a single dataset, noting that these two counts might refer to different rows.
COUNTIF(age >= 18) AS count_of_adults,
COUNTIF(years_at_current_address = 1) AS recently_moved_house
None of this is particularly revolutionary, but since adopting some of these techniques I’ve saved myself a few key-presses, plus found it a bit easier to check and reinterpret my previous code; making querying just a little bit more efficient.