Sometimes one gets a dataset that is in one sense missing rows, but in another sense missing nothing, because those rows represent occasions where nothing happened. That’s perhaps a rather confusing description, so to demonstrate with a common example of this let’s imagine some sales data. Here each row tells you how much each customer spent on a given day at your nice new store.
The key thing to notice here is that there are only records in this table for days when the customer concerned bought something. So each of our four customers, Maria, James, Michael and Mary, bought something on January 1st. But only Mary and Michael came back and bought something on January 2nd, so there’s no row for James or Maria dated January 2nd.
This can cause problems when aggregating the data if you’re not careful. For example, let’s try and figure out the average amount our four customers spend per day.
Intuitively, in R, tidyverse style, you might do:
library(dplyr) data %>% group_by(date) %>% summarise(avg_revenue = mean(revenue))
But this doesn’t work to properly answer our question. It returns a value of $3.25 for January 2nd. This is indeed the average of Michael’s $4 and Mary’s $2.50. But what about Maria and James? They’re still customers from the point of view of this question. So we need to take their $0 contributions into account.
Per usual, in R there are many many ways to approach this question. But here we’ll use this as an example of how to create rows in our dataset for these “implicitly missing” datapoints. We can later include these in the average so as to take into account the customers who didn’t buy anything on a given day.
The main function I like here is “complete” from the tidyr package. Here it is in action:
library(tidyr) data %>% complete(name, date)
Quite simply, we pass our data into that function and ask it to “complete” all combinations of the name and date variables. You can add as many or as few variables into that list as you like, and rows will be generated for each combination of all those variables.
If you don’t like the pipe syntax, it’d look like this:
complete(date, name, date)
The output is the following dataframe:
As you can see, the function has returned a dataframe that is guaranteed to have one row for everything you asked to “complete”. Here, that’s name and date. Rows where there was originally no data for the columns you didn’t specify to be completed are by default given a value of NA i.e. missing data.
In our example, that’s not actually quite what we want. The data isn’t missing in the sense that we don’t know what it is. We know for sure that the reason the rows weren’t there is because the customer didn’t provide us with any revenue on that day. Or stated in a different way: they provided us with a revenue of $0. The NAs will also cause us problems if we go back to using the same method to calculate the daily average per customer that we used above.
Instead, in our example, we need to tell R to fill in these automatically generated rows with a revenue of zero, not NA. The complete function lets us do that at the same time, by passing a list of the variables to fill and the values to use via the fill parameter. Here, it’s just the revenue variable we want to fill in, and we want it to be given a value of 0 if it doesn’t already have a value.
data %>% complete(name, date, fill = list(revenue = 0))
Why do we have to go to the trouble passing a list to the fill parameter? That’s because in theory you can use it to fill in multiple columns at once. Imagine we also had a cost column representing the cost of servicing each customer that day. And that for some reason it costs $1 per day to service a customer when they don’t buy anything. In that case you could fill in the revenue column with a 0 and the cost column with a 1 for any of the originally-missing combinations like this:
data %>% complete(name, date, fill = list(revenue = 0, cost = 1))
But anyway, back to the original example. Now we have that nice table, with zeroes inserted for the revenue on days the customer didn’t buy anything, we can easily calculate the average revenue per customer per day using the same method as we tried above. Let’s imagine we saved that “completed” data as a dataframe called complete_data. In that case, we can do:
complete_data %>% group_by(date) %>% summarise(avg_revenue = mean(revenue))
And this time, we get the correct result for January 2nd!
Mary and Michael’s total contributions of $6.50 divided by our 4 customers is indeed $1.625.
OK, let’s make things a little more complicated. With this similar-but-different example, imagine we also store the customer’s surname. By coincidence, these are customers with some of the most popular names in the US, according to Ancestry, albeit muddled up a bit to make for a hopefully less confusing example in what follows.
Let’s imagine we’re trying to do the same thing as before – add explicit rows with $0 revenue for days the customer didn’t buy anything. But now we have a surname field to contend with, so perhaps:
data %>% complete(forename, surname, date, fill = list(revenue = 0))
Um no, this produces a long table with 48 rows. I’ll spare you scrolling through it all here, but here are the rows generated for the first day, January 1st, where in reality all customers purchased something.
The function has actually done what we asked, and generated one row for every combination of forename, surname and date. But this means a row where each individual forename is associated with each individual surname. James Garcia doesn’t exist here. James does, but his surname should always be Johnson, and only Johnson. We don’t want rows for James where has has been assigned Maria’s surname.
Fair enough. But we can still use the complete function to fix this example up. We just need to add the nesting function to it.
This function takes a set of variables, and will only allow production of rows where the combination of those variables already exists at least once in the data. In this case, whilst we do want rows for every customer and every date, we only want to generate combinations of forenames and surnames that already exist in the data. Other combinations aren’t actually real customers. But we do still want those combinations associated with every individual date in the dataset, so the date doesn’t need nesting.
data %>% complete(nesting(forename, surname), date, fill = list(revenue = 0))
Perfect! One row for every customer for every date, but no rows for jumbled up names of customers who don’t actually exist.
One last trick: what if you want to generate rows for data that doesn’t actually exist anywhere in the table as it stands?
One common scenario for this might be for time series data. Imagine if in our example above, we’ve actually been trading for 7 days, January 1st to January 7th 2021. But our store was really unpopular, and so after the first three days, no-one ever came back.
How much did revenue did we make per day per customer over the whole 7 day period the store was open for? We can’t just do:
data %>% complete(nesting(forename, surname), date, fill = list(revenue = 0)) %>% summarise(avg_revenue = mean(revenue))
…because, as we saw in the table above, we’re only averaging rows that correspond to data from January 1st to 3rd. January the 4th through 7th didn’t feature at all in the dataset, so the complete function doesn’t generate rows for them.
But we can get around that by passing in the sequence of data we want using in our completion operation, originating outside of the dataframe we’re applying it to itself. For example, given that:
seq(as.Date("2021-01-01"), as.Date("2021-01-07"), by = "day")
produces a daily sequence of dates from 2021-01-01 through to 2021-01-07 inclusive, we can replace the “date” in the original dataset above with the sequence generated by seq() instead:
data %>% complete(nesting(forename, surname), date = seq(as.Date("2021-01-01"), as.Date("2021-01-07"), by = "day"), fill = list(revenue = 0))
This gives the following result, although I’ll just show the first 14 rows for brevity:
…where we see one row for every forename-surname combo that existed in the original dataframe, but for all dates generated by the seq command.
We can then use the results of that to perform a simple average over the whole dataset:
complete_data_7_days %>% summarise(avg_revenue = mean(revenue))
in order to see the sad fact that our less-than-impressive store made an average of $1.14 per customer per day for the seven days it was open.