Those of us who write code can, and should, be thankful for the existence and popularity of code version control systems. Git & Github are probably the most famous such software, but many other options are available.
These are systems that keep track of the changes you make to code over time – the who, when, and if used properly, hopefully the why of them. Whilst they are nigh-on-essential for analyst teams that code, they can be greatly useful even for the solo data analyst. Who amongst us never wanted to undo the all the changes we made to our project today after we somehow broke all of yesterday’s stellar work? Who amongst us has never wanted to remember why we made a decision when seeking inspiration from code written a year ago?
In more recent times I’ve started to realise benefits from version-controlling my data as well as my code.
Even when you have a perfect version-controlled copy of your original analysis queries and code, you might still not be able to fully recreate and evolve your analysis if something about the original source of your data changes.
Database schemas change, tables are deleted, definitions are altered, data is archived, deleted, anonymised, people with access quit, companies go bust; there are any number of reasons why running the data retrieval query that worked yesterday might not work today.
If your analysis is most definitely a one-off thing then that might be fine. But if you are ever likely to get follow-up questions in the future, about more than just the encoded logic of your analysis, you might wish you had a copy of the data you actually used when you did the analysis.
Furthermore, analysis is an iterative process. As you develop and experiment with your approach to a given work product, in a similar vein to the coding process, you might occasionally wish you could turn back the clock and use the data you had before you or someone else messed something up in the latest pull.
To quote the Institute of Data Analytics on “Why is Data Versioning Important?“:
Reproducibility: In data science, ensuring that analyses and models can be replicated requires access to the exact versions of datasets used. Data versioning facilitates this by providing a detailed record of data changes and enabling easy access to past versions.
Collaboration: In team settings, different members might need to work on various aspects of a dataset. Without proper versioning, this can lead to conflicts, overwritten data and confusion. Data versioning tools allow team members to work independently while ensuring that their contributions can be integrated.
Data Integrity: Data is rarely static, it is often dynamic; it evolves due to updates, corrections and additions. Keeping track of these changes helps maintain the integrity of the data. If an error is introduced, versioning allows data scientists to pinpoint when and where the error occurred and to roll back to a correct version if necessary.
Auditability: For industries that require strict compliance and auditing standards, such as finance and healthcare, data versioning provides a transparent trail of data modifications. This can be crucial for audits, regulatory compliance and ensuring the credibility of data-driven decisions.
Experimentation: In the exploratory phases of data science, multiple hypotheses and approaches are tested. Data versioning allows data scientists to branch out from a specific dataset version to explore various methods and then merge the best results back into the main dataset. This freedom to experiment without fear of losing previous work accelerates innovation and discovery.
Whilst you can generally store smaller, ideally text-based, datafiles in the typical version control systems like Git just fine – and indeed Github displays the likes of csvs quite nicely in a web browser – these systems are not usually designed for storing and version controlling large or binary datasets. In Github, for instance, you will get a warning for files greater than 50mb and there’s a hard limit of 100mb. The diff view will also be pretty useless in many cases.
Git is not designed to handle large SQL files. To share large databases with other developers, we recommend using a file sharing service.
Whilst I do not presently have access to a organisation-wide formal data version control system, I have found a good personal (and potentially team) solution within the “pins” library.
This library is available for both R and Python, so it might well work well for you too if you’re a user of either or both.
From the blurb of the R version:
The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including folders (to share on a networked drive or with services like DropBox), Posit Connect, Databricks, Amazon S3, Google Cloud Storage, Azure storage, and Microsoft 365 (OneDrive and SharePoint). Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.
You can use pins from Python as well as R. For example, you can use one language to read a pin created with the other.
Let’s see how it works in R!
First we need to create a “pinboard” in order to have some place to “pin” our data onto. You might think of this as being the nearest equivalent of a Github repo.
There are many storage options for where you want to store your data, as can be seen from the multitude of board_ functions in the package.
Presently they include:
- an Azure storage container
- Posit Connect
- a Databricks volume
- Google Cloud Storage
- a Google Drive folder
- a local folder, on your computer
- a OneDrive or Sharepoint library
- an S3 bucket
In my examples below I will be using a Google Drive folder, but, aside from the board_ function, everything else should work the same irrespective of destination.
One important note re Google Drive: presently you cannot use a folder on a shared Google drive, without a somewhat faffy workaround. That’s logged as a bug in the package’s Github issues. But it works fine on a non-shared Gdrive. And hopefully one day it will be fully solved.
Starting with examples that use the R analysis language – here’s one method you can use to set up or connect to an existing Google drive pinboard. Here I specify versioned to be true as I explicitly want the system to keep track of different versions of my data.
install.packages("pins") # if you haven't already done this once
library(pins)
board <- board_gdrive("ADD THE URL OF YOUR GOOGLEDRIVE FOLDER HERE", versioned = TRUE)
Now you’re in a position to start pinning data. Each time you want to pin new data or update existing pins with new data version you can use the pin_write function. Here we’ll use R’s built in dataset mtcars to illustrate:
pin_write(
board = board,
x = mtcars,
name = "mtcars",
type = "parquet",
description = "Example mtcars dataset"
)
which, as you can see below, creates a new version of a pin called “mtcars”, generating an automatic version identifier, and saves the mtcars dataframe data to it.
Creating new version '20260118T123652Z-5239d'
Writing to pin 'mtcars'
board is the board we already created.
x is the data you want to save. Usually that’s a dataframe although it doesn’t have to be.
type is the file format to save to. There are several options that are pretty self-explanatory.
- csv
- json
- rds
- parquet
- arrow
- qs
I usually use parquet, which is described by Apache as “an open source, column-oriented data file format designed for efficient data storage and retrieval.”. It is far more space-efficient than e.g. CSV which can be important if dealing with large datasets.
name and description are strings to help you later identify the data.
The name parameter is the name of the pin, which should be unique to the pin you want to maintain. It’s how the software identifies what should be a new version of a pin vs what should be a whole new pin. So, if you’re planning on making lots of pins over time you probably want to come up with a system so you don’t accidentally overwrite pins with anything other than new and improved versions of the data you originally wanted to store in them
Each time you pin_write with the same name parameter to the same board, the pin with that name will get updated with whatever you put in the x parameter – with the previous versions being maintained such that you can retrieve them if desired.
The write function is sensible enough to check if anything changed in your data since you last wrote to the pin. If it sees that there are no changes then it won’t waste space and time by updating the pin with a new version of the exact same data. Instead you’ll see message indicating that nothing changed so no new version was created.
So that’s writing data. Reading it back is pretty much what you might assume. The simplest version being:
my_data <- pin_read(board, "mtcars")
The first parameter is your board, the second is the pin name. Here I’m assigning the retrieved data to the my_data dataframe. Used in the above manner you will be returned the newest version of your pin.
But what if you wanted an older version? Then you can use the version parameter.
my_data <- pin_read(board, "mtcars", version = "20260118T123652Z-5239d")
But how on earth would you know the identifier of the version you want?
Easy! Use pin_versions
pin_versions(board, "mtcars")
which gives you output like this, showing all the available versions and when they were created:
version created hash
<chr> <dttm> <chr>
1 20260118T123652Z-5239d 2026-01-18 12:36:52 5239d
2 20260118T131041Z-c2dcc 2026-01-18 13:10:41 c2dcc
Incidentally, if you don’t know the name of the pin you’re after you can also list all pins on a board with pin_list:
pin_list(board)
There’s only 1 pin on my board so far:
[1] "mtcars"
But after a while you may end up with many, many pins. No need to read through a massive list of pins to identify the one you want if you have a vague idea what you called it. Instead, search for it:
pin_search(board, "cars")
which in this case gives the following output:
name type title created file_size meta
<chr> <chr> <chr> <dttm> <fs::bytes> <list>
1 mtcars parquet mtcars: a pinned 5 x 11 data frame 2026-01-18 13:10:41 1.97K <pins_met>
Given the typical iterative process of analysis, over time you might build up a load of versions of your pins that you just know you’ll never need again. Obviously be careful here. Who knows what the future might bring? But if you know for sure that you don’t need certain prior versions of a pin you can delete a single version, without deleting the whole pin, like this:
pin_version_delete(board, "mtcars", version = "20260118T123652Z-5239d")
which gives confirmation:
File trashed:
* 20260118T123652Z-5239d <id: 1dtkWsloExAWGn5hh54NHBFphtSqOBNzd>
If you’ve a few versions to delete, rather than type in all their identifiers you can “prune” them. pin_version_prune lets you delete versions either by virtue of them being older than x days old, or for being more than x versions ago. No matter what you pick here the most recent version will always be retained.
board |> pin_versions_prune("mtcars", n = 2) # keep only the newest 2 versions
board |> pin_versions_prune("mtcars", days = 7) # keep only the versions that were created in the last 7 days
That can be a convenient way of tidying up your disk once you got to the first version of the data you’re actually happy with.
Whilst the above examples use R code, as noted, the pins library also exists for Python. You can use either language to read data in from the other one.
Below, for reference, is most of the above but in Python.
Although, for those of us that use both languages, one important note is that so far the Python pins library doesn’t seem to have the ability to directly use a Google Drive folder as a source. There’s no board_gdrive function. You’d likely have to mount the Google drive as a local folder if you wanted to do that.
The pinboard storage options in the Python library at the time of writing are:
- local folder
- AWS S3 bucket
- Google Cloud Storage bucket
- Azue Datalake Filesystem folder
- Databricks Volume folder
- Posit Connect server
In the below example I’ll just use a standard local folder. Other than that, except for some intuitive changes based on the differences between Python and R syntax, the workflow is exactly the same in Python as it was in R.
pip install pins # the first time you use it
import pins
# Create a pinboard
board = pins.board_folder("./NAME OF YOUR FOLDER", versioned = True)
# Write data
board.pin_write(
x = mtcars,
name = "mtcars",
type = "parquet",
description = "Example mtcars dataset"
)
my_data = board.pin_read("mtcars") # read latest version of pin data in
my_data = board.pin_read("mtcars", version = "20260118T133959Z-40cae") # read specific version
board.pin_list() # List pins
board.pin_versions("mtcars") # List pin versions of a pin
board.pin_search("cars") # Search for a pin
board.pin_versions_prune("mtcars", days = 7) # Prune versions from more than 7 days ago
board.pin_version_delete("mtcars", version = "20260118T133959Z-40cae") # Delete specific versions