How to get a Wikipedia (or other HTML) table into R as a dataframe

I recently wanted to use some data I found in a Wikipedia article for analysis in R. Acknowledging of course the historical buyer-beware status of Wikipedia data – although these days it often seems as reliable as any other source.

It turns out it’s pretty easy to do. You can use the rvest library, which is a Tidyverse-style library designed for webscraping:

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

Step one is naturally to install and load the library if you haven’t already:

install.packages("rvest")
library(rvest)

Now let’s say we want to get the contents of one of the tables from the Wikipedia page “2024 United Kingdom general election“. That page’s URL is https://en.wikipedia.org/wiki/2024_United_Kingdom_general_election.

We start by retrieving its content using the read_html function:

my_webpage <- read_html("https://en.wikipedia.org/wiki/2024_United_Kingdom_general_election")

This performs a HTTP request to the given URL and parses the result. In the case of that page it’s a pretty long pile of text you don’t really want to go through too manually.

But right now we only care about retrieving the contents of one of the tables, so can use the html_elements function of rvest in order to filter down to only the parts of the page’s HTML that are defined as being HTML tables.

my_tables <- html_elements(my_webpage, "table")

You can use the same function for all sorts of clever identifying-parts-of-HTML-docs activities.

Now unfortunately the page I’m using in my example has 33 tables on it, bearing in mind that some things that don’t look like tables in the conventional sense may in fact be tables in the HTML sense.

We only want the results from one of those tables though. So say you want the results of the 20th table on that page, which for reference looks like this on Wikipedia:

Table showing BBC exit poll results for UK General Election 2024 from Wikipedia

You can parse an individual HTML table into a dataframe via the html_table() rvest function via its index:

my_dataframe <- html_table(my_tables[[20]])

Now you’ll find the contents of that table as a nice neat R dataframe.

print(my_dataframe)

Well, if it was a nice neat HTML table in the first place. For a lot of tables on Wikipedia and elsewhere you’ll probably find you need to do a bit of data manipulation and cleanup to get them into a truly tidy dataframe.

How to get a Wikipedia (or other HTML) table into R as a dataframe

One thought on “How to get a Wikipedia (or other HTML) table into R as a dataframe”

Leave a comment Cancel reply

Share this:

Related

One thought on “How to get a Wikipedia (or other HTML) table into R as a dataframe”

Leave a comment Cancel reply