Kaggle, a company most famous for facilitating competitions that allow organisations to solicit the help of teams of data scientists to solve their problems in return for a nice big prize, recently introduced a new section useful even for the less competitive types: "Kaggle Datasets". Here they host "high quality public datasets" you can access for free. … Continue reading Kaggle now offers free public dataset and script combos
Category: Free data
Microsoft Academic Graph: paper, journals, authors and more
The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals and conference "venues" and fields of study. Microsoft have been good enough to structure and release a bunch of web-crawled data around scientific papers, journals, authors, URLs, keywords, references between and so on for … Continue reading Microsoft Academic Graph: paper, journals, authors and more
Free dataset: all Reddit comments available for download
As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn't protected, and made it available for download and analysis. This is about 1.65 million comments, in JSON format. It's pretty big, so you can download it via a torrent, as per the … Continue reading Free dataset: all Reddit comments available for download
Free data: Constituency Explorer – UK demographics, politics, behaviour
From some combination of the Office of National Statistics, the House of Commons and Durham library comes Constituency Explorer. Billing itself as "reliable evidence for politicians and journalists - data for everyone", it allows interactive visualisation of many interesting demographics/behavioural/political attributes by UK political constituency. It's easy to view distributions and compare between a specific contstituency, the region … Continue reading Free data: Constituency Explorer – UK demographics, politics, behaviour
The most toxic place on Reddit
Reddit, the "front page of the internet" - and a network I hardly ever dare enter for fear of being sucked in to reading 100s of comments for hours on highly pointless yet entertaining things - has had its share of controversies over the years. The site is structurally divided up into "subreddits" , which … Continue reading The most toxic place on Reddit
Free data: data.gov.uk – thousands of datasets from the UK government
Data.gov.uk is the official portal that releases what the UK government deems of as open data. The government is opening up its data for other people to re-use. This is only about non-personal, non-sensitive data – information like the list of schools, crime rates or the performance of your council. At the time of writing it … Continue reading Free data: data.gov.uk – thousands of datasets from the UK government
Free data: Yelp “challenge” dataset: 1.6mi reviews, tips, business data
"1.6M reviews and 500K tips by 366K users for 61K businesses 481K business attributes, e.g., hours, parking availability, ambience. Social network of 366K users for a total of 2.9M social edges. Aggregated check-ins over time for each of the 61K businesses" Plus if you're a student you could win $5000 for playing with it. Go … Continue reading Free data: Yelp “challenge” dataset: 1.6mi reviews, tips, business data