As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn’t protected, and made it available for download and analysis.
This is about 1.65 million comments, in JSON format. It’s pretty big, so you can download it via a torrent, as per the announcement on archive.org.
If you don’t need a local copy, Reddit user fhoffa has loaded most of it into Google BigQuery for anyone to use.
If you have an account over there, then as Tableau now has a native BigQuery connector you can visualise it directly in Tableau – which Mr Hoffa has indeed done and shared with the world at Tableau Public.
Although you get a certain amount of uploading and usage from BigQuery for free, you will most likely need a paid account to integrate it directly into a Tableau (or equivalent) project like this, as you’ll want to create a BigQuery dataset to connect Tableau to.
However, if you only need to run some SQL on the freely available dataset to get some output – which you can then manually download and integrate into whatever you like – your free monthly allowance of BigQuery usage might well be enough.
Here’s the link to the data in BigQuery – at least one of the tables. You’ll see the rest in the interface on the left as per this screenshot:
You can then run some BigQuery SQL over it using the web interface – for free, up to a point, and retrieve whichever results you need.
SELECT * FROM [fh-bigquery:reddit_comments.2007] LIMIT 10
will give you 10 Reddit comments from (surprise surprise) 2007.
As you can see on the bottom right, you can save results into a BigQuery table (this requires a dataset for which you need to enable billing on your BigQuery account) or download as CSV / JSON to do whatever you want with.