Situations when multicollinearity in regression model variables isn’t important

When creating basic multiple regression models, if your predictor variables correlate with each other this usually presents a problem in that you can end up with unstable estimates for the resulting coefficients. One way to test for multi-collinearity is to check for a relatively high Variance Inflation Factor, or VIF. Many packages exist that make … Continue reading Situations when multicollinearity in regression model variables isn’t important

The great SQL leading vs trailing commas debate

It might seem a small thing, but I noticed that a recent update of the Snowflake database now allows you to have a trailing comma at the end of the SQL's SELECT statement. For example, this now works: SELECT my_field, my_field_2, FROM my_table Whereas before that'd give an error. It's arguably bad form nonetheless, but … Continue reading The great SQL leading vs trailing commas debate

The Data Is Plural newsletter provides a mass of free and fascinating data

I recently chanced upon "Data Is Plural" - an email newsletter, currently on issue 370. Each week it provides a list and some commentary on "useful/curious datasets". There's a ton of links in each issue for anyone who wants data to play or work with to get stuck into. To give a taster of what … Continue reading The Data Is Plural newsletter provides a mass of free and fascinating data

Notes on the book “Becoming a Data Head”

Below are notes that I took when reading Alex J. Gutman and Jordan Goldmeier's book "Becoming a Data Head - How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning". The notes simply aim to summarise the parts of the book that most attracted my attention, sometimes reworded or reorganised, and don’t necessarily … Continue reading Notes on the book “Becoming a Data Head”

Using ChatGPT’s Data Analysis bot to analyse your data

One less widely known feature of OpenAI's large language model chatbot, ChatGPT, is that if you become a paying subscriber then you can create your own bots that are attuned to be good at doing specific types of task. OpenAI also provides you with a few examples that they created, which include the one I'm … Continue reading Using ChatGPT’s Data Analysis bot to analyse your data

The ongoing battle between human creators and AI trainers

In order for the current generation of generative AI tools - large language model chatbots, art generators et al - to work they must first undergo an extensive training process whereby they are fed a huge number of examples of the sort of content they will be later expected to produce. Per Wikipedia, the basic … Continue reading The ongoing battle between human creators and AI trainers

The differences between what Twitter users say they do and what they actually do

I always enjoy studies comparing what people say they do to what they actually do. Most of us are often pretty uninsightful, oftentimes systematically so, in terms of knowing what we did, or at least what we're prepared to tell someone else we did. Probably both. Of course since the time we decided everything should … Continue reading The differences between what Twitter users say they do and what they actually do

Is British gas and oil really 4x as good for the environment as imported fuel?

The British Prime Minister, Rishi Sunak, recently declared that he's going to enable a huge expansion of North Sea gas and oil extraction. There is a lot to criticise about this plan to say the least. But here I will endeavor to restrict myself to digging into one of his more surprising claims about this … Continue reading Is British gas and oil really 4x as good for the environment as imported fuel?

Writing conditional filter statements in dplyr

Somehow only recently did I realise that you can use if statements directly within R’s dplyr library filter function. This lets you create conditional filter criteria that can filter on different variables based on some other condition external to the function call. For instance you can change what you filter for by referencing another unrelated variable in your code. … Continue reading Writing conditional filter statements in dplyr