The risks of using generative AI when writing code


I’ve been nearly as fixated as much of the rest of the online world is with artificial intelligence over the last few months. Particularly the large language models (LLMs) out there, ChatGPT being the most famous example.

Whilst the hype seems to be dying down a little – it’s been a while since I’ve seen many articles about extremely well paid prompt engineers or that “the most important job skill of this century” is going to be talking to AI – I continue to find these tools at least somewhat useful for certain types of data analysis task, particularly those that are somewhat formulaic or repetitive. And given the tendency of these models to “hallucinate”, aka confidently lie to you, it’s absolutely critical for anything of importance that you restrict them to tasks where you have the ability to sense-check or fully test any response it gives you. It’s not for nothing that Stackoverflow banned these tools a while back – a policy that they now seem to be relaxing a bit, but that in itself is proving controversial.

All that aside, I was somewhat excited to hear that there’s an official integration between Github Copilot and RStudio, probably the analysis tool I use the most at present, in the works. So far I haven’t tried the Copilot lifestyle directly, but this may be enough to tempt me to give it a go.

That said, perhaps I shouldn’t embrace various code editors’ future integrations with LLM AIs too much. Samsung is amongst several other companies that have banned their employees from using tools such as ChatGPT.

Samsung Electronics Co. is banning employee use of popular generative AI tools like ChatGPT after discovering staff uploaded sensitive code to the platform, dealing a setback to the spread of such technology in the workplace.

The main issue would appear to be that these models generally use the text you input as training data for future iterations of the tool. They “learn” from what you type into them. The nature of these tools is such that it’s feasible that one day a future response to someone else’s chat session might include stuff that you told it, possibly verbatim. So if that was something proprietary or confidential then that might be bad.

There are also more conventional security concerns insomuch as the company providing the AI tool probably, at least in theory, has access to transcripts of your chats; imagine if it was a direct competitor of yours or it decided to sell the data on.

But even if these companies are extremely well behaved, if there are any future security issues with their tools then that might open up possibilities for Bad People to get access to your dialogue. After all, it wasn’t so long ago when OpenAI was accidently showing parts of people’s chat histories to the wrong people.

Amazon have also warned its employees to be careful what they chatbot:

The attorney, a senior corporate counsel at Amazon, suggested employees follow the company’s existing conflict of interest and confidentiality policies because there have been “instances” of ChatGPT responses looking similar to internal Amazon data.

“This is important because your inputs may be used as training data for a further iteration of ChatGPT, and we wouldn’t want its output to include or resemble our confidential information (and I’ve already seen instances where its output closely matches existing material),” the lawyer wrote.

According to TechCrunch, a bunch of banks including Bank of America, Citi, Deutsche Bank, Goldman Sachs, Wells Fargo and JPMorgan have similarly restricted their employee’s use of ChatGPT.

These stories are mostly about ChatGPT, probably because it’s the most famous and most impressive tool for most people. The technology in other such tools like Google Bard is similar enough that it’d make sense to me to have the same concerns.

It might be that more specific tools designed to help one code such as Copilot are set up differently given the use-case it’s aimed at, but I haven’t yet looked into its inner workings in great detail.

It is based on a generative large language model, created by the same company that runs ChatGPT, called Codex which is optimised for translating people’s natural language prompts into working computer code. But the fact that their model appears to be primarily trained on publicly accessible source code might reduce the risk of a certain class of problems, although I note they do collect “user engagement data” (mandatory) and user “prompts and suggestions” (you can turn this off).

But, even at best, this focus on training with open source code certainly doesn’t eliminate all concerns. Certainly back in 2022 it was producing the some potentially problematic responses, serious enough for the various parties involved to be sued.

This case wasn’t about leaking company secrets; the model was trained only on publicly accessible code. The main claim here was that whilst Github had chosen to train the model on a ton of public code repositories, the output paid no heed to the licensing of said code, even on the instances it output the code exactly in its original form.

Perhaps the most famous example is from 2021 when Armin Ronacher found that Copilot was word-for-word suggesting code that came from the Quake 3 Arena videogame. This code is more memorable than most due to the swearing embedded in the comments.

Just because code is “free to read” does not imply that you are free to do anything you want with it. There are many, many different licenses that can be applied to open-source code. Thus Copilot occasionally suggests code snippets that could lead to users violating terms of a license they never saw. For example, a developer of a closed-source product may have Copilot suggest them word-for-word code that comes from an open source project with a GPLv3 license, which prohibits exactly that style of usage.

As the law firm involved writes:

GitHub Copilot, an AI-based coding product made by GitHub in cooperation with OpenAI, appears to profit from the work of open-source programmers by violating the conditions of their open-source licenses. According to GitHub, Copilot has been trained on billions of lines of publicly-available code, leaving open-source programmers with serious concerns regarding license violations. Microsoft apparently is profiting from others’ work by disregarding the conditions of the underlying open-source licenses and other legal requirements.

It sounds like Github feels that the fair user doctrine, alongside the fact that it’s only in a small minority of circumstances where big chunks of verbatim copies of existent code are generated, means that they’re not doing anything legally problematic.

You can follow the updates to this case as well as get more details on the specific laws that the class-action is alleging were broken on a dedicated site.

Not related to the above case, but it was also noted at the time that Copilot would occasionally offer up things that should remain secret; for example paid-for API keys or passwords. This is perhaps a grey area, because, assuming the tool is working as claimed, the fact that Copilot knew about them implies that someone somewhere must have inadvertently included them in their freely-available code output. The secrets were already public in a sense. Fancy AI aside, if you happened to manually search for the right thing on Github itself you’d have found it.

But perhaps the different method of delivery is of some relevance. You may well know if you went out of your way to search for something that you should have paid for, or the code might have contextual comments in it that lead you to understand you shouldn’t really use it. But you might have no way to distinguish between a Copilot suggestion that included such a thing vs one that was absolutely intended for reuse.

Things may well have improved somewhat. Earlier this year Github updated the Copilot model such that it was less likely to suggest code that contained secrets. They also introduced filters that are supposed to prevent suggestions that are exact matches for the existing public code it was trained on, although some folk have found that the filters don’t work that well.

Whilst the legal cases resolve and society works on coming up with some kind of at least vaguely-enforceable opinion and thus policy on the huge variety of matters affected by modern-day AI, I suppose the workers of today probably shouldn’t be entering company – or really any other kind of – secrets into any external AI-thing without thinking carefully about it. Even if they do offer the tantalising possibility that it’ll do their job for them (or, more charitably, make them more effective at their job).

And for anyone developing a product that involves coding, particularly one that they intend to make accessible to the public, then we should probably do our best to check we’re not inadvertently copying someone else’s code in a way that infringes their license. This of course might be easier said than done whilst the whole area remains something of a grey area with no legislative requirement for tools to be transparent about where what they tell you is coming from.

One thought on “The risks of using generative AI when writing code

Leave a comment