┌╦   ╦┐     ┌╦═══╦┐     ┌╦═══╦┐     ┌╦═╦═╦┐     ┌╦═══╦┐
│║ ║ ║│     ├╬═══╬┤     └╩═══╦┐        ║        ├╬══
└╩═╩═╩┘ └╩┘ └╩   ╩┘ └╩┘ └╩═══╩┘ └╩┘    ╩    └╩┘ └╩═══╩┘ └╩┘

GitHub Copilot is piracy but fascinating

GitHub Copilot is a machine learning model sold as a service by GitHub/Microsoft. It works as a super-powerful programmer's autocomplete. It lets the programmer type a comment like "function to calculate prime numbers" and Copilot will offer a complete implementation or at least a skeleton of an implementation. It can also just see that the programmer is starting to type a prime number calculator, and offer to complete the implementation, without any specific prompt or instruction.

This is fascinating technology. You never know when it will work but when it does, it frequently does very well. It works because the machine learning model behind it was trained with gargantuan amounts of source code available on GitHub.

Copilot frequently spits out code that can be traced back to the origins, and people have made it generate dozens of lines of copyrighted code verbatim. Whether the code is verbatim, or the ML model has "learned" something, it was trained on code that never authorized this sort of use, and it can lead to unintentional copyright violations. Because the model itself contains those code snippets, it can be argued (very reasonably) that Copilot itself is a copyright violator.

This has divided the software development community. Some are horrified with the copyright aspect of it. Others seem to perceive the copyright angle of it as a speed bump in the way of progress, as Copilot is undeniably useful, and clearly points at what the future looks like for software development.

Some will even argue that, like a human, Copilot learns from copyrighted code, that just like humans are not copyright violators for reproducing what they learned from a copyrighted book, the ML model shouldn't be bound to the copyrighted works it learned from.

This claim is very easily debunked by the streams of verbatim code Copilot generates. However, even where it seems to be doing something considerably smarter, and not just offering literal code from other projects, the claim that it is incorporating and copying copyrighted code does not go away. Copilot is, after all, just software and allowing it to be treated as a human is moving things way too quickly.

The way it was released is its own litmus test: Microsoft is one of world's great software firms. They own the source code to Windows, Office, Azure, Xbox, Flight Simulator, Bing, SQL Server, etc. Certainly many, many million lines of code. If they had released Copilot based on a ML model of their own code, the copyright angle would disappear. But instead, they opted to train it only on public source code available on GitHub. They were protective of their intellectual property while treating other's as a free resource. This speaks volumes in favor of the view that Copilot is a copyright violator.

Claims that the ML model is somehow learning are, I think, laughable. We are in the "stochastic parrots" of AI, impressed by some glistening objects and awed by smokes and mirrors. Rather than being so quick to extend a ML model human-like rights, we should think about whether extending those rights to a model means that we should treat animals like cows to the same standard.

Legality aside, the fact is that Copilot is very useful, and it points at even more impacting work in the future. There is a very strong argument that outlawing such a model on the basis of copyright will harm the development of an important productivity multiplier. This may be the societally more important than the the preservation of copyrights.

There is the possibility that, if banned outright, models like Copilot could only be viable inside companies with a sufficiently large code base and development team to make this kind of research visible. So Microsoft could have its internal model, trained on its own code, for their own employees. Same at Amazon, and Oracle, and SAP. Smaller development companies would likely not be able to invest in a model at all, of wouldn't have the large codebase to train the model on. This would leave them at a relative disadvantage.

A better option may be to have software licenses specify whether the code can be used in ML models. Much like developers now choose between the GNU GPL or a proprietary license, they could allow their code to train ML models - or not.

It wouldn't be hard to imagine a world where there would be a GPL model trained on GPL code, and usable by GPL projects. It's fragmentation, but achieves most of the benefit of large models without the copyright violation.

While I think this world of explicit consent would be desirable, I think this is unlikely. I believe such models will be normalized before courts and legislators have been able to process them for what they are. This is already happening in the AI-generated art space, where stochastic-parrot approximations of copyrighted works are becoming very popular. If anything the direct usefulness of source code provides a more clear-cut case for the utility of such models, which may lead courts to treat the entire issue as a fait accompli by the time when they finally get around to investigating the space.

All posts at w.a.s.t.e.