Thoughts on GitHub Copilot and Sudowrite
Over the past few days, there has been a lot written about the launch of GitHub Copilot, “Your AI pair programmer”. Much has been written about the legality, ethics, and even utility of a such a tool. The reason I decided to write about this topic on this mailing list is that very similar discussions are happening in the fiction writing community, thanks to GPT-3. Copilot and Sudowrite (both in limited beta) are making the technology productive now, adding urgency to the conversation. These new ways for computers to assist with creative work are coming for all kinds of creative work, and it’s worth thinking about what we each feel about this kind of assistance.
What are they?
GitHub Copilot is turbocharged autocomplete for Visual Studio Code. It can do things like suggesting an entire function when you write out a comment about what the function is for.
Sudowrite provides a collection of tools for generating side characters in your story, new plot twists, or new possible next steps. Earlier GPT-3-based tools would take a some input text and generate the next couple of paragraphs. The output was remarkably reasonable, but often not useful. Sudowrite seeks to optimize the interaction for these specific use cases in order to make GPT-3 a useful augmentation of human creativity.
Training data’s murky legality and ethics
Let me preface this section with: I am not a lawyer. Even so, actual lawyers seem to be doing a lot of guessing when it comes to the legality of this generated output, because there is a lot that’s untested in court. I have done a good deal of reading about licensing issues in open source over many years, so I do have some basis for what I’ll write here, but it’s all just my opinion and definitely not legal advice.
GitHub’s model is trained on all of the public source code on GitHub, which is a lot of code. Some authors of GNU General Public License (GPL) (and AGPL) licensed-code are not happy about the inclusion of their code in the model. The GPL requires derivative works to also be licensed under the GPL, but GitHub imposes no restrictions on how people use the generated code.
GPT-3’s text model that’s used by Sudowrite is trained on a huge amount of text from around the internet. Unlike open source code, that text is unlikely to have any license permitting further use.
What is generated is disconnected from the training data. This is where it gets truly interesting. Both Copilot and Sudowrite claim that less than 1% of what is generated is going to be verbatim from training data. Instead, you’re getting code and text informed by the training data and your own, specific inputs. Not only is the amount of output relatively small, but it has also gone through customization for your case. The fair use doctrine was already somewhat complicated, but this adds another layer.
You’re probably going to customize further. Certainly in the case of Sudowrite, and also probably fairly often in the case of Copilot, you’re not going to just use the output given to you by the tool. You’ll take what’s there and customize it, rendering the already-likely-original output even more unique.
Opt-in would be better, but isn’t going to happen. Ideally, people producing creative work would be able to choose whether or not to have their work be a part of the training data for these systems, rather than the work just being slurped up and used automatically. It seems to me that tech companies with enough financial backing have been willing to take the risk to create the tool they want to create, even if the legal footing is murky. Google Book Search is a good example of this: Google scanned large numbers of books to make the information searchable, and spent years in court as a result. In the end, they prevailed.
But is it right? Before I give my answer to that question, I want to first talk about whether or not these tools are even useful.
AI Weirdness is Janelle Shane’s wonderful site highlighting strange creations that AI thinks are perfectly reasonable things to generate. As illustrated there, AI has a long way to go before it can replace humans in creative work, if it ever actually will. But these models don’t need to replace people to be useful.
People have spotted subtle bugs in some of the code generated by Copilot. Knowing that you’ll have to carefully look at the output to ensure there are no bugs may diminish the value of it somewhat, but most of the commentary I’ve seen from people using Copilot is that it is producing useful snippets of code and saving them time. It’s only going to get better with age.
English is squishier than code, so minor issues may not matter and serious issues would likely be obvious to the writer. Generating English that people would recognize as good writing and actually sticks to a topic is a very hard problem, but tools like Sudowrite constrain the problem enough to provide useful output. Sudowrite can help writers overcome writers block and more quickly enrich their worlds with additional details or plots with more variation.
I diverted from the ethics of AI generation to talk about utility because I think these tools will genuinely assist people in their creative endeavors in brand new ways. Word processors with spellcheckers or even grammar checkers have helped us with polish, but computers are gaining the ability to help with the act of creation itself and that is new and important. But I’m not making a purely utilitarian argument here. It’s not just that these tools have value that should cause us to think that they’re okay ethically.
Are these tools morally okay?
This is a new area and I’m willing to change my opinion as new information comes in, but I do have a way of thinking about this that leads me to believe we (humankind collectively) should continue down this path.
When we create, our creations are influenced by our own inputs. We read books and watch movies, we read non-fiction and tweets. All of this gets squashed into the soup in our heads and comes out radically transformed when we make our own creations. It is much the same for the computers. They have the ability to consume a lot more input than we do and process that vast amount of information in ways very different from the ways we do. This combination allows ML models to provide completely new directions for us to take our thinking in.
Is the computer learning from its input really wrong in a way that us learning from our input isn’t?
I do think that there are tons of edges to this line of thinking that we’ll need to work through. For example, if a model is trained only on Stephen King’s work and essentially generates variations of King’s prose, that crosses a line for me. Shakespeare is fair game, because his work is public domain.
There may also come a time when the machines are generating significant amounts of the text, just guided by humans. At that point, we may want norms around giving credit to the tool. No one needs to say “Written in Microsoft Word” today, but Microsoft Word is not actually doing the writing!
I’m excited for this technology, because I think it will help us create more and create better. GPT-3 was a huge advance over GPT-2, and I think the 2023 tools are going to put today’s tools to shame.
I’d love to hear your thoughts! Feel free to comment, email me, or @ me on Twitter.
p.s. None of this article was generated with AI, nor was my brand-new serialized urban fantasy The Dragon of DC that’s available now in Kindle Vella. Amazon is giving away 200 tokens right now (essentially a 20,000 word novella worth) to get people interested.