Benedict Evans pondering AI intellectual property in historical context

Monday 28th August, 2023 - Bruce Sterling

https://www.ben-evans.com/benedictevans/2023/8/27/generative-ai-ad-intellectual-property

*I really didn’t expect the venture-capital consultant to refer to Nadar.

(…)

Similar problems come up in art, and also some interesting cultural differences. If I ask Midjourney for an image in the style of a particular artist, some people consider this obvious and outright theft, but if you chat to the specialists at Christie’s or Sotheby’s, or wander the galleries of lower Manhattan or Mayfair, most people there will not only disagree but be perplexed by the premise – if you make an image ‘in the style of’ Cindy Sherman, you haven’t stolen from her and no-one who values Cindy Sherman will consider your work a substitute (except in the Richard Prince sense). I know which I agree with, but that isn’t what matters. How did we reach a consensus about sampling in hip hop? Indeed, do we agree about Richard Prince? We’ll work it out.

Let’s take another problem. I think most people understand that if I post a link to a news story on my Facebook feed and tell my friends to read it, it’s absurd for the newspaper to demand payment for this. A newspaper, indeed, doesn’t pay a restaurant a percentage when it writes a review. If I can ask ChatGPT to read ten newspaper websites and give me a summary of today’s headlines, or explain a big story to me, then suddenly the newspapers’ complaint becomes a lot more reasonable – now the tech company really is ‘using the news’. Unsurprisingly, as soon as ChatGPT announced that it had its own web crawler, news sites started blocking it.

But just as for my ‘make me something like the top ten hits’ example, ChatGPT would not be reproducing the content itself, and indeed I could ask an intern to read the papers for me and give a summary (I often describe AI as giving you infinite interns). That might be breaking the self-declared terms of service, but summaries (as opposed to extracts) are not generally considered to be covered by copyright – indeed, no-one has ever suggested this newsletter is breaking the copyright of the sites I link to.

Does that mean we’ll decide this isn’t a problem? The answer probably has very little to do what that today’s law happens to say today in one or another country. Rather, one way to think about this might be that AI makes practical at a massive scale things that were previously only possible on a small scale. This might be the difference between the police carrying wanted pictures in their pockets and the police putting face recognition cameras on every street corner – a difference in scale can be a difference in principle. What outcomes do we want? What do we want the law to be? What can it be?

But the real intellectual puzzle, I think, is not that you can point ChatGPT at today’s headlines, but that on one hand all headlines are somewhere in the training data, and on the other, they’re not in the model.

OpenAI is no longer open about exactly what it uses, but even if it isn’t training on pirated books, it certainly uses some of the ‘Common Crawl, which is a sampling of a double-digit percentage of the entire web. So, your website might be in there. But the training data is not the model. LLMs are not databases. They deduce or infer patterns in language by seeing vast quantities of text created by people – we write things that contain logic and structure, and LLMs look at that and infer patterns from it, but they don’t keep it. So ChatGPT might have looked at a thousand stories from the New York Times, but it hasn’t kept them.

Moreover, those thousand stories themselves are just a fraction of a fraction of a percent of all the training data. The purpose is not for the LLM to know the content of any given story or any given novel – the purpose is for it to see the patterns in the output of collective human intelligence.

(…)