Some lawsuits over generative AI

Here are some links and notes about copyright lawsuits being filed against OpenAI and Meta and Google and Microsoft because of their generative AI systems’ probable use of various kinds of training data.

Links first:

Now a few thoughts from me:

The various things that the companies are being accused of are mostly getting labeled as copyright violations, but I feel like there may be some pretty different issues involved, so I think it’s worth separating out the strands. Some of the things they’re being accused of:

  • Using collections of pirated ebooks as training data. Those books aren’t offered for free by the authors or publishers for any use.
  • Using free-to-read ebooks as training data. In particular, Smashwords offers some free or temporarily free ebooks—usually the first book in a series, to entice readers to read the rest of the series. (With the permission of the author.) Smashwords says that their interpretation of their terms of service doesn’t allow this use of their free ebooks.
  • Using publicly available code as training data, “[much] of which [is] published with licenses that require anyone reusing the code to credit its creators.”
  • Using the open web in general as training data, including publicly available personal information about people (though the personal-data lawsuit doesn’t appear to be talking about copyright as such).
  • Using publicly available summaries and descriptions and quotes from specific works as training data. For example, Goodreads includes a lot of that kind of material. (But to be clear: the appearance of that material on Goodreads itself isn’t a copyright violation.) Another example: sites like SparkNotes provide “study guides” that can include detailed summaries of books.

So it’ll be interesting to see how and whether courts look differently at those different issues.

Side note: As far as I know, nobody has yet accused the LLMs of producing text that’s identical to the text used in copyrighted training data (except in the code case); if an LLM did that, that might be a clearer-cut copyright issue than some of the above. (But I don’t mean to say that’s the only possible way a copyright claim could succeed; for example, close paraphrases can also be copyright violations.)

Another side note: I find it particularly interesting that all five of the authors involved appear to be basing their claims (at least in part) on the LLMs’ ability to generate detailed summaries of the books in question. To me, those summaries look more like they’re derived from human-written summaries and reviews than like they’re derived from the text of the books themselves.

Join the Conversation