Good Models Borrow, Great Models Steal: Intellectual Property Rights and Generative AI

This article addresses three critical policy questions that will determine the impact of generative AI on the knowledge economy and the creative sector. The first concerns how we think about the training of such models — in particular, whether the creators or owners of the data that are “scraped” (lawfully or unlawfully, with or without permission) should be compensated for that use. The second question revolves around the ownership of the output generated by AI, which is continually improving in quality and scale. These questions are inherently linked to the realm of intellectual property, a legal framework designed to incentivize and reward human creativity and innovation. For instance, the United Kingdom has historically maintained a distinct category with limited rights for new “computer-generated” works, while Singapore recently introduced an exemption allowing for computational data analysis of existing works. The third section of this article explores the broader implications of these policy choices, weighing the advantages of reducing the cost of content creation and the value of expertise against the potential risks to various careers and sectors of the economy, which may be rendered unsustainable. Some lessons might be found in the music industry, which also went through a period of unrestrained piracy in the early digital era, epitomized by the rise and fall of the file-sharing service Napster. Similar litigation and legislation may help navigate the present uncertainty, along with an emerging market for “legitimate” models that respect the copyright of humans and are clear about the provenance of their own creations.

This article addresses three critical policy questions that will determine the impact of generative AI on the knowledge economy and the creative sector.The first concerns how we think about the training of such models -in particular, whether the creators or owners of the data that are "scraped" (lawfully or unlawfully, with or without permission) should be compensated for that use.The second question revolves around the ownership of the output generated by AI, which is continually improving in quality and scale.These questions are inherently linked to the realm of intellectual property, a legal framework designed to incentivize and reward human creativity and innovation.For instance, the United Kingdom has historically maintained a distinct category with limited rights for new "computer-generated" works, while Singapore recently introduced an exemption allowing for computational data analysis of existing works.The third section of this article explores the broader implications of these policy choices, weighing the advantages of reducing the cost of content creation and the value of expertise against the potential risks to various careers and sectors of the economy, which may be rendered unsustainable.Some lessons might be found in the music industry, which also went through a period of unrestrained piracy in the early digital era, epitomized by the rise and fall of the file-sharing service Napster.Similar litigation and legislation may help navigate the present uncertainty, along with an emerging market for "legitimate" models that respect the copyright of humans and are clear about the provenance of their own creations.

Introduction
When people think of the risks associated with artificial intelligence (AI), Hollywood looms large.Movies have long conjured the worst-case scenarios: from Hal refusing to open the pod bay doors in 2001, to a murderous Arnold Schwarzenegger travelling back through time.
If there is a robot apocalypse, however, it is unlikely to resemble a Terminator movie.A more probable scenario is what was recently seen off-screen in -ironically enough -the Writers Guild of America (WGA) strike of 2023.
Hollywood's scriptwriters were protesting, in part, about the threat of many jobs being replaced by new generative AI tools that can perform similar functions at little or no cost.
The concern is not that humanity will wake up to discover that it has been replaced by AI; rather, it is that AI will progressively reduce the economic viability of certain careers by salami-slicing fulltime jobs into tasks that can be commoditised and outsourced.This can be thought of as the dark side of the gig economy (Prassl 2018).Where Uber, Grab, and the like offered flexible arrangements that were attractive for young workers who later discovered that no foundation had been laid for a career, ChatGPT threatens to take existing careers and break them into gig work for hire.Such precarity is not limited to scriptwriters.After an initial panic by academics worldwide that this new technology might enable students to cheat on their papers, it became clear that generative AI had larger implications for the knowledge economy, comparable perhaps to the impact of the industrial revolution on manufacturing."Knowledge workers" was the term introduced in 1959 by management consultant Peter Drucker for non-routine problem solvers (Drucker 1959).People who "think for a living" earn through their ability to analyse and write -something that ChatGPT can replicate in almost no time and at almost no cost.
Journalists, already taking a beating as readers turn from traditional to social media, now face the prospect of technology taking over the writing task as well.Yet that same threat confronts anyone who analyses or writes for a living, such as lawyers and even -gasp -academics.Applications are not limited to prose, as ChatGPT has demonstrated proficiency in coding as well as poetry (Dwivedi, et al. 2023).Similar developments have shaken the art world, with generative AI images flooding social media and, increasingly, traditional media.
Video and multimodal content is close behind.This article will consider three policy questions facing governments around the world in relation to how generative AI will impact the knowledge economy and the creative sector.
The first concerns how we think about the training of such models, in particular whether the creators or owners of the data that are "scraped" -lawfully or unlawfully, with or without permission -should be compensated for that use.The second question is who (if anyone) should own the output of generative AI, which is being produced at ever greater quality on ever greater scale.Both issues are linked to intellectual property, a body of laws that was adopted to incentivize and reward human creativity and innovation.Section three of the article considers the larger implications of the answers to those questions, weighing the benefits of lowering the cost of creation and the value of expertise against the possibility that diverse careers and sectors of the economy may be rendered unsustainable (Mims 2023).

I Think, Therefore I'm Paid
AI has always depended on access to data (Roberts, et al. 2021).Large language models (LLMs) in particular are trained on huge datasets, comprising publicly available material as well as copyrighted and pirated material available online (O'Leary 2013;Zikopoulos, et al. 2012).That scale transformed public debate about the impact of AI with the release of ChatGPT by OpenAI in November 2022, quickly followed by competitors such as Google's Bard, Anthropic's Claude, and Meta's Llama.Excitement and trepidation about the uses for systems able to respond to natural language queries with human-like responses -in text as well as images -suggested that the long-heralded economic promise of AI might be at hand.Goldman Sachs breathlessly reported that generative AI could increase global GDP by seven percent (2023).
How (if at all) should the rights of creators, whose text and images train such models, be recognized and compensated?The use of pirated or illegally obtained material appears at first blush to be a simple case of theft of intellectual property, but has been notoriously difficult to prove.Around the world, concepts like fair use are being stretched by the wholesale consumption of books, photographs, and other materials.In some jurisdictions, new rights to data-mine have sought to balance the interests of developers against those of creators.
If a work is not in the public domain, even temporary unauthorized use can be an infringement.This is the subject of ongoing litigation brought by Getty Images against Stability AI, for example, alleging that the Stable Diffusion model was trained on millions of copyrighted images and metadata.Getty claims that this deprived it of the revenue from licensing those images.Evidence of the alleged infringement includes content generated by Stable Diffusion with distortions of the watermark Getty uses to protect its product.Given the secretive nature of much model training, proving infringement is rarely as easy as this.Even in the Getty case, it appears possible that infringement may need to be demonstrated on a case-by-case basis, establishing substantial similarity for each image one by one -rather than the systemic infringement alleged by Getty (Tan 2024a).
Even if infringement can be established, fair use is a defence that balances the rights of creators and the interests of the wider public in distributing and using their works.It generally considers the purpose of the use, the nature of the work, the amount used, and the effect on the market for the original work.When an individual records a televised broadcast to watch at a later time, for example, that can be considered fair use.Projecting such a recording for an audience and charging tickets, by contrast, would not be (Beebe 2008).Litigation followed, with the Supreme Court ultimately concluding that, while Warhol's art might be fair use if hung in a museum, using the image for a magazine cover was precisely the kind of purpose for which Goldsmith licensed her own photos.She was therefore entitled to compensation (Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith

2023).
Returning to generative AI, a key question is whether using data to train models, which are then used to produce works that directly compete with the authors of those data, constitutes fair use.This appears to be distinct from other forms of data mining.When Google began scanning vast quantities of books in 2002, there were challenges that this infringed copyright.Google was, for the most part, successful in arguing that it made the information available but was not itself providing a substantial substitute or competing with the market for the original works (Authors Guild v. Google 2015; Maguire 2020).
The ability of generative AI to produce text and images that may, in fact, compete directly with past and present works produced by the authors and artists whose works trained those models is central to several of the lawsuits currently underway, including prominent authors such as John Grisham, Jonathan Franzen, and Elin Hilderbrand who are suing OpenAI, the creator of ChatGPT (Alter and Harris 2023;Reisner 2023b).
Mark Lemley, among others, has argued that model training should be regarded as fair use on the basis that machine learning is a transformative use of the underlying data.He and coauthor Bryan Casey also argue that this will encourage the creation of new databases with greater transparency, as well as recognizing that licensing materials for such large training sets is impractical given their scale (Lemley and Casey 2021).Lemley, who is part of Stability AI's defence team, has gone on to argue that the infringement question may be inapplicable to generative AI "for the simple reason that generative AI is not about copying existing works but about creating new ones" (Guadamuz 2023;Lemley 2023).
In the absence of statutory reform, lawsuits are likely to proliferate. 1 Singapore is an example of a jurisdiction that has tried to thread this needle through legislation.Amendments to its Copyright law in 2021 include a permitted use to make a copy of a work for the purpose of "computational data analysis", which includes extracting and analysing information and using it to "improve the functioning of a computer program in relation to that type of information or data" (Copyright Act 2021, ss.243-244).
The provision still requires lawful access to the underlying data, but appears more open to datamining and model training than traditional conceptions of fair use (Lim 2023) or the "non-commercial" text and data analysis exception adopted in the United Kingdom in 2014 (Copyright, Designs and Patents Act 1988, s. 29A).An information sheet produced by the Intellectual Property Office of Singapore (IPOS) explicitly states that the provision is intended to allow "training machine learning" (IPOS Factsheet 2022).Yet, analysing text or images for the purpose of making recommendations or optimising workflows is quite distinct from using those text and images to generate more text and images.The difference is not just the usage, where copying is central to the process, but also the economic impact of that usage (Tan 2023; Torrance and Tomlinson forthcoming).This is no longer a hypothetical problem.In addition to the possibility of diluting human authors' works, it is possible that they will simply be swamped by the volume of generative AI produced.An early example was the science fiction magazine Clarkesworld had to shut down unsolicited submissions because it was being flooded with AI content (Silberling 2023).Amazon, which is now one of the world's largest publishers of books, was becoming so overwhelmed by submissions that it now imposes a limit that its self-published authors may "only" publish three books per day (Creamer 2023).
Returning to the question of lawful access, much of the data used by LLMs for training is pirated in the first place.More than 70,000 pirated books were found when Peter Schoppert analysed the "Books3" dataset (Reisner 2023a;Schoppert 2023).No one is seriously suggesting that generative AI should not be trained.But it is reasonable to expect that 1 #update https://aicopyright.substack.com/p/i-will-get-an-order-out-when-i-gethttps://originality.ai/blog/openai-chatgpt-lawsuit-list models are not trained on stolen data, and that those who profit from this technology pay something to the creators whose works serve as its fuel (Tan 2024b).

Author, Author!
A second set of questions concerns who should own the outputs of generative AI.In an unscientific experiment, the author decided to ask ChatGPT itself and got two very different answers:2 "I do not have the ability to own intellectual property or any other legal rights," ChatGPT replied at first."Any text or other content that I generate is the property of OpenAI, as the creator and owner of the tool that I am." The author pointed out that OpenAI itself now explicitly states that it will not claim copyright over any content generated by ChatGPT (Ellison 2022;Guadamuz 2022;Schade 2023). 3This led to a revised answer: "The text generated is not the intellectual property of the model itself.Instead, the intellectual property rights belong to the person or entity who has commissioned the model to generate the text." Clear and concise, but also wrong.
In most jurisdictions, automatically generated text does not receive copyright protection at all.The U.S. Copyright Office has stated that legislative protection of "original works of authorship" is limited to works "created by a human being" (17 USC § 102(a)).It will not register works "produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author."(Compendium of U.S. Copyright Office Practices, 3rd edition 2019) (emphasis added).The word "any" is key and begs the question of what level of human involvement is required to assert authorship (Gervais 2020;Phelan and Carey 2023).Early photographs, for example, were not protected because the mere capturing of light through the lens of a camera obscura was not regarded as true "authorship" (de Cock Buning 2018, p. 524).It took an iconic picture of Oscar Wilde going all the way to the US Supreme Court before copyright was recognised in mechanically-produced creations (Burrow-Giles Lithographic Co v. Sarony 1884).Arguments continued in other jurisdictions, however, with Germany withholding full copyright of photographs until 1965 (Nordemann 1999).
The issue today is distinct: not whether a photographer can own images passively captured by a machine, but who might own new works actively created by one.Computer programs like word processors do not own the text typed on them, any more than a pen owns the words that it writes.But AI systems now generate news reports, compose songs, paint pictures.These activities generate value -can and should they be protected by the law?At present, the answer in most places is no.Unless there is an identifiable human author, copyright will not apply.The policy behind this is often said to be incentivizing and rewarding innovation.This has long been dismissed as unnecessary or inappropriate for computers."All it takes," Pamela Samuelson wrote in 1986, "is electricity (or some other motive force) to get the machines into production" (Samuelson 1986(Samuelson , p. 1199)).
Indeed, protecting such works might disincentivise innovation -by humans, at least.AI has already unleashed an economic tornado in the art world, massively lowering the cost of producing original images (Menéndez 2023).If we wish to have a thriving arts sector that gainfully employs humans, it is arguable that their creations should be protected while machine creations should not be.Automatically generated content may not be eligible for copyright protection, but edited and curated content that draws on such material could still be owned by the person doing the editing and curating.
The author fed that into ChatGPT, which agreed that this was correct -sensibly adding that legal advice should be sought if there were any further questions.
An alternative approach, adopted in Britain is to have more limited protections for "computer-generated" work, the "author" of which is deemed to be the person who undertook "the arrangements necessary for the creation of the work"."Computergenerated" is defined as meaning that the work was "generated by computer in circumstances such that there is no human author of the work" (Copyright, Designs and Patents Act 1988).Similar legislation has been adopted in New Zealand (Copyright Act 1994), India (Copyright Amendment Act 1994), Hong Kong (Copyright Ordinance 1997), and Ireland (Copyright and Related Rights Act 2000).Though disputes about who took the "arrangements necessary" may arise, ownership by a recognized legal person or by no one at all remain the only possible outcomes (Brown, et al. 2019, pp. 100-01;Nova Productions v. Mazooma Games 2007).The duration is generally for a shorter period, and the deemed "author" is unable to assert moral rights -including the right to be identified as the author of the work (Copyright, Designs and Patents Act 1988).
A World Intellectual Property Organization (WIPO) issues paper recognized the dilemma, noting that excluding these works would favour "the dignity of human creativity over machine creativity" at the expense of making the largest number of creative works available to consumers.A middle path, it observed, was to offer "a reduced term of protection and other limitations" (Revised Issues Paper on Intellectual Property Policy and Artificial Intelligence 2020).Several commentators have suggested similar approaches (Abbott 2020, pp. 71-91;du Sautoy 2019, p. 102).
As human authorship becomes more ambiguous, that middle ground may help preserve and reward flesh and blood authorship, while also encouraging experiments in collaboration with our silicon and metal partners.
Europe is actively considering such a measure (Séjourné 2020).The Singapore Academy of Law's Law Reform Committee proposed something similar in 2020 (Rethinking Database Rights and Data Ownership in an AI World 2020), but only traditional human authorship remains recognised under the new Copyright Act adopted the following year.AI-assisted works may still warrant protection there is a causal connection to a human exercising input or control, though determining the threshold for that connection is left to the courts (Tan and Tan 2022).
An indication of the difficulty can be seen in the case of Jason M. Allen, who was denied protection for the work "Théâtre D'opéra Spatial", which won first prize at the Colorado State Fair in 2022 -before he revealed that it was created using Midjourney.

Brave New World?
Generative AI has the potential to transform the arts as well as the knowledge economy.
The precise impact is presently unknowable, with a recent study suggesting a "jagged frontier" of innovation across different fields based on a survey of complex, realistic, and knowledge-intensive tasks (Dell 'Acqua, et al. 2023).Protecting IP rights too strictly could hinder the development of new tools and works enhanced by AI; failing to protect those rights could render millions of jobs unsustainable and undermine the viability of the arts sector in particular.
In the near-term, the most important regulatory steps are two forms of transparency in how such models are developed and deployed.Development should at least disclose the origins of the data used to train them, with appropriate compensation paid; deployment should make clear the relative contribution of AI to new "works", with a new category of computergenerated work offering a reasonable middle ground between purely human-and purely AIgenerated content.
With regard to development and economic sustainability, the music industry offers interesting parallels (Huber 2023).It also went through a period of unrestrained piracy in the early digital era, which radically transformed the economics of copying and gave rise to file-sharing services such as Napster (Tan 2017).Lawsuits and legislative changes led to most media platforms adopting copyright policies and takedown protocols (Digital Millennium Copyright Act (DMCA) 1998; Seng 2014), while those like Napster were shut down completely (Menn 2003).Producers and distributors developed technical means to limit copying, but a certain amount of piracy is often priced in as the cost of doing business (Aguiar, et al. 2018;Herings, et al. 2018).
It is possible that a similar evolution will take place in AI, at least with regard to LLMs.For all the concerns that IP protection will constrain the development of new models, the market for "legitimate" models appears to be growing.Adobe, for example, has built its Firefly tools using training sets consisting only of public domain and licensed works.Shutterstock has also committed to building AI tools with a Contributor Fund to compensate artists (Hayes 2023).Other models might also be used, such as the manner in which YouTube allows certain usages of music and other copyrighted material by sharing advertising revenue with owners of the original work through its Content ID system (Edwards 2018).
On the deployment of AI models, this connects to the larger question of whether consumers should know whether a given work is the product of a machine or a human.That might seem like a simple question, but AI-assisted decision-making increasingly blurs that line.For many years, certain customer relations chatbots have started on automatic for basic queries, moving through suggested responses that are vetted by a human, escalating up to direct contact with a person for unusual or more complex interactions (Kucherbaev, et al. 2018).
For the raw text and images produced by AI, at least, it should be possible to disclose their provenance.To guard against misrepresentation, various efforts are underway to detect AIgenerated text through anti-plagiarism software, though these have had mixed success, at best (Barrett, et al. 2023;Morris 2023).A more difficult but effective approach would be to "watermark" text and images in a manner that is invisible to users but detectable using a key (Li, et al. 2023;Sun, et al. 2023).Given the likely spread of the underlying software, this would be practical only if it is required by law.Even then, however, the spread of deepfake porn points to the difficulty of policing any such rules.
Much of the energy in this context comes from governments around the world concerned about generative AI being used to produce ever more realistic content at ever greater scale.
"Fake news" existed long before Donald Trump -it appeared in the New York Times at least by 1894 ("The "A.P." News" 1894) and in a headline by 1901 (Bartlett 1901) -but AIgenerated videos of Ukrainian President Volodymyr Zelenskyy "surrendering" in 2022 made clear how it might be operationalised as a weapon of war (Geng 2023, pp. 159-60).Yuval Noah Harari has gone further to argue that such usages of generative AI threaten democracy itself (Harari 2023).Hyperbole aside, greater understanding of what content is produced by AI and how it is generated would aid regulators in the world in maker better decisions, rather than relying on the market and the good graces of technology companies.

Conclusion
T.S. Eliot once observed that "good authors borrow, great authors steal".Occasionally, this is taken literally, a case in point being the German writer Helene Hegemann, whose 2010 best-selling novel Axolotl Roadkill lifted entire pages from another novel.When confronted with the apparent theft, the seventeen-year-old responded that "There's no such thing as originality, just authenticity" (Ellis 2010).
Eliot was not, of course, condoning plagiarism.His larger point was to challenge naïve idealization of the creative process: in arts, as much as in science, each new thinker and writer builds on the work of those who have come before.Painters inspire and echo one another; writers offer variations on plots and structures that can be mapped and catalogued (Booker 2006;Koestler 1964).This is perhaps clearest in music, where the limits of the heptatonic scale and chord progressions mean that melodies will inevitably echo one another, as Ed Sheeran successfully argued in a case concerning similarities between his hit song "Thinking Out Loud" and Marvin Gaye's "Let's Get It On" (Seabrook 2023).
With regard to the first question considered in this article, it may seem pointless to argue that AI models should pay for the use of data when the entire Internet has already been absorbed (Guadamuz 2023).In addition to the market for "legitimate" models, however, there is evidence that further refinement of those models and the training of new ones depends not just on the volume of data but its quality.In particular, early suggestions that LLMs might continue improving based on synthetic data that they themselves create have foundered on projections that such AI-generated data will "poison" future models (Martínez, et al. 2023;Rao 2023).Presuming that there is an ongoing market for data and the political will to regulate it, the idea that generative AI will have its own "Napster moment" is at least plausible.
As regards AI-produced content, existing laws appear capable of holding the line on protection of the rights of human creators.Much of the regulatory attention is focused on the threats posed by the quality and scale of synthetic content and its ability to overwhelm the market by sheer volume or exert influence on populations through deception.
Jurisdictions like Singapore that adopted laws intended to address misinformation and disinformation online were criticized as draconian (Jayakumar, et al. 2021), but similar tools are increasingly being considered by western liberal democracies also (Bollinger and Stone 2022;Giusti and Piras 2021).
Underpinning all of this is the question of how societies choose to regulate this sector.In theory, governments regulate activities to address market failures, or in support of social or other policies.In practice, relationships with industry and political interests may cause politicians to act -or refrain from acting -in less principled ways (Baldwin, et al. 2011, pp. 15-24).Though the troubled relationship between Big Tech and government is well documented (Alfonsi 2019;Romm 2020), this article assumes good faith on the part of regulators.
From a market perspective, failing to protect human-authored works used in the training of generative AI -or offering too much protection to the computer-generated outputswould reduce the incentive for additional human creations.It is conceivable that this would not be a net loss if AI content more than makes up for the deficit.That would certainly be the case with regard to quantity -and may yet be so with regard to quality.(In discussions of AI content, it is common to hear sage observations that AI will never produce Michalangelo's "David" or Jane Austen's Pride and Prejudice.That may be true.Yet I will never produce such works either -nor will you.)Nonetheless, if the concerns about model poisoning are correct, even the AI models themselves will continue to require human creativity to achieve further improvements.
In any case, regulation is not simply about market optimization.If societies value the arts, then investments should be made in them.Again, there is precedent for this -even if it is not particularly inspiring.Photography all but killed portraiture, though painting remains a niche activity (Graw and Lajer-Burcharth 2016).Motion pictures and television did not lead to the end of live theatre, but far fewer see it today than a century ago.Such art forms, along with dance, opera, orchestral music and the like continue with government subsidies -and, even then, are often regarded as bourgeois conceits (Bennett 2016).It is possible that human-generated text and images will become similarly rarefied, preserved as a callback to a different era, like Shakespeare's Globe Theatre.
There are far larger implications, of course, connected to how we relate to knowledge.For the past two decades, "to Google" came to reflect how many questions were formulated.
The answers came with a ranked list of responses, which had the salutary consequence of making clear that there were multiple possible answers -along with subtle indications that some of them might be supported by advertisers who paid for the whole enterprise.If, as appears likely, generative AI leads to our interactions with ChatGPT, Bard, Claude, and so on becoming the first point of inquiry, it is probable that answers will be clear, succinct, and opaque.At that point, understanding the inputs that go into generative AI and who is responsible for its outputs will have political as well as economic consequences.
draft for LKYSPP Conference October 2023, and later submission to Policy & Society Journal Special Issue: Governance of Generative Artificial Intelligence

Figure 1 .
Figure 1.From the complaint in Getty Images litigation, Getty image (left) and Stability AI image (right) photograph in 1981.

Figure 2 :
Figure 2: A black and white portrait photograph of Prince taken in 1981 by Lynn Goldsmith

Figure 3 :
Figure 3: A purple silkscreen portrait of Prince created in 1984 by Andy Warhol to illustrate an article in Vanity Fair

Figure 4 :
Figure 4: An orange silkscreen portrait of Prince on the cover of a special edition magazine published in 2016 by Condé Nast

Figure 5 :
Figure 5: Oscar Wilde, in the Smithsonian Magazine, May 2004

Figure 7 :
Figure 7: A Recent Entrance to Paradise