It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.

  • CosmoNova@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    4 months ago

    Sadly, there’s a silver lining for giga corporations exclusively. They have near endless resources to amass more and more human made data and IPs to keep feeding their content machine for years to come. You and me won’t be able to train anything decent from datasets that scrape random websites anymore for the known reasons, but Microsoft, Facebook and Google are above us filthy plebs in that they already own or are able to pay for high quality datasets that they can monetize completely legally. I mean they’re lobbying for exactly that: To lock the tech away from the public. And of course the US government being the US government, they make it happen already with nightmarish regulations that hand the keys to the tech to the super rich.

    Though I wonder how much enshittification us people can take before we simply leave most parts of the internet to experience real life instead. Because the digital world looks more surreal by the day lately and it kind of stops existing as soon as we avert our eyes from our screens.