Another great example (from DeepMind) is AlphaFold. Because there’s relatively little amounts of data on protein structures (only 175k in the PDB), you can’t really build a model that requires millions or billions of structures. Coupled with the fact that getting the structure of a new protein in the lab is really hard, and that most proteins are highly synonymous (you share about 60% of your genes with a banana).
So the researchers generated a bunch of “plausible yet never seen in nature” protein structures (that their model thought were high quality) and used them for training.
Granted, even though AlphaFold has made incredible progress, it still hasn’t been able to show any biological breakthroughs (e.g. 80% accuracy is much better than the 60% accuracy we were at 10 years ago, but still not nearly where we really need to be).
Image models, on the other hand, are quite sophisticated, and many of them can “beat” humans or look “more natural” than an actual photograph. Trying to eek the final 0.01% out of a 99.9% accurate model is when the model collapse happens–the model starts to learn from the “nearly accurate to the human eye but containing unseen flaws” images.
next up: “Great thanks we’re gonna sell all your photos unless you pay for a subscription. Gotta keep in business somehow!”