Synthetic Data is Fentanyl for AI

Sep 08, 2023

SHOT

The internet of digital natives (and immigrants) is slowly dying.

“Google is trying to kill the 10 blue links.
“Twitter is being abandoned to bots and blue ticks.
“There’s the junkification of Amazon and the enshittification of TikTok.
“Layoffs are gutting online media.
“A job posting looking for an ‘AI editor’ expects “output of 200 to 250 articles per week.”
“OpenAI’s ChatGPT is being used to generate whole spam sites.
“Etsy is flooded with “AI-generated junk.”
“Chatbots cite one another in a misinformation ouroboros.
“LinkedIn is using AI to stimulate tired users.
Snap Inc. and Instagram hope bots will talk to you when your friends don’t.
“Reddit, Inc. [moderators] are staging blackouts.
“StackOverflow mods are on strike.
“The Internet Archive is fighting off data scrapers, and “AI is tearing Wikipedia, the Free Encyclopedia apart.”

CHASER

Synthetic data is fentanyl for emergent Clippy, resulting in collapse of the Large Language Model.

Data governance is the active practice of curating & stewarding trusted data assets to create and sustain value to the business model.

Per Jason Bloomberg of Intellyx”

“Model collapse occurs when AI models train on AI-generated content. It’s a process where small errors or biases in generated data compound with each cycle, eventually steering the model away from generating inferences based on the original distribution of data.
“In other words, the model eventually forgets the original data entirely and ends up creating useless noise.”
“People are poisoning the Web all the time. Perhaps even you have done so. All you need to do to accomplish this nefarious deed is to post any AI-generated content online.
“Poisoning, after all, can be either intentional or inadvertent.
“Because AI is creating the synthetic data, however, there is the risk that the data sets that trained the synthetic data creation models included AI-created data themselves, thus establishing the vicious feedback loop that leads to model collapse.”

The intentional and inadvertent "poisoning of the well" used for high-cost training data will only make the ROI for projects harder to justify.

Active data governance of curated and trusted data is a way for Product organizations to retain value from an AI implementation in their offering.

Daily Bread

Discussion about this post