Generative AI is not trained on "data"
- data
- factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
Is a poem factual information?
Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.
Doesn’t look like it. But what about this?
The first verse of “Do Not Go Gentle into That Good Night” by Dylan Thomas goes like this: “Do not go gentle into that good night, / Old age should burn and rave at close of day; / Rage, rage against the dying of the light.”
Oh dear, that’s looking pretty factual to me! See how the magical transformation is completed:
- information in digital form that can be transmitted or processed
Behold, a .txt file:
| 0000 | D |
| 0001 | o |
| 0002 | |
| 0003 | n |
| 0004 | o |
| 0005 | t |
| 0006 | |
| 0007 | g |
| 0008 | o |
| ... | |
Aha! Merely by saving it in a file, I have transformed Do Not Go Gentle… from poetry to data! It’s presented as a table, it must be so.
And there’s more! The advent of the multimedia computer means I can perform the same trick on paintings, songs, speeches, movies and anything else your heart desires!
Maybe we shouldn’t treat everything that can be saved on a computer like soil density
Even though an amateur rhetorician (that’s a portmanteau of “rhetoric” and “magician”, I’m still working on it) like me can trivially transmute works into data, the two have very different moral frameworks.
Depending on where you live and what your ideology is, data is either unownable, or owned by the person who collects it. Even if it’s unownable, you can still charge for the service of collecting it, you just can’t stop others from doing the same. Either way, the producer of data has no rights over it[1] — you don’t pay the wind to measure its speed.
But with creative works, the producer has all the rights. Copyright, for one, but even if you don’t believe in that, most people still think plagiarism is bad, and artists should be able to sell their art as long as money exists.
These are sensible if you assume that data is produced involuntarily, mostly by inanimate beings, and collecting it takes effort, while works are the result of intentional human labor, and are easy to “collect” since most creatives want their works seen.
They’re also sensible if you assume that data is collected by researchers to advance science or provide reference material, and works are collected by publishers to sell for entertainment and edification.
So, when a research non-profit uses well-established resources like Common Crawl to advance the field of artificial intelligence, nobody bats an eye. But when that firm pivots to selling its research artifact as a product, it becomes the political issue of the decade.
Wait… aren’t AI models data too?
The spirit known as OpenAI and its thralls have turned all web-published works into data by saving them into a database and doing research to them (this is the same spell we just performed earlier),
But these powerful entities, through rituals like word vectorization, reinforcement learning with human feedback, and bribing politicians, are now trying to transmute the database into an ownable, sellable work without turning the data inside back into works as well.
They will, of course, be granted challenges to business models—as new technologies always are—especially for those who make their money off of gating up and charging access to data. But such practices simply aren’t tenable in the long term, legally or practically (let alone morally). Under US law, facts aren’t copyrightable (thanks to the landmark Supreme Court decision in Feist v. Rural Telephone Service) and databases are just collections of facts.
(Another “data is factual information”, nice! Maybe I’ll take all the works that say this and put them in a CSV so they’re data.)
I like that facts aren’t copyrightable.[2] But “gating up and charging access to” something is sadly the only reliable way to make money off of it. And now that all digital works have been turned into data, creatives are losing their ability to do exactly that.
The megacorps spirits continue to make money, because they can perform the bribery rituals.
I cannot. So, instead, I’m trying to cast a counterspell upon the initial data-fication spell.
Training materials
I’m dropping the magic shtick now.
This post is not an appeal for you to stop saying “training data”. After years of wrestling with “free as in beer”, I remain unconvinced using a different word to describe the same thing makes people want to hear you out.
But the data-works distinction exists, and it’s a distinction between concepts not words. (I don’t think “works” is the best word for it anyway, anyone have a better one?) When you’re building software, everything is “data”,[3] but we need to step back at some point and realize that this is only an illusory state that works must inhabit to fit inside the computer — they get turned into data temporarily, after the moment of creation, until being restored at the point of retrieval. For anyone who doesn’t live in the computer, a poem is not data.
So, I will be calling the inputs to an AI training process “training materials”, unless they are truly data. You can do it too. It can be our thing. It won’t make the general public interested, nobody will be begging us to explain one more time that you can pay for free software, but at least we’ll know when we’ve chanced upon one another at the slop trough.
Unless that producer is a human in a country with privacy regulations like the GDPR, but even then, the “data subject” in GDPR terminology is not a “data owner”. ↩︎
I wish they weren’t tradable either — it is absurd that information about me that I give to a business so they can provide services is an asset that can be valuated and sold. How did this ever become normal? ↩︎
or “content”, as in the information-free phrase “Content Management System”. Hey, why are we still building Thing Doers? Hasn’t anyone designed a flexible enough Thing Doer that can Do any Thing? ↩︎