#012 - Dirt in the AI machine
Welcome to the shadier corners of AI.
You're reading Complex Machinery, a newsletter about risk, AI, and related topics. (You can also subscribe to get this newsletter in your inbox.)
This week I'm expanding on a couple of links I shared last time, bookended by recent news.
It's all about the dirty side of AI: how companies collect training data, how they prepare it for use, and how they sell their products.
Hungry for training data
GenAI products elicit a mix of "OMG this is amazing" and "hey they stole my work."
The latter should make AI hopefuls think twice about how they source their training data and deploy their models. But not so much. Figma's AI design tool has been cranking out apps that look suspiciously like Apple Weather. Which is mild compared to Perplexity's AI system allegedly plagiarizing a Wired article … that was critical of Perplexity. (The results thus far: Figma has since pulled their tool in order to address the problem. Perplexity is standing firm in its delightfully twisted (mis?)interpretation of web crawler standards.)
OpenAI is doing its best to lead the pack. In addition to racking up lawsuits, they've said the quiet part out loud:
As The Telegraph reports, [OpenAI] said in a filing submitted to a House of Lords subcommittee that using only content from the public domain would be insufficient to train the kind of large language models (LLMs) it's building, suggesting that the company must therefore be allowed to use copyrighted material.
"Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
Right. Right.
What?
Mixed messaging
If I read that correctly, OpenAI has just acknowledged that their entire business model requires a rather spirited interpretation of fair use. And that they pretty much agree with plaintiffs in the various copyright infringement lawsuits levied against them. Plus, they think their products are more important than existing laws and norms.
To borrow an old phrase: if this is what they say in public, what do they say behind closed doors?
Their employee agreements offer a hint:
The whistleblowers said OpenAI issued its employees overly restrictive employment, severance and nondisclosure agreements that could have led to penalties against workers who raised concerns about OpenAI to federal regulators, according to a seven-page letter sent to the SEC commissioner earlier this month that referred to the formal complaint. The letter was obtained exclusively by The Washington Post.
OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation, the letter said. These agreements also required OpenAI staff to get prior consent from the company if they wished to disclose information to federal authorities. OpenAI did not create exemptions in its employee nondisparagement clauses for disclosing securities violations to the SEC.
A document that says hey don't blab company secrets to the competition? That's an NDA. Standard fare in today's information-heavy tech field.
A document that says hey don't tell the feds what we're up to? That … does not sound like an NDA. I don't even know what you'd call that. I don't understand what legitimate business might need it. Nor is it the sort of promise that you'd want to commit to written form on company letterhead, because it might come back to haunt you in a courtroom. Especially if employees disregard the "don't talk to the feds" bit and, y'know, go to the feds.
But that's just me.
Not a new problem
Getting sued over your training data is a distraction. If you don't have the deep pockets and/or nerve of an AI startup, you can amend your privacy policy for the genAI era. As many companies have done.
I am disappointed. But I am not surprised.
This hunger for dodgy, gray-area data collection has been around since the early days of predictive analytics and Big Data. Long before that, even, when you consider the very quiet, behind-the-scenes world of data brokers. (How quiet? I dare you to name four data brokers that predate social media. Go ahead. I'll wait.)
Sketchy privacy policies contort the idea of informed consent. And when people complain, we roll out the tired "if you're not the customer then you're the product" line. We should really focus on the flip-side of that coin, for companies: "this business model only works if the raw materials are free." It's the financial equivalent of defying gravity. That deserves more attention.
Consider the human cost
While sneaky data collection is old, modern-day ML/AI has created a couple new problems of its own. One stems from data labeling.
Once you've stolen collected all of that raw data, you need to mark up images, assign categories to documents, and otherwise tell the computers what to look for. That job is best suited to people, so a cottage industry has cropped up to coordinate the labor.
Content moderation is a sizable consumer of data labeling work. This requires labeling teams to look at all the slurs and violence that we do not want to see on social media. And a good deal of this digital drudgework happens overseas – out of sight and out of mind of the wealthy western countries that benefit from the content moderation systems.
If you speak German (or you trust online translation tools) this recent piece in Der Spiegel interviews two people who've performed data labeling work for content moderation systems. It's eye-opening, if for no other reason than to demonstrate the humans involved in – and the human cost of – AI.
Nouns and adjectives
A couple of months ago I noted that the AI field is deep in Fake It Till You Make It territory:
New technology lends itself to the murkier practice of pitching a distant, possible-future state as a present-day reality. For AI this means that we might someday have chatbots that are hallucination-free replacements for search. And we might someday have fully-autonomous cars. But despite what the AI hype train may tell you, we don't have either one just yet.
The Atlantic's Charlie Warzel raised similar concerns when he spoke to Ariana Huffington and OpenAI founder Sam Altman about their new AI-based health venture. The pair were as upbeat about the project as they were squishy on the details.
(I won't provide an excerpt here because the entire article is quoteworthy. I encourage you to read the whole thing and then apply that lens to every other AI product.)
Why would they be so vague? I can't say for sure. But I can tell you two things about sales:
1/ Founders are always in sales mode. They must sell prospective hires, buyers, and investors on some glorious (possible) future.
2/ Do you know what's surprisingly easy to sell? Something that doesn't exist.
Since The Thing isn't real (yet), it can be anything. It neatly conforms to the shape of every buyer's desires while offering plausible deniability to sellers. It can be frustrating to critique for this same reason.
My antidote to this AI vaporware stems from advice I once received about real estate listings:
Focus on the nouns, not the adjectives.
Every noun in a house listing constitutes a falsifiable claim – something a buyer can test, and to which the seller can be held to account. Adjectives? I won't say terms like "cozy" and "charming" are worthless. But if you show me a house listing that is all adjectives, I will show you a reason to look elsewhere. Stick with the noun-heavy descriptions.
The same holds for sussing out an AI vendor's pitch. Warzel tried this in his interview with Huffington and Altman. He probed for concrete ideas and came away empty-handed. That tells me everything I need to know about this AI health venture.
Because remember: without hard facts, all you're buying is someone else's dream.
In other news …
- There's an old Chris Rock line: a thief will steal your wallet; a junkie will steal your wallet and pretend to help you look for it. No idea why that comes to mind right now. No idea! But completely unrelated, OpenAI is pairing up with Los Alamos Lab to protect the world from … AI. (Gizmodo)
- AT&T recently suffered a data breach. This would have been bad enough in the era before predictive ML and genAI. But when you consider what today's technology could do with that haul of call metadata, that's scary. (TechCrunch)
- Intuit, maker of Quicken, is cutting workers while also hiring other workers. Because, AI. Of course. (WSJ)
- Investors and other high-profile figures are raising concerns about a possible AI bubble. I don't see any quotes from Complex Machinery #007 on this topic, but I'll just assume that they've all read it. (Le Monde 🇫🇷)
The wrap-up
This was an issue of Complex Machinery.
Reading online? You can subscribe to get this newsletter in your inbox every time it is published.
Who’s behind Complex Machinery? I'm Q McCallum. I think a lot about AI and risk, which I write about here.
Disclaimer: This newsletter does not constitute professional advice.