Let’s not treat our AI product’s terms lightly, (few) people read them…
And that’s enough to start an internet “tornado” raging against your company.
We have seen this kind of communication nightmare over and over recently: Zoom, Adobe, Midjourney, OpenAI… These events highlight the growing public scrutiny of AI, specifically concerns over how people’s data and content could be used to train AI large language models without their consent or without their receiving compensation. Once the word of mouth is out there, companies find themselves in a communication catch-up (e.g. Zoom's AI products terms) but trust is altered forever, and an opportunity for competitors.
As such, we should not be naïve, these competitors aren’t likely any better. AI learns from data. In order for service providers to bring you clever services, they have to use some data to train their models. Any good AI product strategy includes a good data collect strategy. One has to grow the company’ dataset overtime, ensuring the models is updated frequently, ideally with real data used by the services. That is so the inference performance is at its best, as data then is close enough to what has been learnt. Model performance may decay otherwise, a phenomenon most often called data drift.
Why should we be concerned about our data being used for training purposes. Companies has trade secrets, and they don’t want their competitors to use large model to “reason” with their data.
Even in your private sphere, once a model has learned information about you, scrapped from the internet, it might be a nightmare for you to get this “unlearnt”, as large models training costs are over millions of dollars, thus not done on a daily basis. Also models hallucinations might not be at your advantage (example ChatGPT hallucination just got OpenAI sued, although, things are changing rapidly, as this recent OpenAI announced you can block their scrapper).
Companies like Zama might alleviate privacy concerns with their FHE technology (Fully Homomorphic Encryption). They are bringing privacy with what they call “encryption at use” (encryption at rest, and encryption in transit are what GDPR requires as security by design). Basically it is about processing data without having to decrypt it (more on the subject at "This company believes to have the solution to chatgpt privacy problems").
Compensating data owners (authors) is something that is rather unclear. Adobe made a commitment to reward artists. We’ve heard some stating Blockchain might help. But reality is there is nothing tying, say a picture, used at learning stage, with its model inference.
Now, is the freemium vs premium business model the solution to this privacy concern? “When something is free, the user is the product.”, right? It seems we have accepted this model widely, and the GAFAM platforms have leveraged it for a long time. Zoom little oversight should have considered this before publishing their new terms. Their free users are likely more open to see their data used for training AI while benefiting from a new (optional) service.
So let’s be prepared to handle AI product terms with extreme care. For sure we’ll see more of that kind of issues in the press.
Comments