AI Has Already Run Out of Training Data, Goldman's Data Chief Says (businessinsider.com) 81
AI has run out of training data, according to Neema Raphael, Goldman Sachs' chief data officer and head of data engineering. "We've already run out of data," Raphael said on the bank's podcast. He said this shortage is already shaping how developers build new AI systems. China's DeepSeek may have kept costs down by training on outputs from existing models instead of fresh data. The web has been tapped out.
Developers have been using synthetic data -- machine-generated material that offers unlimited supply but carries quality risks. Raphael said he doesn't think the lack of fresh data will be a massive constraint. "From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said. Proprietary datasets held by corporations could make AI tools far more valuable. The challenge is "understanding the data, understanding the business context of the data, and then being able to normalize it."
Developers have been using synthetic data -- machine-generated material that offers unlimited supply but carries quality risks. Raphael said he doesn't think the lack of fresh data will be a massive constraint. "From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said. Proprietary datasets held by corporations could make AI tools far more valuable. The challenge is "understanding the data, understanding the business context of the data, and then being able to normalize it."
The withered teat has been sucked dry (Score:3, Insightful)
and the future is safe from Skynet.
How angry are you? (Score:1, Troll)
Rage against the AI machine? People want wise oracles, but the AIs are stupid and some people are therefore learning to limit their thinking to what the AI is good at talking about. Even worse, some folks think the YUGE Orange Buffoon is some kind of gawd king...
How about updating the old song to "AI can do anything better than you"?
(And I think all the ACs could be replaced with a genAI set to the style of "stupid".)
Re: (Score:2)
Rage against the AI machine? People want wise oracles, but the AIs are stupid and some people are therefore learning to limit their thinking to what the AI is good at talking about. Even worse, some folks think the YUGE Orange Buffoon is some kind of gawd king...
How about updating the old song to "AI can do anything better than you"?
(And I think all the ACs could be replaced with a genAI set to the style of "stupid".)
Really? What part of that joke triggered the censor trolls with mod points? Perhaps the reference to he who should not be mentioned?
Copyingis easier than innovation (Score:2, Troll)
It is very easy to copy other people, hard to innovate. One of the reasons the US has kept up with China.
Similarly, it is very hard for someone to create something smarter than they are. One of the reasons I am not afraid of AI - it will never become smarter than a human. It's speed of growth is because it is following in our footsteps, that does not translate to breaking new ground.
Re: (Score:3)
Not everything is about smarts though, some of it is just volume.
How many polymer formation can you model and observe properties of in the search for some new plastic material for a specific application. The AI does not have to be smarter, it just has to be faster than the human material scientists before it.
Re:Copyingis easier than innovation (Score:5, Interesting)
Not everything is about smarts though, some of it is just volume.
How many polymer formation can you model and observe properties of in the search for some new plastic material for a specific application. The AI does not have to be smarter, it just has to be faster than the human material scientists before it.
In the end though, that's just (a set of) new algorithms. The FFT already transformed the world. The ability of mobile phones to handover from one cell to another automatically transformed communication. Being able to fold proteins more efficiently will transform biology. There are lots of these examples. They are important and valuable but they aren't smart.
None of them comes close to the effect that an AGI would have. If you could get something as smart as a cat, it's reasonably likely that small tweaks allow you to create something more intelligent than a human. The fact remains that even the greatest LLMs and other current "AIs" are visibly lacking. The future, many years from now, when intelligence is finally actually understood is likely going to laugh at our current AI experts claiming that they are close to intelligence in the same way that we laugh at alchemists who claimed that they had almost transmuted lead into gold long before atomic theory was understood.
Re: (Score:3)
Yeah this whole data exhaustion thing should be ringing alarm bells for investors. My toddler has not been exposed to the entire recorded media archive of humanity, yet they are self-aware and able to understand the world and interact with it.
Even for someone who doesn't understand the tech, this should be a strong indication to them that the LLM approach is not at all replicating the most important biological processes required for sentience. And if that is the case, then it's absolutely crazy to say that
Synthetic data is quality controlled (Score:1)
I think "China's DeepSeek may have kept costs down by training on outputs from existing models" is a negative framing. Models like Microsoft's Phi or Auraflow used a lot of synthetic data, because the usual datasets, both for text and images, are mostly unsorted results of crawling everything that's not behind login walls and contain a lot of garbage.
While there are some tools like aesthetic scoring, the result of such a classification is not as reliable as one would want, and quality may also be subjective
Got enough for bootstrapping (Score:3, Insightful)
For example, how does a self-driving car 'run out of training data'? They are gathering vast amounts every day from their already-deployed cars. Probably more than they can handle.
Same with call center AI. Where could they get more speech data? Well obviously it gets reams more data every day in the course of doing its job, to get better over time.
Re:Got enough for bootstrapping (Score:4, Insightful)
Tell me you have no clue how AI training works without telling me you have no clue how AI training works ...
Re: (Score:2)
Re: (Score:3)
He's not wrong. Many of the large models are large because they contain more knowledge, not more intelligence. But if you add RAG and other techniques to access lets say a Wikipedia dump, you don't need the knowledge in the weights. So train the weights to provide as much intelligence, instruction following, etc. as you can and then give the model the most recent Wikipedia dump and access to Google search. Most humans don't have the knowledge a typical LLM has in most areas, so why would a LLM need to try t
Re: (Score:3)
What experts bring to the table is meta-knowledge. Yes, you definitely do not need to know everything. You need to know some things or you will never develop that meta-knowledge. And you need to have a pretty good idea about what exist and what it can do, so you know what to look up. The latter also comes into play when plausibility-checking what an artificial cretin (a.k.a. LLM) delivered you for completeness and applicability (or existence) in the first place.
Re: (Score:3)
I guess meta-knowledge and "intelligence" go hand in hand. You need some kind of "basic understanding", logic, reasoning, and conclusions together with language and grammar. If this is scaled high enough, you can have the knowledge in external well-curated databases and can effectively prevent hallucinations and so on.
The point with LLM is, that their base training is not training it on logic, etc. but logic is an emergent effect of training it on a huge pile of knowledge. As long as we rely on the emergent
Re: (Score:2)
Sorry, but you are deeply mistaken. LLMs cannot do logic and there is no "emergent" intelligence in LLMs. The mathematics used does not allow that. Logic (and intelligence) needs generalization steps LLMs cannot do.
Re: (Score:2)
Reality in LLM disproves you.
Re: (Score:3)
Re: (Score:2)
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fwww.reuters.com%2Fbusine... [reuters.com]
Of course waymo never did rely on opportunistically scraping crowdsourced data in the first place, so "running out" was never going to be an issue.
Re: (Score:2)
Re: (Score:2)
My prediction is that self-driving won't level off before handling a large share of all trips for several reasons: the cars are already somewhat adaptable and will become moreso; that increased trips will provide ever-increasing data to train on; that temporary roadway changes e.g. for construction will become handled more systematically to accommodate automation e.g. waze, google
Re: (Score:2)
Re: (Score:3)
It may have sampled a major highway 50,000 times but it's not going to learn anything extra so it has run out of training data.
You're looking at the road as a fixed object. It's not, it is ever changing. The varying position of cars, obstructions and the chaotic nature of the way humans react present virtually limitless training data. That's to say nothing of seasonal changes. Does that road look the same in summer, winter, day, night, with water, fog, snow, covered in leaves, etc.
There's a question of whether more training actually improves performance, but nature and humans are naturally chaotic meaning the training data will not
Re: (Score:2)
Re: (Score:2)
Technically correct, but cars don't need to learn anything about the highway anymore. They still have plenty to work about roadworks, about human driving behaviour, etc.
You don't actually need to learn a highway at all, you could in theory just map it once.
Re: (Score:2)
Re: (Score:3)
To take your car example, one of the ways that Tesla "full self driving" kills people is by slamming into overturned vehicles at high speed. Its computer vision system simply doesn't understand what vehicles that have turned over look like.
To train for that they either needs a lot of images of vehicles that have turned over, been smashed up, mangled, and distorted, in all lighting and weather conditions, or they need to give it more general training to understand solid obstacles even when it doesn't know wh
"synthetic data" (Score:3)
Re: (Score:2)
"synthetic data" is made-up crap.
Indeed. Synthetic data means you could have just put the assumptions the data was generated on into some more reliable AI approaches than an LLM. Newsflash: That was possible before. It was just way too expensive. That constraint remains.
Also, if you put stupid in, putting some data-synthesis and then LLM training process in its path, it will still be stupid, just with hallucinations on top.
Re: (Score:2)
That is the intuition behind model collapse, and well understood in the field before the paper that coined the term. A lot of us are kicking ourselves for not jumping on the opportunity! My name wasn't as cool though, so while I'll miss the citations, it's probably for the best.
To complete the thought: There can be no new information produced, only the old information with whatever error was encoded (no model perfectly captures everything we want) and whatever error was introduced in the generation of th
Re: (Score:3)
The big misunderstanding about model collapse is the curation of data.
Get a model that is trained with good hands and produces bad hands 20% of the time and train it with its own output. The rate of bad hands will increase, as it now works with worse data than before. Now put a human in the loop and train only on the 80% output data with good hands. The rate of bad hands will decrease, as first the model got more training on good hands (even generated ones) and second the set of generated hands are from the
Re: (Score:3)
Whether what you described is the goal or not is an unanswered question. If the goal is output that deals with known situations properly, it may achieve that. If, on the other hand, the goal is to have output that is good on old, known situations and new situations, it will probably fail at that.
It's like curve fitting. Give me a set of 10 data points and I can make an equation that perfectly fits those 10 data points. Then see if an 11th data point lies directly on that curve, and there's a good likelihood
Re: (Score:2)
The point is in the end, that extrapolation (in contrast to interpolation) expects that the learned pattern is true outside of the training data. How good that works depends on how different the unknown data is, and how the trained model is designed. Two models that are identical at interpolating can be very different at extrapolating. And by definition you don't know the data to extrapolate before.
I'd say text and image models benefit a lot of things having a lot more pattern than we recognize. You'd think
Re: (Score:2)
Re: (Score:2)
There are too many models even from larger corporations like Microsoft trained on synthetic data to say it has no place. Phi has a lot of synthetic data, the "distillation" of the smaller Llama 3 models is no classical distillation but training with synthetic data from the large model. The whole stable diffusion community trains characters LoRA and styles on synthetic data.
For LLM it is often more or less distillation and reinforcement of good data, for image AI you have things like fixing characters. You'v
How about "No." (Score:5, Interesting)
"From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said. Proprietary datasets held by corporations could make AI tools far more valuable. The challenge is "understanding the data, understanding the business context of the data, and then being able to normalize it."
Most business owners have become aware of how tech companies treat any data that they get ahold of. I don't know of many that will be all-aboard for allow these AI companies full access to their corporate owned proprietary datasets to train their public models. And while some businesses will happily allow local / company owned data center loaded AI train on their proprietary datasets, they're not going to sign off willingly on letting that training data escape the company network. So, they'll be able to train on proprietary data, but it will remain locked behind corporate firewalls, unless these companies are exactly as scummy as suspected and simply have the local model phone home with data, regardless of how they are set up.
And if they do, the day that's discovered by the corporations paying for local models is going to be a very, very interesting one. Stealing data from all use rubes is one thing, but pissing off the money holders? That's a bad move. They can actually do something about it.
Re:How about "No." (Score:4, Interesting)
Marketing 101 says that the more tightly you define your market, the more accurately you can provide a solution, and the more money you can charge for it. i.e. Autocad. What I think you're saying is that companies want to train LLMs or whatever it is, on their corporate knowledge and culture and data and restrictions. They want the AI customized to their use case. That seems logical to me.
The smart ones will do everything themselves and not allow Big Brains anywhere near their data or model. Work from first principles, own your valuable intellectual property. Don't give it away, chumps.
Re:How about "No." (Score:4, Interesting)
Interesting comments. The quote you chose is where the future lies, imho. The idea that all data is good data and that ChatGPT is all things to all people was naively flawed from the beginning. Based on my experience, businesses want deterministic answers to specific questions, not a distribution of "sort of" answers. What I'm reading from calling data "synthetic" means TELLING the model what is right or wrong. You must bias your model, also known as "teaching" your model. The idea that all you have to do is give the model EVERYTHING then it will magically be "smart" is ludicrous from inception. Marketing 101 says that the more tightly you define your market, the more accurately you can provide a solution, and the more money you can charge for it. i.e. Autocad. What I think you're saying is that companies want to train LLMs or whatever it is, on their corporate knowledge and culture and data and restrictions. They want the AI customized to their use case. That seems logical to me. The smart ones will do everything themselves and not allow Big Brains anywhere near their data or model. Work from first principles, own your valuable intellectual property. Don't give it away, chumps.
Yes, spinning up open, non corporate models and "teaching" them in the business is an option for corporations. Buying proprietary models, or worse, using "cloud" (other people's) models as a catch-all, will never lead to gains for the businesses involved, only gains in data aggregation for the companies selling the models.
I do think there's some value in domain specific AIs being trained within businesses. I've done a minor bit of that myself for one of my hobbies, feeding one my fictional universe and "correcting" it when it gives me incorrect information with correct references. It's interesting to watch how they improve on the subject matter as you train them. But I'd never feel comfortable doing that with a cloud model, only with one I can run locally and completely disconnected from the outside world.
Re: (Score:2)
I imagine I could google it or ironically ask an AI
Re: (Score:3)
interesting... do you have any advice or references about how to do that ... how to setup and train your own model?
I imagine I could google it or ironically ask an AI .. but I'm interested to learn even the broad strokes on how that is done.
Here's one of the ones I've tried [localai.io] that seemed to be fairly easy to implement, and I've only had to struggle a little to get the training to stick with a few different models.
Re: (Score:2)
Should be interesting.
AI belongs to the Big four (Score:2)
AI has a lot of problems and a lot of reasons it's only going to be bad for humanity. But one of those reasons is the way training data works AI automatically consolidates around a couple of big players leaving everybody else out in the cold.
That means no real comp
No surprise (Score:5, Interesting)
This is entirely expected. It is being reached (probably some time ago) is one of the indicators that this stupid hype may not go on much longer. It also means that what we have now is the max we will get in capabilities for the foreseeable future. Except for very specialized models. Maybe.
Re: (Score:2)
I agree. However, diffusion algorithms and LLMs are a very impressive parlour tricks, and all these companies need to do is find some more things like this. Hell, they can even start brute forcing features (some variation on the AI companies that were actually using Indian IT workers, or exactly what Elon is doing with his teleoperation).
We are at the phase where you just need to show 'computer does thing human did!!! - look we've almost figured out AGI' and the money keeps flowing.
There are so many wealthy
The limiting factor is algorithms (Score:5, Insightful)
There is no way that the existing data is insufficient for "AI". All of wikipedia contains more knowledge than any human can absorb, and this isn't anywhere near the amount of data the big companies have for training. The data exists.
The limiting factor is that there isn't an algorithm to take advantage of that data. Current methods are just statistical models that try to duplicate existing patterns (and they do this well). They work merely because of a brute-force use of gigantic data sets gives reasonably good statistical results.
Give a human brain the ability to read all that data (obviously no one can do it - there is just too much), and it would figure out connections in ways that LLMs cannot. Maybe there are algorithms that can do similar - or at the very least do more than LLMs can do with that data.
Re: (Score:2)
It is a mix-- you need algorithms that can do more with a smaller tailored set of training data, but if they think AGI is going to come from the existing algorithms much larger data sets are needed.
Re: (Score:2)
but if they think AGI is going to come from the existing algorithms
I am sure there are those who believe this, but I am not one of them. New algorithms are required. There never will be a sufficiently large data set for the existing algorithms to have AGI. The existing data contains a large chunk of human knowledge. More data is not needed, just better use of the data that is there. The algorithms themselves need to figure out (or be somehow taught) how to recognize high-quality vs. low-quality data and exploit that information appropriately.
Re: (Score:2)
The algorithms themselves need to figure out (or be somehow taught) how to recognize high-quality vs. low-quality data and exploit that information appropriately.
This rapidly gets into questions of self-awareness though. It seems like the algorithm needs to be self-aware, but we don't even know what that means. We can certainly make something that appears to be self-aware (LLMs in chatbot format have characteristics of this), but does making something simulate self-awareness in more and more convincing ways eventually lead to real self-awareness and then a singularity point at which it is able to learn and think etc? Or is there something more fundamental missing.
In
Re: (Score:2)
AGI requires self-awareness. Otherwise it isn't AGI. I don't really think this ought to be a goal as it seems too far away. AGI isn't needed for useful applications.
I personally think that somehow these algorithms need to be able to recognize high-quality data from lower quality and "concentrate" its training on the high-quality. Today, if a training set contains 50 copies of a low quality statement and 1 copy that convincingly debunks the other 50, my bet is that the 50 will be far more likely to be used i
Re: (Score:2)
Indeed, and the only way I can see them do this by themselves, is that they need to have the ability to construct an internally consistent model about the world (or whatever it is we want them to do) from the data feed into them and then evaluate that model against new data and adjust it as required. But this is just so much more complex than what LLMs are doing. To solve any sort of real world task it needs this absurd layer upon layer of concepts about the world it is trying to interact with.
I doubt any o
Re: (Score:2)
I'm pretty sure AGI won't come from transformers trained on text.
Ok, here's an idea... (Score:2)
Re: (Score:2)
Train your LLM on Dad jokes?
Re: (Score:2)
"proprietary datasets" (Score:5, Informative)
"Proprietary datasets, like those that come from businesses' data, may hold the key to the data hole."
And now you know why Microsoft and Google are doing everything they can to force you at gunpoint to store your confidential and proprietary business information, personal financial information, personal health information, personal letters, photos, and everything else that should be held in strict confidence, in their cloud for "FREE" for some reason.
Re: (Score:2)
We have recently had Big AI (yeah, we which start calling it that) admit that their models produced copyright violating output, not surprising when you use copyrighted information for training. I image corporations are going to be very leery about allowing sensitive proprietary data be sucked into publicly accessible models. Their competitors will hit the models with advanced prompt engineering to exfiltrate that data for their own perusal.
Business data would be great for building models that the business i
Re: (Score:2)
We have recently had Big AI (yeah, we which start calling it that) admit that their models produced copyright violating output,
When, and who?
Is it Halloween already ? (Score:2)
The LLM zombie is yearning for more brains.
It's not about the quantity (Score:4, Insightful)
but the quality.
It's not going to become intelligent by training on loads of nonsense as data.
It can't even reason about what's wrong and correct.
Re: (Score:2)
"Quantity has a quality of its own."
--Thomas A. Callaghan Jr., U.S. defense consultant in the 1970s and 80s. (Often misattributed to Joseph Stalin or Gen. George Patton)
So the entirety of human knowledge wasn't enough? (Score:3, Interesting)
So, training LLMs on the sum of human knowledge wasn't enough to make a human equivalent AI? The human brain runs on ~20 watts of power; OpenAI/Oracle/SoftBank are building five data centers that will use ~10-gigawatts of power, which is ~500 million of those 20 watt human brains. Guess it's back to the drawing board on how to create a human level AI?
Re: (Score:2)
So, training LLMs on the sum of human knowledge
It's complete shit. I have a couple of favorite questions I ask search engines, LLMs and whatnot. None of them get the answer right. Based upon incorrect references to more recent sources. Nothing existed before Google started scraping the Internet. Even if the on-line source references the correct source. The LLM can't figure that out.
Re: (Score:2)
I'm looking for an old paperback book about Einstein published in the 1970s which was in a graphic novel style.
I need a list of childrens encyclopedia published in England in the early 1970s
Tell me the plot of the eleventh book in [scifi series]. (There are only 10 books in that series).
I'm looking for a childrens book where there's a character called Cherry, and the kids help an old lady to find a missing will.
I'm looking for the name of
Re: (Score:2)
Re: (Score:2)
If someone gave you all of human knowledge in written form, but you have no other sensory experiences, do you think you can learn as good as a person that learns using all senses?
AI has already run out of *easily accessible* data (Score:2)
FTFY
Digitizing old books (Score:2)
There are millions of books, magazines, newspapers, speeches, inscriptions, recordings which are in the public domain and have not been digitized.
Re: (Score:2)
So, since copyright is good for 90+ years these days (or life of author + 70 yrs more, sometimes),
it might be interesting to have LLMs based on ye olde language, only. It might be necessary if
the copyright Nazis totally prevail.
A pre-1930s American English LLM might sound a little stilted, but would have Einstein's relativity baked in
(together with silent film sensibility, at least). Earlier instantiations might talk like Shakespeare,
or talk like a pirate.
With sufficient powers of "reasoning", perhaps a goo
Re: (Score:2)
low hanging fruit has been picked (Score:2)
Initially it was scraping up all human knowledge and interaction and using that to train models that can mimic super-knowledgeable humans in situations where the training data would apply. Now the researchers are focusing on reasoning ability beyond the learned patterns. AI that has a 'world model' that can be trained on physical environments and have situational awareness.
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Farstechnica.com%2Fai%2F202... [arstechnica.com]
Unpaywalled (Score:3)
Unpaywalled story. [archive.is]
Get yourself a printed encyclopedia (Score:3)
AI (Score:2)
If the AI was actually intelligent, you wouldn't need any more training data than the average person comes into contact with in a few short years of life.
Just saying.
How?? (Score:2)
Getting past typeahead (Score:2)
which is all that the chatbots are.
The idea of trying to train them on the real world seems to have passed over their head unnoticed.