Why know-it-all LLMs make second-rate forecasters
A bevy of experiments suggests LLMs are ill-suited for time-series forecasting
In early August, OpenAI released GPT-5, the latest version of its large language model (LLM). The promise is of an AI assistant as knowledgeable as a PhD on most topics, able to code a sleek web app in minutes, more accurate than most human doctors at responding to medical queries, and so on.
GPT-5 and LLMs of its type know most of what there is to know. For some applications in quantitative finance, though, such breadth of learning turns out to be a weakness.
Practitioners in financial markets and academics alike have been working on ways to use LLMs to forecast time-series data such as the path of inflation or interest rates or stock prices. Increasingly, it’s becoming clear the models are poorly suited to the task.
In a study by academics at the University of Virgina and University of Washington last year, researchers found that versions of models in which an LLM component was removed did just as well at a range of forecasting tasks as the same models with the LLM component included.
They train on as much data as possible going back in time – data that may no longer be relevant. They can’t really adapt
Alexander Denav, Turnleaf Analytics
When the academics jumbled the time-series inputs to their models it made no more difference to an LLM versus other model types, suggesting the language models were showing no special understanding of sequential patterns in the data.
“We’ve seen a lot of evolution in LLMs, and if you have a hammer every problem starts to look like a nail,” says Alexander Denev, co-founder of Turnleaf Analytics, a macro and inflation forecaster that uses machine learning and alternative data.
Denev and his colleagues ran inflation forecasting tests, comparing so-called zero-shot time-aware LLM models from Google and Amazon versus generic tools like ChatGPT and the firm’s own, much simpler, machine learning models. “They cannot be compared,” Denev says. “The errors of these LLM models are very large.”
To be clear, quants including Denev say LLMs are fantastically useful for some things: digging up obscure datasets from the internet or getting up to speed on an unfamiliar topic, say.
When it comes to predicting the future, though, they sink under the weight of their knowledge. “They train on as much data as possible going back in time – data that may no longer be relevant,” Denev says. “They can’t really adapt.”
Some machine learning scientists had hoped that LLMs would be able to forecast better with limited data, and perhaps to apply reasoning to make more accurate predictions. Optimists had LLMs lined up to solve forecasting tasks using data from electricity consumption to weather patterns and, in finance, all manner of macro or market data.
Further investigation, though, has exposed what looks like a killer weakness: the LLMs’ inability to unknow what they know.
In backtests of a forecasting model, quants need to check how well the model makes predictions using only the information it would have possessed at the time. And because LLMs are trained with data up to the present moment, quants cannot be sure the model isn’t leveraging that knowledge when undergoing such scrutiny. It’s a familiar problem for quants who take great care to eliminate so called “look ahead” or “peak ahead” bias in their tests.
A simple model with fewer parameters is quicker to change by construction
Alexander Denav, Turnleaf Analytics
LLMs “don’t have a point-in-time notion”, Denev says.
A second problem arises from the non-stationarity of financial markets data, meaning the speed, frequency and degree with which market patterns break with the past. Languages change only slowly. The Oxford English Dictionary adds a few hundred new words a year to a lexicon of more than half a million. In finance, something like US tariffs can change overnight how economies and markets move.
LLMs are trained on masses of data to develop a highly accurate understanding of the status quo. And it means they must collect a lot of new data before they will recognise that the status quo has changed.
Simpler models with fewer elements to retune to new data adapt faster. “A simple model with fewer parameters is quicker to change by construction,” Denev says. Gold stocks jumped higher in January as investors sought to hedge against inflation. A model that uses gold stocks and a few other datasets to forecast inflation will pick up such changes more quickly, he says.
The time and cost of running the more complex models is unmatched by their value. “It’s overkill,” he says. “They are computationally expensive and difficult to train for a marginal or no advantage.”
Methods exist to potentially ameliorate the shortcomings. One approach in backtesting, for example, is to build simpler models from scratch, train them up to a specific point in time, test their prediction ability, then train a step further, test again, and so on. To make this plausible, though, a model would have to be basic.
Jeff Shen, co-chief investment officer in the systematic investment team at BlackRock, describes this as training a model to the level of a high-school student. To do so would give up much of the promise of LLMs that can reach PHD level expertise in certain tasks.
BlackRock, after experimenting with LLM forecasters, has chosen to focus more attention on a different “more promising” machine learning technology: online learning. These models update themselves continuously as they ingest new information, rather than training in one big exercise.
The trick, Shen says, is to create a model that can make judgments about how far new information should lead it to revise its parameters.
BlackRock already does some of this, he says, tracking the Euclidian distance across 300 market variables to determine whether today’s markets share similarities with periods in history, and to update model parameters accordingly.
For time-series forecasting, it seems, a model that knows the right thing is likely to beat a model that knows everything.
Editing by Kris Devasabai
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net
More on Our take
Roll over, SRTs: Regulators fret over capital relief trades
Banks will have to balance the appeal of capital relief against the risk of a market shutdown
Thrown under the Omnibus: will GAR survive EU’s green rollback?
Green finance metric in limbo after suspension sees 90% of top EU banks forgo reporting
Has the Collins Amendment reached its endgame?
Scott Bessent wants to end the dual capital stack. How that would work in practice remains unclear
Talking Heads 2025: Who will buy Trump’s big, beautiful bonds?
Treasury issuance and hedge fund risks vex macro heavyweights
The AI explainability barrier is lowering
Improved and accessible tools can quickly make sense of complex models
Do BIS volumes soar past the trend?
FX market ADV has surged to $9.6 trillion in the latest triennial survey, but are these figures representative?
DFAST monoculture is its own test
Drop in frequency and scope of stress test disclosures makes it hard to monitor bank mimicry of Fed models
Lightening the RWA load in securitisations
Credit Agricole quants propose new method for achieving capital neutrality