Empirical science is reeling. In the past five years, in fields from biomedicine to social psychology, top journals have upended canonical studies by showing their results cannot be reproduced.
A review of 100 major psychology studies, for instance, found only 36% had statistical significance. Over half the alien planets identified by Nasa’s Kepler telescope turned out to be stars. And in preclinical cancer research, a mere six out of 53 breakthrough studies were found to be reproducible.
Quantitative finance does not fare much better.
“It’s a gigantic problem – spurious results are the norm,” says Zak David, co-founder of the analytics firm Mile 59, and former engineer of high-frequency trading software. David says he has tried to replicate any number of studies in the past decade, and been “consistently disappointed.”
Data-driven traders and quant researchers rely on the same processes of statistical inference and number-crunching as scientists, to design investment strategies and suss out the things that drive returns. But the field of quantitative finance is now littered with the stiffened detritus of data mining.
How bad is it? Take factor investing, which sidesteps traditional analysis of companies and stocks and instead looks at a quantitative selection of securities with shared characteristics – factors – that purportedly drive above-market returns. Factors underpin quant investment strategies, risk premia products and smart beta exchange-traded funds.
But a 2014 study found that approximately half the discoveries of factors could not be replicated.
A follow-up study found that the underlying risk premium associated with 85 of the 99 most recent factor “discoveries” were found to correlate to previously identified factors, and therefore provide little diversification in terms of risk. For example, a “new” factor proposed by Steven Heston and Ronnie Sadka relating to the seasonality of the stock market was found to be highly correlated with a six-month momentum factor proposed in a 1993 study.
In yet another indication, in recent years, two out of five bank risk premia products built on factors were withdrawn raising questions about poor performance related to “overfitting,” when strategies are engineered to look strong in backtests, but underperform once launched.
The causes given for unreliable quant research land broadly in four camps: there is way too much data; there is no human input; there is not nearly enough data; forget humans – machines are smarter.
Monkey and machine
Many criticisms of Wall Street’s unreliable data results parallel the infinite monkeys theorem – the idea that an infinite number of monkeys with infinite time could end up randomly typing a work by Shakespeare.
John Ioaniddis, a professor of medicine at Stanford University and one of the leading whistleblowers on spurious research, asserts that as much as nine-tenths of medical research may be based on false information. He attributes that to data dredging, where scientists scour databases for patterns, with no hypothesis in mind.
Because there is no starting idea, “people can go wild about mining these datasets,” he says. “Nothing is shared. It’s just a scientist with his fellows, data dredging day and night, and producing unreproducible results.”
Not unlike the attempt to glean patterns from large datasets in quantitative finance, which has likewise churned up false positives.
“If you compare enough datasets with other datasets, you’ll find one which appears to predict another,” says Tom Howat, chief technology officer at GAM Systematic’s Cantab team. “But because you started with so many contenders, one was bound to appear to predict another, when in fact it has little or no predictive power.”
Anthony Morris, Nomura’s global head of quantitative strategies, says the use of artificial intelligence and machine learning to resolve non-linear relationships makes the problem “a hundred times worse because there are so many more degrees of freedom to play around with”.
Meanwhile, George Mussalli, chief investment officer at quant hedge fund PanAgora, believes the rise of alternative datasets has exacerbated the problem. Asset managers spent $400 million on alternative data last year, an amount projected to exceed $1 billion by 2019.
“It becomes a bigger danger when using machine learning techniques on bigger datasets. You can find more spurious patterns,” he says. The sheer volume of data – for example, credit card information on 10 million people can amount to terabytes of data – becomes its own trap, he adds.
Regardless, data today is a fact on the ground from which there is no going back. So, many quants are building methodologies to diminish the incidence of false positives.
Canadian economist Campbell Harvey was one of the authors of the 2014 study that found of the 296 published papers to explain returns in equities, 158 were false positives. Harvey, a professor of finance at Duke’s Fuqua School of Business, has since been leading the call for financial researchers to raise the level of statistical significance of their findings.
Currently, quant researchers use a metric called the t-statistic indicating a 95% level of confidence in their results. Harvey wants to raise it to over 3σ, a 99% confidence level.
Amit Goyal is an early adopter of the higher t-stat. “Two has been the golden standard for a very long time,” says Goyal, a finance professor at the University of Lausanne. “Physicists use 5. In finance, we are advocating between 3 and 4.”
Another metric used to prove the robustness of findings is the p value, i.e., the probability that results occur by pure chance. A p value of 0.05 represents a 5% probability of a false positive. Data mining is sometimes referred to as p-hacking, since researchers attempt to get as small a p value as possible.
But there is a problem, says Bryan Cross, head of quantitative evidence and data science at UBS Asset Management: there is debate on what p value actually means.
It is generally thought that the smaller the p value, the better the model. But there is disagreement as to whether p values are a good way to quantify uncertainty at all, since there is uncertainty in the calculation of the p value itself.
Instead of using the “frequentist” approach of fixed confidence levels like t-stats and p values, UBS Asset Management has turned to a Bayesian approach, which Cross says is better understood by fundamental analysts.
A frequentist interprets probability as the frequency of an outcome recurring in repeated experiments. A Bayesian approach is more in line with human understanding of probability as the belief in the likelihood of an outcome, and allows for the inclusion of prior knowledge in the calculation of a probability.
“Instead of saying, ‘Sales are going to go up 10% this quarter with a t-statistic of 2,’ we can show the fundamental analyst a distribution of outcomes based on the data,” Cross says. Instead he can say: “There’s a 60% chance that sales will increase this quarter.”
But not everyone thinks these precautions are necessary. Jeff Holman, chief investment officer of Sentient Investment Management, notes that unlike science, finance is a Darwinian arena, where natural selection will quickly annihilate those who invest on shaky data.
“Financial journals should have higher standards. That makes sense. But who cares?” Holman says. “I think ultimately it’s self-policing in the real world: I’m going to lose money on my trade and go out of business.”
But shouldn’t there be a human idea?
In traditional scientific methodology, a researcher puts forward a hypothesis before conducting any experiment. Jai Jacob, a managing director leading Lazard Asset Management’s multi-asset investment team, says the practice of mining datasets without an economic rationale leads to a proliferation of false positives. Lazard’s fundamental analysts are now required to submit a formal hypothesis if they want the quant team’s help.
“A lot of times we have a request that says, ‘Let’s get all the world’s credit card information,’ but I don’t want people on my team to be just swimming through datasets,” Jacob says. The analysts “have to be specific and they have to be comfortable with the null hypothesis”, referring to an outcome of no statistical significance.
Besides an economic idea, quants could come up with measurable corollaries that should also be true if the hypothesis is true, suggests John Fawcett, co-founder of the crowdsourced hedge fund Quantopian.
“If you have an economic rationale that there should be a correlation with interest rates, you can then do experiments where you’ll simulate what would happen if the rates spike or drop, and then your model should explain that behaviour,” he says.
But others say the very point of machine learning is precisely that it goes beyond human cognition. Holman at Sentient Investment notes the role of emotion.
Unfortunately, we can rationalise almost anything. It’s motivated cognitionJeff Holman, Sentient Investment Management
“Unfortunately, we can rationalise almost anything. It’s motivated cognition,” he says. “You spend all this time on finding this pattern, so of course, you can come up with a rationale that that makes sense.”
Humans can be hindered by their own biases and the limits of their knowledge, Holman says. In addition, they might not pick up on highly complex, sometimes non-intuitive drivers of the markets.
“If you restrict yourself to only trading in cases where you have an intuition, I think you’re leaving a lot of money on the table,” he says.
Testing the tests
One statistical technique that has emerged to combat overfitting is out-of-sample testing, i.e. testing and training quantitative models on one set of data and then validating results on another. Depending on the model, the data can be split into time periods and run on various assets, as well as across different countries and markets.
But the results can be apples and oranges. For instance, many quant strategies are based on American economic or market data due to ease of access. “But try that investment strategy on Japanese data and it often fails,” Nomura’s Morris says. “The markets can be quite different.”
For something to be truly out of sample, it needs to be a dataset that you’ve never seen beforeAnthony Morris, Nomura
The real problem is that even out-of-sample testing relies to some extent on events of the past. “I can come up with a strategy that happens to work on a testing period and an out-of-sample period,” Morris says. “As long as they’re both from the past, it doesn’t really mean anything. For something to be truly out of sample, it needs to be a dataset that you’ve never seen before.”
Nonetheless, Quantopian’s Fawcett believes out-of-sample testing can work in some instances. For example, a model based on high-frequency data over a several-month time period produces a strong statistical result on whether the strategy has predictive power, he says.
“It gets more difficult when you have less and less frequent data. So it is not a cure-all for all of this,” he says. “I personally think the ultimate thing would be to publish both the data and the source code that codifies the theory, because then peers could evaluate whether you have enough data to either validate or invalidate that paper or that theory,” he adds.
More is more
Yet others think the problem is not curating the data sample or nailing down a hypothesis, but getting enough data points. The head of quantitative strategy at one systematic hedge fund, says the trick is choosing the right statistical method for the question at hand, and that often depends on the size of the dataset.
“You don’t always need a sound economic thesis, you just need to have enough data,” he says. Sparse data may not provide “enough observations”.
Nonetheless, most quants have faith that more stringent statistical techniques will reduce false positives and uncover data patterns unlikely to be found by human intelligence.
Instead of giving up on building better models from data, quants should borrow statistical tools from the world of machine learning to validate their resultsJohn Alberg, Euclidean Technologies
John Alberg, a co-founder of the machine learning hedge fund Euclidean Technologies, says moving to a higher level of statistical significance “doesn’t seem sustainable long term” as it will only serve to kick the can down the road as more factors are discovered.
“Instead of giving up on building better models from data,” Alberg says, “quants should borrow statistical tools from the world of machine learning to validate their results.”
For example, recently developed algorithms used in machine learning allow researchers to test the likelihood that their backtest is overfitted by repeatedly validating the model selection process on sub-samples of the data.
“In other fields like medicine, self-driving cars and language translation, researchers have been able to use machine learning to create complex models that are better than human performance. And we can do the same if we leverage tools that help us avoid overfitting.”
In the meantime, the soul searching in financial research continues – as does Wall Street’s romance with machines.
The granularity of data “has become finer,” notes David of Mile 59, “and the ubiquity of machine learning software packages has placed predictive modeling in the hands of anyone with a pulse.”
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
You are currently unable to print this content. Please contact email@example.com to find out more.
You are currently unable to copy this content. Please contact firstname.lastname@example.org to find out more.
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. Printing this content is for the sole use of the Authorised User (named subscriber), as outlined in our terms and conditions - https://www.infopro-insight.com/terms-conditions/insight-subscriptions/
If you would like to purchase additional rights please email email@example.com
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. Copying this content is for the sole use of the Authorised User (named subscriber), as outlined in our terms and conditions - https://www.infopro-insight.com/terms-conditions/insight-subscriptions/
If you would like to purchase additional rights please email firstname.lastname@example.org
Regulated venues say rival firms might unfairly escape oversight in activities such as blocking trades and price aggregationReceive this by email