Why machine learning quants need ‘golden’ datasets
An absence of shared datasets is holding back the development of ML models in finance
Today’s computers are able to tell the difference between all manner of everyday things – cats and dogs, fire hydrants and traffic lights – because individuals have painstakingly catalogued 14 million such images, by hand, for the computers to learn from. Quants think finance needs something similar.
The labelled pictures used to train and test image recognition algorithms sit in a publicly available database called ImageNet. It’s been critical in making those algos better. Developers are able to benchmark their progress by their success rate in categorising ImageNet pictures correctly.
Without ImageNet, it would be far tougher to tell whether one model was beating another.
Finance is no different. Like all machine learning models, those used in investing or hedging reflect the data they have learnt from. So comparing models that have been trained on different data can tell quants lots about the data, but far less about the models themselves.
Measuring a firm’s machine learning model against other known models in the industry, or even against different models from the same organisation, becomes all but impossible.
The idea, then, is to create shared datasets that quants could use to weigh models one against another. In finance, it’s a more complex task than just collecting and labelling pictures, though.
For one, banks and investing firms are reluctant to share proprietary data – sometimes due to privacy concerns, often because the data has too much commercial value. Such reticence can make collecting raw information for benchmark datasets a challenge from the start.
Secondly, the new “golden” datasets would need masses of data covering all market scenarios – including scenarios that have never actually occurred in history.
This is a well-known problem affecting machine learning models that are trained on historical data. In financial markets the future seldom looks like the past.
If the dataset you train your model on resembles the data or scenarios it encounters in real life, you’re in business. If it’s significantly different, you don’t know what the model is going to do
Blanka Horvath, Technical University of Munich
“If the dataset you train your model on resembles the data or scenarios it encounters in real life, you’re in business,” says Blanka Horvath, professor of mathematical finance at the Technical University of Munich. “If it’s significantly different, you don’t know what the model is going to do.”
The solution to both problems, quants think, could be to create some of the benchmark data themselves.
Horvath, with a team at the TUM’s Data Science Institute, has launched a project called SyBenDaFin – synthetic benchmark datasets for finance – to do just that.
The plan is to formulate gold standard datasets that reflect what happened in markets in the past but also what could have happened, even if it didn’t.
Synthesising data in this way is increasingly common in finance. Horvath, in another project, carried out tests on machine learning deep hedging engines, for example, by training a model on synthetic data and comparing its output against a conventional hedging approach.
Quants say it would be too complex to formulate a universal dataset comparable to ImageNet for all types of finance models.
The market patterns that would test a model that rebalances every few seconds, for example, would be different from events that would challenge a model trading on a monthly horizon.
Instead, the idea would be to create multiple sets of data, each designed to test models created for a specific use.
Benchmarks could help practitioners grasp the strengths and weaknesses of models as well as whether changes to a model bring improvement or not.
Regulators, too, stand to benefit. Potentially, they could train models using the gold standard data and see how well they perform versus the same model trained on a firm’s in-house data.
In a paper last year, authors from the Alan Turing Institute and the Universities of Edinburgh and Oxford said the industry today had little understanding of how appropriate or optimal different machine learning methods were in different cases. A “clear opportunity” exists for finance to use synthetic data generators in benchmarking, they wrote.
“Firms are increasingly relying on black-box algorithms and methods,” says Sam Cohen, one of the authors and an associate professor with the Mathematical Institute at the University of Oxford and the Alan Turing Institute. “This is one way of verifying our understanding of what they are actually going to do.”
コンテンツを印刷またはコピーできるのは、有料の購読契約を結んでいるユーザー、または法人購読契約の一員であるユーザーのみです。
これらのオプションやその他の購読特典を利用するには、info@risk.net にお問い合わせいただくか、こちらの購読オプションをご覧ください: http://subscriptions.risk.net/subscribe
現在、このコンテンツを印刷することはできません。詳しくはinfo@risk.netまでお問い合わせください。
現在、このコンテンツをコピーすることはできません。詳しくはinfo@risk.netまでお問い合わせください。
Copyright インフォプロ・デジタル・リミテッド.無断複写・転載を禁じます。
当社の利用規約、https://www.infopro-digital.com/terms-and-conditions/subscriptions/(ポイント2.4)に記載されているように、印刷は1部のみです。
追加の権利を購入したい場合は、info@risk.netまで電子メールでご連絡ください。
Copyright インフォプロ・デジタル・リミテッド.無断複写・転載を禁じます。
このコンテンツは、当社の記事ツールを使用して共有することができます。当社の利用規約、https://www.infopro-digital.com/terms-and-conditions/subscriptions/(第2.4項)に概説されているように、認定ユーザーは、個人的な使用のために資料のコピーを1部のみ作成することができます。また、2.5項の制限にも従わなければなりません。
追加権利の購入をご希望の場合は、info@risk.netまで電子メールでご連絡ください。
詳細はこちら 投資
大げさな宣伝を超えて、トークン化は基盤構造を改善することができる
デジタル専門家によれば、ブロックチェーン技術は流動性の低い資産に対して、より効率的で低コストな運用手段を提供します。
投資家は高コストな「オールウェザー」ヘッジ戦略に目を向けている
地政学的リスクと技術的リスクが、マルチ戦略QISテールヘッジの需要を促進しています。
株式には、投資家が見落としている可能性のある「賭け要素」が存在する
投機的取引は、対象となる株式によって異なる形で、暗号資産と株式市場との間に連動関係を生み出します。
米国政府機関の閉鎖が、待ち望まれていたベーシス取引を引き起こした経緯
ヘッジファンドは、数年にわたって準備を進めてきた相対価値取引を再び活用し、フォールバック・ミスマッチから利益を得ようとしています。
パッシブ投資とビッグテック:相性の悪い組み合わせ
トラッカーファンドがアクティブ運用会社を締め出し、ごく少数の株式に対して過熱した評価をもたらしています。
FSBのチーフが、世界的なノンバンク規制の推進を擁護
シンドラー氏は、規制当局が銀行のような標準化された規則を課そうとしているという「誤解」を強く批判しました。
アテネ社にとって、危機時代のCDO保護は引き続き恩恵をもたらしている
アポロ傘下の保険会社は、2006年に実施された合成証券化取引において売却したCDS保護契約に基づく支払いを受け続けております。
生命保険会社の移動は、台湾のNDF市場を崩壊させるのか?
保険会社による外国為替ヘッジからの撤退が、取引業者にとって助けとなるか妨げとなるかについて、意見が分かれています。