Machine-Learning-in-Assset-Pricing-Scholarly-Papers-with-Notes
This repo contains the list and a constructive summary of seminal and important scholarly research papers in the area of Machine Learning in Asset Pricing. This purpose of this initiative to provide a conmphesive repo for the subject area. For interactive reading experience do visit this repo’s GitHub Page. If you like the repo, please star. This motivates us to update the repo frequently.
Asset Return Predication
Paper | Summary | Data | Issues |
---|---|---|---|
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Autoencoder asset pricing models." Journal of Econometrics 222, no. 1 (2021): 429-450. | Add Summary |
|
|
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical asset pricing via machine learning." The Review of Financial Studies 33, no. 5 (2020): 2223-2273 | Add Summary |
|
|
Chen, Luyang, Markus Pelger, and Jason Zhu. "Deep learning in asset pricing." Management Science (2023) | This paper proposes a GAN based deep learning model for moeling stock return. The authors craftully connect no arbitrage condition for asset pricing with deep learning. In sort, they use generative adversarial network to find SDF stochastic discount factor. Where the asset pricing modeler wants to choose an asset pricing model, whereas the adversary wants to choose conditions under which the asset pricing model performs badly. In each iteration, the advarsary select the moment conditions that lead to the largest mispricing, and the modeler revise the candidate SDF to encorporate the identifed factor . In addition, they use LSTM and take the final output layer as a summarized state of the time varing macroeconomic condition | 46 time-varying, firm-specific characteristics and 178 macroeconomic time series | |
Leippold, Markus, Qian Wang, and Wenyu Zhou. "Machine learning in the Chinese stock market." Journal of Financial Economics 145, no. 2 (2022): 64-82 |
Machine Learning in Asset Pricing — Literature Review
This repository presents a comprehensive collection of academic papers exploring the intersection of machine learning and asset pricing.
Each paper entry includes a collapsible summary with data descriptions, limitations, and potential future research directions.
Table of Contents
- Accounting / NLP
- Accounting / fraud detection (ML applied)
- Accounting / misstatement detection
- Alternative data / NLP
- Alternative data / computer vision (asset pricing application)
- Alternative data / narrative factors
- Alternative data / news NLP
- Corporate disclosure / NLP
- Factor construction / tree-based methods
- Factor discovery / representation learning/Latent factor models
- Fixed income / macro-finance
- Fixed income / prediction
- Fundamental analysis / ML
- High-frequency / microstructure ML
- Human+AI / analyst forecasting
- Macroeconomic expectations / ML
- Methodology / inference with ML
- Momentum / deep learning
- Prediction / economic restrictions
- Prediction / forecasting
- Prediction / forecasting (market-specific)
- Prediction / industry-focused methods
- SDF estimation / deep learning
- Structural estimation / deep learning
- Uncategorized
<style> body { font-family: 'Segoe UI', Roboto, Helvetica, Arial, sans-serif; background-color: #ffffff; line-height: 1.6; color: #222; } .paper-card { background-color: #f9fafb; border-left: 4px solid #004aad; border-radius: 8px; padding: 1.2em 1.4em; margin-bottom: 1.5em; box-shadow: 0 1px 3px rgba(0,0,0,0.06); transition: all 0.25s ease; } .paper-card:hover { background-color: #f3f6fb; box-shadow: 0 4px 8px rgba(0,0,0,0.08); } details { background-color: #fdfdfd; border: 1px solid #ddd; border-radius: 6px; padding: 0.7em 1em; margin-top: 0.8em; transition: background-color 0.3s ease; } summary { font-weight: 600; color: #003366; cursor: pointer; } .content-block { margin-top: 0.5em; color: #333; line-height: 1.55; font-size: 0.96em; } hr.section { border: none; height: 1px; background-color: #e0e0e0; margin: 1.5em 0; } h2 { border-bottom: 2px solid #ddd; padding-bottom: 0.3em; margin-top: 1.4em; color: #004aad; }
Accounting / NLP
[Brown, N. C., Crowley, R. M., & Elliot, W. B. (2019). What are You Saying? Using topic to Detect Financial Misreporting. Journal of Accounting Research.](Wiley / SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803733)
Year: 2019/2020 Category: Accounting / NLP
View Details
Summary: Introduces topic-based textual features to detect financial misreporting and shows topic models significantly improve out-of-sample detection performance. Demonstrates text topics add information beyond traditional style and financial features.
Data Used: 10-K filings text, SEC enforcement actions (AAERs), restatements, and financial statement data.
Challenges / Limitations: Topic models capture co-occurrence patterns but may miss nuanced or cleverly disguised language. - Label and event-timing uncertainty for misreporting cases complicate evaluation. - Language drift over time may require periodic model retraining.
Future Research Directions: Combine topic models with supervised deep-learning classifiers for richer feature sets. - Develop time-adaptive topic models to handle language evolution. - Evaluate cross-lingual transferability for international filings.
Accounting / fraud detection (ML applied)
Bao, Ke, Li, Yu, & Zhang. Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research (2020).
Year: 2020 Category: Accounting / fraud detection (ML applied)
View Details
Summary: Uses ML classifiers to detect accounting fraud in US firms, leveraging financial statement features and textual signals; reports improved detection relative to standard benchmarks.
Data Used: Compustat/CRSP financials, audit reports, possibly text of filings; see paper for dataset and labeling procedure.
Challenges / Limitations: Label quality: fraud/misstatement labels are noisy and subject to detection biases. - Class imbalance (fraud events are rare) complicates training and evaluation. - Potential adverse incentives if algorithms are used operationally without human oversight.
Future Research Directions: - Develop causal/interpretability tools to explain flagged cases to auditors. - Use multi-source signals (text, network, alternative data) to improve robustness. - Evaluate real-world deployment impacts on audit selection and false-positive costs.
Accounting / misstatement detection
Bertomeu, Cheynel, Floyd, & Pan. Using machine learning to detect misstatements. Review of Accounting Studies (2020).
Year: 2020 Category: Accounting / misstatement detection
View Details
Summary: Applies various ML methods to detect financial misstatements and misreporting, comparing performance to traditional models and highlighting useful features.
Data Used: Financial statement data and enforcement/SEC restatement records; see paper for details.
Challenges / Limitations: - False positives have real costs; need to balance precision vs recall in operational settings. - Heterogeneity across firms/industries may reduce generalization of trained models. - Regulatory and privacy constraints limit access to rich features in practice.
Future Research Directions: Better calibration and cost-sensitive learning tailored to audit priorities. - Cross-firm transfer learning to improve small-sample performance. - Integrate explainable AI for audit trail documentation.
Alternative data / NLP
[Azimi, M., & Agrawal, A. Is Positive Sentiment in Corporate Annual Reports Informative? Evidence from Deep Learning. Review of Asset Pricing Studies (2021).](https://doi.org/10.1093/rapstu/raab005 ; SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3258821)
Year: 2021 Category: Alternative data / NLP
View Details
Summary: Applies deep-learning-based sentiment extraction to 10-K filings and finds both positive and negative sentiments predict abnormal returns and future firm fundamentals around filing dates. Shows finer-grained sentiment measures (vs. simple net-sentiment) contain incremental information.
Data Used: EDGAR 10-K filings for U.S. firms, stock returns around filing dates, and firm fundamentals.
Challenges / Limitations: Text-sentiment models require careful training and can be sensitive to label noise. - Filing-based signals can be confounded by concurrent announcements or news. - Generalizing across jurisdictions/time requires retraining sentiment models.
Future Research Directions: `- Event-level causal identification strategies (e.g., instrumental variables). - Cross-sectional tests across industries and international filings. - Release trained models and preprocessing code for reproducibility. Incorporate multi-modal data (text + earnings calls + management forecasts) for richer signals. - Test long-horizon predictive power and economic exploitability after costs. - Explore causal mechanisms linking narrative tone to firm fundamentals.
Alternative data / computer vision (asset pricing application)
Aubry, Mathieu, Roman Kräussl, Gustavo Manso, and Christophe Spaenjers. "Biased auctioneers." The Journal of Finance (2023).
Year: 2023 Category: Alternative data / computer vision (asset pricing application)
View Details
Summary: Uses neural-network based image and metadata analysis to predict art auction prices and documents systematic auctioneer biases in pricing. Combines visual and non-visual features.
Data Used: Proprietary data from Blouin Art Sales Index, has information on buyins and data from 2008 to 2015 (2015 as the test set). 1,2 mil lots, 130k individual artists. Has inmformation about the artist, the artwork, the auction. The amounts are converted to USD using a spot rate at the time of sale and also a high quality image of the artwork for analysis.
Challenges / Limitations: External validity: art markets are niche and results may not generalize to other asset classes. - Image-based valuation models can be sensitive to sample selection and feature extraction choices. - Causality: disentangling information effects vs. behavioral reactions to published estimates.
Future Research Directions: - Explore ML valuation feedback loops across other illiquid markets (collectibles, real estate). - Use experimental settings to test causal mechanisms behind auctioneer bias. - Combine alternative data (provenance, exhibition history) with image features.
Alternative data / narrative factors
Bybee, L., Kelly, B., Su, Y. (2022). Narrative Asset Pricing: Interpretable Systematic Risk Factors from News Text. SSRN.
Year: 2022/2023 Category: Alternative data / narrative factors
View Details
Summary: Extracts narrative factors from WSJ news text using LDA + IPCA + group lasso and shows these news-derived factors have strong pricing performance and high out-of-sample Sharpe ratios. Shows that news attention topics track macroeconomic activity and help forecast aggregate stock market returns.
Data Used: Wall Street Journal text, returns on anomaly portfolios, macro variables.
Challenges / Limitations: Dependence on a single news source (WSJ) and LDA hyperparameters. - Risk of overfitting when selecting narrative topics. - Economic interpretation beyond correlations requires care.
Future Research Directions: Cross-validate across news sources and years. - Link narrative factors more tightly to macro-investment opportunities.
Alternative data / news NLP
[Bybee, L., Kelly, B., Manela, A. (2021). Business News and Business Cycles](SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3446225 ; Journal link: https://onlinelibrary.wiley.com/doi/full/10.1111/jofi.13377)
Year: 2024 Category: Alternative data / news NLP
View Details
Summary: Constructs topical measures from Wall Street Journal business news and shows that news attention topics track macroeconomic activity and help forecast aggregate stock market returns.
Data Used: Full text of Wall Street Journal articles (1984–2017), macro series, and market returns.
Challenges / Limitations: - News coverage may itself be endogenous to economic events. - Topic modeling choices and coverage biases influence results. - Limited to WSJ; generalizability to other media uncertain.
Future Research Directions: Apply method to broader news sources and international outlets. - Investigate causal channels between news narratives and real economic activity.