Machine-Learning-in-Assset-Pricing-Scholarly-Papers-with-Notes

This repo contains the list and a constructive summary of seminal and important scholarly research papers in the area of Machine Learning in Asset Pricing. This purpose of this initiative to provide a conmphesive repo for the subject area. For interactive reading experience do visit this repo’s GitHub Page. If you like the repo, please star. This motivates us to update the repo frequently.

Asset Return Predication

Paper	Summary	Data
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Autoencoder asset pricing models." Journal of Econometrics 222, no. 1 (2021): 429-450.	Add Summary	94 characteristics29 (61 of which are updated annually, 13 are updated quarterly, and 20 are updated monthly) 74 industry dummies corresponding to the first two digits of SIC codes Eight macroeconomic predictors
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical asset pricing via machine learning." The Review of Financial Studies 33, no. 5 (2020): 2223-2273	Add Summary	94 characteristics29 (61 of which are updated annually, 13 are updated quarterly, and 20 are updated monthly) 74 industry dummies corresponding to the first two digits of SIC codes Eight macroeconomic predictors $x_{i,t} \otimes c_{i,t}$ Total $94×(8+1)+74=920$
Chen, Luyang, Markus Pelger, and Jason Zhu. "Deep learning in asset pricing." Management Science (2023)	This paper proposes a GAN based deep learning model for moeling stock return. The authors craftully connect no arbitrage condition for asset pricing with deep learning. In sort, they use generative adversarial network to find SDF stochastic discount factor. Where the asset pricing modeler wants to choose an asset pricing model, whereas the adversary wants to choose conditions under which the asset pricing model performs badly. In each iteration, the advarsary select the moment conditions that lead to the largest mispricing, and the modeler revise the candidate SDF to encorporate the identifed factor . In addition, they use LSTM and take the final output layer as a summarized state of the time varing macroeconomic condition	46 time-varying, firm-specific characteristics and 178 macroeconomic time series
Leippold, Markus, Qian Wang, and Wenyu Zhou. "Machine learning in the Chinese stock market." Journal of Financial Economics 145, no. 2 (2022): 64-82

Machine Learning in Asset Pricing — Literature Review

This repository presents a comprehensive collection of academic papers exploring the intersection of machine learning and asset pricing.
Each paper entry includes a collapsible summary with data descriptions, limitations, and potential future research directions.

Accounting / NLP
Accounting / fraud detection (ML applied)
Accounting / misstatement detection
Alternative data / NLP
Alternative data / computer vision (asset pricing application)
Alternative data / narrative factors
Alternative data / news NLP
Corporate disclosure / NLP
Factor construction / tree-based methods
Factor discovery / representation learning/Latent factor models
Fixed income / macro-finance
Fixed income / prediction
Fundamental analysis / ML
High-frequency / microstructure ML
Human+AI / analyst forecasting
Macroeconomic expectations / ML
Methodology / inference with ML
Momentum / deep learning
Prediction / economic restrictions
Prediction / forecasting
Prediction / forecasting (market-specific)
Prediction / industry-focused methods
SDF estimation / deep learning
Structural estimation / deep learning
Uncategorized

Accounting / NLP

[Brown, N. C., Crowley, R. M., & Elliot, W. B. (2019). What are You Saying? Using topic to Detect Financial Misreporting. Journal of Accounting Research.](Wiley / SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803733)

Year: 2019/2020 Category: Accounting / NLP

View Details

Summary: Introduces topic-based textual features to detect financial misreporting and shows topic models significantly improve out-of-sample detection performance. Demonstrates text topics add information beyond traditional style and financial features.

Data Used: 10-K filings text, SEC enforcement actions (AAERs), restatements, and financial statement data.

Challenges / Limitations: Topic models capture co-occurrence patterns but may miss nuanced or cleverly disguised language. - Label and event-timing uncertainty for misreporting cases complicate evaluation. - Language drift over time may require periodic model retraining.

Future Research Directions: Combine topic models with supervised deep-learning classifiers for richer feature sets. - Develop time-adaptive topic models to handle language evolution. - Evaluate cross-lingual transferability for international filings.

Accounting / fraud detection (ML applied)

Bao, Ke, Li, Yu, & Zhang. Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research (2020).

Year: 2020 Category: Accounting / fraud detection (ML applied)

View Details

Summary: Uses ML classifiers to detect accounting fraud in US firms, leveraging financial statement features and textual signals; reports improved detection relative to standard benchmarks.

Data Used: Compustat/CRSP financials, audit reports, possibly text of filings; see paper for dataset and labeling procedure.

Challenges / Limitations: Label quality: fraud/misstatement labels are noisy and subject to detection biases. - Class imbalance (fraud events are rare) complicates training and evaluation. - Potential adverse incentives if algorithms are used operationally without human oversight.

Future Research Directions: - Develop causal/interpretability tools to explain flagged cases to auditors. - Use multi-source signals (text, network, alternative data) to improve robustness. - Evaluate real-world deployment impacts on audit selection and false-positive costs.

Accounting / misstatement detection

Bertomeu, Cheynel, Floyd, & Pan. Using machine learning to detect misstatements. Review of Accounting Studies (2020).

Year: 2020 Category: Accounting / misstatement detection

View Details

Summary: Applies various ML methods to detect financial misstatements and misreporting, comparing performance to traditional models and highlighting useful features.

Data Used: Financial statement data and enforcement/SEC restatement records; see paper for details.

Challenges / Limitations: - False positives have real costs; need to balance precision vs recall in operational settings. - Heterogeneity across firms/industries may reduce generalization of trained models. - Regulatory and privacy constraints limit access to rich features in practice.

Future Research Directions: Better calibration and cost-sensitive learning tailored to audit priorities. - Cross-firm transfer learning to improve small-sample performance. - Integrate explainable AI for audit trail documentation.

Alternative data / NLP

[Azimi, M., & Agrawal, A. Is Positive Sentiment in Corporate Annual Reports Informative? Evidence from Deep Learning. Review of Asset Pricing Studies (2021).](https://doi.org/10.1093/rapstu/raab005 ; SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3258821)

Year: 2021 Category: Alternative data / NLP

View Details

Summary: Applies deep-learning-based sentiment extraction to 10-K filings and finds both positive and negative sentiments predict abnormal returns and future firm fundamentals around filing dates. Shows finer-grained sentiment measures (vs. simple net-sentiment) contain incremental information.

Data Used: EDGAR 10-K filings for U.S. firms, stock returns around filing dates, and firm fundamentals.

Challenges / Limitations: Text-sentiment models require careful training and can be sensitive to label noise. - Filing-based signals can be confounded by concurrent announcements or news. - Generalizing across jurisdictions/time requires retraining sentiment models.

Future Research Directions: `- Event-level causal identification strategies (e.g., instrumental variables). - Cross-sectional tests across industries and international filings. - Release trained models and preprocessing code for reproducibility. Incorporate multi-modal data (text + earnings calls + management forecasts) for richer signals. - Test long-horizon predictive power and economic exploitability after costs. - Explore causal mechanisms linking narrative tone to firm fundamentals.

Alternative data / computer vision (asset pricing application)

Aubry, Mathieu, Roman Kräussl, Gustavo Manso, and Christophe Spaenjers. "Biased auctioneers." The Journal of Finance (2023).

Year: 2023 Category: Alternative data / computer vision (asset pricing application)

View Details

Summary: Uses neural-network based image and metadata analysis to predict art auction prices and documents systematic auctioneer biases in pricing. Combines visual and non-visual features.

Data Used: Proprietary data from Blouin Art Sales Index, has information on buyins and data from 2008 to 2015 (2015 as the test set). 1,2 mil lots, 130k individual artists. Has inmformation about the artist, the artwork, the auction. The amounts are converted to USD using a spot rate at the time of sale and also a high quality image of the artwork for analysis.

Challenges / Limitations: External validity: art markets are niche and results may not generalize to other asset classes. - Image-based valuation models can be sensitive to sample selection and feature extraction choices. - Causality: disentangling information effects vs. behavioral reactions to published estimates.

Future Research Directions: - Explore ML valuation feedback loops across other illiquid markets (collectibles, real estate). - Use experimental settings to test causal mechanisms behind auctioneer bias. - Combine alternative data (provenance, exhibition history) with image features.

Alternative data / narrative factors

Bybee, L., Kelly, B., Su, Y. (2022). Narrative Asset Pricing: Interpretable Systematic Risk Factors from News Text. SSRN.

Year: 2022/2023 Category: Alternative data / narrative factors

View Details

Summary: Extracts narrative factors from WSJ news text using LDA + IPCA + group lasso and shows these news-derived factors have strong pricing performance and high out-of-sample Sharpe ratios. Shows that news attention topics track macroeconomic activity and help forecast aggregate stock market returns.

Data Used: Wall Street Journal text, returns on anomaly portfolios, macro variables.

Challenges / Limitations: Dependence on a single news source (WSJ) and LDA hyperparameters. - Risk of overfitting when selecting narrative topics. - Economic interpretation beyond correlations requires care.

Future Research Directions: Cross-validate across news sources and years. - Link narrative factors more tightly to macro-investment opportunities.

Alternative data / news NLP

[Bybee, L., Kelly, B., Manela, A. (2021). Business News and Business Cycles](SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3446225 ; Journal link: https://onlinelibrary.wiley.com/doi/full/10.1111/jofi.13377)

Year: 2024 Category: Alternative data / news NLP

View Details

Summary: Constructs topical measures from Wall Street Journal business news and shows that news attention topics track macroeconomic activity and help forecast aggregate stock market returns.

Data Used: Full text of Wall Street Journal articles (1984‚Äì2017), macro series, and market returns.

Challenges / Limitations: - News coverage may itself be endogenous to economic events. - Topic modeling choices and coverage biases influence results. - Limited to WSJ; generalizability to other media uncertain.

Future Research Directions: Apply method to broader news sources and international outlets. - Investigate causal channels between news narratives and real economic activity.

Machine-Learning-in-Asset-Pricing-Papers