Skip to content
Causal Machine Learning: A Survey and Open Problems

Causal Machine Learning: A Survey and Open Problems

Authors: Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva

Venue: arXiv, 2022 — arXiv:2206.15475

TL;DR

Comprehensive survey of causal machine learning (CausalML), which formalizes data generation as a structural causal model to reason about interventions and counterfactuals. Categorizes 191 pages of CausalML work into five groups—causal supervised learning, causal generative modeling, causal explanations, causal fairness, and causal reinforcement learning—with systematic comparison of methods, open problems, and applications to vision, NLP, and graph learning.

Contributions

  • Unified taxonomy of CausalML methods across five problem categories with open problems for each
  • Causal foundations (Chapter 2): Self-contained introduction to structural causal models, interventions, counterfactuals, and identifiability without assuming prior knowledge of causal inference
  • Causal supervised learning (Chapter 3): Invariant feature learning and invariant mechanism learning to learn domain-robust representations that remain predictive across environments
  • Causal generative modeling (Chapter 4): Structural assignment learning and causal disentanglement to generate counterfactual samples
  • Causal explanations (Chapter 5): Feature attribution and contrastive explanations grounded in causal graphs
  • Causal fairness (Chapter 6): Counterfactual and interventional fairness criteria to mitigate discrimination
  • Causal reinforcement learning (Chapter 7): Model-based RL, off-policy evaluation, and counterfactual data augmentation
  • Modality-specific applications (Chapter 8): Causal computer vision, NLP, and graph representation learning
  • Benchmarks and open challenges (Chapter 9–10): Causal benchmarks, limitations of current approaches, and future directions

Method

The survey adopts a causal perspective on machine learning. Rather than treating data as i.i.d. samples from a fixed distribution, CausalML formalizes the data-generation process as a structural causal model (SCM): a directed acyclic graph (DAG) with nodes as variables and edges as causal relationships. This enables:

  1. Reasoning about interventions: What happens to \(Y\) if we do set treatment \(T\) to \(t\)? (counterfactual prediction)
  2. Identifying causal effects: When is the causal effect identifiable from observational data? (relies on graph structure and back-door / front-door criteria)
  3. Invariance and robustness: Which features remain predictive under distribution shifts? (via the principle of independent mechanisms)

Key concepts:

  • Spurious associations arise from unobserved confounders; e.g., in ImageNet, bird images often have trees in the background due to photographer bias, not because trees cause birds
  • Style and content decomposition: Disentangle "style variables" (domain-specific features subject to interventions) from "content variables" (causal parents of the outcome)
  • Counterfactual invariance: Predictions must be invariant to interventions on attributes we don't want to influence predictions (e.g., race or gender in fairness)
  • Causal influence: Quantify one variable's causal effect on another via KL-divergence or other information-theoretic measures

The survey reviews methods across modalities: data augmentation for deconfounding (vision), contrastive learning and foundation-model fine-tuning (NLP), and causal graph learning (graph neural networks).

Results

No empirical results table; this is a methodological survey. Key findings include:

  • Invariant feature learning outperforms standard supervised learning on out-of-distribution benchmarks (WILDS, DomainNet, ImageNet-C) when spurious associations are present
  • Contrastive learning implicitly enforces causal structure by comparing samples under soft interventions (style augmentations)
  • Causal fairness methods successfully reduce disparate impact in hiring, lending, and criminal justice domains compared to naive approaches
  • Causal RL achieves better off-policy evaluation and sample efficiency than model-free methods on simulated benchmarks
  • Open challenges include: tractable causal discovery from high-dimensional data, learning from limited observational data with hidden confounders, and generalizing across multiple environments

Connections

Notes

This survey synthesizes a nascent and fast-growing field. The main strength is the unified vocabulary and taxonomy—CausalML comprises heterogeneous methods (data augmentation, contrastive learning, causal discovery, fairness constraints) that share common causal principles. The authors usefully distinguish between "the good" (CausalML enables more robust, interpretable, and fair models), "the bad" (identifiability and confounding assumptions are strong and often untestable in practice), and "the ugly" (causal discovery from high-dimensional data remains open, and the cost of enforcing invariance can be high in flexible models). For fake news research, causal approaches are relevant to understanding how models generalize across sources, time periods, and languages—and to designing deconfounded features that capture misinformation signals rather than spurious patterns in training data.