Skip to content

Explainability in misinformation detection

Explainability (or interpretability) refers to the ability to understand why a machine learning model makes a particular prediction. In the context of misinformation detection, explainability is crucial for:

  • Trust and deployment: Content moderators and platform operators need to understand which signals triggered a misinformation flag before acting on it
  • Debugging and validation: Identifying whether a model relies on spurious features (e.g., source URL patterns) vs. genuine veracity signals
  • Regulatory compliance: Increasingly, platforms must explain content moderation decisions to users and regulators
  • Discovering new detection signals: Interpretable models can reveal which linguistic, structural, or behavioral patterns are most predictive

Approaches to explainability

Attention-based explanations: Attention weights in neural networks can highlight which input elements (words, users, images) the model focused on. Models like dEFEND and GCAN use co-attention mechanisms to show which source tweet words correspond to which user propagation patterns.

Gradient-based methods: Saliency maps and integrated gradients show which input features most influence the output by computing gradients with respect to the prediction.

Post-hoc interpretation: Methods like LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) explain individual predictions by fitting interpretable surrogate models around specific examples.

Feature importance: Simpler statistical models (e.g., logistic regression with engineered features) provide direct interpretability: each feature has a coefficient indicating its contribution.

Explainability vs. accuracy trade-off

One of the key challenges is the trade-off between model accuracy and explainability: - Interpretable models (decision trees, logistic regression) are easier to explain but often achieve lower accuracy - Black-box models (deep neural networks, ensembles) often achieve higher accuracy but are harder to interpret - Neural attention mechanisms partially bridge this gap: they add interpretability to deep models without dramatic accuracy loss

Explainability in graph neural networks

A Comprehensive Survey on Trustworthy Graph Neural Networks: Privacy, Robustness, Fairness, and Explainability surveys explainability methods for graph neural networks including saliency-based approaches (feature importance), example-based methods (prototype selection), and model-based approaches (attention mechanisms, decomposition).

Explainability in the literature

Overview: Danilevsky et al. (2020) surveys explainability techniques in NLP, categorizing approaches as local vs. global and self-explaining vs. post-hoc, detailing five major techniques (feature importance, surrogate models, example-driven, provenance-based, induction-based rules), and identifying evaluation gaps that remain relevant to misinformation detection.

Misinformation-specific work: Most misinformation detection papers prioritize accuracy and do not provide explanations. A few exceptions:

  • FOLK (Wang & Shu 2023): uses first-order logic to decompose claims into verifiable sub-claims and generates natural-language explanations with high coverage and readability; demonstrates that symbolic reasoning guides LLMs to produce more interpretable outputs
  • dEFEND (Shu et al. 2019): uses co-attention to highlight which source article sentences co-occur with which user comment sentences
  • GCAN (Lu & Li 2020): uses dual co-attention to identify suspicious retweeting users and informative source words

Open challenges

  • How do we validate that model explanations are correct? (An attention weight doesn't guarantee the model actually uses that feature for the decision.)
  • Can we design models that are both highly accurate and easily interpretable?
  • How do explanations generalize across datasets and domains?
  • What explanation format is most useful for content moderators: visualizations, feature importance scores, natural language, or examples?
  • Can adversarial actors game models by understanding their explainability mechanisms?