Explainability in misinformation detection¶

Explainability (or interpretability) refers to the ability to understand why a machine learning model makes a particular prediction. In the context of misinformation detection, explainability is crucial for:

Trust and deployment: Content moderators and platform operators need to understand which signals triggered a misinformation flag before acting on it
Debugging and validation: Identifying whether a model relies on spurious features (e.g., source URL patterns) vs. genuine veracity signals
Regulatory compliance: Increasingly, platforms must explain content moderation decisions to users and regulators
Discovering new detection signals: Interpretable models can reveal which linguistic, structural, or behavioral patterns are most predictive

Approaches to explainability¶

Attention-based explanations: Attention weights in neural networks can highlight which input elements (words, users, images) the model focused on. Models like dEFEND and GCAN use co-attention mechanisms to show which source tweet words correspond to which user propagation patterns.

Gradient-based methods: Saliency maps and integrated gradients show which input features most influence the output by computing gradients with respect to the prediction.

Post-hoc interpretation: Methods like LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) explain individual predictions by fitting interpretable surrogate models around specific examples.

Feature importance: Simpler statistical models (e.g., logistic regression with engineered features) provide direct interpretability: each feature has a coefficient indicating its contribution.

Explainability vs. accuracy trade-off¶

One of the key challenges is the trade-off between model accuracy and explainability: - Interpretable models (decision trees, logistic regression) are easier to explain but often achieve lower accuracy - Black-box models (deep neural networks, ensembles) often achieve higher accuracy but are harder to interpret - Neural attention mechanisms partially bridge this gap: they add interpretability to deep models without dramatic accuracy loss

Explainability in graph neural networks¶

A Comprehensive Survey on Trustworthy Graph Neural Networks: Privacy, Robustness, Fairness, and Explainability surveys explainability methods for graph neural networks including saliency-based approaches (feature importance), example-based methods (prototype selection), and model-based approaches (attention mechanisms, decomposition).

Explainability in the literature¶

Overview: Danilevsky et al. (2020) surveys explainability techniques in NLP, categorizing approaches as local vs. global and self-explaining vs. post-hoc, detailing five major techniques (feature importance, surrogate models, example-driven, provenance-based, induction-based rules), and identifying evaluation gaps that remain relevant to misinformation detection.

Misinformation-specific work: Most misinformation detection papers prioritize accuracy and do not provide explanations. A few exceptions:

FOLK (Wang & Shu 2023): uses first-order logic to decompose claims into verifiable sub-claims and generates natural-language explanations with high coverage and readability; demonstrates that symbolic reasoning guides LLMs to produce more interpretable outputs
dEFEND (Shu et al. 2019): uses co-attention to highlight which source article sentences co-occur with which user comment sentences
GCAN (Lu & Li 2020): uses dual co-attention to identify suspicious retweeting users and informative source words

Fake news detection — the broader task
Neural approaches to fake news detection — deep learning methods underlying many explainable models
Propagation-based fake news detection — propagation structures can be visualized to explain model predictions
Graph Neural Networks — attention-based GNN variants provide graph-level explanations

Open challenges¶

How do we validate that model explanations are correct? (An attention weight doesn't guarantee the model actually uses that feature for the decision.)
Can we design models that are both highly accurate and easily interpretable?
How do explanations generalize across datasets and domains?
What explanation format is most useful for content moderators: visualizations, feature importance scores, natural language, or examples?
Can adversarial actors game models by understanding their explainability mechanisms?