SAFE: Similarity-Aware Multi-Modal Fake News Detection¶

Authors: Xinyi Zhou, Jindi Wu, Reza Zafarani (Zhou and Wu contributed equally) Venue: arXiv:2003.04981 [cs.CL], February 2020

TL;DR¶

Fake news articles frequently pair misleading textual claims with irrelevant or manipulative images, creating a detectable gap between the two modalities. SAFE addresses this by extracting separate neural representations of a news article's text and images — using the same Text-CNN architecture applied to an image2sentence embedding for visual content — then measuring their cross-modal similarity with a modified cosine metric; the final classifier jointly optimizes a modal-independent prediction loss and a similarity-based loss, outperforming text-only, image-only, and prior multi-modal baselines on both PolitiFact (F₁ 0.896) and GossipCop (F₁ 0.895) partitions of FakeNewsNet.

Contributions¶

First method to explicitly model the relationship (similarity) between news textual and visual information as a signal for fake news detection, distinct from simply concatenating multi-modal features.
SAFE framework with three modules: (1) multi-modal feature extraction via Text-CNN on text and image2sentence embeddings, (2) modal-independent fake news prediction, and (3) cross-modal similarity extraction, jointly optimized with loss ℒ = αℒ_p + βℒ_s.
Empirical ablation showing that cross-modal similarity (SAFE\S → SAFE) consistently improves over concatenated multi-modal features alone, and that textual information contributes more than visual (SAFE\V > SAFE\T).
Case studies demonstrating fake news articles receive low similarity scores (s ≈ 0.001–0.044) while true news articles receive high scores (s ≈ 0.966–0.983).

Method¶

Problem formulation. A news article A = {T, V} consists of textual content T and visual content V. The goal is to predict ŷ ∈ {0, 1} (real vs. fake) by learning d-dimensional representations t = ℳ_t(T, θ_t) and v = ℳ_v(V, θ_v) and their similarity s = ℳ_s(t, v) ∈ [0, 1].

Textual feature extraction. Text-CNN (Kim, 2014) is extended with an additional fully connected layer. Words are embedded via word2vec; a convolutional layer with window sizes H = {3, 4} produces feature maps over n-gram windows, followed by max-over-time pooling and a linear projection to t ∈ ℝᵈ.

Visual feature extraction. Rather than applying a pre-trained vision model (e.g., VGG-19) directly to pixel data, images are first processed by the image2sentence model (Vinyals et al., 2016) to produce a sentence-like embedding of the visual content. The same Text-CNN architecture is then applied to produce v ∈ ℝᵈ. This ensures t and v inhabit a comparable representation space, making cross-modal similarity computationally well-defined.

Modal-independent prediction. The concatenation t ⊕ v is passed to a softmax classifier ℳ_p with cross-entropy loss:

\[\mathcal{L}_p(\theta_t, \theta_v, \theta_p) = -\mathbb{E}_{(a,y) \sim (A,Y)}\!\left[y \log \mathcal{M}_p(\mathbf{t}, \mathbf{v}) + (1-y)\log(1-\mathcal{M}_p(\mathbf{t}, \mathbf{v}))\right]\]

Cross-modal similarity extraction. Similarity is a modified cosine that maps to [0, 1]:

\[\mathcal{M}_s(\mathbf{t}, \mathbf{v}) = \frac{\mathbf{t} \cdot \mathbf{v} + \|\mathbf{t}\|\|\mathbf{v}\|}{2\|\mathbf{t}\|\|\mathbf{v}\|}\]

The similarity loss ℒ_s minimizes cross-entropy under the assumption that low-similarity (mismatched) articles are more likely to be fake. Joint loss: ℒ = αℒ_p + βℒ_s. Parameters are updated in closed-form gradient steps for θ_p, θ_t, and θ_v until convergence.

Results¶

Datasets: PolitiFact (1,056 articles: 432 fake, 624 true) and GossipCop (22,140: 5,323 fake, 16,817 true) from FakeNewsNet. Split: 80/20 by publication date; 5-fold cross-validation. Learning rate 10⁻⁴, 100 iterations, strides H = {3, 4}.

Dataset	Method	Acc.	Pre.	Rec.	F₁
PolitiFact	LIWC	0.822	0.785	0.846	0.815
	VGG-19	0.649	0.668	0.787	0.720
	att-RNN	0.769	0.735	0.942	0.826
	SAFE	0.874	0.889	0.903	0.896
GossipCop	LIWC	0.836	0.878	0.317	0.466
	VGG-19	0.775	0.775	0.970	0.862
	att-RNN	0.743	0.788	0.913	0.846
	SAFE	0.838	0.857	0.937	0.895

Ablation (F₁ on PolitiFact / GossipCop): SAFE\T 0.761/0.837 < SAFE\V 0.782/0.868 < SAFE\S 0.813/0.868 < SAFE\W 0.795/0.876 < SAFE 0.896/0.895. Best weighting: α:β = 0.4:0.6 on PolitiFact, 0.6:0.4 on GossipCop.

Connections¶

Uses FakeNewsNet (PolitiFact + GossipCop); for social-context approaches on the same data, see Shu et al. (2019).
Extends the content-based detection paradigm to the multi-modal regime; feature engineering baselines (LIWC, RST) are the comparison regime that SAFE surpasses.
See multimodal detection for the broader landscape of text+image approaches.
Sitaula et al. (2019) demonstrates that source-credibility features outperform 23 text-content features — a complementary finding that SAFE's content-based neural approach also reflects by outperforming LIWC text baselines.
SAFE is used as a multimodal baseline in ReCOVery (Zhou et al., 2020), achieving F₁ 0.833/0.672 (reliable/unreliable) on COVID-19 data — the best-performing baseline in that benchmark.

Notes¶

The image2sentence step converts images to a text-like embedding via a captioning model, making the cross-modal similarity essentially a text-vs-generated-caption comparison. This is architecturally elegant and ensures commensurability, but sacrifices low-level visual signals (color, composition, copy-paste artifacts) that might independently indicate image manipulation.

SAFE is content-only by design, explicitly targeting the pre-diffusion stage where no social context exists. This makes it complementary to social-context methods like UPF but also limits it: well-curated fake news stories with deliberately matching images will not be caught by the similarity signal.

GossipCop is heavily class-imbalanced (5,323 fake vs. 16,817 true); LIWC collapses to near-random fake recall (0.317), illustrating how accuracy alone is misleading under imbalance. SAFE achieves strong recall (0.937) on the minority fake class without sacrificing overall F₁.