Skip to content
SAFE: Similarity-Aware Multi-Modal Fake News Detection

SAFE: Similarity-Aware Multi-Modal Fake News Detection

Authors: Xinyi Zhou, Jindi Wu, Reza Zafarani (Zhou and Wu contributed equally) Venue: arXiv:2003.04981 [cs.CL], February 2020

TL;DR

Fake news articles frequently pair misleading textual claims with irrelevant or manipulative images, creating a detectable gap between the two modalities. SAFE addresses this by extracting separate neural representations of a news article's text and images — using the same Text-CNN architecture applied to an image2sentence embedding for visual content — then measuring their cross-modal similarity with a modified cosine metric; the final classifier jointly optimizes a modal-independent prediction loss and a similarity-based loss, outperforming text-only, image-only, and prior multi-modal baselines on both PolitiFact (F₁ 0.896) and GossipCop (F₁ 0.895) partitions of FakeNewsNet.

Contributions

  • First method to explicitly model the relationship (similarity) between news textual and visual information as a signal for fake news detection, distinct from simply concatenating multi-modal features.
  • SAFE framework with three modules: (1) multi-modal feature extraction via Text-CNN on text and image2sentence embeddings, (2) modal-independent fake news prediction, and (3) cross-modal similarity extraction, jointly optimized with loss ℒ = αℒ_p + βℒ_s.
  • Empirical ablation showing that cross-modal similarity (SAFE\S → SAFE) consistently improves over concatenated multi-modal features alone, and that textual information contributes more than visual (SAFE\V > SAFE\T).
  • Case studies demonstrating fake news articles receive low similarity scores (s ≈ 0.001–0.044) while true news articles receive high scores (s ≈ 0.966–0.983).

Method

Problem formulation. A news article A = {T, V} consists of textual content T and visual content V. The goal is to predict ŷ ∈ {0, 1} (real vs. fake) by learning d-dimensional representations t = ℳ_t(T, θ_t) and v = ℳ_v(V, θ_v) and their similarity s = ℳ_s(t, v) ∈ [0, 1].

Textual feature extraction. Text-CNN (Kim, 2014) is extended with an additional fully connected layer. Words are embedded via word2vec; a convolutional layer with window sizes H = {3, 4} produces feature maps over n-gram windows, followed by max-over-time pooling and a linear projection to t ∈ ℝᵈ.

Visual feature extraction. Rather than applying a pre-trained vision model (e.g., VGG-19) directly to pixel data, images are first processed by the image2sentence model (Vinyals et al., 2016) to produce a sentence-like embedding of the visual content. The same Text-CNN architecture is then applied to produce v ∈ ℝᵈ. This ensures t and v inhabit a comparable representation space, making cross-modal similarity computationally well-defined.

Modal-independent prediction. The concatenation tv is passed to a softmax classifier ℳ_p with cross-entropy loss:

\[\mathcal{L}_p(\theta_t, \theta_v, \theta_p) = -\mathbb{E}_{(a,y) \sim (A,Y)}\!\left[y \log \mathcal{M}_p(\mathbf{t}, \mathbf{v}) + (1-y)\log(1-\mathcal{M}_p(\mathbf{t}, \mathbf{v}))\right]\]

Cross-modal similarity extraction. Similarity is a modified cosine that maps to [0, 1]:

\[\mathcal{M}_s(\mathbf{t}, \mathbf{v}) = \frac{\mathbf{t} \cdot \mathbf{v} + \|\mathbf{t}\|\|\mathbf{v}\|}{2\|\mathbf{t}\|\|\mathbf{v}\|}\]

The similarity loss ℒ_s minimizes cross-entropy under the assumption that low-similarity (mismatched) articles are more likely to be fake. Joint loss: ℒ = αℒ_p + βℒ_s. Parameters are updated in closed-form gradient steps for θ_p, θ_t, and θ_v until convergence.

Results

Datasets: PolitiFact (1,056 articles: 432 fake, 624 true) and GossipCop (22,140: 5,323 fake, 16,817 true) from FakeNewsNet. Split: 80/20 by publication date; 5-fold cross-validation. Learning rate 10⁻⁴, 100 iterations, strides H = {3, 4}.

Dataset Method Acc. Pre. Rec. F₁
PolitiFact LIWC 0.822 0.785 0.846 0.815
VGG-19 0.649 0.668 0.787 0.720
att-RNN 0.769 0.735 0.942 0.826
SAFE 0.874 0.889 0.903 0.896
GossipCop LIWC 0.836 0.878 0.317 0.466
VGG-19 0.775 0.775 0.970 0.862
att-RNN 0.743 0.788 0.913 0.846
SAFE 0.838 0.857 0.937 0.895

Ablation (F₁ on PolitiFact / GossipCop): SAFE\T 0.761/0.837 < SAFE\V 0.782/0.868 < SAFE\S 0.813/0.868 < SAFE\W 0.795/0.876 < SAFE 0.896/0.895. Best weighting: α:β = 0.4:0.6 on PolitiFact, 0.6:0.4 on GossipCop.

Connections

  • Uses FakeNewsNet (PolitiFact + GossipCop); for social-context approaches on the same data, see Shu et al. (2019).
  • Extends the content-based detection paradigm to the multi-modal regime; feature engineering baselines (LIWC, RST) are the comparison regime that SAFE surpasses.
  • See multimodal detection for the broader landscape of text+image approaches.
  • Sitaula et al. (2019) demonstrates that source-credibility features outperform 23 text-content features — a complementary finding that SAFE's content-based neural approach also reflects by outperforming LIWC text baselines.
  • SAFE is used as a multimodal baseline in ReCOVery (Zhou et al., 2020), achieving F₁ 0.833/0.672 (reliable/unreliable) on COVID-19 data — the best-performing baseline in that benchmark.

Notes

The image2sentence step converts images to a text-like embedding via a captioning model, making the cross-modal similarity essentially a text-vs-generated-caption comparison. This is architecturally elegant and ensures commensurability, but sacrifices low-level visual signals (color, composition, copy-paste artifacts) that might independently indicate image manipulation.

SAFE is content-only by design, explicitly targeting the pre-diffusion stage where no social context exists. This makes it complementary to social-context methods like UPF but also limits it: well-curated fake news stories with deliberately matching images will not be caught by the similarity signal.

GossipCop is heavily class-imbalanced (5,323 fake vs. 16,817 true); LIWC collapses to near-random fake recall (0.317), illustrating how accuracy alone is misleading under imbalance. SAFE achieves strong recall (0.937) on the minority fake class without sacrificing overall F₁.