Skip to content
Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multimodal Data

Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multimodal Data

Authors: Amila Silva, Ling Luo, Shanika Karunasekera, Christopher Leckie

Affiliation: School of Computing and Information Systems, The University of Melbourne

Venue: AAAI 2021 — arXiv

TL;DR

Most fake news detection models are trained on a single domain (e.g., politics) and fail when tested on others (e.g., entertainment, COVID-19). This paper proposes a multimodal framework that leverages both domain-specific and cross-domain knowledge to detect fake news across different domains. Using unsupervised domain discovery and supervised domain-agnostic classification, the model achieves 7.55% F1-score improvement over baselines on rarely-appearing domains.

Contributions

  • A multimodal fake news detection framework that exploits both domain-specific and cross-domain knowledge without requiring hard domain labels.
  • An unsupervised domain embedding learning technique using propagation networks and textual content to discover domains in unlabeled data.
  • A supervised domain-agnostic classification module using domain-specific and cross-domain embedding spaces to preserve both local and global knowledge.
  • An LSH-based instance selection technique to reduce manual labeling cost while maintaining domain coverage across multiple datasets.
  • Empirical validation on three public datasets (PolitiFact, GossipCop, CoAID) showing substantial improvements on rarely-appearing domains.

Method

The proposed framework consists of two main components: (1) unsupervised domain discovery and (2) supervised domain-agnostic fake news classification.

Unsupervised Domain Discovery (Module A): The framework constructs a heterogeneous network from news propagation patterns (users tweeting/sharing news, words appearing in titles) using a two-step process. First, communities in this network are detected using the Louvain algorithm to identify distinct user groups with homophilous interests. Second, domain embeddings are learned by concatenating the soft-membership probabilities of communities, yielding low-dimensional representations that cluster records by domain.

Supervised Domain-agnostic Classification (Module B): Each news record is represented as a multimodal vector combining text content (via BERT) and propagation network features. The classification model maps input into two separate subspaces: one preserving domain-specific knowledge (via a domain-specific decoder) and one preserving cross-domain knowledge (via a shared decoder). The domain-specific and cross-domain decoders are trained jointly with an adversarial loss to ensure that domain information does not leak into the cross-domain representation.

LSH-based Instance Selection (Module C): To address the labeling cost when training across multiple domains, the model uses Locality-Sensitive Hashing to select high-quality instances that maximize domain coverage. The method works by: (1) creating random hash functions mapping records to buckets; (2) constructing a hash table; (3) randomly picking records from hash buckets with equal probability; (4) repeating until the dataset size reaches the labeling budget. This approach substantially outperforms random selection for rarely-appearing domains.

Results

The model was evaluated on three public datasets combined to create a cross-domain dataset: PolitiFact (269 fake, 280 real), GossipCop (1,269 fake, 2,466 real), and CoAID (135 fake, 1,568 real).

Main results: The proposed approach achieves 0.836–0.869 F1-score across domains, substantially outperforming seven widely-used baselines (LIWC, TextCNN, HPNF+SVM, EANN-Unimodal, EANN-Multimodal, HPNF+SVM, SAFE) which achieve 0.744–0.831 F1-score. The best baseline (EANN-Multimodal) achieves 0.833 F1-score, showing the importance of domain-specific knowledge and the ability of the proposed approach to capture domain-specific information while generalizing across domains.

Ablation study: Removing domain-specific loss (Eq. 5) drops F1-score by ~3%, removing network modality drops by ~3%, and removing text modality drops by ~2%, demonstrating the value of each component. The LSH-based instance selection approach yields 24% F1-score improvement for PolitiFact and 27% for CoAID when compared to random selection, with substantially better domain coverage.

Connections

Notes

This paper addresses a critical practical problem: fake news detectors trained on one domain often fail dramatically on others. The key insight is that domain differences are not random noise but systematic patterns in user behavior and language that should be preserved rather than averaged out. The unsupervised domain discovery approach is elegant—using network structure to infer domains without labels—though it assumes communities correlate with domains, which may not always hold. The LSH-based instance selection is a practical contribution for practitioners managing labeling budgets across multiple domains.