MVAE: Multimodal Variational Autoencoder for Fake News Detection¶
Authors: Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, Vasudeva Varma
Venue: World Wide Web Conference (WWW), May 2019, San Francisco — DOI
TL;DR¶
The paper proposes MVAE, a variational autoencoder that learns shared representations of text and images in social media posts to detect fake news. The model jointly trains an encoder-decoder (to discover correlations across modalities) with a binary classifier. On Twitter and Weibo datasets, MVAE outperforms prior multimodal detection methods by ~6% accuracy and ~5% F1 score.
Contributions¶
- A novel multimodal variational autoencoder (MVAE) that learns joint text-image representations for fake news detection.
- Joint training of the VAE (encoder-decoder) with a fake news classifier, enabling the model to discover correlations across modalities while performing detection.
- Extensive evaluation on two real-world social media datasets (Twitter, Weibo) with multimodal content, showing substantial improvements over state-of-the-art baselines.
Method¶
Encoder: Encodes text and image into a shared latent representation: - Textual encoder: Stacked bi-directional LSTMs with pre-trained Word2Vec embeddings, followed by a fully connected layer. - Visual encoder: Pre-trained VGG-19 (frozen weights) followed by fully connected layers, producing feature representations of same dimensionality as text. - Both modalities are concatenated, passed through a fully connected layer, and reparameterized using Gaussian sampling to produce the latent vector.
Decoder: Reconstructs both text and image from the latent vector: - Textual decoder: Bi-directional LSTMs with softmax outputs to reconstruct word probabilities. - Visual decoder: Fully connected layers to reconstruct VGG-19 features.
Fake news detector: Binary classifier with fully connected layers that operates on the learned latent representation.
Joint loss: Final training objective combines VAE reconstruction loss (cross-entropy for text, MSE for image features), KL divergence loss, and binary classification loss.
Results¶
On Twitter (MediaEval dataset): MVAE achieves 74.5% accuracy and 0.73 F1, compared to 66.4% and 0.66 for the best prior method (att-RNN).
On Weibo (from [8]): MVAE achieves 82.4% accuracy and 0.82 F1, compared to 78.2% and 0.80 for the best baseline (EANN).
The improvement reflects the model's ability to discover correlations across modalities, something prior attention-based and event-discriminator approaches did not explicitly optimize for.
Connections¶
- Related to SAFE via joint treatment of text and image for fake news detection, though SAFE uses similarity-aware fusion rather than VAE-based reconstruction.
- Extends multimodal fusion methods to the detection domain, building on VAE foundations from generative modeling.
- Differs from network-based detection by focusing on post content rather than propagation structure.
Notes¶
Strengths: - Clear motivation: using shared representation learning (VAE) rather than just fusion or attention to exploit multimodal correlations. - Well-executed experiments with appropriate baselines (unimodal, multimodal prior art). - Substantial empirical gains on two datasets, suggesting the approach generalizes.
Limitations: - Evaluation limited to microblogs (Twitter, Weibo); generalization to news articles or other platforms unclear. - No ablation isolating the contribution of VAE reconstruction loss vs. classification loss; unclear if reconstruction is essential or just regularization. - Frozen VGG-19 weights—joint fine-tuning might improve performance but was avoided for "parameter explosion." - No analysis of failure modes or learned latent representations (e.g., interpretability of what correlations are discovered).
Follow-ups: - Explore whether reconstruction loss is necessary or if a simpler joint embedding would suffice. - Extend to more modalities (e.g., user metadata, temporal context, multi-image posts). - Analyze the latent space to understand what multimodal correlations the VAE learns.