Skip to content
We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!

We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!

Authors: Ankesh Anand, Tanmoy Chakraborty, Noseong Park Venue: arXiv:1612.01340 — arXiv

TL;DR

Clickbaits exploit curiosity gaps to drive clicks on disappointing content. This paper proposes bidirectional RNNs with distributed word embeddings and character-level embeddings for clickbait detection, achieving 98% accuracy on a 15,000-headline dataset — a 5% improvement over hand-crafted feature baselines.

Contributions

  • Novel neural network architecture combining word and character embeddings for clickbait detection
  • Comparison of RNN variants (standard RNN, LSTM, GRU) in bidirectional setting
  • Empirical evaluation on a balanced 15,000-headline dataset with state-of-the-art results
  • Demonstration that deep learning eliminates the need for heavy feature engineering in this task

Method

The model uses a bidirectional RNN architecture with three main components:

Embedding Layer: Each input word is embedded as a concatenation of (1) pre-trained 300-dimensional word2vec embeddings from Google News and (2) character-level embeddings learned via a 3-layer 1D CNN with ReLU activations and max-pooling. Character embeddings capture orthographic and morphological features while handling out-of-vocabulary words.

Hidden Layer: A bidirectional RNN processes the embedded sequence in both directions to capture contextual information. The paper evaluates three RNN architectures: standard RNNs, LSTMs (which use gating to preserve long-range dependencies), and GRUs (a simpler gated variant). The final hidden state becomes a fixed-size representation.

Output Layer: The RNN representation passes through a fully connected network with a sigmoid output node for binary classification (clickbait vs. non-clickbait).

The model is trained with mini-batch gradient descent (batch size 64), ADAM optimizer, binary cross-entropy loss, and dropout (rate 0.3) for regularization.

Results

On a balanced dataset of 15,000 headlines (7,500 clickbait from BuzzFeed/Upworthy/ViralNova, 7,500 non-clickbait from Wikinews):

Model Accuracy Precision Recall F1 ROC-AUC
BiLSTM (CE+WE) 0.98 0.98 0.98 0.98 0.99
Chakraborty et al. (SVM) 0.93 0.95 0.90 0.93 0.97

The bidirectional LSTM with combined embeddings outperforms all baselines, achieving >5% accuracy improvement and >2% ROC-AUC improvement over the state-of-the-art SVM model.

Connections

Notes

This paper demonstrates the effectiveness of deep learning over feature engineering for clickbait detection. The use of both word and character embeddings is well-motivated: word embeddings capture semantic content while character embeddings handle morphological cues that signal clickbait (e.g., excessive punctuation, capitalization patterns). The evaluation via 10-fold cross-validation is rigorous, though limited to English headlines. The paper promises to open-source the model weights for reproducibility, though this is now standard practice.