We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!¶

Authors: Ankesh Anand, Tanmoy Chakraborty, Noseong Park Venue: arXiv:1612.01340 — arXiv

TL;DR¶

Clickbaits exploit curiosity gaps to drive clicks on disappointing content. This paper proposes bidirectional RNNs with distributed word embeddings and character-level embeddings for clickbait detection, achieving 98% accuracy on a 15,000-headline dataset — a 5% improvement over hand-crafted feature baselines.

Contributions¶

Novel neural network architecture combining word and character embeddings for clickbait detection
Comparison of RNN variants (standard RNN, LSTM, GRU) in bidirectional setting
Empirical evaluation on a balanced 15,000-headline dataset with state-of-the-art results
Demonstration that deep learning eliminates the need for heavy feature engineering in this task

Method¶

The model uses a bidirectional RNN architecture with three main components:

Embedding Layer: Each input word is embedded as a concatenation of (1) pre-trained 300-dimensional word2vec embeddings from Google News and (2) character-level embeddings learned via a 3-layer 1D CNN with ReLU activations and max-pooling. Character embeddings capture orthographic and morphological features while handling out-of-vocabulary words.

Hidden Layer: A bidirectional RNN processes the embedded sequence in both directions to capture contextual information. The paper evaluates three RNN architectures: standard RNNs, LSTMs (which use gating to preserve long-range dependencies), and GRUs (a simpler gated variant). The final hidden state becomes a fixed-size representation.

Output Layer: The RNN representation passes through a fully connected network with a sigmoid output node for binary classification (clickbait vs. non-clickbait).

The model is trained with mini-batch gradient descent (batch size 64), ADAM optimizer, binary cross-entropy loss, and dropout (rate 0.3) for regularization.

Results¶

On a balanced dataset of 15,000 headlines (7,500 clickbait from BuzzFeed/Upworthy/ViralNova, 7,500 non-clickbait from Wikinews):

Model	Accuracy	Precision	Recall	F1	ROC-AUC
BiLSTM (CE+WE)	0.98	0.98	0.98	0.98	0.99
Chakraborty et al. (SVM)	0.93	0.95	0.90	0.93	0.97

The bidirectional LSTM with combined embeddings outperforms all baselines, achieving >5% accuracy improvement and >2% ROC-AUC improvement over the state-of-the-art SVM model.

Connections¶

Clickbait — directly addresses clickbait detection via neural networks
Deep learning — deep neural network methods applied to text classification
Recurrent Neural Networks — employs bidirectional LSTM and GRU architectures
Text classification — headline classification task

Notes¶

This paper demonstrates the effectiveness of deep learning over feature engineering for clickbait detection. The use of both word and character embeddings is well-motivated: word embeddings capture semantic content while character embeddings handle morphological cues that signal clickbait (e.g., excessive punctuation, capitalization patterns). The evaluation via 10-fold cross-validation is rigorous, though limited to English headlines. The paper promises to open-source the model weights for reproducibility, though this is now standard practice.