Some Like it Hoax: Automated Fake News Detection in Social Networks¶

Authors: Eugenio Tacchini, Gabriele Ballarin, Marco L. Della Vedova, Stefano Moret, Luca de Alfaro

Venue: Technical Report UCSC-SOE-17-05, School of Engineering, UC Santa Cruz, 2017 — arXiv

TL;DR¶

Hoaxes on Facebook can be detected with >99% accuracy using user interaction patterns (who "liked" posts) rather than content analysis. Two algorithms—logistic regression and harmonic boolean crowdsourcing—achieve high accuracy even with minimal labeled training data, and knowledge transfers across different Facebook communities.

Contributions¶

Novel approach to fake news detection based on user interaction patterns instead of content features.
Logistic regression classifier achieving >99% accuracy on Facebook hoax classification.
Harmonic boolean label crowdsourcing (BLC) adaptation for the hoax detection setting, achieving 99.4% accuracy.
Empirical demonstration that the method works with <1% of posts labeled, enabling practical deployment at scale.
Evidence that detection knowledge transfers across different Facebook communities (scientific vs. conspiracy pages).

Method¶

The paper treats hoax detection as a binary classification problem over Facebook posts using features derived from user interactions (likes).

Logistic regression approach: Each user u is modeled as having a weight w_u that indicates whether they preferentially like hoaxes or non-hoaxes. For a post i with likes from users U_i, the probability that i is a hoax is given by:

\[p_i = \frac{1}{1 + e^{-w}}\]

where w is the sum of user weights for all users who liked post i. The model learns weights in a supervised setting where a subset of posts has known labels.

Harmonic boolean label crowdsourcing: An adaptation of the harmonic algorithm for crowdsourced labels, designed to handle settings where training data may be sparse and unbalanced. The algorithm maintains parameters α and β for each user/post representing beliefs about hoax/non-hoax likelihood, updated iteratively:

\[\alpha_u := A + \sum\{q_i | i \in \partial u, q_i > 0\}$$ $$\beta_u := B - \sum\{q_i | i \in \partial u, q_i < 0\}\]

where A and B are constants tuned to require roughly 5 "likes" from known reliable users to shift belief toward non-hoax. The algorithm propagates information through the user-post interaction graph and satisfies a non-interference property: information only flows along edges corresponding to "likes."

Dataset: The authors constructed a dataset of 15,500 Facebook posts from 32 pages (14 conspiracy-focused, 18 scientific news sources) with 909,236 total user interactions, labeled as hoax or non-hoax based on whether the page is classified as a conspiracy or scientific source.

Results¶

Cross-validation accuracy: - Logistic regression: >99% accuracy on the complete dataset across all training set sizes tested. - Harmonic BLC: 99.4% accuracy on the complete dataset.

Scaling with small training sets: Both methods maintain high accuracy with minimal labeled data: - Logistic regression: >90% accuracy with as little as 1% of posts labeled (~150 posts). - Harmonic BLC: Still achieves 99%+ accuracy when trained on 0.5% of posts (~80 posts).

Cross-page transfer learning: - One-page-out validation: logistic regression achieves 79.4% accuracy when trained on all but one page; harmonic BLC achieves 99.1%. - Half-pages-out validation: logistic regression achieves 71.6% accuracy; harmonic BLC achieves 99.3%. - Demonstrates that user behavior patterns generalize across different Facebook communities, even those with different topical focuses.

Intersection dataset: When restricting to users who liked both hoax and non-hoax posts (to test transfer in mixed-preference communities), logistic regression degrades to 91.6% but harmonic BLC remains above 99%, suggesting greater robustness for mixed communities.

Connections¶

Related to Misinformation and fake news detection through shared goal of identifying false/misleading information.
Related to crowdsourcing methods for data labeling and aggregation of noisy labels.
Related to User Behavior Analysis for understanding how users interact with different content types.
Related to social network analysis for exploiting diffusion and interaction patterns.

Notes¶

Strengths: - Simple and scalable approach that avoids content analysis, which is language-dependent and evolves as misinformation tactics change. - High accuracy with minimal labeled data is practically valuable for real-world deployment. - Cross-page transfer learning suggests the method captures something fundamental about user communities, not just specifics of individual pages. - The harmonic algorithm variant is computationally efficient and theoretically grounded.

Limitations: - Assumes that conspiracy pages predominantly post hoaxes and science pages predominantly post accurate content—a reasonable but not universally true assumption. - The intersection dataset results (users who liked both hoax and non-hoax content) hint at potential brittleness when user preferences are genuinely mixed. - The method is specific to Facebook's "like" affordance; applicability to other platforms requires adapted interaction signals. - Limited to detection; does not explain why certain hoaxes spread or how to intervene.

Open questions: - How does the method perform in real time, before posts accumulate many likes? - Can the approach extend to other platforms (Twitter, TikTok) where interaction signals differ? - What are failure modes—do certain types of hoaxes spread within scientific communities or certain non-hoaxes within conspiracy communities?