Skip to content
Predicting Factuality of Reporting and Bias of News Media Sources

Predicting Factuality of Reporting and Bias of News Media Sources

Authors: Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, Preslav Nakov

Venue: arXiv preprint, 2018 — arXiv:1810.01765

TL;DR

This paper frames predicting news media factuality and bias as classification tasks at the news outlet level rather than at the claim or article level. The authors extract 141 features from multiple sources (articles, Wikipedia, Twitter, URL structure, web traffic) and train SVM classifiers, creating a dataset of 1,066 websites manually annotated for both factuality (low/mixed/high) and bias (7-point political scale). Results show textual features from articles are most predictive for factuality, while Wikipedia and Twitter information help with bias detection; combining feature types yields the best performance.

Contributions

  • Media-level profiling: Shift focus from individual claims to characterizing entire news outlets as a prior for fact-checking systems
  • Large-scale dataset: Release 1,066 websites with annotations for both factuality and bias, 1–2 orders of magnitude larger than prior work
  • Multi-source feature extraction: Design 141 features spanning article text, Wikipedia presence, Twitter profiles, URL structure, and traffic patterns
  • Feature ablation: Demonstrate complementary value of different feature families; textual features dominate for factuality while bias benefits from diverse signals

Method

The authors model factuality (Low, Mixed, High) and bias (7-point scale from Extreme-Left to Extreme-Right) as separate classification tasks using Support Vector Machines.

Data collection: They curate labels from the Media Bias/Fact Check (MBFC) website, which provides manual annotations of news outlets' factuality and political slant. They filter to 1,066 websites where both labels are available or readily inferred (e.g., a satirical outlet is low-factuality).

Feature families:

  1. Articles: 141 features including structural properties (POS tags, readability), sentiment scores, engagement metrics (share/like counts), topic-based lexicons, text complexity, bias-related word lexicons, and morality foundation theory features.

  2. Wikipedia: Binary presence/absence of a Wikipedia page; seven vector representations using word embeddings from the article's content, infobox, summary, categories, and table of contents; counts of friends/statuses/favorites.

  3. Twitter: Account properties (verification status, account age, location presence); tweet-level description vector (Google News embeddings).

  4. URL: Character-based n-gram features (1–2-grams), orthographic properties (length, sections, special characters), credibility indicators (HTTPS usage, blog-hosting platforms, top-level domains like .gov).

  5. Web Traffic: Alexa Rank as a single feature (less discriminative than expected).

Classification: SVM with RBF kernel, hyperparameters tuned via 5-fold cross-validation. Results reported as Macro-F₁, accuracy, and MAE (for ordinal regression variant).

Results

Headline numbers (5-fold cross-validation on full feature set):

  • Factuality: 58.02 Macro-F₁ (64.35 accuracy, 0.43 MAE)
  • Article features yield best individual performance (58.02 Macro-F₁)
  • Wikipedia features contribute meaningfully (55.52 Macro-F₁ alone)
  • URL and Twitter features improve modestly when combined

  • Bias (7-point): 36.63 Macro-F₁ (41.74 accuracy, 1.15 MAE)

  • Twitter features most useful (21.38 Macro-F₁), especially counts (friends, statuses, favorites)
  • Wikipedia features also important (30.92 Macro-F₁)
  • Article features less effective for bias detection than factuality

Ablation studies show feature types are complementary; removing any major family reduces performance, especially for bias. The best model for factuality excludes Web traffic features (negligible contribution), while bias benefits from all feature families.

Connections

Notes

This is a foundational paper in source-level profiling for fake news and media bias. Its main contribution is reframing the task from evaluating individual claims to profiling entire outlets—a practical distinction for deployment on new articles before manual verification. The breadth of features (textual, behavioral, structural) and systematic ablation provide useful baselines for later work.

The dataset is manually curated from MBFC, limiting it to English-language sources known to public fact-checking orgs; coverage of non-Western outlets or emerging misinformation sources may be sparse. The reliance on Wikipedia and Twitter presence also creates a bias toward outlets with high digital footprints. The paper does not deeply explore the correlation between factuality and bias, though later work by Baly et al. (2019) explicitly models this joint structure.