Skip to content
CHECKED: Chinese COVID-19 Fake News Dataset

CHECKED: Chinese COVID-19 Fake News Dataset

Authors: Chen Yang, Xinyi Zhou, Reza Zafarani Venue: arXiv:2010.09029 (2020) — arXiv

TL;DR

CHECKED is the first Chinese-language, fact-checked COVID-19 social media dataset — 2,104 Weibo microblogs (344 fake, 1,760 real) spanning December 2019 to August 2020, labeled using Weibo's expert-operated Community Management Center for fake posts and People's Daily official account for real posts. Each microblog carries rich multimedia (text, up to 18 images, video URLs), propagation context (1,868,175 reposts, 1,185,702 comments, 56.8M likes), and privacy-hashed user identifiers. Benchmark experiments show TextCNN achieves macro F₁ = 0.938 on a temporal split, filling a gap that prior COVID-19 misinformation datasets — all non-Chinese and without multimedia — could not address.

Contributions

  • Introduces CHECKED, the first Chinese COVID-19 social media dataset with ground-truth credibility labels (real/fake), filling the intersection of two previously disjoint spaces: Chinese COVID-19 data (Weibo-COV, NAIST-COVID) and COVID-19 credibility datasets (ReCOVery, CoAID, MM-COVID), both of which were either unlabeled or non-Chinese.
  • Provides a rich multimedia schema per microblog: textual content, up to 18 image URLs, one video URL, comment/repost/like counts, full comment and repost thread content, and privacy-hashed user IDs.
  • Uses 39 keywords across five categories (virus names, pandemic terminology, key figures and organizations, medical supplies, policy terms) to filter COVID-19-relevant content; English keywords included for professional terms (N95, SARS-CoV-2, COVID).
  • Reports benchmark macro F₁ results for five text-classification methods — FastText (0.839), TextCNN (0.938), TextRNN (0.700), Att-TextRNN (0.871), Transformer (0.927) — providing the first performance reference on Chinese COVID-19 fake news detection.

Method

Fake news collection. CHECKED sources fake microblogs from Weibo's Community Management Center, an official platform where expert reviewers evaluate user reports of false information, cyberbullying, privacy violations, and impersonation. This center has publicly verified over two million microblogs. Only items labeled as false information are included.

Real news collection. Real microblogs are collected from the People's Daily Weibo account (@人民日报), ranked first in both the 2019 China online media white paper and the State Information Center COVID-19 dissemination report. People's Daily has over 120 million Weibo followers and 120,000+ posts — chosen for credibility, reach, and prolific COVID-19 coverage.

Relevance filtering. A microblog is included if it contains any of 39 curated keywords across five categories: (i) coronavirus names (新冠肺炎, SARS-CoV-2, COVID, etc.); (ii) pandemic terms (疫情, confirmed cases, lockdown, etc.); (iii) key figures and organizations (WHO, Zhong Nanshan, Wenliang Li, Fauci); (iv) medical supplies (vaccines, N95, nucleic acid tests); (v) policies (quarantine, herd immunity, health code). All collected microblogs are manually reviewed to remove irrelevant items.

Collection interval. December 2019 to August 2020 (nine months), bracketing the full period from Wuhan's first confirmed case to the dataset cutoff, providing temporal coverage of COVID-19's escalation, global spread, and partial containment phases.

Schema. Each microblog is stored with: hashed id (32-digit SHA-256 of the original 16-digit Weibo ID), label (real/fake), analysis (official expert evaluation text for fake posts), date, hashed user_id, text, pic_url (up to 18 images), video_url (mutually exclusive with images), comment_num, repost_num, like_num, full comments thread (hashed IDs, dates, texts), and full reposts thread.

Results

Experiments use a temporal 70%/10%/20% train/validation/test split (preserving time ordering to simulate realistic deployment), with macro F₁ as the evaluation metric given the 5:1 class imbalance (1,760 real vs. 344 fake). Padding size = 150, batch size = 16, dropout = 0.5 for all models.

Method Macro F₁
FastText 0.839
TextRNN 0.700
Att-TextRNN 0.871
Transformer 0.927
TextCNN 0.938

TextCNN achieves the best result. The Transformer model (0.927) is competitive, suggesting self-attention is effective for this task despite the relatively short microblog texts. TextRNN lags considerably (0.700), likely because RNNs without attention underfit the keyword-heavy, short-text domain.

All baselines are text-only; CHECKED's image and video fields, and its full propagation graphs, are deliberately left for future work.

Connections

  • CoAID (Cui & Lee, 2020) is a complementary English-language COVID-19 dataset integrating claims, news articles, tweets, and platform posts; CoAID emphasizes user engagement data while CHECKED emphasizes multimedia and propagation structure on a single platform (Weibo).
  • ReCOVery (Zhou et al., 2020) is the nearest structural analogue: also a COVID-19 credibility dataset with multimedia and social data, but covers English-language news articles with publisher-level labels rather than Chinese microblogs with per-item expert labels; the two datasets are complementary.
  • SAFE (Zhou et al., 2020) uses image2sentence visual similarity for fake news detection; CHECKED's image and video fields are a natural evaluation domain for SAFE-like multimodal models that the benchmark section explicitly leaves open.
  • FakeNewsNet is the dominant general fake news benchmark; CHECKED's propagation distributions (power-law comment/repost/like distributions with median 744/586/6,179) are structurally similar to FakeNewsNet's Twitter graphs, enabling methods designed for FakeNewsNet to transfer to CHECKED.
  • COVID-19 misinformation is the containing topic; CHECKED directly addresses the Chinese-language gap in infodemic research.
  • Multimodal detection is the open research direction CHECKED enables — image, video, and propagation features are collected but not exploited in the paper's benchmarks.
  • Social-context detection via repost/comment graphs is similarly unexploited in the current benchmarks, representing a direct extension opportunity.

Notes

The class imbalance (344 fake vs. 1,760 real, approximately 1:5) is significant and expected given the nature of the collection sources — Weibo's Community Management Center has a limited throughput of verified fake posts relative to the volume of People's Daily's output. Macro F₁ is the correct metric, but high overall performance may mask poor recall on the minority (fake) class; per-class results are not reported.

The choice of People's Daily as the sole source of real news is a potential confound: it is a state-run outlet. Real microblogs from People's Daily may have stylistic uniformity not representative of genuine grassroots credible information, and any classifier trained on CHECKED could be exploiting institutional register rather than factual accuracy signals.

The full propagation graphs (1.87M reposts) distinguish CHECKED from the text-only Chinese COVID-19 datasets (Weibo-COV, NAIST-COVID) but the paper does not evaluate graph-based detection methods — leaving this as the dataset's most valuable underexplored resource.