Skip to content

CHECKED

Full name: CHECKED: Chinese COVID-19 Fake News Dataset Authors: Chen Yang, Xinyi Zhou, Reza Zafarani Paper: Yang et al. (2020), arXiv:2010.09029 Access: https://github.com/cyang03/CHECKED (data and code)

Description

CHECKED is the first Chinese-language social media dataset for COVID-19 misinformation research with ground-truth credibility labels. It collects fact-checked Weibo microblogs from December 2019 to August 2020, labeled via Weibo's expert-run Community Management Center (fake) and People's Daily's official Weibo account (real). The dataset is distinguished by its rich multimedia schema and full propagation thread capture.

Statistics

Label Microblogs w/ images w/ video w/ reposts w/ comments
Real 1,760 1,149 563 1,151 1,151
Fake 344 53 106 229 292
Total 2,104 1,202 669 1,380 1,443
Metric Real Fake Total
Reposts 1,827,817 40,358 1,868,175
Comments 1,169,246 16,456 1,185,702
Likes 56,407,610 445,116 56,852,726
Unique users 686,077 51,674 737,751

Class ratio: approximately 5:1 real to fake.

Schema

Each microblog record contains:

Field Description
id 32-digit hashed Weibo microblog ID (SHA-256 of original 16-digit ID)
label "real" or "fake"
analysis Official Weibo expert evaluation text (fake microblogs only)
date Posting timestamp (yyyy-mm-dd hh:mm)
user_id 32-digit hashed Weibo user ID
text Full microblog text
pic_url List of image URLs (up to 18 per microblog)
video_url Video URL (mutually exclusive with images)
comment_num Total comment count as shown on Weibo
repost_num Total repost count as shown on Weibo
like_num Total like count
comments Array of comment objects: hashed ID, date, text, hashed commenter ID, optional image
reposts Array of repost objects: hashed ID, date, text, hashed reposter ID, optional image

Note: due to Weibo's access restrictions, the comment/repost arrays may contain fewer entries than comment_num/repost_num, which capture the Weibo-displayed count.

Labeling methodology

Fake: Microblogs verified as false information by Weibo's Community Management Center. Users report suspicious microblogs; Weibo experts investigate and publish detailed evaluations. The center has processed over two million reports as of publication.

Real: Microblogs from People's Daily (@人民日报), China's largest newspaper group, ranked first in both the 2019 China Online Media White Paper and the State Information Center COVID-19 Dissemination Report. Over 120 million Weibo followers.

Relevance filtering

39 keywords across five categories: (i) coronavirus names (新冠肺炎, SARS-CoV-2, COVID, Coronavirus, 冠状病毒, 新冠); (ii) pandemic terms (疫情, 确诊, 死亡病例, 输入病例, 隔离, 封城, 防控, etc.); (iii) key figures/organizations (WHO, CDC, 钟南山, 张文宏, 李文亮, 福奇); (iv) medical supplies (疫苗, 抗体, N95, 口罩, 火神山, 雷神山, 试剂盒, 核酸检测); (v) policies (群体免疫, 健康码, 战疫, 援鄂). English keywords are case-insensitive.

Benchmark results

Five text-classification methods evaluated with temporal 70/10/20 train/val/test split; macro F₁:

Method Macro F₁
FastText 0.839
TextCNN 0.938
TextRNN 0.700
Att-TextRNN 0.871
Transformer 0.927

Intended use

  • Chinese COVID-19 fake news detection
  • Multilingual infodemic research (complement to English ReCOVery, Spanish MM-COVID)
  • Multimodal misinformation detection (image/video features not yet benchmarked)
  • Propagation-based detection (full repost/comment thread graphs available)
  • Temporal analysis of COVID-19 misinformation spread on Chinese social media

Limitations

Real news is sourced exclusively from a state-run media outlet (People's Daily), introducing potential institutional register bias. Class imbalance (5:1 real:fake) requires careful evaluation metric choice. Multimedia benchmarks are absent from the paper. Historical tweet/repost retrieval may be incomplete due to Weibo visibility controls.

Connections

  • ReCOVery is the complementary English-language COVID-19 credibility dataset; together with CHECKED they provide Chinese and English multimodal COVID-19 misinformation corpora.
  • FakeNewsNet is the dominant general-domain benchmark; CHECKED's propagation structure (power-law repost/comment distributions) is analogous to FakeNewsNet's Twitter graphs.
  • Yang et al. (2020) is the introducing paper.