CHECKED¶

Full name: CHECKED: Chinese COVID-19 Fake News Dataset Authors: Chen Yang, Xinyi Zhou, Reza Zafarani Paper: Yang et al. (2020), arXiv:2010.09029 Access: https://github.com/cyang03/CHECKED (data and code)

Description¶

CHECKED is the first Chinese-language social media dataset for COVID-19 misinformation research with ground-truth credibility labels. It collects fact-checked Weibo microblogs from December 2019 to August 2020, labeled via Weibo's expert-run Community Management Center (fake) and People's Daily's official Weibo account (real). The dataset is distinguished by its rich multimedia schema and full propagation thread capture.

Statistics¶

Label	Microblogs	w/ images	w/ video	w/ reposts	w/ comments
Real	1,760	1,149	563	1,151	1,151
Fake	344	53	106	229	292
Total	2,104	1,202	669	1,380	1,443

Metric	Real	Fake	Total
Reposts	1,827,817	40,358	1,868,175
Comments	1,169,246	16,456	1,185,702
Likes	56,407,610	445,116	56,852,726
Unique users	686,077	51,674	737,751

Class ratio: approximately 5:1 real to fake.

Schema¶

Each microblog record contains:

Field	Description
id	32-digit hashed Weibo microblog ID (SHA-256 of original 16-digit ID)
label	"real" or "fake"
analysis	Official Weibo expert evaluation text (fake microblogs only)
date	Posting timestamp (yyyy-mm-dd hh:mm)
user_id	32-digit hashed Weibo user ID
text	Full microblog text
pic_url	List of image URLs (up to 18 per microblog)
video_url	Video URL (mutually exclusive with images)
comment_num	Total comment count as shown on Weibo
repost_num	Total repost count as shown on Weibo
like_num	Total like count
comments	Array of comment objects: hashed ID, date, text, hashed commenter ID, optional image
reposts	Array of repost objects: hashed ID, date, text, hashed reposter ID, optional image

Note: due to Weibo's access restrictions, the comment/repost arrays may contain fewer entries than comment_num/repost_num, which capture the Weibo-displayed count.

Labeling methodology¶

Fake: Microblogs verified as false information by Weibo's Community Management Center. Users report suspicious microblogs; Weibo experts investigate and publish detailed evaluations. The center has processed over two million reports as of publication.

Real: Microblogs from People's Daily (@人民日报), China's largest newspaper group, ranked first in both the 2019 China Online Media White Paper and the State Information Center COVID-19 Dissemination Report. Over 120 million Weibo followers.

Relevance filtering¶

39 keywords across five categories: (i) coronavirus names (新冠肺炎, SARS-CoV-2, COVID, Coronavirus, 冠状病毒, 新冠); (ii) pandemic terms (疫情, 确诊, 死亡病例, 输入病例, 隔离, 封城, 防控, etc.); (iii) key figures/organizations (WHO, CDC, 钟南山, 张文宏, 李文亮, 福奇); (iv) medical supplies (疫苗, 抗体, N95, 口罩, 火神山, 雷神山, 试剂盒, 核酸检测); (v) policies (群体免疫, 健康码, 战疫, 援鄂). English keywords are case-insensitive.

Benchmark results¶

Five text-classification methods evaluated with temporal 70/10/20 train/val/test split; macro F₁:

Method	Macro F₁
FastText	0.839
TextCNN	0.938
TextRNN	0.700
Att-TextRNN	0.871
Transformer	0.927

Intended use¶

Chinese COVID-19 fake news detection
Multilingual infodemic research (complement to English ReCOVery, Spanish MM-COVID)
Multimodal misinformation detection (image/video features not yet benchmarked)
Propagation-based detection (full repost/comment thread graphs available)
Temporal analysis of COVID-19 misinformation spread on Chinese social media

Limitations¶

Real news is sourced exclusively from a state-run media outlet (People's Daily), introducing potential institutional register bias. Class imbalance (5:1 real:fake) requires careful evaluation metric choice. Multimedia benchmarks are absent from the paper. Historical tweet/repost retrieval may be incomplete due to Weibo visibility controls.

Connections¶

ReCOVery is the complementary English-language COVID-19 credibility dataset; together with CHECKED they provide Chinese and English multimodal COVID-19 misinformation corpora.
FakeNewsNet is the dominant general-domain benchmark; CHECKED's propagation structure (power-law repost/comment distributions) is analogous to FakeNewsNet's Twitter graphs.
Yang et al. (2020) is the introducing paper.