CHECKED¶
Full name: CHECKED: Chinese COVID-19 Fake News Dataset Authors: Chen Yang, Xinyi Zhou, Reza Zafarani Paper: Yang et al. (2020), arXiv:2010.09029 Access: https://github.com/cyang03/CHECKED (data and code)
Description¶
CHECKED is the first Chinese-language social media dataset for COVID-19 misinformation research with ground-truth credibility labels. It collects fact-checked Weibo microblogs from December 2019 to August 2020, labeled via Weibo's expert-run Community Management Center (fake) and People's Daily's official Weibo account (real). The dataset is distinguished by its rich multimedia schema and full propagation thread capture.
Statistics¶
| Label | Microblogs | w/ images | w/ video | w/ reposts | w/ comments |
|---|---|---|---|---|---|
| Real | 1,760 | 1,149 | 563 | 1,151 | 1,151 |
| Fake | 344 | 53 | 106 | 229 | 292 |
| Total | 2,104 | 1,202 | 669 | 1,380 | 1,443 |
| Metric | Real | Fake | Total |
|---|---|---|---|
| Reposts | 1,827,817 | 40,358 | 1,868,175 |
| Comments | 1,169,246 | 16,456 | 1,185,702 |
| Likes | 56,407,610 | 445,116 | 56,852,726 |
| Unique users | 686,077 | 51,674 | 737,751 |
Class ratio: approximately 5:1 real to fake.
Schema¶
Each microblog record contains:
| Field | Description |
|---|---|
| id | 32-digit hashed Weibo microblog ID (SHA-256 of original 16-digit ID) |
| label | "real" or "fake" |
| analysis | Official Weibo expert evaluation text (fake microblogs only) |
| date | Posting timestamp (yyyy-mm-dd hh:mm) |
| user_id | 32-digit hashed Weibo user ID |
| text | Full microblog text |
| pic_url | List of image URLs (up to 18 per microblog) |
| video_url | Video URL (mutually exclusive with images) |
| comment_num | Total comment count as shown on Weibo |
| repost_num | Total repost count as shown on Weibo |
| like_num | Total like count |
| comments | Array of comment objects: hashed ID, date, text, hashed commenter ID, optional image |
| reposts | Array of repost objects: hashed ID, date, text, hashed reposter ID, optional image |
Note: due to Weibo's access restrictions, the comment/repost arrays may contain fewer entries than comment_num/repost_num, which capture the Weibo-displayed count.
Labeling methodology¶
Fake: Microblogs verified as false information by Weibo's Community Management Center. Users report suspicious microblogs; Weibo experts investigate and publish detailed evaluations. The center has processed over two million reports as of publication.
Real: Microblogs from People's Daily (@人民日报), China's largest newspaper group, ranked first in both the 2019 China Online Media White Paper and the State Information Center COVID-19 Dissemination Report. Over 120 million Weibo followers.
Relevance filtering¶
39 keywords across five categories: (i) coronavirus names (新冠肺炎, SARS-CoV-2, COVID, Coronavirus, 冠状病毒, 新冠); (ii) pandemic terms (疫情, 确诊, 死亡病例, 输入病例, 隔离, 封城, 防控, etc.); (iii) key figures/organizations (WHO, CDC, 钟南山, 张文宏, 李文亮, 福奇); (iv) medical supplies (疫苗, 抗体, N95, 口罩, 火神山, 雷神山, 试剂盒, 核酸检测); (v) policies (群体免疫, 健康码, 战疫, 援鄂). English keywords are case-insensitive.
Benchmark results¶
Five text-classification methods evaluated with temporal 70/10/20 train/val/test split; macro F₁:
| Method | Macro F₁ |
|---|---|
| FastText | 0.839 |
| TextCNN | 0.938 |
| TextRNN | 0.700 |
| Att-TextRNN | 0.871 |
| Transformer | 0.927 |
Intended use¶
- Chinese COVID-19 fake news detection
- Multilingual infodemic research (complement to English ReCOVery, Spanish MM-COVID)
- Multimodal misinformation detection (image/video features not yet benchmarked)
- Propagation-based detection (full repost/comment thread graphs available)
- Temporal analysis of COVID-19 misinformation spread on Chinese social media
Limitations¶
Real news is sourced exclusively from a state-run media outlet (People's Daily), introducing potential institutional register bias. Class imbalance (5:1 real:fake) requires careful evaluation metric choice. Multimedia benchmarks are absent from the paper. Historical tweet/repost retrieval may be incomplete due to Weibo visibility controls.
Connections¶
- ReCOVery is the complementary English-language COVID-19 credibility dataset; together with CHECKED they provide Chinese and English multimodal COVID-19 misinformation corpora.
- FakeNewsNet is the dominant general-domain benchmark; CHECKED's propagation structure (power-law repost/comment distributions) is analogous to FakeNewsNet's Twitter graphs.
- Yang et al. (2020) is the introducing paper.