COVID-19 Twitter Chatter Dataset¶
Paper: Banda et al., 2020
Repository: https://zenodo.org/record/3723939
Languages: Multilingual (English, Spanish, French, German, Russian)
Collection period: January 1 – November 8, 2020
Overview¶
This is a large-scale, openly available dataset of COVID-19 tweets collected and curated by an international collaboration of researchers. The dataset contains over 800 million unique tweet identifiers spanning the first eleven months of the pandemic, enabling analysis of public sentiment, misinformation sources, and social dynamics during the pandemic.
Core statistics: - Full dataset: 800,064,296 tweets (including retweets) - Clean dataset: 194,272,176 tweets (original tweets only, no retweets) - Time span: January 1 – November 8, 2020 - Languages: English, Spanish, French, German, Russian, and others - Format: Tab-separated tweet IDs, timestamps, language, country code
Data Collection and Methodology¶
Initial Phase (January–March 2020)¶
- Collected via Twitter Stream API with keywords: "coronavirus", "2019ncov", "corona virus"
- Covered January 1 – March 11, 2020
- Limited to publicly available 1% sample of Twitter stream
Expanded Phase (March 12 – November 8, 2020)¶
- Shifted to exclusive COVID-19 keyword set: "COVID19", "CoronavirusPandemic", "COVID-19", "2019nCoV", "CoronaOutbreak", "coronavirus", "WuhanVirus"
- Significantly expanded collection scope
- Collection tools: Twitter Stream API via Tweepy package and Social Media Mining Toolkit (SMMT)
International Collaboration Phase¶
- 30+ million tweets (January 27 – March 27, 2020) contributed by co-author Jingyuan Yu and collaborators
- Keywords: "coronavirus", "wuhan", "pneumonia", "pneumonie", "neumonia", "lungenentzündung", "covid19"
- Languages: English, French, Spanish, German
- ~1.5 million tweets (January 1 – May 8, 2020) in Russian contributed by co-authors Katya Artemova and Elena Tutubalina
- Full deduplication performed across all contributing datasets
Data Processing¶
- Preprocessing via Social Media Mining Toolkit (SMMT) components
- Special character cleaning (carriage returns, URLs, excess whitespace)
- Language-preserving approach (all languages intact)
- Two dataset versions released:
- Full: Includes tweets and retweets (useful for dissemination analysis)
- Clean: Original tweets only, no retweets (preferred for NLP tasks with resource constraints)
Deliverables¶
The Zenodo repository contains seven files:
1. Tweet ID Collections¶
- full_dataset.tsv.gz: All tweet IDs with metadata (date, time, language, country code)
- full_dataset-clean.tsv.gz: Original tweets only (no retweets) with date and time
2. Daily Statistics¶
- statistics-full_dataset.tsv: Daily tweet count for full dataset
- statistics-full_dataset-clean.tsv: Daily tweet count for clean dataset
3. Frequent Terms and N-grams¶
- frequent_terms.csv: Top 1,000 terms with occurrence counts
- frequent_bigrams.csv: Top 1,000 bigrams with occurrence counts
- frequent_trigrams.csv: Top 1,000 trigrams with occurrence counts
Note: Stop words removed in English and Spanish using spaCy; other languages processed with stop-word awareness where applicable.
4. Additional Collections¶
- emoji.zip: Daily top emojis (both text and Unicode) with frequencies
- hashtag.zip: Daily top hashtags with frequencies
- mentions.zip: Daily top mentions (@users) with frequencies
Data Access and Compliance¶
Format: Tab-separated values (TSV), compressed with gzip
Compliance: Adheres to FAIR (Findable, Accessible, Interoperable, Reusable) principles
Twitter ToS: Tweet text is not included in the dataset; only tweet IDs are published. Researchers must hydrate tweets using tools like: - Social Media Mining Toolkit (SMMT) - twarc - Other Tweepy-based tools
Important note: Tweets removed or deleted by users after initial collection are unavailable upon rehydration. The dataset maintainers can share on request while adhering to Twitter data-sharing policies.
Use in Research¶
The dataset enables research into: - Sentiment analysis of public response to pandemic measures - Misinformation detection and spread patterns - Social dynamics during global health crises - Mental health impacts (emotional and psychological responses to lockdowns) - Stratified sentiment measurement by geography and time period - Linguistic patterns across languages and communities - Bot detection and network analysis - Hashtag and mention trends for public health communication
Accessibility and Reuse¶
GitHub repository: https://github.com/thepanacealab/covid19_twitter - Processing and analysis scripts - Data parsing utilities (parse_json_extreme.py, parse_json_lite.py, get_1grams.py, get_ngrams.py) - Combination and statistics utilities
Update schedule: Two-day incremental updates with weekly cumulative releases (as of publication)
Community engagement: 41,592+ views and 33,274+ downloads as of dataset publication; international researchers contributing additional data and analysis expertise
Related Datasets¶
- CMU-MisCOV19: Annotated COVID-19 tweets with misinformation classification
- CoAID: Multimodal COVID-19 dataset with news articles and credibility assessment
- MM-COVID: Multilingual COVID-19 dataset
- CHECKED: Chinese-language COVID-19 dataset