COVID-19 Twitter Chatter Dataset¶

Paper: Banda et al., 2020

Repository: https://zenodo.org/record/3723939

Languages: Multilingual (English, Spanish, French, German, Russian)

Collection period: January 1 – November 8, 2020

Overview¶

This is a large-scale, openly available dataset of COVID-19 tweets collected and curated by an international collaboration of researchers. The dataset contains over 800 million unique tweet identifiers spanning the first eleven months of the pandemic, enabling analysis of public sentiment, misinformation sources, and social dynamics during the pandemic.

Core statistics: - Full dataset: 800,064,296 tweets (including retweets) - Clean dataset: 194,272,176 tweets (original tweets only, no retweets) - Time span: January 1 – November 8, 2020 - Languages: English, Spanish, French, German, Russian, and others - Format: Tab-separated tweet IDs, timestamps, language, country code

Data Collection and Methodology¶

Initial Phase (January–March 2020)¶

Collected via Twitter Stream API with keywords: "coronavirus", "2019ncov", "corona virus"
Covered January 1 – March 11, 2020
Limited to publicly available 1% sample of Twitter stream

Expanded Phase (March 12 – November 8, 2020)¶

Shifted to exclusive COVID-19 keyword set: "COVID19", "CoronavirusPandemic", "COVID-19", "2019nCoV", "CoronaOutbreak", "coronavirus", "WuhanVirus"
Significantly expanded collection scope
Collection tools: Twitter Stream API via Tweepy package and Social Media Mining Toolkit (SMMT)

International Collaboration Phase¶

30+ million tweets (January 27 – March 27, 2020) contributed by co-author Jingyuan Yu and collaborators
Keywords: "coronavirus", "wuhan", "pneumonia", "pneumonie", "neumonia", "lungenentzündung", "covid19"
Languages: English, French, Spanish, German
~1.5 million tweets (January 1 – May 8, 2020) in Russian contributed by co-authors Katya Artemova and Elena Tutubalina
Full deduplication performed across all contributing datasets

Data Processing¶

Preprocessing via Social Media Mining Toolkit (SMMT) components
Special character cleaning (carriage returns, URLs, excess whitespace)
Language-preserving approach (all languages intact)
Two dataset versions released:
Full: Includes tweets and retweets (useful for dissemination analysis)
Clean: Original tweets only, no retweets (preferred for NLP tasks with resource constraints)

Deliverables¶

The Zenodo repository contains seven files:

1. Tweet ID Collections¶

full_dataset.tsv.gz: All tweet IDs with metadata (date, time, language, country code)
full_dataset-clean.tsv.gz: Original tweets only (no retweets) with date and time

2. Daily Statistics¶

statistics-full_dataset.tsv: Daily tweet count for full dataset
statistics-full_dataset-clean.tsv: Daily tweet count for clean dataset

3. Frequent Terms and N-grams¶

frequent_terms.csv: Top 1,000 terms with occurrence counts
frequent_bigrams.csv: Top 1,000 bigrams with occurrence counts
frequent_trigrams.csv: Top 1,000 trigrams with occurrence counts

Note: Stop words removed in English and Spanish using spaCy; other languages processed with stop-word awareness where applicable.

4. Additional Collections¶

emoji.zip: Daily top emojis (both text and Unicode) with frequencies
hashtag.zip: Daily top hashtags with frequencies
mentions.zip: Daily top mentions (@users) with frequencies

Data Access and Compliance¶

Format: Tab-separated values (TSV), compressed with gzip

Compliance: Adheres to FAIR (Findable, Accessible, Interoperable, Reusable) principles

Twitter ToS: Tweet text is not included in the dataset; only tweet IDs are published. Researchers must hydrate tweets using tools like: - Social Media Mining Toolkit (SMMT) - twarc - Other Tweepy-based tools

Important note: Tweets removed or deleted by users after initial collection are unavailable upon rehydration. The dataset maintainers can share on request while adhering to Twitter data-sharing policies.

Use in Research¶

The dataset enables research into: - Sentiment analysis of public response to pandemic measures - Misinformation detection and spread patterns - Social dynamics during global health crises - Mental health impacts (emotional and psychological responses to lockdowns) - Stratified sentiment measurement by geography and time period - Linguistic patterns across languages and communities - Bot detection and network analysis - Hashtag and mention trends for public health communication

Accessibility and Reuse¶

GitHub repository: https://github.com/thepanacealab/covid19_twitter - Processing and analysis scripts - Data parsing utilities (parse_json_extreme.py, parse_json_lite.py, get_1grams.py, get_ngrams.py) - Combination and statistics utilities

Update schedule: Two-day incremental updates with weekly cumulative releases (as of publication)

Community engagement: 41,592+ views and 33,274+ downloads as of dataset publication; international researchers contributing additional data and analysis expertise

CMU-MisCOV19: Annotated COVID-19 tweets with misinformation classification
CoAID: Multimodal COVID-19 dataset with news articles and credibility assessment
MM-COVID: Multilingual COVID-19 dataset
CHECKED: Chinese-language COVID-19 dataset