Skip to content
Cross-lingual COVID-19 Fake News Detection

Cross-lingual COVID-19 Fake News Detection

Authors: Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. Yu

Venue: arXiv preprint, October 2021 — arXiv:2110.06495

TL;DR

The paper addresses COVID-19 misinformation in low-resource languages by proposing CrossFake, a cross-lingual fake news detector trained on English COVID-19 news and applied to Chinese news via machine translation. The method slices long news texts into sub-text groups before BERT encoding to preserve fact-related information. On a manually annotated Chinese COVID-19 dataset (86 fake, 114 real), CrossFake achieves 75% accuracy, significantly outperforming monolingual and cross-lingual baselines, though machine translation quality remains a limiting factor.

Contributions

  • Cross-lingual COVID-19 dataset: Manually annotated Chinese COVID-19 news dataset (200 articles: 86 fake, 114 real) matched with existing English COVID-19 datasets, addressing the gap in fact-checked non-English pandemic misinformation
  • CrossFake framework: End-to-end neural architecture handling long news texts via sub-text slicing (500 tokens per group), mean pooling, and fully-connected aggregation before binary classification
  • Empirical validation: Demonstrates that machine-translated Chinese news can be effectively classified using English-trained detectors, with 75% accuracy significantly exceeding text-only (71.6%) and social-context-based baselines (CSI 68.3%)
  • Cross-lingual analysis: Identifies translation quality and information location as bottlenecks; shows pre-trained multilingual encoders (mBERT, multilingual transformers) underperform due to lack of domain knowledge for emerging COVID-19 terminology
  • Comparative benchmarking: Evaluates against five monolingual (CSI, SAFE, exBAKE) and two cross-lingual baselines (CLEF, EMET), establishing new performance standards for COVID-19 cross-lingual detection

Method

Problem Definition: Train a binary classifier on English COVID-19 news (source language) to predict truthfulness of Chinese COVID-19 news (target language) without annotated Chinese training data.

Training Phase:

Text preprocessing: Break tokenized news body text into groups of 500 tokens sequentially: $\(T_e^G = \{t_{e1}, \ldots, t_{em}\}\)$ where \(m = \lceil |T_e| / 500 \rceil\)

Encoding: Apply BERT to each sub-text group independently: $\(h_e = \text{FC}\left(\frac{\sum_{i=1}^{m} \text{BERT}(t_{ei})}{m}\right)\)$ where FC is a fully-connected layer and mean pooling aggregates embeddings across all sub-texts.

Loss: Binary cross-entropy with sigmoid activation and SGD optimization: $\(L = \sum_{e \in N_e} -\log(y_e \cdot \text{ReLU}(\text{MLP}(h_e)))\)$

Testing Phase:

  1. Translate Chinese news to English via Google Translate API
  2. Tokenize and slice into groups of 100 tokens (shorter than English articles)
  3. Apply the trained classifier to each sub-text
  4. Aggregate predictions via majority voting with threshold θ = 0.8: $\(p_c = \begin{cases} 1, & \text{if} \frac{\sum_j |C(t_{cj})|}{n} \geq \theta \\ 0, & \text{otherwise} \end{cases}\)$

Results

Performance comparison (Table II)

Model Accuracy Precision Recall F1
CLEF (cross-lingual) 43.1% 42.9% 97.4% 59.5%
EMET (cross-lingual) 45.9% 42.2% 70.9% 51.9%
CSI (monolingual, LSTM) 68.3% 61.4% 71.2% 65.8%
SAFE (monolingual, TextCNN) 71.6% 63.7% 80.7% 71.0%
exBAKE (monolingual, BERT) 64.3% 55.6% 92.1% 69.0%
exBAKE-sub (BERT + sub-text) 66.8% 59.7% 70.5% 64.3%
CrossFake-avg 73.6% 64.8% 85.4% 73.5%
CrossFake-sub 75.0% 71.5% 70.5% 70.7%

Key findings: 1. CrossFake outperforms baselines: 75% accuracy (CrossFake-sub) beats CNN-based SAFE (71.6%) and all cross-lingual baselines (43–46%) 2. Sub-text aggregation helps: Preserving full article information (CrossFake-avg 73.6%) beats truncating to 512 tokens (exBAKE 64.3%); aggregating predictions further improves precision (71.5%) 3. Cross-lingual models fail: CLEF and EMET achieve only 43–46% accuracy despite using multilingual encoders; high recall (97%, 71%) but poor precision, suggesting overfitting to "fake" predictions 4. CNN outperforms RNN: SAFE (TextCNN, 71.6%) substantially outperforms CSI (LSTM, 68.3%), suggesting local feature extraction is more effective than sequential modeling for long news articles

Analysis of failure modes

  • Translation quality: "Coronavirus" mistranslated as "new crown virus" (literal Chinese translation), confusing the classifier
  • Information location: Fake news with misinformation in middle/end of article are missed by models with fixed sequence length limits
  • Dataset size: 200-article test set is small compared to typical benchmarks; results may not generalize

Connections

Notes

Strengths: - Addresses a genuine and timely problem: non-English COVID-19 misinformation circulating unmoderated while English fact-checking dominates - Practical approach: leverages existing English datasets and off-the-shelf translation rather than assuming annotated Chinese training data - Sub-text slicing strategy is simple and effective, preserving information across long documents that exceed BERT's 512-token limit - Systematic comparison across monolingual and cross-lingual baselines clarifies why multilingual models underperform (lack of domain knowledge, sequence length constraints)

Weaknesses and limitations: - Machine translation bottleneck: Translation quality directly impacts accuracy; mistranslations of domain-specific terms (COVID-19, vaccine, hydroxychloroquine) create fundamental error ceiling - Small test set: 200 articles (86 fake, 114 real) is small relative to other fake news benchmarks; confidence intervals/multiple runs would strengthen claims - Dataset curation methodology: Manual matching of English news to Chinese news introduces selection bias; Chinese sources matching English stories may overrepresent duplicated/widely-spread misinformation rather than indigenous Chinese misinformation - No social context: Unlike MM-COVID, CrossFake relies on text only; propagation patterns and user engagement likely carry additional signal for COVID-19 detection - Limited language scope: Only English→Chinese; claims about low-resource languages generalize from single language pair - Evaluation metric concerns: F1 averaging (macro vs. micro) unclear; precision/recall imbalance (73.6% vs. 85.4%) suggests class imbalance effects not controlled for

Impact and open questions: - Demonstrates feasibility of cross-lingual transfer for emerging-event detection without in-language annotations - Raises question: Can detection improve by addressing translation directly (better translation models, terminology dictionaries) rather than accepting translation as fixed input? - Opens avenue for multi-hop transfer: English → intermediate-resource language (Spanish) → low-resource language


Related datasets: ReCOVery (Zhou et al., 2020), MM-COVID (Li et al., 2020), CHECKED (Yang et al., 2020)