Data-to-Text Generation¶

Data-to-text generation systems automatically produce natural language descriptions from structured data sources—tables, databases, knowledge bases, and logical forms. Applications include weather reports, financial summaries, sports scores, medical report generation, and knowledge base documentation.

Hallucination in data-to-text¶

Data-to-text generation is particularly vulnerable to numeric hallucinations—errors in numbers, dates, quantities, and other scalar values. These hallucinations are problematic because:

Small numeric errors (wrong date by one day, wrong quantity by 1%) are fluent but factually incorrect
Users trust numeric information and may not notice or catch errors
Incorrect numeric claims can have serious consequences (financial reports, medical data)

Other hallucination types include: - Attribute hallucination: Generating descriptions of data not present in the input table - Relational hallucinations: Claiming relationships (e.g., "X is larger than Y") unsupported by the data

Architectures¶

Sequence-to-sequence models with table encoding (flattened tables, graph neural networks over table structure, or retrieval-augmented approaches) are standard. Key models include TGEN, Struct2Seq, and recent pretrained models (BART, T5) fine-tuned on data-to-text datasets.

Evaluation¶

Automatic metrics: BLEU, METEOR, PARENT (penalizes hallucinations outside the table)
Human evaluation: Fluency, informativeness, factuality
Data correctness metrics: Check whether each numeric claim in the generated text is supported by the input table

Datasets¶

E2E (restaurant reviews from data)
WebNLG (describing knowledge base records)
Table-to-Text (Wikipedia tables)

Key papers¶

Survey of Hallucination in Natural Language Generation — Section 10 surveys hallucination definitions, metrics (especially numeric accuracy), and mitigation in data-to-text generation

Natural Language Generation — broader task
Hallucination in language models — cross-task phenomenon
Semantic Parsing — inverse task (natural language to logical form)
Information Extraction — related task