Dr. Sarah Chen
Associate Professor of Computer Science
Stanford University
My research focuses on natural language processing, machine learning, and the intersection of language understanding with knowledge representation.
Robust Factuality Evaluation for Open-Domain Text Generation
Published in Journal of Machine Learning Research, 2025
Ananya Gupta, Sarah Chen
Abstract
Evaluating the factual consistency of text generated by large language models remains a fundamental challenge. Existing metrics either rely on coarse document-level scores that mask fine-grained errors, or require expensive human annotation pipelines that do not scale. The gap between automated metrics and human judgments of factuality continues to hinder progress on building reliable generation systems.
We introduce FactScore-R, a robust evaluation metric that decomposes generated text into atomic claims and assesses each claim against retrieved evidence at multiple levels of granularity. Unlike prior work, FactScore-R explicitly models partial entailment—recognizing that a claim may be partially supported rather than fully true or fully false. Our metric uses a hierarchical claim decomposition strategy that captures both fine-grained factual assertions and higher-level thematic consistency.
We evaluate FactScore-R on four open-domain generation benchmarks spanning biography, science, history, and current events. Our metric achieves a Kendall tau correlation of 0.71 with expert human judgments, compared to 0.52 for the best existing automated metric. We further show that FactScore-R is robust to paraphrasing, domain shift, and adversarial perturbations, making it suitable for deployment in production evaluation pipelines.
Citation
A. Gupta, S. Chen. (2025). "Robust Factuality Evaluation for Open-Domain Text Generation." Journal of Machine Learning Research, 26(1), 1–45.