Dr. Sarah Chen

Associate Professor of Computer Science

Stanford University

My research focuses on natural language processing, machine learning, and the intersection of language understanding with knowledge representation.

📍 Stanford, CA

🏛 Stanford University

✉ Email 🎓 Google Scholar 🆔 ORCID 💻 GitHub

← Back to Publications

Robust Factuality Evaluation for Open-Domain Text Generation

Published in Journal of Machine Learning Research, 2025

Ananya Gupta, Sarah Chen

Abstract

Evaluating the factual consistency of text generated by large language models remains a fundamental challenge. Existing metrics either rely on coarse document-level scores that mask fine-grained errors, or require expensive human annotation pipelines that do not scale. The gap between automated metrics and human judgments of factuality continues to hinder progress on building reliable generation systems.

We introduce FactScore-R, a robust evaluation metric that decomposes generated text into atomic claims and assesses each claim against retrieved evidence at multiple levels of granularity. Unlike prior work, FactScore-R explicitly models partial entailment—recognizing that a claim may be partially supported rather than fully true or fully false. Our metric uses a hierarchical claim decomposition strategy that captures both fine-grained factual assertions and higher-level thematic consistency.

We evaluate FactScore-R on four open-domain generation benchmarks spanning biography, science, history, and current events. Our metric achieves a Kendall tau correlation of 0.71 with expert human judgments, compared to 0.52 for the best existing automated metric. We further show that FactScore-R is robust to paraphrasing, domain shift, and adversarial perturbations, making it suitable for deployment in production evaluation pipelines.

Citation

A. Gupta, S. Chen. (2025). "Robust Factuality Evaluation for Open-Domain Text Generation." Journal of Machine Learning Research, 26(1), 1–45.

Download Paper | BibTeX | Code