I am a (Now() - 04/2021).ceil().ordinal() year PhD student studying AI and LLMs at University of Washington. I am fortunate to be advised by Prof. Yejin Choi and Prof. Hanna Hajishirzi.

My current research focus is inspecting and curating massive text corpora for LLM pretraining. I also work on training data attribution, LLM pretraining, and scaling laws for LLMs. During my PhD, I have worked on commonsense knowledge generation and verification, automated theorem proving, RLHF, and text decoding.

Previously, I received B.S. in Computer Science from University of Illinois at Urbana-Champaign, where I worked with Prof. Julia Hockenmaier. I used to work in Facebook’s Natural Language Generation (NLG) team.

My name in Chinese characters is 刘嘉程

[CV] [Google Scholar] [GitHub] [Twitter] [LinkedIn]

News

(2025.11) Honored to be selected for the 2025 cohort of Rising Stars in Data Science
(2025.11) Infini-gram mini received the Best Paper Award at EMNLP 2025! 🏆
(2025.09) Talk at BAIR, UC Berkeley
(2025.07) OLMoTrace received the Best Demo Award at ACL 2025! 🏆
(2025.07) Talk at TPC25 (Trillion Parameter Consortium)
(2025.07) Talk at Stanford NLP
(2025.06) Introducing infini-gram mini: an even more compact version of infini-gram, enabling search in Internet-scale text corpora with limited budget.
(2025.06) Talk at IBM Research
(2025.05) Talk at UCSD
(2025.04) Introducing OLMoTrace: tracing LLM outputs back into their multi-trillion-token training data in real time. Now available on Ai2 Playground
(2024.07) Infini-gram and PPO-MCTS are accepted to COLM 2024.
(2024.01) Introducing infini-gram: an efficient text search engine, and the biggest n-gram LM ever built.
(2023.10) PPO-MCTS is featured by 机器之心 on WeChat!
(2023.10) Vera and Crystal are accepted to EMNLP 2023 (main conference).
(2023.09) The Inverse Scaling paper is accepted to TMLR! Check out our contributed dataset, memo-trap, where LLMs demonstrate the strongest inverse scaling trends.
(2023.07) I am awarded the Qualcomm Innovation Fellowship for academic year 2023-2024.
(2023.05) Invited talk the the MLNLP Seminar: Estimating the plausibility of commonsense statements.
(2023.02) Our submission to the Inverse Scaling Challenge, memo-trap, receives one of the 11 Third Prizes!

Selected Publications

See my full list of publications here.

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
EMNLP 2025 (Main Conference, Best Paper Award)
[Arxiv] [Project Page] [Web Interface] [API Endpoint] [Code] [Documentation] [Contamination Bulletin]

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
ACL 2025 System Demonstrations Track (Best Demo Award)
[Arxiv] [Blog] [Web Interface] [Code] [Twitter] [Trailer Video] [Demo Video]

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
COLM 2024 (Oral Spotlight, 2%)
[Arxiv] [Project Page] [Web Interface] [API Endpoint] [Python Package] [Code] [Documentation]

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A Smith, Yejin Choi, Hannaneh Hajishirzi
NeurIPS 2024
[Arxiv] [Code] [Models]

Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding
Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz
COLM 2024
[Arxiv] [Code]

Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements
Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi
EMNLP 2023 (Main Conference, Oral)
[Arxiv] [Code] [Model] [Demo] [Dataset]

Generated Knowledge Prompting for Commonsense Reasoning
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, Hannaneh Hajishirzi
ACL 2022 (Main Conference)
[Arxiv] [Code] [Talk] [Poster]