Article | [Paper-club sessions] Rho-1: Not All Tokens Are What You Need

Large language models (LLMs) have traditionally been trained with a one-size-fits-all approach—predicting every next token in a corpus using the same loss function. But what if not all tokens contribute equally to learning? In the RHO-1 paper, RHO-1: Not All Tokens Are What You Need researchers from Xiamen University, Tsinghua University, Shanghai AI Laboratory, and Microsoft challenge this norm by introducing a novel training paradigm called Selective Language Modeling (SLM), which focuses the model’s learning on the tokens that truly matter, as depicted in Figure 1. Notably, this work was recognized as the best paper runner-up at NeurIPS 2024, underscoring its significant impact on the field.

Figure 1 (taken from RHO-1 paper): Comparison of Causal Language Modeling (CLM) and Selective Language Modeling (SLM). Even with rigorous document-level filtering, high-quality datasets still contain noisy tokens that can hinder training. In contrast to the standard CLM approach, which computes loss uniformly for every token, SLM processes the entire sequence and selectively discards the loss from undesired tokens, leading to more efficient learning.

Conventional pretraining methods apply the next-token prediction loss uniformly across all tokens, but the RHO-1 team demonstrates that this approach is inefficient. By analyzing token-level training dynamics, the authors uncovered that tokens in a corpus exhibit markedly different loss patterns—some tokens are “easy” and quickly learned, while others remain persistently challenging or even fluctuate throughout training, as shown in Figure 2. This insight suggests that treating every token equally wastes valuable compute on parts of the data that offer little new information.

‍

Figure 2 (adapted from RHO-1 paper): Token-level loss dynamics during pretraining. This figure evaluates the loss for individual tokens every 1B training tokens using a 320K-token validation set. It categorizes tokens into four groups: 26% exhibit significant loss reduction (H→L), 51% remain consistently low (L→L), 11% are persistently challenging (H→H), and 12% show an unexpected loss increase (L→H). These results highlight that individual token losses do not decrease uniformly, revealing complex training dynamics that motivate the selective focus of SLM.

At the heart of RHO-1 is the idea that only a subset of tokens—those exhibiting higher “excess loss” relative to a high-quality reference model—should drive the learning process. The approach begins by training a reference model on a carefully curated corpus to capture the desired token distribution. Each token in the larger pretraining dataset is then scored by comparing the loss under the current model to that of the reference model, yielding an “excess loss” that indicates the token’s potential to improve the model if focused on. Rather than computing loss over the entire corpus, SLM selectively applies the loss only to tokens with high excess loss, effectively filtering out the noise and concentrating on the most informative parts of the data.

This selective approach not only improves data efficiency but also enhances overall model performance. Experiments on the OpenWebMath corpus showed that RHO-1 achieves an absolute few-shot accuracy boost of up to 30% on nine math tasks. Moreover, after fine-tuning, the 1B and 7B parameter models reached state-of-the-art results on the MATH dataset—matching or even exceeding the performance of models trained on far larger amounts of data. When extended to a diverse 80B-token corpus from the general domain, RHO-1 yielded an average improvement of 6.8% across 15 benchmarks, with especially notable gains in tasks involving code and math. In addition, by focusing on high-impact tokens, SLM accelerates learning, reaching comparable accuracy levels 5–10× faster than traditional methods.

Remarks

RHO-1 represents a promising new direction for LLM pretraining by challenging the assumption that “all tokens are created equal.” Instead of squandering compute on already-learned or irrelevant tokens, SLM selectively focuses on those that truly boost the model’s capabilities. Although the results in both math and general domains are impressive, this work also opens up several exciting avenues:

• Broader Applications: Could selective training be extended to multimodal tasks or even reinforcement learning scenarios?

• Reference Model Choices: What is the impact of using different reference models, and could proprietary APIs be harnessed for more robust token scoring?

• Dynamic Adaptation: How might the token selection ratio be adapted dynamically during training to further optimize performance?

While RHO-1 demonstrates significant improvements on math and general domain tasks with smaller models and datasets, its scalability to very large models and extensive corpora remains to be demonstrated. Moreover, relying on a reference model for token scoring introduces an additional dependency that might limit its broader applicability in certain settings. RHO-1 not only achieves state-of-the-art results with fewer tokens but also underscores the importance of rethinking our approach to data in LLM pretraining. By concentrating on what truly matters, we may be able to build more efficient and capable models in the future.

‍

References

[1] Lin, Zhenghao, et al. "Rho-1: Not all tokens are what you need." arXiv preprint arXiv:2404.07965 (2024).

‍

[Paper-club sessions] Rho-1: Not All Tokens Are What You Need

Remarks

ARTISTIC
AND RESEARCH
RESIDENCY

Explore our content

[Paper-club sessions] Rho-1: Not All Tokens Are What You Need

Remarks

ARTISTICAND RESEARCHRESIDENCY

Explore our content

ARTISTIC
AND RESEARCH
RESIDENCY