Farseer: A Refined Scaling Law in Large Language Models

Houyi Li^1,2 Wenzhen Zheng¹ Qiufeng Wang¹ Zhenyu Ding³ Haoying Wang³ Zili Wang¹

Shijie Xuyang^1,2 Ning Ding³ Shuigeng Zhou² Xiangyu Zhang^1,4 Daxin Jiang¹

¹StepFun ²Fudan University ³Xi’an Jiaotong University ⁴Megvii Technology

**Figure 1.** (a) Average relative error (BPC) vs.\ model size $N$ for \textit{Farseer} (red) and Chinchilla (blue). Chinchilla, lacking high-order cross terms, fits only near the central $N$ and its error diverges as model size grows. In contrast, \textbf{\textit{Farseer}'s error is 232\% lower within the fitted range and remains stable across the full $N$ range.} (b) Chinchilla\text{'}s rule of thumb ($D/N\approx20$) is valid only at moderate budgets ($C\approx10^{20}\!-\!10^{21}$), but it underestimates the requirements for larger scale regimes. In contrast, our analysis predicts a steadily increasing optimal $D/N$, which is consistent with the actual training configurations used in recent large language models (e.g., Llama 3.1, Qwen3, etc.)

Abstract

We present Farseer, a novel and unified scaling law that offers unprecedented predictive accuracy for Large Language Models (LLMs). Our findings demonstrate remarkable extrapolation capabilities, reducing prediction error by 433% compared to Chinchilla's law and enabling highly accurate forecasts of large-scale performance from small-scale experiments.

This research represents a significant computational investment, utilizing approximately 3 million NVIDIA H100 GPU hours to train a comprehensive suite of nearly 1,000 LLMs from scratch. By systematically constructing the model loss surface L(N,D), Farseer provides a robust and generalizable framework that bridges the critical scaling gap between experimentation and production.

To support reproducibility and advance the field of LLM pre-training, we will progressively release our full dataset of loss measurements and checkpoints.

Step Law demonstrates that the optimal batch size $$B(D)$$ exhibits a primary dependence on dataset size $$D$$ , while the optimal learning rate $\eta(N, D)$ manifests a joint dependence on both model parameters $$N$$ and dataset size $$D$$ .

\begin{aligned} \eta(N, D) & = 1.79N^{-0.713}D^{0.307} \\ B(D) & = 0.58D^{0.571} \end{aligned}

Robustness and Data Generalization

Farseer's prediction error consistently decreases as the amount of fitting data increases, demonstrating strong robustness. It also exhibits excellent generalization across different data distributions.

**Figure 2. Robustness and data distribution generalizability of \textit{Farseer}.** (a) Relative error on excluded 6.4 B models as a function of the maximum model size included in the fitting data, assessing robustness to fitting data volume. (b) Relative error on excluded 3.2 B models trained with an English\text{-}Chinese data recipe, demonstrating structural generalizability to different data mixes. Circle size and adjacent numbers indicate the number of model\text{-}size points used for fitting in each case.

Left Figure: The prediction error for Farseer decreases significantly as the amount of fitting data increases. For instance, when the upper bound for model size in the fitting data is expanded from 1.9B to 5.4B parameters, the relative prediction error for a 6.4B model drops from 0.6% to 0.058%, showcasing its robustness to data quantity. Right Figure: On bilingual English-Chinese (EN-ZH) data, as the amount of fitting data increases, the prediction error for a 3.2B model converges to 0.076%. This indicates its generalization capability across different data distributions, effectively capturing the scaling patterns of cross-lingual data.

Extrapolation

Farseer's average relative error is just 0.50%, whereas the Chinchilla scaling law exhibits an average relative error of 2.68%, a 433% increase.

Comparison with Chinchilla's Fitting Formula

Besides, our research findings reveal that this scaling law not only applies to dense models but also generalizes effectively to MoE models with varying sparsity levels, demonstrating robust generalization capabilities.

**Figure 4.** Empirical BPC values (Ground Truth) are plotted alongside fits obtained from \textit{Farseer} and Chinchilla. \textit{Farseer} yields predictions that lie almost exactly on the ground truth curve, whereas Chinchilla's fit exhibits systematic under‐ and over‐estimations, particularly at small and large $D$.

Comparison of Optimal Compute Allocation

Under the condition C≈6ND, Chinchilla recommends a fixed D/N≈20, which is only applicable for medium compute budgets ($10^20−10^21FLOPs). Farseer,however, predict that the optimal D/N ratio continuously increases with the compute budget, which is fundamentally consistent with the training configurations of large models like Llama 3.1 and Qwen 3.

Point, Line, and Surface Comparison

**Figure 6.** \textit{Farseer}\text{'}s normalized 3D surface of relative BPC difference between datasets with 85\% and 50\% English proportions. The translucent pink plane marks $\Delta=0$: above it, the 50\%\text{-}English configuration outperforms; below it, the 85\%\text{-}English configuration outperforms. Green squares show small\text{-}scale experiments at individual $(N,D)$ points, and the yellow dashed curve connects several such points—conclusions from these point/line comparisons do not hold at larger scales.

Regions above the plane indicate that the 50\% English mixture yields lower error, while regions below favor the 85\% mixture. Green squares denote individual point comparison experiments at specific $(N,D)$ coordinates, and the yellow line connects several such points for a line comparison. Although these smaller scale analyses can suggest one mixture is better, the surface comparison reveals how those conclusions can reverse at larger scales. \textit{Farseer}\text{'}s exhibits power for low-cost, high-fidelity extrapolation across any two training recipes or model designs.

Step Law Tools

Open Source Roadmap

Live Progress Tracking

Milestone

Release Status

Predictable Scale Part I: Optimal Hyperparameter Scaling Law

ArXiv

Optimal Hyperparameter Tool

Tool

Training Dynamic of 3700 Models

Wandb

Train Smoothed Loss of 3700 Models

Github

Training Dynamics of 3700 Models

Hugging Face

Predictable Scale Part II

2025.5.1

Predictable Scale Part III

2025.6.1

BibTeX

@misc{li2025predictablescalei,
  title    = {Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining}, 
  author   = {Houyi Li and Wenzheng Zheng and Jingcheng Hu and Qiufeng Wang and Hanshan Zhang and Zili Wang and Shijie Xuyang and Yuantao Fan and Shuigeng Zhou and Xiangyu Zhang and Daxin Jiang},
  year     = {2025},
  eprint   = {2503.04715},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url      = {https://arxiv.org/abs/2503.04715}, 
}