Farseer: A Refined Scaling Law in Large Language Models

Houyi Li1,2 Wenzhen Zheng1 Qiufeng Wang1 Zhenyu Ding3 Haoying Wang3 Zili Wang1
Shijie Xuyang1,2 Ning Ding3 Shuigeng Zhou2 Xiangyu Zhang1,4 Daxin Jiang1
1StepFun 2Fudan University 3Xi’an Jiaotong University 4Megvii Technology
img1
Figure 1. (a) Average relative error (BPC) vs.\ model size $N$ for \textit{Farseer} (red) and Chinchilla (blue). Chinchilla, lacking high-order cross terms, fits only near the central $N$ and its error diverges as model size grows. In contrast, \textbf{\textit{Farseer}'s error is 232\% lower within the fitted range and remains stable across the full $N$ range.} (b) Chinchilla\text{'}s rule of thumb ($D/N\approx20$) is valid only at moderate budgets ($C\approx10^{20}\!-\!10^{21}$), but it underestimates the requirements for larger scale regimes. In contrast, our analysis predicts a steadily increasing optimal $D/N$, which is consistent with the actual training configurations used in recent large language models (e.g., Llama 3.1, Qwen3, etc.)

Abstract

We present Farseer, a novel and unified scaling law that offers unprecedented predictive accuracy for Large Language Models (LLMs). Our findings demonstrate remarkable extrapolation capabilities, reducing prediction error by 433% compared to Chinchilla's law and enabling highly accurate forecasts of large-scale performance from small-scale experiments.

This research represents a significant computational investment, utilizing approximately 3 million NVIDIA H100 GPU hours to train a comprehensive suite of nearly 1,000 LLMs from scratch. By systematically constructing the model loss surface L(N,D), Farseer provides a robust and generalizable framework that bridges the critical scaling gap between experimentation and production.

To support reproducibility and advance the field of LLM pre-training, we will progressively release our full dataset of loss measurements and checkpoints.

Step Law demonstrates that the optimal batch size $B(D)$ exhibits a primary dependence on dataset size $D$, while the optimal learning rate $\eta(N, D)$ manifests a joint dependence on both model parameters $N$ and dataset size $D$.

$$ \begin{aligned} \eta(N, D) & = 1.79N^{-0.713}D^{0.307} \\ B(D) & = 0.58D^{0.571} \end{aligned} $$

Robustness and Data Generalization

Farseer's prediction error consistently decreases as the amount of fitting data increases, demonstrating strong robustness. It also exhibits excellent generalization across different data distributions.

img2
Figure 2. Robustness and data distribution generalizability of \textit{Farseer}. (a) Relative error on excluded 6.4 B models as a function of the maximum model size included in the fitting data, assessing robustness to fitting data volume. (b) Relative error on excluded 3.2 B models trained with an English\text{-}Chinese data recipe, demonstrating structural generalizability to different data mixes. Circle size and adjacent numbers indicate the number of model\text{-}size points used for fitting in each case.

Left Figure: The prediction error for Farseer decreases significantly as the amount of fitting data increases. For instance, when the upper bound for model size in the fitting data is expanded from 1.9B to 5.4B parameters, the relative prediction error for a 6.4B model drops from 0.6% to 0.058%, showcasing its robustness to data quantity. Right Figure: On bilingual English-Chinese (EN-ZH) data, as the amount of fitting data increases, the prediction error for a 3.2B model converges to 0.076%. This indicates its generalization capability across different data distributions, effectively capturing the scaling patterns of cross-lingual data.

Extrapolation

Farseer's average relative error is just 0.50%, whereas the Chinchilla scaling law exhibits an average relative error of 2.68%, a 433% increase.

img3
Figure 3. Extrapolation of \textit{Farseer}. Blue circles represent the grid of $$(N,D$$ employed to fit. Red stars denote validation points beyond that distribution, including a 25.1 B model, larger dataset sizes, and off-grid combinations. Annotated percentages give the relative errors of each extrapolated point.

Comparison with Chinchilla's Fitting Formula

Besides, our research findings reveal that this scaling law not only applies to dense models but also generalizes effectively to MoE models with varying sparsity levels, demonstrating robust generalization capabilities.

img4
Figure 4. Empirical BPC values (Ground Truth) are plotted alongside fits obtained from \textit{Farseer} and Chinchilla. \textit{Farseer} yields predictions that lie almost exactly on the ground truth curve, whereas Chinchilla's fit exhibits systematic under‐ and over‐estimations, particularly at small and large $D$.

Comparison of Optimal Compute Allocation

Under the condition C≈6ND, Chinchilla recommends a fixed D/N≈20, which is only applicable for medium compute budgets ($10^20−10^21FLOPs). Farseer,however, predict that the optimal D/N ratio continuously increases with the compute budget, which is fundamentally consistent with the training configurations of large models like Llama 3.1 and Qwen 3.

img5
Figure 5.

Point, Line, and Surface Comparison

img6
Figure 6. \textit{Farseer}\text{'}s normalized 3D surface of relative BPC difference between datasets with 85\% and 50\% English proportions. The translucent pink plane marks $\Delta=0$: above it, the 50\%\text{-}English configuration outperforms; below it, the 85\%\text{-}English configuration outperforms. Green squares show small\text{-}scale experiments at individual $(N,D)$ points, and the yellow dashed curve connects several such points—conclusions from these point/line comparisons do not hold at larger scales.

Regions above the plane indicate that the 50\% English mixture yields lower error, while regions below favor the 85\% mixture. Green squares denote individual point comparison experiments at specific \((N,D)\) coordinates, and the yellow line connects several such points for a line comparison. Although these smaller scale analyses can suggest one mixture is better, the surface comparison reveals how those conclusions can reverse at larger scales. \textit{Farseer}\text{'}s exhibits power for low-cost, high-fidelity extrapolation across any two training recipes or model designs.

Step Law Tools

Open Source Roadmap

Live Progress Tracking

Milestone
Release Status
Predictable Scale Part I: Optimal Hyperparameter Scaling Law
Optimal Hyperparameter Tool
Training Dynamic of 3700 Models
Train Smoothed Loss of 3700 Models
Training Dynamics of 3700 Models
Predictable Scale Part II
2025.5.1
Predictable Scale Part III
2025.6.1

BibTeX

@misc{li2025predictablescalei,
  title    = {Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining}, 
  author   = {Houyi Li and Wenzheng Zheng and Jingcheng Hu and Qiufeng Wang and Hanshan Zhang and Zili Wang and Shijie Xuyang and Yuantao Fan and Shuigeng Zhou and Xiangyu Zhang and Daxin Jiang},
  year     = {2025},
  eprint   = {2503.04715},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url      = {https://arxiv.org/abs/2503.04715}, 
}
StepFun Logo
Fudan Logo
Tsinghua Logo
Megvii Logo