LLM Scaling Laws: A Practical Guide to Model Sizing
Use scaling laws (D≈20N), compute budgets, and experiments to find the optimal LLM size, balancing training, inference cost, and task performance.
Context: The Optimization Problem
Finding the optimal LLM size is not about finding a single magic number. It's an optimization problem with different potential goals:
- Training-Optimal: Achieve the lowest possible validation loss (
L) for a fixed training compute budget (C_train). This is the focus of classic scaling laws. - Inference-Optimal: Achieve a target quality (
L_target) with the minimum total cost, factoring in both training and future inference costs. If inference volume is high, this favors smaller, more thoroughly trained models. - Task-Optimal: Maximize performance on a specific downstream task (e.g., code generation, summarization), where validation loss is only a proxy metric.
Your choice of optimization target fundamentally changes the definition of "optimal."
Key Heuristic: The Chinchilla Rule
A widely adopted starting point for compute-optimal training of dense decoder-only Transformers is the Chinchilla scaling law. It suggests that model size (parameters, N) and training data size (tokens, D) should be scaled in proportion.
The Rule of Thumb:
D ≈ 20 * NThis means for every 1 parameter in your model, you should aim to train on approximately 20 tokens. The practical range is often cited as D ≈ 20N to D ≈ 30N.
| Effective Training Tokens (D) | Chinchilla Starting Point (N) |
| ----------------------------- | ----------------------------- |
| 300B tokens | ~10B - 15B parameters |
| 1T tokens | ~33B - 50B parameters |
| 2T tokens | ~67B - 100B parameters |
**Crucially, D refers to effective tokens after deduplication and quality filtering, not raw token count.** High data duplication can severely harm performance and waste model capacity on memorization.