The main focus of the research on LLMs is to optimize algorithms for training large language models, which are essential for understanding and generating human language. This research aims to address the high memory demand of optimization algorithms used in training large language models, making the process more efficient and accessible to researchers with limited resources.
The Adam optimizer is considered a high memory demand because it requires substantial memory to store optimizer states such as first-order and second-order momentum values. This memory demand doubles the necessary resources compared to the model size, creating a significant burden and making training large models expensive and less accessible to researchers with limited resources.
Adafactor is a stochastic optimization method based on Adam that reduces memory usage while maintaining adaptivity. It achieves this by maintaining a factored representation of the squared gradient accumulator across training steps, tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables5. This reduces memory requirements from O(nm) to O(n+m), making it more memory-efficient than Adam.