Large Language Model (LLM) compression techniques aim to optimize LLM efficiency by reducing computational costs and latency4. These methods involve measuring the performance of compressed models against baseline models by comparing their accuracy on predefined tasks. The main challenge is to capture changes in model behavior, such as the phenomenon of "flips" where correct answers turn incorrect and vice versa4. To address this, researchers propose introducing distance metrics like KL-Divergence and % flips, alongside traditional accuracy metrics, for a more comprehensive evaluation of how closely compressed models mimic their baseline counterparts4.
Quantization methods improve LLM efficiency by reducing the precision of weights and activations in the model, which in turn reduces memory requirements and computational costs. This allows for faster inference and deployment on devices with limited resources, without significant loss of accuracy.
Accuracy metrics in LLM evaluation have limitations as they overlook the occurrence of "flips," where compressed models may produce different answers despite having similar accuracy levels. This can lead to a misleading perception of the model's reliability. Additionally, accuracy metrics alone do not account for qualitative differences in model behavior, especially in tasks involving generative responses, where the nuances of language generation are critical.