The four key areas of LLM performance evaluated are factuality, toxicity, bias, and propensity for hallucinations. Factuality assesses the model's ability to provide accurate information, toxicity measures its ability to avoid producing offensive content, bias evaluates the presence of religious, political, gender, or racial prejudice, and propensity for hallucinations checks the generation of factually incorrect or nonsensical information.
Llama2 demonstrated strong performance in factuality tests, excelling in tasks that required grounding answers in verifiable facts5. The model was evaluated using a mix of summarization tasks and factual consistency checks, such as the Correctness of Generated Summaries and the Factual Consistency of Abstractive Summaries5.
Toxicity in LLMs was assessed using various prompts designed to elicit potentially toxic responses. The models' ability to avoid producing offensive or inappropriate content was evaluated. Llama2 demonstrated robust performance in handling toxic content, properly censoring inappropriate language when instructed. However, it needs improvement in maintaining safety in multi-turn conversations.