Page History
...
Automatic evaluation of LLMs is usually done by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, especially when the output deviates from the reference answers, conventional similarity metrics are insufficient to measure the factuality of the generated output. Incorporating information from knowledge graphs (KGs) into the evaluation can help ensure an accurate measurement of the factual integrity and reliability of LLM outputs.
Explanation of concepts
- Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
- Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
- Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points.
Brief description of the state of the art
Knowledge Graphs (KGs) provide a structured and reliable basis for evaluating the knowledge encoded in LLMs. Relational triples from the KGs can be used to systematically test whether an LLM can accurately retrieve relevant information. Additionally, in cases where direct comparisons between reference text and LLM-generated output fall short in assessing factual accuracy, the output can be converted into a meaningful representation to measure alignment with the KG. Finally, the neutral and structured nature of KG data makes it a valuable tool for identifying and analyzing potential biases within LLMs.
It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM.
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
...