Page History
...
Automatic evaluation of LLMs is usually done by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, especially when the output deviates from the reference answers, conventional similarity metrics are insufficient to measure the factuality of the generated output. Incorporating information from knowledge graphs (KGs) into the evaluation can help ensure an accurate measurement of the factual integrity and reliability of LLM outputs.
Explanation of concepts
- Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
- Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
- Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points.
Brief description of the state of the art
Knowledge Graphs (KGs) provide a structured and reliable basis for evaluating the knowledge encoded in LLMs. Relational triples from the KGs can be used to systematically test whether an LLM can accurately retrieve relevant information. Additionally, in cases where direct comparisons between reference text and LLM-generated output fall short in assessing factual accuracy, the output can be converted into a meaningful representation to measure alignment with the KG. Finally, the neutral and structured nature of KG data makes it a valuable tool for identifying and analyzing potential biases within LLMs.
It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM.
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
...
- While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
- When using a question-answering format (to evaluate text-generating / autoregressive LMs; for the application, see Chapter Selected Applications / Question Answering), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
- Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.
...