Page History
...
Problem statement (only a few sentences, at most one paragraph): Automatic evaluation of LLMs is usually done by cleverly comparing the generated model output with a desired result. The desired output can be evaluated using Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, there are various reasons why KG can should be used in the evaluation to support or enhance these evaluation techniquesevaluations.
Explanation of concepts: Firstly, KG triplets triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved. Secondly, KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. Both Both variants are explained in more detail below as examples.
Brief description of the state of the art (only a few sentences, at most one paragraph): ...
Properties to evaluate the LLM on:
- Represented Knowledge: Which fact queries can the LLM answer correctly & consistently?
- Factuality: When generating an output, are the facts an LLM uses in its answer correct?
- Biases: How to detect and mitigate bias in LLMs using KG?
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
...
- While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
- When using a question-answering format (to evaluate text-generating / autoregressive LMs), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
- Verbalizing a triple requires not only requires labels for the subject and object (which are typically annotated with one or multiple labels) , but also a rule [1] or template [2] which that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.
...