Page History
...
How do I evaluate LLMs through KGs? (3) – length: up to one page
Lead: Fabio
Contributors:
- Daniel Burkhardt (FSTI)
- Daniel Baldassare (doctima)
- Fabio Barth (DFKI)
- Max Ploner (HU)
- Alan Akbik (HU)
- ...
First VersionProblem statement (only a few sentences, at most one paragraph): Automatic evaluation of LLMs is usually done by cleverly comparing a desired result. The desired output can be evaluated using direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore). However, there are various reasons why KG can be used in the evaluation to support or enhance these evaluation techniques.
Explanation of concepts: Firstly, KG triplets can be used to evaluate the amount of knowledge that is represented by the LLMs parameters and how consistently this knowledge can be retrieved.
Secondly, KG triplets can be used to evaluate the output of an LLM. The triplets can be compared with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points.
Both Both variants are explained in more detail below as examples.
Brief description of the state of the art (only a few sentences, at most one paragraph): ...
Properties to evaluate the LLM on:
...
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
Relational Description: Relational triplets from a KG can be used to create up-to-date and domain-specific knowledge evaluation datasets. The LLM can the queried with the subject and relation to predict the object. The KG cannot only provide the correct triplets, but can also be used to generate other plausible answer options which are incorrect, allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM [1, 2]. While KGs are designed for a consistent representation of facts, LLMs do not necessarily answer identically when prompted with different wordings of a query [3]. Using KGs to generate evaluation sets allows the inclusion of multiple differently worded queries for the same answer (and relation triplet) [2,3].
- Considerations:
- Standards and Protocols and Scientific Publications:
References:
- https://arxiv.org/pdf/2401.00761
- https://aclanthology.org/2024.findings-naacl.155/
- https://arxiv.org/pdf/2204.06031
...
Draft from Daniel Burkhardt:
Short definition/description of this topic: Description: This involves using knowledge graphs to analyze and evaluate various aspects of LLMs, such as knowledge coverage and biases. KGs provide a structured framework for assessing how well LLMs capture and represent knowledge across different domains. This involves assessing the extent to which LLMs cover the knowledge represented in KGs. By comparing LLM outputs with the structured data in KGs, this approach can identify gaps in knowledge and areas for improvement in LLM training and performance
(First Version): The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG.
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.
- Considerations:
- Standards and Protocols and Scientific Publications:
References:
...
Draft from Daniel Burkhardt:
Short definition/description of this topicDescription: This involves using knowledge graphs to identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.
First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.
...
- Considerations:
- Standards and Protocols and Scientific Publications:
References: