Page History
...
- Daniel Burkhardt (FSTI)
- Daniel Baldassare (doctima)
- Fabio Barth (DFKI)
- Max Ploner (HU)
- Alan Akbik (HU)
- ...
Problem statement (only a few sentences, at most one paragraph):
Automatic evaluation of LLMs is usually done by comparing the generated by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, there are various reasons why KG should be used in the evaluation to support or enhance these evaluations.
Explanation of concepts:
- Firstly,
...
- KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
- Secondly, KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
- The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. Both variants are explained in more detail below as examples.
Properties to evaluate the LLM on:
- Represented Knowledge: Which fact queries can the LLM answer correctly & consistently?
- Factuality: When generating an output, are the facts an LLM uses in its answer correct?
- Biases: How to detect and mitigate can bias in LLMs using be detected and mitigated using KG?
Brief description of the state of the art
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
Description:
Relational triples from a Knowledge Graph (KG) can be leveraged to create up-to-date and domain-specific datasets for evaluating knowledge within Language Models (LMs) [1, 2, 4, 5, 6]. The LLM can then be queried with the subject and relation to predict the object using either a question-answer pattern [6], predicting the answer using masked language modeling [4, 5], or predicting the correct statement from a multiple-choice item [1,2]. KGs provide the correct answers and enable the creation of plausible distractors (incorrect answer options), allowing the generation of multiple-choice items to evaluate the knowledge represented in an LM [1, 2].
Considerations:
- While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
- When using a question-answering format (to evaluate text-generating / autoregressive LMs), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
- Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.
Standards and Protocols and Scientific Publications:
The use of KGs for generating evaluation datasets has been successfully employed in various scientific publications [1, 2, 4, 5, 6, 7].
References:
- https://arxiv.org/pdf/2401.00761 (The Earth is Flat?)
- https://aclanthology.org/2024.findings-naacl.155/ (BEAR)
- https://arxiv.org/pdf/2204.06031 (Review)
- https://aclanthology.org/D19-1250/ (LAMA)
- https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)
- https://aclanthology.org/N19-1421/ (CommonsenseQA)
Answer 2: Using KGs to Evaluate LLM Knowledge Coverage Factuality
...
Answer 2: Using KGs to Evaluate LLM Knowledge Coverage Factuality
Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ... (We could move the definition of these properties to the top and discuss which answer addresses which property)
Draft from Daniel Burkhardt:
Description: This involves using knowledge graphs to analyze and evaluate various aspects of LLMs, such as knowledge coverage and factuality. KGs provide structured information for assessing LLMs' knowledge capture and representation across domains. This involves verifying the knowledge represented in an LLM using KGs. By extracting and comparing knowledge or facts from LLM outputs with the structured data in KGs, this approach can identify gaps in knowledge and areas for improvement in LLM training and performance.
(First Version): The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG.
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.
- Considerations:
- Standards and Protocols and Scientific Publications:
Answer 3: Analyzing LLM Biases through KG Comparisons
Draft from Daniel Burkhardt:
Description: This involves using knowledge graphs to analyze and evaluate various aspects of LLMs, such as knowledge coverage and factuality. KGs provide structured information for assessing LLMs' knowledge capture and representation across domains. This involves verifying the knowledge represented in an LLM using KGs. By extracting and comparing knowledge or facts from identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can identify gaps in knowledge and areas for improvement in LLM training and performance.(First Version): highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.
First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG.
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.
- Considerations:
- Standards and Protocols and Scientific Publications:
References
References:
- https://wwwarxiv.amazon.scienceorg/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-frameworkpdf/2401.00761 (The Earth is Flat?)
- https://aclanthology.org/20222024.findings-naacl-main.236.pdf (FactGraph.155/ (BEAR)
- https://arxiv.org/pdf/2204.06031 (Review)
- https://aclanthology.org/2020.acl-main.173.pdf (On Faithfulness and Factuality in Abstractive Summarization)
Answer 3: Analyzing LLM Biases through KG Comparisons
Draft from Daniel Burkhardt:
Description: This involves using knowledge graphs to identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.
First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.
- Considerations:
- Standards and Protocols and Scientific Publications:
References
...
- D19-1250/ (LAMA)
- https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)
- https://aclanthology.org/N19-1421/ (CommonsenseQA)
References:
- https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework (Fact)
- https://aclanthology.org/2022.naacl-main.236.pdf (FactGraph)
- https://aclanthology.org/2020.acl-main.173.pdf (On Faithfulness and Factuality in Abstractive Summarization)
- https://arxiv.org/abs/2405.04756 (Bias)
- http://arxiv.org/abs/2403.09963 (Bias)