Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Automatic evaluation of LLMs is usually done by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, there are various reasons why KG should be used in the evaluation to support or enhance these evaluations.

Explanation of concepts 

  • Firstly, Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
  • Secondly, Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
  • Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. Both variants are explained in more detail below as examples

Properties to evaluate the LLM on:

  • Represented Knowledge: Which fact queries can the LLM answer correctly & consistently?
  • Factuality: When generating an output, are the facts an LLM uses in its answer correct?
  • Biases: How can bias in LLMs be detected and mitigated using KG?

Brief description of the state of the art 

...

Answer 1: Using KGs to Evaluate LLM Represented Knowledge

...

The use of KGs for generating evaluation datasets has been successfully employed in various scientific publications [1, 2, 4, 5, 6, 7].

Answer 2: Using KGs to Evaluate

...

LLM Factuality

Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ... (We could move the definition of these properties to the top and discuss which answer addresses which property)

Draft from Daniel Burkhardt

Description:

KGs hold factual knowledge for various domains, which can be used to analyze and evaluate LLM knowledge coverage. This involves verifying the knowledge represented in an LLM using KGs. Similar to previous solutions, the target object can be predicted using either a QA patternpatterns, predicting the answer using masked language modeling, or predicting the correct statement from a multiple-choice item. However, the information embedded in a KG can not be compared using strict matching or similarity metrics with the target object of an LLM due to the abstract structure of KG triples. Therefore, the output prediction has to be transformed into a meaning representation that describes the core semantic concepts and relations of an output sequence. Meaning representations should be extracted as a graph-based semantic representation. Thereby, the congruence of the extracted target graph and an objective KG can then be evaluated, and missing or misplaced relations and missing or false knots can be detected.  

Considerations:

  • meaning representation
  • KG congruence

...

  • Meaningful graph representations: Meaningful graph representations formally represent semantics that capture a sentence's meaning in a natural language. Various meaningful representations can be used to describe the meaning of a sentence and, therefore, have to be well-defined before evaluating an LLM on factuality using KGs. Target and objective KG should be mapped onto the same meaningful graph representations.
  • KG-Enconding: Any evaluated LLM output must be encoded into the pre-defined KG meaning representation. This process's concepts are versatile and have already been discussed in other sections. Text-to-graph encoder,  KG construction prompt, or multi-component extraction where in multiple stages first entities, then coreference resolutions, and relations are detected and extracted.
  • KG similarity: Depending on the KG encoding strategy, the target and objective KG can be compared and analyzed at different levels and granularities. The general idea is to check whether each triple in the target KG is factually consistent given an objective KG (or context). For instance, an encoded edge representation derived from the corresponding entity nodes can be classified by a pre-trained binary classification model on factuality or non-factuality. 


Standards and Protocols and Scientific Publications:

  • For meaningful graph representations, the standard protocols are Abstract Meaning Representation (AMR) or Open Information Extraction (OpenIE). AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation.
  • Graph-Encoders KG ...
  • KG similarity ...

Answer 3

...

Answer 3: Analyzing LLM Biases through KG Comparisons

...

Description: This involves using knowledge graphs to identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.

...

  1. https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework (Fact)
  2. https://aclanthology.org/2022.naacl-main.236.pdf (FactGraph)
  3. https://aclanthology.org/W13-2322.pdf (AMR)
  4. https://aclanthology.org/2022.findings-emnlp.103.pdf (OpenIE)
  5. https://aclanthology.org/2020.acl-main.173.pdf (On Faithfulness and Factuality in Abstractive Summarization)
  6. https://arxiv.org/abs/2405.04756 (Bias)
  7. http://arxiv.org/abs/2403.09963 (Bias)