Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

How do I evaluate LLMs through KGs? (3) – length: up to one page

Lead: Fabio

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Fabio Barth (DFKI)
  • Max Ploner (HU)
  • ...

First Version: Automatic evaluation of LLMs is usually done by cleverly comparing a desired result. The desired output can be evaluated using direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore). However, there are various reasons why KG can be used in the evaluation to support or enhance these evaluation techniques. 
Firstly, KG triplets can be used to evaluate the output of an LLM. The triplets can be compared with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The second reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. 
Both variants are explained in more detail below as examples. 


Properties to evaluate the LLM on:

  • Coverage: Which fact-queries can the LLM answer correctly?
  • Factuality: When generating an output, are the facts an LLM uses in its answer correct?
  • Biases:


Answer 1: Using KGs to Evaluate LLM Knowledge Coverage

Relational triplets from a KG can be used to create up-to-date and domain-specific knowledge evaluation datasets. The LLM can the queried with the subject and relation to predict the object. The KG cannot only provide the correct triplets, but can also be used to generate other plausible answer options which are incorrect, allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM [1, 2].

References:

  1. https://aclanthology.org/2024.findings-naacl.155/
  2. https://arxiv.org/pdf/2401.00761

Answer 2: Using KGs to Evaluate LLM Knowledge Coverage Factuality

Maybe Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ...  

Lead: Fabio

Contributors:

...

(We could move the definition of these properties to the top and discuss which answer addresses which property)

Draft from Daniel Burkhardt

...

First Version: The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG. 
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.
Alternatively, relational triplets from a KG can be used to create up-to-date and domain-specific knowledge evaluation datasets. The LLM can the queried with the subject and relation to predict the object. The KG cannot only provide the correct triplets, but can also be used to generate other plausible answer options which are incorrect, allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM [2, 3], the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.


References:

  1. https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework
  2. https://aclanthology.org/2024.findings-naacl.155/
  3. https://arxiv.org/pdf/2401.00761

...


Answer 3: Analyzing LLM Biases through KG Comparisons

...

Contributors:

...

Draft from Daniel Burkhardt

...