Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

First Version: Automatic evaluation of LLMs is usually done by cleverly comparing a desired result. The desired output can be evaluated using direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore). However, there are various reasons why KG can be used in the evaluation to support or enhance these evaluation techniques. 

Firstly, KG triplets can be used to evaluate the amount of knowledge that is represented by the LLMs parameters and how consistently this knowledge can be retrieved.
Secondly, KG triplets can be used to evaluate the output of an LLM.  The The triplets can be compared with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The second final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. 
Both variants are explained in more detail below as examples. 

...

Properties to evaluate the LLM on:

  • CoverageRepresented Knowledge: Which fact-queries can the LLM answer correctly & consistently?
  • Factuality: When generating an output, are the facts an LLM uses in its answer correct?
  • Biases:


Answer 1: Using KGs to Evaluate LLM Represented Knowledge

...

Relational triplets from a KG can be used to create up-to-date and domain-specific knowledge evaluation datasets. The LLM can the queried with the subject and relation to predict the object. The KG cannot only provide the correct triplets, but can also be used to generate other plausible answer options which are incorrect, allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM [1, 2]. While KGs are designed for a consistent representation of facts, LLMs do not necessarily answer identically when prompted with different wordings of a query [3]. Using KGs to generate evaluation sets allows the inclusion of multiple differently worded queries for the same answer (and relation triplet) [2,3].

References:

  1. https://arxiv.org/pdf/2401.00761
  2. https://aclanthology.org/2024.findings-naacl.155/
  3. https://arxiv.org/pdf/24012204.0076106031

Answer 2: Using KGs to Evaluate LLM Knowledge Coverage Factuality

Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ... (We could move the definition of these properties to the top and discuss which answer addresses which property)

...