Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

First Version: Automatic evaluation of LLMs is usually done by cleverly comparing a desired result. The desired output can be evaluated using direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore). However, there are various reasons why KG can be used in the evaluation to support or enhance these evaluation techniques. 
Firstly, KG triplets can be extracted from used to evaluate the output of an LLM and then analyzed. The  The triplets can be compared with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The second reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. 
Both variants are explained in more detail below as examples. 

...

First Version: The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG. 
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.

Alternatively, relational triplets from a KG can be used to create up-to-date and domain-specific knowledge evaluation datasets. The LLM can the queried with the subject and relation to predict the object. The KG cannot only provide the correct triplets, but can also be used to generate other plausible answer options which are incorrect, allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM [2, 3].


literatureReferences:

  1. https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework
  2. https://aclanthology.org/2024.findings-naacl.155/
  3. https://arxiv.org/pdf/2401.00761


Answer 2: Analyzing LLM Biases through KG Comparisons

...

First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.

  • Content ...
  • Content ... 

literatureReferences:

  1. https://arxiv.org/abs/2405.04756
  2. http://arxiv.org/abs/2403.09963