Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Automatic evaluation of LLMs is usually done by comparing generated by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, especially when the output deviates from the reference answers, conventional similarity metrics are insufficient to measures the factuality of the generated output. Incorporating information from knowledge graphs (KGs) into the evaluation can help ensuring an accurate measurment of the factual integrity and reliability of LLM outputs.

However, there are various reasons why KG should be used in the evaluation to support or enhance these evaluations.


Explanation of concepts 

  • Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
  • Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
  • Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. 

...

  • Meaningful graph representations: Meaningful graph representations formally represent semantics that capture a sentence's meaning in a natural language. Various meaningful representations can be used to describe the meaning of a sentence and, therefore, have to be well-defined before evaluating an LLM on factuality using KGs. Target and objective KG should be mapped onto the same meaningful graph representations.
  • KG-EncondingInformation Extraction: Any evaluated LLM output must be encoded into the pre-defined KG meaning representation. This These process 's concepts are versatile and multiple solutions have already been discussed in other sectionsused and tested in research. Text-to-graph encoder,  KG Graph Generation Models, KG construction prompt, or multi-component extraction where in multiple stages first entities, then coreference resolutions, and relations are detected and extracted in multiple stages.
  • KG similarity: Depending on the KG encoding generation strategy, the target and objective KG can be compared and analyzed at different levels and granularities. The general idea is to check whether each triple in the target KG is factually consistent given an objective KG (or context). For instance, an encoded edge representation a graph neural network (GNN) that encodes edge representations derived from the corresponding entity nodes can be classified by a pre- trained on binary classification model on of factuality or non-factuality of each encoded edge. 


Standards and Protocols and Scientific Publications:

  • For For meaningful graph representations, the standard protocols are, for instance, Abstract Meaning Representation (AMR) or Open Information Extraction (OpenIE). AMR  AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation.
  • Graph-Encoders KG KG  ...
  • KG similarity ...

Answer 3: Analyzing LLM Biases through KG Comparisons

...

First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.

...