Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Knowledge Graphs (KGs) provide a structured and reliable basis for evaluating the knowledge encoded in LLMs. Relational triples from the KGs can be used to systematically test whether an LLM can accurately retrieve relevant information. Additionally, in cases where direct comparisons between reference text and LLM-generated output fall short in assessing factual accuracy, the output can be converted into a meaningful representation to measure alignment with the KG. Finally, the neutral and structured nature of KG data makes it a valuable tool for identifying and analyzing potential biases within LLMs.

It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM.


Answer 1: Using KGs to Evaluate LLM Represented Knowledge

...

  • While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
  • When using a question-answering format (to evaluate text-generating / autoregressive LMs; for the application, see Chapter Selected Applications / Question Answering), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
  • Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.

...