Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Answer 1: Using KGs to Evaluate LLM Represented Knowledge

Description: Relational triplets

Relational triples from a Knowledge Graph (KG) can be used leveraged to create up-to-date and domain-specific knowledge evaluation datasetsdatasets for evaluating knowledge within Language Models (LMs) [1, 2, 4, 5, 6]. The LLM can the then be queried with the subject and relation to predict the object . The KG cannot using either a question-answer pattern [6], predicting the answer using masked language modeling [4, 5], or predicting the correct statement from a multiple-choice item [1,2]. KGs not only provide the correct triplets, answers but can also be used to generate other plausible answer options which are incorrectalso enable the creation of plausible distractors (incorrect answer options), allowing the generation of multiple-choice items to evaluate the knowledge represented in an LLM LM [1, 2].

Considerations:

  • While KGs are

...

  • inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. The integration of KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
  • When using a question-answering format (to evaluate text generating / autoregressive LMs), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
  • Verbalizing a triple not only requires labels for the subject and object (which are typically annotated with one or multiple labels), but a rule [1] or template [2] which translates the formal triple into natural text. This may requires humans to create one or multiple rules per relation. Depending on the target language and the number of used relations this can be a non-negligible amount of work.


Standards and Protocols and Scientific Publications:

The use of KGs for the generation of evaluation datasets has been successfully employed in various scientific publications [1, 2, 4, 5, 6, 7].- Considerations:
- Standards and Protocols and Scientific Publications:


References:

  1. https://arxiv.org/pdf/2401.00761 (The Earth is Flat?)
  2. https://aclanthology.org/2024.findings-naacl.155/ (BEAR)
  3. https://arxiv.org/pdf/2204.06031 (Review)
  4. https://aclanthology.org/D19-1250/ (LAMA)
  5. https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)
  6. https://aclanthology.org/N19-1421/ (CommonsenseQA)

Answer 2: Using KGs to Evaluate LLM Knowledge Coverage Factuality

...