Page History
...
KGs hold factual knowledge for various domains, which can be used to analyze and evaluate LLM knowledge coverage [1]. This involves verifying the knowledge represented in an LLM using KGs. Similar to previous solutions, the target object can be predicted using either QA patterns , predicting the answer using masked language modeling, or predicting the correct statement from a multiple-choice item[7, 8]. However, the information embedded in a KG can not be compared using strict matching or similarity metrics with the target object of an LLM due to the abstract structure of KG triples. Therefore, the output prediction has to be transformed into a meaning representation that describes the core semantic concepts and relations of an output sequence [8]. Meaning representations should be extracted as a graph-based semantic representation. Thereby, the congruence of the extracted target graph and an objective KG can then be evaluated, and missing or misplaced relations and missing or false knots can be detected [7, 8].
Considerations:
- Meaningful graph representations: Meaningful graph representations formally represent semantics that capture a sentence's meaning in a natural language. Various meaningful representations can be used to describe the meaning of a sentence and, therefore, have to be well-defined before evaluating an LLM on factuality using KGs. Target and objective KG should be mapped onto the same meaningful graph representations [8].
- Information Extraction: Any evaluated LLM output must be encoded into the pre-defined KG meaning representation. These process concepts are versatile and multiple solutions have been used and tested in research. Text-to-Graph Generation Modelsgraph generation models [8, 9], KG construction prompt [7], or multi-component extraction where entities, coreference resolutions, and relations are detected and extracted in multiple stages [7].
- KG factuality: Depending on the KG generation strategy, the target and objective KG can be compared and analyzed at different levels and granularities. The general idea is to check whether each triple in the target KG is factually consistent given an objective KG (or context). For instance, a graph neural network (GNN) that encodes edge representations derived from the corresponding entity nodes can be trained on binary classification of factuality or non-factuality of each encoded edge [8].
Standards and Protocols and Scientific Publications:
- For meaningful graph representations, the standard protocols are, for instance, Abstract Meaning Representation (AMR) [10] or Open Information Extraction (OpenIE) [11]. AMR AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs [10]. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation [11].
- Extracting information from a text and generating or enhancing a KG from it will be discussed in Chapter 4.2. NLP tasks like named entity recognition, coreference resolutions, and relation extraction are well-established problems in this field of research that are solved using either generative LLMs or fine-tuned language models [12, 13]. The third option of using prompting for generating a KG is based on two techniques: in-context learning and chain-of-thought reasoning (explained in Section 4) [7].
- KG factuality The standard protocol for checking the factuality of a generated KG from an LLM output sequence would be to encode the KG using an LLM or a GNN and predict the factuality using binary classification [8, 14]. For both models, context can be provided in addition to the generated KG for higher precision in the prediction. For this task, the GNN has to be fine-tuned to factuality prediction. When using an LLM for the prediction, prompting can be used to predict the factuality of KG triples [7]. The prompt can be enhanced with in-context learning examples or the context of factual KG relations [14].
- Current publications that use the explained techniques are GraphEval [7] and FactGraph [8]. GraphEval uses SOTA LLMs like LLaMA to extract and generate the KG from a given model output. The Framework identifies each extracted triple on whether they are factually consistent given the provided context. FactGraph builds on text and graph encoders that are augmented with structure-aware adapters to classify factualityactuality [8].
Answer 3: Analyzing LLM Biases through KG Comparisons
...
KG can also enhance model inputs with structured KG information instead of extracting meaning and knowledge from LLM outputs. For bias detection, in-context samples can be generated from domain-specific KGs to establish them as so-called biased "superior knowledge" to manipulate the prediction of LLMs and test their robustness against them [15]. This technique can be seen as an adversarial attack because the model gets manipulated to check for leveraged bias from the pre-training that is not mitigated by red-teaming or other bias mitigation techniques. The first step in setting up such an evaluation pipeline is to define a KG or extract relevant subgraphs from a larger KG covering a desirable evaluation bias or a biased context. Those KG are called bias KG. From the adversarial The bias KG nodes , we can encode the KG and extract can be encoded, and the top k nodes representing the a bias based on a context or gold standard can be extracted using an arbitrarily efficient retrieval method [15, 16]. With those top k-biased nodes, a set of k-input in-context samples can be generated using a graph-to-text generation model and embedded in the input prompt for evaluating an LLM. The generated output will then be evaluated based on the pre-defined bias.
- Considerations:
- The bias KG is generated with sensitive attributes that can be considered a potential bias target [15]. Defining those attributes is important for the quality of the adversarial attacks
...
- attacks. A closed set of sensitive attributes is advised for the bias evaluation so that the results can be analyzed properly. Various approaches can be used to generate the bias KG. Most of the established approaches are discussed in Section 4, in this Section in Answer 2, and in Chapter 4.2.
Standards and Protocols and Scientific Publications:
- The standard protocols for each step are partially discussed in different sections. RAG systems would be the closest best practice for this evaluation technique (See Section 4). In BiasKG [15], for instance, bias KG is constructed from free text using RAG methodology. KG triples and entities are mapped into a vectorized embedding space. Those can now be clustered and retrieved for the input text generation.
References
- Wenxuan Wang et al. The Earth is Flat? Unveiling Factual Errors in Large Language Models (2024)
- Jacek Wiland et al. BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models (2024)
- Badr AlKhamissi et al. A Review on Language Models as Knowledge Bases (2022)
- Fabio Petroni et al. Language Models as Knowledge Bases? (2019)
- Jan-Christoph Kalo et al. KAMEL: Knowledge Analysis with Multitoken Entities in Language Models (2022)
- Alon Talmor et al. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge (2019)
- Hannah Sansford et al. GRAPHEVAL: A Knowledge Graph-Based LLM Hallucination Evaluation Framework (2023)
- Leonardo F. R. Ribeiro et al. FACTGRAPH: Evaluating Factuality in Summarization with Semantic Graph Representations (2022)
- Yu Wang et al. Large Graph Generative Models (2024)
- Laura Banarescu et al. Abstract Meaning Representation for Sembanking (2013)
- Bowen Yu et al. Towards Generalized Open Information Extraction (2022)
- Hanwen Zheng et al. A Survey of Document-Level Information Extraction (2023)
- Derong Xu et al. Large Language Models for Generative Information Extraction: A Survey(2024)
- Joshua Maynez et al. On Faithfulness and Factuality in Abstractive Summarization (2020)
- Chu Fei Luo et al. BiasKG: Adversarial Knowledge Graphs to Induce Bias in Large Language Models (2024)
- Zhilin Yang et al. HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (2018)
- Ziyang Xu et al. Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction (2024)
References
References
References:
https://arxiv.org/pdf/2401.00761 (The Earth is Flat?)https://aclanthology.org/2024.findings-naacl.155/ (BEAR)https://arxiv.org/pdf/2204.06031 (Review)https://aclanthology.org/D19-1250/ (LAMA)https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)https://aclanthology.org/N19-1421/ (CommonsenseQA)
...
https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework (Fact)https://aclanthology.org/2022.naacl-main.236.pdf (FactGraph)https://arxiv.org/pdf/2406.05109 (Large Graph Gerative Models)https://aclanthology.org/W13-2322.pdf (AMR)https://aclanthology.org/2022.findings-emnlp.103.pdf (OpenIE)https://arxiv.org/pdf/2309.13249 (NER, CR, RE on Doc lvl)https://arxiv.org/pdf/2312.17617 (IE survey)https://aclanthology.org/2020.acl-main.173.pdf (On Faithfulness and Factuality in Abstractive Summarization)https://arxiv.org/abs/2405.04756 (BiasBiasKG)https://aclanthology.org/D18-1259/ (retrieval algorithm)http://arxiv.org/abs/2403.09963 (Take Care of Your Prompt Bias!)