Page History
First draft to be created until 11 October 2024.
ADD NEW TOP LEVEL SECTION: LLM TRAINING
...
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- ...
Description:
Verbalizing knowledge graphs for LLM is the pre-training task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:
- Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques - Encoding strategies for nodes, relationship between nodes, nodes communities, and metadata Talk like a graph: Encoding graphs for large language models (research.google)
- What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances
Considerations:
Standards:
Answer 2: Integrate KGs during pre-training
Description: These methods use KG knowledge directly during the LLM pre-training phase by modifying the encoder side of the transformer architecture and improving the training tasks.
LLMs do not process KG structure directly; therefore a KG representation that allows us a combination with text embeddings is necessary, i.e., KG embeddings. There is a need to have an alignment between the text and the (sub)graph pre-training data. To allow the LLM to learn from KG embeddings, there are three main modifications to the transformer-encoder architecture (which future research may extend):
...
a) Incorporate a knowledge encoder to fuse textual context (Text Embeddings) and knowledge context (KG embeddings). The LLM could stay frozen and reuse just the output of the Transformer encoder.
b) Insert knowledge encoding layers in the middle of the transformer layers to adjust the encoding mechanism, enabling the LLM to process knowledge from the KG.
c) Add independent adapters to process knowledge. These adapters match 1x1 the transformer layers and are easy to train because they do not affect the parameters of the original LLM during pre-training.
...
:
- Simple concatenation of KG triples with text
- Entity/Token alignment prediction
Considerations:
- Simple concatenation of tokens and triples from KG can cause "knowledge noise"
Standards:
- Prediction alignment links between tokens and entities
- Entity embeddings + additional entity prediction task to token-only pretraining objective
Answer 2: Integrate KGs during pre-training
Description: These methods use KG knowledge directly during the LLM pre-training phase by modifying the encoder side of the transformer architecture and improving the training tasks.
LLMs do not process KG structure directly; therefore a KG representation that allows us a combination with text embeddings is necessary, i.e., KG embeddings. There is a need to have an alignment between the text and the (sub)graph pre-training data. To allow the LLM to learn from KG embeddings, there are three main modifications to the transformer-encoder architecture (which future research may extend):
a) Incorporate a knowledge encoder to fuse textual context (Text Embeddings) and knowledge context (KG embeddings). The LLM could stay frozen and reuse just the output of the Transformer encoder. b) Insert knowledge encoding layers in the middle of the transformer layers to adjust the encoding mechanism, enabling the LLM to process knowledge from the KG. c) Add independent adapters to process knowledge. These adapters match 1x1 the transformer layers and are easy to train because they do not affect the parameters of the original LLM during pre-training. |
Although nothing prohibits implementing all these modifications simultaneously, we see (recommend) implementing just one of these variations during LLM pretraining.
The pre-training task allows LLM to learn and model the world. Thus, another option is to modify the pretraining task. In the Encoder side of LLMs, the typical task is to MASK words in the context. A simple modification is to MASK not random words but entities represented in the KG. Another option is to perform a multi-tasking pre-training, i.e., perform MASK and KG link predictions.
Considerations:
- As a result, we have LLMs with better language understanding.
- Empirical evaluation has shown that this combination can improve reasoning capabilities in LLMs.
- Tail entities, i.e., entities not
Although nothing prohibits implementing all these modifications simultaneously, we see (recommend) implementing just one of these variations during LLM pretraining.
The pre-training task allows LLM to learn and model the world. Thus, another option is to modify the pretraining task. In the Encoder side of LLMs, the typical task is to MASK words in the context. A simple modification is to MASK not random words but entities represented in the KG. Another option is to perform a multi-tasking pre-training, i.e., perform MASK and KG link predictions.
Considerations:
- As a result, we have LLMs with better language understanding.
- Empirical evaluation has shown that this combination can improve reasoning capabilities in LLMs.
- Tail entities, i.e., entities not frequently mentioned in the text, are better learned and modeled by the resulting LLM.
- Harmonizing and combining heterogeneous embedding spaces, such as text and graph embeddings, is challenging. Therefore, experts in NLP and Graph Machine Learning are required to properly apply these methods.
- More resources are required because of the pre-training time being extended.
...
- [1] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang: Graph Retrieval-Augmented Generation: A Survey. CoRR abs/2408.08921 (2024)
- [2] Diego Collarana, Moritz Busch, Christoph Lange: Knowledge Graph Treatments for Hallucinating Large Language Models. ERCIM News 2024(136) (2024)
- [3] Junde Wu, Jiayuan Zhu, Yunli Qi: Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. CoRR abs/2408.04187 (2024)
- [4] Sen, Priyanka, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations - NLRSE (2023)
- [5] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, Xindong Wu: Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 36 (7) (2024)
- [6] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson: From Local to Global: A Graph RAG Approach to Query-Focused Summarization. CoRR abs/2404.16130 (2024)
- [7] Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta: HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. CoRR abs/2408.04948 (2024)
- [8] Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, Sahar Vahdati: Beyond Boundaries: A Human-like Approach for Question Answering over Structured and Unstructured Information Sources. Trans. Assoc. Comput. Linguistics (2024)
- [9] Juan Sequeda, Dean Allemang, Bryon Jacob: A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. GRADES/NDA (2024)
First draft to be created until 11 October 2024.
...
How do I enhance LLM explainability by using KGs? (2.2 – Answer Verification) – length: up to one page
...
Automatic evaluation of LLMs is usually done by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, especially when the output deviates from the reference answers, conventional similarity metrics are insufficient to measure the factuality of the generated output. Incorporating information from knowledge graphs (KGs) into the evaluation can help ensure an accurate measurement of the factual integrity and reliability of LLM outputs.
However, there are various reasons why KG should be used in the evaluation to support or enhance these evaluations.
Explanation of concepts
- Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
- Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
- Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points.
Properties to evaluate the LLM on:
Represented Knowledge: Which fact queries can the LLM answer correctly & consistently?Factuality: When generating an output, are the facts an LLM uses in its answer correct?Biases: How can bias in LLMs be detected and mitigated using KG?
Brief description of the state of the art
Knowledge Graphs (KGs) provide a structured and reliable basis for evaluating the knowledge encoded in LLMs. Relational triples from the KGs can be used to systematically test whether an LLM can accurately retrieve relevant information. Additionally, in cases where direct comparisons between reference text and LLM-generated output fall short in assessing factual accuracy, the output can be converted into a meaningful representation to measure alignment with the KG. Finally, the neutral and structured nature of KG data makes it a valuable tool for identifying and analyzing potential biases within LLMs.
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
Description:
It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM.
Answer 1: Using KGs to Evaluate LLM Represented Knowledge
Description:
Relational triples from a Knowledge Graph Relational triples from a Knowledge Graph (KG) can be leveraged to create up-to-date and domain-specific datasets for evaluating knowledge within Language Models (LMs) [1, 2, 4, 5, 6]. The LLM can then be queried with the subject and relation to predict the object using either a question-answer pattern [6], predicting the answer using masked language modeling [4, 5], or predicting the correct statement from a multiple-choice item [1,2]. KGs provide the correct answers and enable the creation of plausible distractors (incorrect answer options), allowing the generation of multiple-choice items to evaluate the knowledge represented in an LM [1, 2].
...
- While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
- When using a question-answering format (to evaluate text-generating / autoregressive LMs; for the application, see Chapter Selected Applications / Question Answering), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
- Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.
...
Answer 2: Using KGs to Evaluate LLM Factuality
Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ... (We could move the definition of these properties to the top and discuss which answer addresses which property)
Description:
LLM Factuality
Description:
KGs hold factual knowledge for various domains, which KGs hold factual knowledge for various domains, which can be used to analyze and evaluate LLM knowledge coverage [1]. This involves verifying the knowledge represented in an LLM using KGs. Similar to previous solutions, the target object can be predicted using either QA patterns [7, 8]. However, the information embedded in a KG can not be compared using strict matching or similarity metrics with the target object of an LLM due to the abstract structure of KG triples. Therefore, the output prediction has to be transformed into a meaning representation that describes the core semantic concepts and relations of an output sequence [8]. Meaning representations should be extracted as a graph-based semantic representation. Thereby, the congruence of the extracted target graph and an objective KG can then be evaluated, and missing or misplaced relations and missing or false knots can be detected [7, 8].
...
- For meaningful graph representations, the standard protocols are, for instance, Abstract Meaning Representation (AMR) [10] or Open Information Extraction (OpenIE) [11]. AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs [10]. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation [11].
- Extracting information from a text and generating or enhancing a KG from it will be discussed in Chapter 4.2. NLP tasks like named entity recognition, coreference resolutions, and relation extraction are well-established problems in this field of research that are solved using either generative LLMs or fine-tuned language models [12, 13]. The third option of using prompting for generating a KG is based on two techniques: in-context learning and chain-of-thought reasoning (explained in Section 4) [7].
- KG factuality The standard protocol for checking the factuality of a generated KG from an LLM output sequence would be to encode the KG using an LLM or a GNN and predict the factuality using binary classification [8, 14]. For both models, context can be provided in addition to the generated KG for higher precision in the prediction. For this task, the GNN has to be fine-tuned to factuality prediction. When using an LLM for the prediction, prompting can be used to predict the factuality of KG triples [7]. The prompt can be enhanced with in-context learning examples or the context of factual KG relations [14].
- Current publications that use the explained techniques are GraphEval [7] and FactGraph [8]. GraphEval uses SOTA LLMs like LLaMA to extract and generate the KG from a given model output . The Framework identifies each extracted triple on whether they are factually consistent given the provided context(see Section 3). FactGraph builds on text and graph encoders that are augmented with structure-aware adapters to classify actuality [8].
...
KG can also enhance model inputs with structured KG information instead of extracting meaning and knowledge from LLM outputs. For bias detection, in-context samples can be generated from domain-specific KGs to establish them as so-called biased "superior knowledge" to manipulate the prediction of LLMs and test their robustness against them [15]. This technique can be seen as an adversarial attack because the model gets manipulated to check for leveraged bias from the pre-training that is not mitigated by red-teaming or other bias mitigation techniques. The first step in setting up such an evaluation pipeline is to define a KG or extract relevant subgraphs from a larger KG covering a desirable evaluation bias or a biased context. Those KG are called bias KG. The bias KG nodes can be encoded, and the top k nodes representing a bias based on a context or gold standard can be extracted using an arbitrarily efficient retrieval method [15, 16]. With those top k-biased nodes, a set of k-input in-context samples can be generated using a graph-to-text generation model and embedded in the input prompt for evaluating an LLM. The generated output will then be evaluated based on the pre-defined bias.
Considerations:
Identifying and evaluating bias in LLMs is a rising research topic that has an ethical impact and demands to and for the research community [17]. Improving this field of research using KG is a recent extension with more publications in the near future. However, although the current research is somewhat limited compared to the other topics, we decided to dedicate a chapter to this topic because of its importance and upcoming extensions.
Considerations:
- The bias KG is generated with sensitive attributes that can be considered a potential bias target [15]. The bias KG is generated with sensitive attributes that can be considered a potential bias target [15]. Defining those attributes is important for the quality of the adversarial attacks. A closed set of sensitive attributes is advised for the bias evaluation so that the results can be analyzed properly. Various approaches can be used to generate the bias KG. Most of the established approaches are discussed in Section 4, in this Section in Answer 2, and in Chapter 4.2.
...
- The standard protocols for each step are partially discussed in different sections. RAG systems would be the closest best practice for this evaluation technique (See see Section 4). In BiasKG [15], for instance, bias KG is constructed from free text using RAG methodology. KG triples and entities are mapped into a vectorized embedding space. Those can now be clustered and retrieved for the input text generation.
References
- Wenxuan Wang et al. The Earth is Flat? Unveiling Factual Errors in Large Language Models (2024)
- Jacek Wiland et al. BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models (2024)
- Badr AlKhamissi et al. A Review on Language Models as Knowledge Bases (2022)
- Fabio Petroni et al. Language Models as Knowledge Bases? (2019)
- Jan-Christoph Kalo et al. KAMEL: Knowledge Analysis with Multitoken Entities in Language Models (2022)
- Alon Talmor et al. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge (2019)
- Hannah Sansford et al. GRAPHEVAL: A Knowledge Graph-Based LLM Hallucination Evaluation Framework (2023)
- Leonardo F. R. Ribeiro et al. FACTGRAPH: Evaluating Factuality in Summarization with Semantic Graph Representations (2022)
- Yu Wang et al. Large Graph Generative Models (2024)
- Laura Banarescu et al. Abstract Meaning Representation for Sembanking (2013)
- Bowen Yu et al. Towards Generalized Open Information Extraction (2022)
- Hanwen Zheng et al. A Survey of Document-Level Information Extraction (2023)
- Derong Xu et al. Large Language Models for Generative Information Extraction: A Survey(2024)
- Joshua Maynez et al. On Faithfulness and Factuality in Abstractive Summarization (2020)
- Chu Fei Luo et al. BiasKG: Adversarial Knowledge Graphs to Induce Bias Wang et al. The Earth is Flat? Unveiling Factual Errors in Large Language Models (2024)
- Zhilin Yang Jacek Wiland et al. HOTPOTQABEAR: A Dataset for Diverse, Explainable Multi-hop Question Answering (2018)
- Ziyang Xu et al. Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction (2024)
References:
- Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models (2024)
- Badr AlKhamissi et al. A Review on Language Models as Knowledge Bases (2022)
- Fabio Petroni et al. Language Models as Knowledge Bases? (2019)
- Jan-Christoph Kalo et al. KAMEL: Knowledge Analysis with Multitoken Entities in Language Models (2022)
- Alon Talmor et al. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge (2019)
- Hannah Sansford et al. GRAPHEVAL: A Knowledge Graph-Based LLM Hallucination Evaluation Framework (2023)
- Leonardo F. R. Ribeiro et al. FACTGRAPH: Evaluating Factuality in Summarization with Semantic Graph Representations (2022)
- Yu Wang et al. Large Graph Generative Models (2024)
- Laura Banarescu et al. Abstract Meaning Representation for Sembanking (2013)
- Bowen Yu et al. Towards Generalized Open Information Extraction (2022)
- Hanwen Zheng et al. A Survey of Document-Level Information Extraction (2023)
- Derong Xu et al. Large Language Models for Generative Information Extraction: A Survey(2024)
- Joshua Maynez et al.
https://arxiv.org/pdf/2401.00761 (The Earth is Flat?)https://aclanthology.org/2024.findings-naacl.155/ (BEAR)https://arxiv.org/pdf/2204.06031 (Review)https://aclanthology.org/D19-1250/ (LAMA)https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)https://aclanthology.org/N19-1421/ (CommonsenseQA)https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework (Fact)https://aclanthology.org/2022.naacl-main.236.pdf (FactGraph)https://arxiv.org/pdf/2406.05109 (Large Graph Gerative Models)https://aclanthology.org/W13-2322.pdf (AMR)https://aclanthology.org/2022.findings-emnlp.103.pdf (OpenIE)https://arxiv.org/pdf/2309.13249 (NER, CR, RE on Doc lvl)https://arxiv.org/pdf/2312.17617 (IE survey)- https://aclanthology.org/2020.acl-main.173.pdf ( On Faithfulness and Factuality in Abstractive Summarization (2020)
https://arxiv.org/abs/2405.04756 (BiasKG)https://aclanthology.org/D18-1259/ (retrieval algorithm)- Chu Fei Luo et al. BiasKG: Adversarial Knowledge Graphs to Induce Bias in Large Language Models (2024)
- Zhilin Yang et al. HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (2018)
- Jochen L. Leidner et al. Ethical by Design: Ethics Best Practices for Natural Language Processing (2017)
- Ziyang Xu et al. http://arxiv.org/abs/2403.09963 ( Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction (2024)