Page History
...
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- Alan Akbik (HU)
Problem statement
The training of large language models typically employs unsupervised methods on extensive datasets. Despite their impressive performance on various tasks, these models often lack the practical, real-world knowledge required for both domain-specific and enterprise applications. Furthermore, since domain-specific data is not included in the public domain datasets used for pre-training or fine-tuning large language models (LLMs), integrating knowledge graphs (KGs) become fundamental for injecting proprietary knowledge into LLMs. To infuse this knowledge into LLMs during training, many techniques have been researched in recent years, resulting in three main state-of-the-art approaches (Pan et al., 2024)[1]:
- Verbalization of KGs into LLM inputs (See answer 1)
- Integrate KGs during pre-training (See answer 2)
- Integration KGs during Fine-Tuning (See answer 3)
Explanation of concepts
The first method focuses on extending the term pre-training procedure. The term pre-training objectives describes the techniques that guide the learning process of a model from its training data. In the context of pre-training large language models, various methods have been employed based on the model's architecture. Decoder-only models such as GPT-4 usually use Causal Language Modelling (CLM), where the model is presented with a sequence of tokens and learns to predict the next token in the sequence based solely on the preceding tokens (Wang et al., 2022). Integrating KGs into training objectives involves extending [2]. Within the first approach, the standard LLM 's pre-training objective of generating coherent and contextually relevant text by designing a knowledge-aware pre-training.The second method involves integrating KGs directly into the LLM's input by verbalizing the knowledge graph into the promptremains untouched, and the knowledge augmentation task is modeled as a linguistic task. Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs through text, thereby transforming structured data into a text format from which the LLM can process and learn. Data from the knowledge is either prepended or postpended to the user's question as contextual information in the prompt. Within this approach, Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, fine-tuning) or during inference (in-context learning). In contrast to the first approach, the second approach extends the pre-training procedure. Integrating KGs into the training objectives involves extending the standard LLM pre-training objective of generating coherent and contextually relevant text remains untouched, and the knowledge augmentation task is modeled as a linguistic task.
Brief description of the state-of-the-art
Regarding the verbalization of KGs into LLM-Inputs
Proposed solutions:
Answer 1: Integrate KGs into LLM Inputs (verbalize KG for LLM training) – Before pre-training enhancement
Contributors:
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- ...
Draft from Daniel Baldassare :
Description: Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:
- Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques - Encoding strategies for nodes, relationship between nodes, nodes communities, and metadata Talk like a graph: Encoding graphs for large language models (research.google)
- What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances
Answer 2: Integrate KGs into the LLM Training Objective – during pre-training
Description: The methods learn knowledge directly during training by improving the LLM's encoder and training tasks.
- Incorporate knowledge encoders
- Insert knowledge encoding layers
- Add independent adapters
- Modify the pre-training task
Considerations:
- It is challenging to harmonize and bring together heterogeneous embedding space, i.e., text and graph embeddings
Standards:
- Content ...
- Content ...
- Content ...
Answer 3: Integrate KGs during Fine-Tuning – Post pre-training enhancement
Description:
Considerations:
Standards:
- Content ...
- Content ...
- Content ...
References:
by designing a knowledge-aware pre-training. In the context of large language models (LLMs), fine-tuning can serve several purposes: adapting the model for a specific task, such as classification or sentiment analysis (Task Adaptation), expanding the pretrained model's knowledge to specialize it for a particular domain or enterprise needs (Knowledge Enhancement) or teaching the model to follow human instrcutions using datasets of prompts (Instruction Tuning).
Brief description of the state-of-the-art
In the context of integrating KGs into LLM inputs, the current state-of-the-art approach focuses on infusing knowledge without modifying the textual sequence itself. The methods proposed by Liu et al. [3] and Sun et al. [4] address the issue of "knowledge noise", a challenge highlighted by Liu et al. [4] that can arise when knowledge triples are simply concatenated with their corresponding sentences, as in the approach of Zhang et al [5].
Answer 1: Integrate KGs into LLM Inputs (verbalize KG for LLM training) – Before pre-training enhancement
Contributors:
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- ...
Description: Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:
- Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques - Encoding strategies for nodes, relationship between nodes, nodes communities, and metadata Talk like a graph: Encoding graphs for large language models (research.google)
- What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances
Considerations:
Standards:
Answer 2: Integrate KGs into the LLM Training Objective – during pre-training
Description: The methods learn knowledge directly during training by improving the LLM's encoder and training tasks.
- Incorporate knowledge encoders
- Insert knowledge encoding layers
- Add independent adapters
- Modify the pre-training task
Considerations:
- It is challenging to harmonize and bring together heterogeneous embedding space, i.e., text and graph embeddings
Standards:
- Content ...
- Content ...
- Content ...
Answer 3: Integrate KGs during Fine-Tuning – Post pre-training enhancement
Description:
Considerations:
Standards:
- Content ...
- Content ...
- Content ...
References:
- [1] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“, IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
- [2] T. Wang u. a., „What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.html
- [3] Liu, Weijie, u. a. K-BERT: Enabling Language Representation with Knowledge Graph. arXiv:1909.07606, arXiv, 17. September 2019. arXiv.org, http://arxiv.org/abs/1909.07606.
- [4] Sun, Tianxiang, u. a. „CoLAKE: Contextualized Language and Knowledge Embedding“. Proceedings of the 28th International Conference on Computational Linguistics, herausgegeben von Donia Scott u. a., International Committee on Computational Linguistics, 2020, S. 3660–70. ACLWeb, https://doi.org/10.18653/v1/2020.coling-main.327.
- [5] Zhang, Zhengyan, u. a. „ERNIE: Enhanced Language Representation with Informative Entities“. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, herausgegeben von Anna Korhonen u. a., Association for Computational Linguistics, 2019, S. 1441–51. ACLWeb, https://doi.org/10.18653/v1/P19-1139.
- S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“, IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
- T. Wang u. a., „What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.html
ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME
...
- Daniel Burkhardt (FSTI)
- Robert David (SWC)
- Diego Collarana (FIT)
- Daniel Baldassare (doctima)
- Michael Wetzel (Coreon)
Problem statement
RAG methods aim to enhance the capabilities of LLMs by providing real-time information and domain-specific knowledge that may not be present in their training data. Despite its advantages over standalone LLMs, conventional RAG has the following limitations:
- Struggles to answer queries that require the intricate interconnectedness of information and global context crucial for generating comprehensive summaries.
- It cannot integrate structure and unstructured data, a use case typically required in industrial applications.
- Limited accuracy due to context loss during text chunking and its reliance on text similarity search.
- It has limited reasoning capabilities, especially with abstract questions that require reasoning, inference, or the synthesis of new information not explicitly stated in the source material.
- The answers cannot be backtracked to the information sources (factual grounding).
- The external knowledge, while consistent, can still lead to inconsistencies in the generated answer.
Explanation of concepts
- Retrieval-augmented generation (RAG) methods combine retrieval mechanisms with generative models to enhance the output of LLMs by incorporating external knowledge. By grounding the generated output in specific and relevant information, RAG methods improve the quality and accuracy of the generated output.
Types of RAG:
- Conventional RAG has three components: 1) Knowledge Base, typically created by chunking text documents, transforming them into embeddings, and storing them in a vector store. 2) Retriever searches the vector database for chunks that exhibit high similarity to the query. 3) Generator feeds the retrieved chunks, alongside the original query, to an LLM to generate the final response.
- Graph RAG integrates knowledge graphs into the RAG framework, allowing for the retrieval of structured data that can provide additional context and factual accuracy to the generative model.
The retrieval can be done on any source with a semantic representation, e.g., documents with semantic annotations or relational data via OBDA or R2RML, thereby ingesting structured and unstructured source information into the Graph RAG.
- RAG is used in various natural language processing tasks, including question-answering, information extraction, sentiment analysis, and summarization. It is particularly beneficial in scenarios requiring domain-specific knowledge.
Brief description of the state-of-the-art
The emerging field of Graph RAG develops methods to exploit the rich, structured relationships between entities within a KG to retrieve more precise, factually relevant context for LLMs [9]. Graph RAG methods encompass graph construction, knowledge retrieval, and answer-generation techniques [1,2,5]. We find methods that leverage existing open-source KGs [3] to methods for automatically building domain-specific KGs from raw textual data using LLMs [6]. The retrieval phase focuses on efficiently extracting pertinent subgraphs, paths, or nodes relevant to a user query with techniques like embedding similarity, pre-defined rules, or LLM-guided search. In the generation phase, retrieved graph information is transformed into LLM-compatible formats, such as graph languages, embeddings, or GNN encoding, to generate enriched and contextually grounded responses [4]. Recently, significant attention has been given to hybrid approaches combining conventional RAG and Graph RAG strengths [7,8]. HybridRAG integrates contextual information from traditional vector databases and knowledge graphs, resulting in a more balanced and effective system that surpasses individual RAG approaches in critical metrics like faithfulness, answer relevancy, and context recall.
...
- While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
- When using a question-answering format (to evaluate text-generating / autoregressive LMs), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
- Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.
...
- For meaningful graph representations, the standard protocols are, for instance, Abstract Meaning Representation (AMR) or Open Information Extraction (OpenIE). AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation.
- Extracting information from a text and generating or enhancing a KG from it will be discussed in Chapter 4.2. NLP tasks like named entity recognition, coreference resolutions, and relation extraction are well-established problems in this field of research that are solved using either generative LLMs or fine-tuned language models. The third option of using prompting for generating a KG is based on two techniques: in-context learning and chain-of-thought reasoning (explained in Section 4).
- KG factuality The standard protocol for checking the factuality of a generated KG from an LLM output sequence would be to encode the KG using an LLM or a GNN and predict the factuality using binary classification. For both models, context can be provided in addition to the generated KG for higher precision in the prediction. For this task, the GNN has to be fine-tuned to factuality prediction. When using an LLM for the prediction, prompting can be used to predict the factuality of KG triples. The prompt can be enhanced with in-context learning examples or the context of factual KG relations.
- Current publications that use the explained techniques are GraphEval and FactGraph. GraphEval uses SOTA LLMs like LLaMA to extract and generate the KG from a given model output. It The Framework identifies each extracted triple on whether they are factually consistent given the provided context. FactGraph builds on text and graph encoders that are augmented with structure-aware adapters to classify factuality.
Answer 3: Analyzing LLM Biases through KG Comparisons
Description: This involves using knowledge graphs to identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG
KG can also enhance model inputs with structured KG information instead of extracting meaning and knowledge from LLM outputs. For bias detection, in-context samples can be generated from domain-specific KGs to establish them as so-called biased "superior knowledge" to manipulate the prediction of LLMs and test their robustness against them. This technique can be seen as an adversarial attack because the model gets manipulated to check for leveraged bias from the pre-training that is not mitigated by red-teaming or other bias mitigation techniques. The first step in setting up such an evaluation pipeline is to define a KG or extract relevant subgraphs from a larger KG covering a desirable evaluation bias or a biased context. Those KG are called bias KG. From the adversarial KG nodes, we can encode the KG and extract the top k nodes representing the bias based on a context or gold standard using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.- Considerations:With those top k-biased nodes, a set of k-input in-context samples can be generated using a graph-to-text generation model and embedded in the input prompt for evaluating an LLM. The generated output will then be evaluated based on the pre-defined bias.
- Considerations:
- The bias KG is generated with sensitive attributes that can be considered a potential bias target. Defining those attributes is important for the quality of the adversarial attacks
- Standards and Protocols and Scientific Publications:
...