First draft to be created until 11 October 2024.
ADD NEW TOP LEVEL SECTION: LLM TRAINING
How do I enhance/augment/extend LLM training through KGs? (LLM TRAINING) – length: up to one page
Lead: Daniel Baldassare
Contributors:
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
Problem statement
The training of large language models typically employs unsupervised methods on extensive datasets. Despite their impressive performance on a range of tasks, these models often lack the practical, real-world knowledge required for certain applications. Furthermore, since domain-specific data is not included in the public domain datasets used for pre-training or fine-tuning large language models (LLMs), the integration of knowledge graphs (KGs) becomes fundamental for the injection of proprietary knowledge into LLMs, especially for enterprise solutions. In order to infuse this knowledge into LLMs during training, many techniques have been researched in recent years, resulting in three main state-of-the-art methods (Pan et al, 2024):
- Integration of KGs into training objectives (See answer 1)
- Verbalization of KGs into LLM inputs (See answer 2)
- Integrate KGs by Fusion Modules: Joint training of graph and language models (See answer 3)
Explanation of concepts
The first method focuses on extending the pre-training procedure. The term pretraining objectives is used to describe the techniques that guide the learning process of a model from its training data. In the context of pre-training large language models, a variety of methods have been employed based on the architecture of the model itself. Decoder-only models such as GPT-4 usually use Casual Language Modelling (CLM), where the model is presented with a sequence of tokens and learns to predict the next token in the sequence based solely on the preceding tokens (Wang et al., 2022). Integrating KGs into training objectives consists in extending the standard llm's pre-training objective of generating coherent and contextually relevant text by designing a knowledge aware pre-training.
The second method involves integrating KGs directly into the LLM's input by verbalising the knowledge graph into the prompt, thereby transforming structured data into text format that the LLM can process and learn from. Data from the knowledge is either prepended or postpended to the user's question as contextual information in the prompt. Within this approach the standard llm pre-training objective of generating coherent and contextually relevant text remains untouched and the knowledge augmentation task is modeled as a linguistic task.
Brief description of the state of the art
First draft to be created until 11 October 2024
Proposed solutions:
Answer 1: integrate KGs into the LLM Training Objective
Contributors:
- Diego Collarana (FIT)
Short definition/description of this topic: please fill in ...
- Content ...
- Content ...
- Content ...
Answer 2: integrate KGs into LLM Inputs (verbalize KG for LLM training)
Contributors:
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- ...
Draft from Daniel Baldassare :
Short definition/description of this topic: Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:
- Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques - Encoding strategies for nodes, relationship between nodes, nodes communities and metadata Talk like a graph: Encoding graphs for large language models (research.google)
- What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances
Answer 3: Integrate KGs by Fusion Modules
Contributors:
- Diego Collarana (FIT)
Short definition/description of this topic: please fill in ...
- Content ...
- Content ...
- Content ...
References:
- S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“, IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
- T. Wang u. a., „What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.html
ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME
How do I use KGs for Retrieval-Augmented Generation (RAG)? (2.1 – Prompt Enhancement) – length: up to one page
Lead: Diego Collarana
Contributors:
- Daniel Burkhardt (FSTI)
- Robert David (SWC)
- Diego Collarana (FIT)
- Daniel Baldassare (doctima)
- Michael Wetzel (Coreon)
Problem statement
RAG methods aim to enhance the capabilities of LLMs by providing real-time information and domain-specific knowledge that may not be present in their training data. Despite its advantages over standalone LLMs, Conventional RAG has the following limitations:
- Struggles to answer queries that require the intricate interconnectedness of information and global context crucial for generating comprehensive summaries.
- It cannot integrate structure and unstructured data, a use case typically required in industrial applications.
- Limited accuracy due to context loss during text chunking and its reliance on text similarity search.
- It has limited reasoning capabilities, especially with abstract questions that require reasoning, inference, or the synthesis of new information not explicitly stated in the source material.
- The answers cannot be backtracked to the information sources (factual grounding).
- The external knowledge, while consistent, can still lead to inconsistencies in the generated answer.
Explanation of concepts
- Retrieval-augmented generation (RAG) methods combine retrieval mechanisms with generative models to enhance the output of LLMs by incorporating external knowledge. By grounding the generated output in specific and relevant information, RAG methods improve the quality and accuracy of the generated output.
Types of RAG:
- Conventional RAG has three components: 1) Knowledge Base, typically created by chunking text documents, transforming them into embeddings, and storing them in a vector store. 2) Retriever searches the vector database for chunks that exhibit high similarity to the query. 3) Generator feeds the retrieved chunks, alongside the original query, to an LLM to generate the final response.
- Graph RAG integrates knowledge graphs into the RAG framework, allowing for the retrieval of structured data that can provide additional context and factual accuracy to the generative model.
The retrieval can be done on any source with a semantic representation, e.g., documents with semantic annotations or relational data via OBDA or R2RML, thereby ingesting structured and unstructured source information into the Graph RAG.
- RAG is used in various natural language processing tasks, including question-answering, information extraction, sentiment analysis, and summarization. It is particularly beneficial in scenarios requiring domain-specific knowledge.
Brief description of the state-of-the-art
The emerging field of Graph RAG develops methods to exploit the rich, structured relationships between entities within a KG to retrieve more precise, factually relevant context for LLMs [9]. Graph RAG methods encompass graph construction, knowledge retrieval, and answer-generation techniques [1,2,5]. We find methods that leverage existing open-source KGs [3] to methods for automatically building domain-specific KGs from raw textual data using LLMs [6]. The retrieval phase focuses on efficiently extracting pertinent subgraphs, paths, or nodes relevant to a user query with techniques like embedding similarity, pre-defined rules, or LLM-guided search. In the generation phase, retrieved graph information is transformed into LLM-compatible formats, such as graph languages, embeddings, or GNN encoding, to generate enriched and contextually grounded responses [4]. Recently, significant attention has been given to hybrid approaches combining conventional RAG and Graph RAG strengths [7,8]. HybridRAG integrates contextual information from traditional vector databases and knowledge graphs, resulting in a more balanced and effective system that surpasses individual RAG approaches in critical metrics like faithfulness, answer relevancy and context recall.
We describe various solutions for integrating knowledge graphs into RAG systems to improve accuracy, reliability, and explainability.
Answer 1: Knowledge Graph as a Database with Natural Language Queries (NLQ)
Description: This solution treats the knowledge graph as a structured database and leverages natural language queries (NLQ) to retrieve specific information. The implementation steps are as follows:
- First, the user's question is processed to extract key entities and relationships using entity linking and relationship extraction techniques. (Natural Language Understanding)
- Next, the natural language query is partially or fully mapped into a graph query language, e.g., Cypher or SPARQL. (Graph Query Construction)
- Then, the constructed graph query is executed against the knowledge graph database, which retrieves precise and targeted information from the knowledge graph. (Knowledge Graph Execution)
- Finally, the retrieved results are passed to the LLM for summarization or further processing to generate the final answer. (Response generation)
Considerations:
- Accurate Query Mapping: Requires advanced NLP techniques to map natural language queries to graph queries accurately. Entity linking and relationship extraction must be precise to ensure correct query formulation.
- Performance Efficiency: Executing complex graph queries may impact performance, especially with large-scale knowledge graphs. Optimization of graph databases and queries is necessary for real-time applications.
- Scalability: The system should handle growing knowledge graphs without significant performance loss. Scalable graph database solutions are essential.
- User Experience: The system must effectively interpret user intent from natural language inputs. Providing clear and concise answers enhances usability and trust.
Standards and Protocols:
- Compliance with Data Standards: Ensure the knowledge graph adheres to relevant data modeling standards. Where applicable, utilize standardized vocabularies and ontologies.
- Interoperability: Design the system for various graph databases and query languages. Support integration with external data sources and systems.
Answer 2: Knowledge Graph-Guided Retrieval Mechanisms
Description: KG-Guided Retrieval Mechanisms involve using, for example, knowledge graphs or vector databases to enhance the retrieval process in RAG systems. Knowledge graphs provide a structured representation of knowledge, enabling more precise and contextually aware information retrieval. This approach can directly query knowledge graphs or use them to augment queries to other data sources, improving the relevance and accuracy of the retrieved information.
- First, the user's question is processed to extract key entities and relationships using entity linking and relationship extraction techniques as a (semantic) graph representation of the question. (Natural Language Understanding)
- Next, the graph representation is executed against the knowledge graph database, which first retrieves information from the knowledge graph and then retrieves the associated mapped data source.
Data sources can be of different kinds:- Knowledge graph data
- Non-knowledge graph data with a graph representation:
- Tabular and relational data, e.g., via OBDA or R2RML.
- Semi-structured data, e.g., XML or DITA.
- Unstructured natural language, e.g., via semantic annotations.
- Then, the retrieved (different kinds of) results are consolidated (preprocessed) to be ingested into the LLM prompt. (Data consolidation)
- Finally, the consolidated results are passed to the LLM for summarization or further processing to generate the final answer. (Response generation)
Considerations:
- Limited input data: a short user's question poses a challenge to effectively create a graph representation sufficiently expressive for a high-quality retrieval of information.
- Knowledge model: a high-quality graph representation of both the user question and the actual information in the database is very likely to need a knowledge model with sufficient expressivity in the background.
- Graph representation: doing graph-based retrieval of (heterogeneous) data sources needs an established graph representation for each source.
- Consolidation architecture: Setting up a system architecture for consolidated data sources needs different kinds of integration components.
- Semantic gap: there is the risk of a gap of semantic information between the retrieved information and the LLM-generated answer, because any semantics contained in the knowledge graph and any knowledge model cannot be preserved during ingestion into the LLM generation.
Standards and Protocols:
- Compliance with Data Standards: Ensure the knowledge graph adheres to relevant data modeling standards. Where applicable, utilize standardized vocabularies and ontologies.
- Interoperability: Design the system for various graph databases and query languages. Support integration with external data sources and systems.
Answer 3: Hybrid RAG Combining KGs and Dense Vectors
Draft from Daniel Burkhardt:
Description: Hybrid Retrieval combines the strengths of knowledge graphs and dense vector representations to improve information retrieval. This approach leverages the structured, relational data from knowledge graphs and the semantic similarity captured by dense vectors, resulting in enhanced retrieval capabilities. Hybrid retrieval systems can improve semantic understanding and contextual insights while addressing scalability and integration complexity challenges.
- First, the user submits a query that is analyzed to select which retrieval approach (1.*) (Arbitrator or Classification)
- The retrieval components are called either in parallel or sequentially (Hybrid Retrieval Process)
- Vector Search: Retrieves data based on vector embeddings
- Keyword Search: Retrieves data based on keyword matching
- Graph Queries: Retrieves structured data from the knowledge graph
- Then, we combine results from all retrieval methods. (Result Integration)
- Response Generation: LLM generates and delivers the response.
Considerations:
- Requires efficient result fusion techniques
- Addresses diverse data types and sources
- Increase in latency of response
REFERENCE TO BE REMOVED
- Dense and sparse vectors (https://infiniflow.org/blog/best-hybrid-search-solution, https://aclanthology.org/2023.findings-acl.679.pdf)
- Hybrid Retrieval (https://arxiv.org/html/2408.05141v1, https://haystack.deepset.ai/blog/hybrid-retrieval, https://arxiv.org/pdf/1905.07129)
- Graph Embeddings (https://www.dfki.de/~declerck/semdeep-4/papers/SemDeep-4_paper_2.pdf, https://arxiv.org/pdf/1711.11231)
- Re-ranking, scoring, and filtering by fusion (https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-hybrid, https://arxiv.org/pdf/2004.12832, https://arxiv.org/pdf/2009.07258)
- Integration of KG with dense vectors (https://github.com/InternLM/HuixiangDou)
- Benefits (enhance semantic understanding, contextual and structure insights, improve retrieval accuracy)
- Challenges (scalability, integration complexity) https://ragaboutit.com/how-to-build-a-jit-hybrid-graph-rag-with-code-tutorial/
Key Challenges
- Knowledge Graph Construction and Maintenance: Creating and updating high-quality knowledge graphs for specific domains can be challenging and resource-intensive.
- Scalability and Efficiency: Retrieving information from large and complex knowledge graphs while maintaining acceptable response times remains challenging.
- Evaluation Standardization: The lack of widely accepted benchmarks and evaluation metrics hinders progress and comparability across Graph RAG approaches. The quality of KG is crucial.
- Human Element, we need knowledge engineers and domain specialists.
References
- [1] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang: Graph Retrieval-Augmented Generation: A Survey. CoRR abs/2408.08921 (2024)
- [2] Diego Collarana, Moritz Busch, Christoph Lange: Knowledge Graph Treatments for Hallucinating Large Language Models. ERCIM News 2024(136) (2024)
- [3] Junde Wu, Jiayuan Zhu, Yunli Qi: Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. CoRR abs/2408.04187 (2024)
- [4] Sen, Priyanka, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations - NLRSE (2023)
- [5] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, Xindong Wu: Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 36 (7) (2024)
- [6] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson: From Local to Global: A Graph RAG Approach to Query-Focused Summarization. CoRR abs/2404.16130 (2024)
- [7] Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta: HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. CoRR abs/2408.04948 (2024)
- [8] Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, Sahar Vahdati: Beyond Boundaries: A Human-like Approach for Question Answering over Structured and Unstructured Information Sources. Trans. Assoc. Comput. Linguistics (2024)
- [9] Juan Sequeda, Dean Allemang, Bryon Jacob: A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. GRADES/NDA (2024)
How do I enhance LLM explainability by using KGs? (2.2 – Answer Verification) – length: up to one page
Lead: Daniel Burkhardt
Problem Statement (one paragraph)
KG-Enhanced LLM explainability focuses on integrating structured knowledge from Knowledge Graphs (KGs) to improve the explainability of large language models (LLMs). The primary goal is to allow LLMs to generate outputs that are more transparent and justifiable. This integration involves aligning LLM outputs with verifiable facts and structured data, enabling improved trust and factuality. The combination of KGs with LLMs ensures that the output remains grounded in known data, reducing issues [1,2].
Explanation of concepts
KG Alignment with LLMs: This refers to ensuring that the representations generated by LLMs are in sync with the structured knowledge found in KGs. For example, frameworks like GLaM fine-tune LLMs to align their outputs with KG-based knowledge, ensuring that responses are factually accurate and well-grounded in known data [3]. By aligning LLMs with structured knowledge, the explainability of model predictions is improved, making it easier for users to verify how and why certain information was provided [1].
KG-Guided Explanation Generation: KGs assist in generating explanations for LLM outputs by providing a logical path or structure to the answer. By referencing entities and their relationships within a KG, LLMs can produce detailed, justifiable answers. Studies like those in the education domain use KG data to provide clear, factually supported explanations for LLM-generated responses [2,5].
Factuality and Verification: Factuality in LLM outputs is critical for trust, and KGs play a crucial role in verifying the truthfulness of LLM answers. Systems like GraphEval [6] analyze LLM outputs by comparing them to large-scale KGs, ensuring that the content is factual. This verification step mitigates hallucination risks and ensures outputs are reliable [6,7].
Brief description of the state of the art (one paragraph)
The integration of KGs into LLM explainability is currently focused on improving transparency through KG alignment and post-hoc verification methods. Existing models like GPT-3 and GPT-4 demonstrate high linguistic ability but suffer from opacity and factual inaccuracies. Research efforts, including KG-guided explanation generation and factual verification techniques (e.g., FACTKG [5]), aim to address these challenges by incorporating KGs during and after LLM inference [1,6,5].
References
- Zhang et al., 2024, " Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering"
- Zhao et al., 2023, "Explainability for Large Language Models: A Survey"
- Dernbach et al., 2024, "GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding"
- Rasheed et al., 2024, "Knowledge Graphs as Context Sources for LLM-Based Explanations of Learning Recommendations"
- Kim et al., 2023, "FACTKG: Fact Verification via Reasoning on Knowledge Graphs"
- Liu et al., 2024, "Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs"
- Hao et al., 2024, "ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings"
- Jiang et al., 2023, "Efficient Knowledge Infusion via KG-LLM Alignment"
Answer 1: Measuring KG Alignment in LLM Representations
Measuring the alignment between LLM representations and KGs involves comparing how well the LLM’s output matches the structured knowledge in the KG. For example, in GLaM [3], fine-tuning is performed to align LLM outputs with KG-derived entities and relationships, ensuring that responses are not only accurate but also interpretable. The alignment helps reduce issues like hallucinations by grounding responses in verifiable data. This method was used to improve performance in domain-specific applications, where LLMs need to accurately reflect relationships and entities defined in KGs [3, 4, 5].
Answer 2: KG-Guided Explanation Generation
KGs can be used to guide the explanation generation process, where the LLM references structured data in the KG to justify its output. For instance, in the educational domain, explanations are generated using semantic relations from KGs to ensure that recommendations and answers are factually supported [4]. This method not only provides the user with an understandable explanation but also reduces hallucination risks by ensuring that every output can be traced back to a known fact in the KG. Studies on KG-guided explanation generation in various fields confirm its utility in making LLM outputs more transparent and understandable to non-experts [4,5].
Answer 3: KG-Based Fact-Checking and Verification
KG-based fact-checking is an essential method for improving LLM explainability. By cross-referencing LLM outputs with structured knowledge in KGs, fact-checking systems like GraphEval ensure that generated responses are accurate and grounded in truth [6, 7]. This is especially useful for reducing hallucinations. GraphEval automates the process of verifying LLM outputs against a KG containing millions of facts, allowing for scalable and efficient fact-checking that improves both explainability and user trust [6].
- First, the generated output is analysed regarding the knowledge graph and key entities and relationships are extracted to create a graph representation of the LLM answer.
- Next, this graph representation is then analyzed regarding the knowledge graph used for retrieval and any knowledge models in the background are also included. The analysis retrieves a graph representation of an explanation or justification and is returned as a (sub)graph or graph traversal with any additional information added, like RDF* weights.
- Finally, the explanation is then returned to the user in a human-readable way to be cross-checked with the LLM generated answer.
Considerations:
- Limited input data: a short LLM generated answer poses a challenge to effectively backtrack sufficient information in the knowledge graph for a high-quality explanation.
- Presentation: the explanation is graph-based data and difficult to explain or present to non-experts.
Standards and Protocols:
Query languages
Path retrieval
- https://graphdb.ontotext.com/documentation/10.7/graph-path-search.html
- https://neo4j.com/docs/graph-data-science/current/algorithms/pathfinding/
How do I enhance LLM reasoning through KGs? (2.3 – Answer Augmentation) – length: up to one page
Lead: Daniel Burkhardt
Problem Statement (one paragraph)
KG-Enhanced LLM Reasoning improves the reasoning capabilities of LLMs by leveraging structured knowledge from KGs. This allows LLMs to perform more complex reasoning tasks, such as multi-hop reasoning, where multiple entities and relationships need to be connected to answer a query. Integrating KGs enhances the ability of LLMs to make logical inferences and draw conclusions based on factual, interconnected data, rather than relying solely on unstructured text [9, 10].
Explanation of concepts
Multi-hop Reasoning with KGs: This involves connecting different pieces of information across multiple steps using relationships stored in KGs. By structuring queries through KGs, LLMs can reason through several layers of related entities and provide accurate answers to more complex questions [11, 10].
Tool-Augmented Reasoning: LLMs can use external tools, such as KG-based queries, to retrieve relevant data during inference, allowing for improved reasoning. ToolkenGPT [13] demonstrates how augmenting LLMs with such tools during multi-hop reasoning helps them perform more logical, structured reasoning by accessing real-time KG data [7, 13].
Consistency Checking in Reasoning: KG-based consistency checking ensures that LLMs maintain logical coherence throughout their reasoning processes. Systems like KONTEST [12] systematically test LLM outputs against KG facts to ensure that answers remain consistent with established knowledge, reducing logical errors [12, 13].
Brief description of the state of the art (one paragraph)
The use of KGs to enhance reasoning is advancing rapidly, with multi-hop reasoning and retrieval-augmented generation (RAG) methods emerging as key techniques. These methods allow LLMs to perform reasoning tasks that require connecting multiple pieces of information through structured KG paths [9, 11]. Furthermore, systems like ToolkenGPT [13] integrate KG-based tools during inference, allowing LLMs to access external factual data, improving their reasoning accuracy [10, 13].
References
9. Liao et al., 2021, "To hop or not, that is the question: Towards effective multi-hop reasoning over knoweldge graphs"
10. Schick et al., 2023, "Toolformer: Language Models Can Teach Themselves to Use Tools"
11. Bratanič et al., 2024, "Knowledge Graphs & LLMs: Multi-Hop Question Answering "
12. Rajan et al., 2024, "Knowledge-based Consistency Testing of Large Language Models"
13. Hao et al., 2024, "ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings"
Answer 1: KG-Guided Multi-hop Reasoning
Multi-hop reasoning refers to the process of connecting multiple entities or facts across a KG to answer complex queries. Using KGs in this way allows LLMs to follow logical paths through the data to derive answers that would be challenging with unstructured text alone. For instance, the Neo4j framework enhances LLM multi-hop reasoning by allowing the LLM to query interconnected entities efficiently [11]. This method improves LLM performance in tasks requiring stepwise reasoning across multiple facts [9, 11].
Answer 2: KG-Based Consistency Checking in LLM Outputs
KG-based consistency checking ensures that LLMs produce logically coherent and accurate outputs by comparing their answers with facts from a KG. KONTEST is an example of a system that uses KGs to systematically generate consistency tests, ensuring that LLM outputs are verified for logical consistency before being returned to the user [12]. This reduces errors in reasoning and improves the reliability of the model’s conclusions [12, 13].
How do I evaluate LLMs through KGs? (3) – length: up to one page
First Version: Automatic evaluation of LLMs is usually done by cleverly comparing a desired result. The desired output can be evaluated using direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore). However, there are various reasons why KG can be used in the evaluation to support or enhance these evaluation techniques.
Firstly, KG triplets can be extracted from the output of an LLM and then analyzed. The triplets can be compared with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
The second reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points.
Both variants are explained in more detail below as examples.
Answer 1: Using KGs to Evaluate LLM Knowledge Coverage
Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ...
Lead: Fabio
Contributors:
- Daniel Burkhardt (FSTI)
- Daniel Baldassare (doctima)
- Fabio Barth (DFKI)
- Max Ploner (HU)
- ...
Draft from Daniel Burkhardt:
Short definition/description of this topic: This involves using knowledge graphs to analyze and evaluate various aspects of LLMs, such as knowledge coverage and biases. KGs provide a structured framework for assessing how well LLMs capture and represent knowledge across different domains. This involves assessing the extent to which LLMs cover the knowledge represented in KGs. By comparing LLM outputs with the structured data in KGs, this approach can identify gaps in knowledge and areas for improvement in LLM training and performance
First Version: The first evaluation process can be divided into two parts. Those can be executed through various techniques, which this section will not discuss. First, the LLM generates output sequences based on an evaluation set of input samples. Specific KG triplets are then identified and extracted from the generated output sequence. The variants for extraction and identification can be found in other subchapters of this DIN SPEC. The extracted KG triplets are usually domain or task-specific. These KG triplets are used to generate a KG.
In the second step, the KG can now be analyzed. For instance, factuality can be checked by analyzing each KG triplet in the generated KG, given the context provided. Alternatively, the extracted KG triplets can be compared with an existing, more extensive KG to analyze the knowledge coverage of an LLM.
Answer 2: Analyzing LLM Biases through KG Comparisons
Contributors:
- Daniel Burkhardt (FSTI)
- Daniel Baldassare (doctima)
- Fabio Barth (DFKI)
- ...
Draft from Daniel Burkhardt:
Short definition/description of this topic: This involves using knowledge graphs to identify and analyze biases in LLMs. By comparing LLM outputs with the neutral, structured data in KGs, this approach can highlight biases and suggest ways to mitigate them, leading to more fair and balanced AI systems.
First Version: In the second process, the inputs, i.e., the evaluation samples, are enhanced with information from a KG to provide helpful or misleading context. KG nodes must first be extracted from the samples using, for example, RAG. Then, based on the extracted KG nodes, the top k nodes can be determined from the KG using an arbitrarily efficient retrieval method. These nodes can then be used to enhance the input. For example, the nodes can be displayed as “superior knowledge” in the prompt in order to carry out adversarial attacks to obtain biased responses from open- and closed-source LLMs. Finally, the output of the model is analyzed. Again, different evaluation methods and metrics can be applied in the final step.
- Content ...
- Content ...
literature: https://arxiv.org/abs/2405.04756