Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

ADD NEW TOP LEVEL SECTION: LLM TRAINING

How do I enhance/augment/extend LLM training through KGs? (LLM TRAINING) – length: up to one page

Lead: Daniel Baldassare

KG-Enhanced LLM Training

...

Contributors:

  • Diego Collarana (FIT)
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • ... 

Integrating KGs into LLM Inputs (verbalize KG for LLM training)

Contributors:

  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima)
  • Michael Wetzel (Coreon)
  • Sabine Mahr (word b sign)
  • ... 

Draft from Daniel Baldassare :

Integrating KGs by Fusion Modules

Contributors:

  • Diego Collarana (FIT)
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • ... 

Retrieval-Augmented Generation (RAG)

Draft Daniel Burkhardt

  • Definition of RAG 
  • Types of RAG 
  • Applications for RAG 

KG-Guided Retrieval Mechanisms

Contributors:

  • Daniel Burkhardt (FSTI)
  • Robert David (SWC)
  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima)
  • Michael Wetzel (Coreon)

Draft Robert David:

  • Initial RAG idea: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  • RAG is commonly used with vector databases.
    • can only grasp semantic similarity represented in the document content
    • only unstructured data
    • vector distance instead of a DB search limits the retrieval capabilities
  • Graph RAG uses knowledge graphs as part of the RAG system
    • KGs for retrieval (directly), meaning the database is storing KG data
    • KGs for retrieval via a semantic layer, potentially retrieving over different data sources of structured and unstructured data
    • KGs for augmenting the retrieval, meaning the queries to some database is modified via KG data
  • Via Graph RAG, we can
    • ingest additional semantic background knowledge (knowledge model) not represented in the data itself
      • additional related knowledge based on defined paths (rule-based inference)
      • focus on certain aspects of a data set for the retrieval (search configuration)
      • personalization: represent different roles for retrieval via ingesting role description data into the retrieval (especially important in an enterprise environment)
    • reasoning
    • linked data makes factual knowledge related to the LLM-generated knowledge and thereby provide a means to check for correctness
    • explainable AI: provide justifications via KG
    • consolidate different data sources: unstructured, semi-structured, structured (enterprise knowledge graph scenario)
    • doing the actual retrieval via KG queries: SPARQL
    • hybrid retrieval: combine KG-based retrieval with vector databases or search indexes

Hybrid Retrieval Combining KGs and Dense Vectors

Contributors:

  • Daniel Burkhardt (FSTI)
  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima)
  • Please add yourself if you want to contribute ...
  • ...

Draft from Daniel Burkhardt

KG-Enhanced LLM Interpretability

Draft from Daniel Burkhardt

Measuring KG Alignment in LLM Representations

Draft from Daniel Burkhardt

literature: https://arxiv.org/abs/2311.06503 , https://arxiv.org/abs/2406.03746, https://arxiv.org/abs/2402.06764

Contributors:

  • Daniel Burkhardt (FSTI)
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • ... 

KG-Guided Explanation Generation

Draft from Daniel Burkhardt

literature: https://arxiv.org/abs/2312.00353, https://arxiv.org/abs/2403.03008

Contributors:

  • Daniel Burkhardt (FSTI)
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • ... 

KG-Based Fact-Checking and Verification

Contributors:

  • Daniel Burkhardt (FSTI)
  • Please add yourself if you want to contribute ...
  • Please add yourself if you want to contribute ...
  • ... 

Draft from Daniel Burkhardt

literatur: https://arxiv.org/abs/2404.00942, https://aclanthology.org/2023.acl-long.895.pdf, https://arxiv.org/pdf/2406.01311 

KG-Enhanced LLM Reasoning

Draft from Daniel Burkhardt

KG-Guided Multi-hop Reasoning

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Please add yourself if you want to contribute ...
  • ... 

Draft from Daniel Burkhardt

literature: https://neo4j.com/developer-blog/knowledge-graphs-llms-multi-hop-question-answering/, https://link.springer.com/article/10.1007/s11280-021-00911-5

KG-Based Consistency Checking in LLM Outputs

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Michael Wetzel (Coreon)
  • ... 

Draft from Daniel Burkhardt

literature:https://www.researchgate.net/publication/382363779_Knowledge-based_Consistency_Testing_of_Large_Language_Models

KGs for LLM Analysis

Using KGs to Evaluate LLM Knowledge Coverage

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Please add yourself if you want to contribute ...
  • ... 

Draft from Daniel Burkhardt

literature: https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework

Analyzing LLM Biases through KG Comparisons

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Please add yourself if you want to contribute ...
  • ... 

Draft from Daniel Burkhardt

  • Daniel Baldassare (doctima) – Lead
  • Michael Wetzel (Coreon)
  • Rene Pietzsch (ECC)
  • Alan Akbik (HU)

Problem statement

The training of large language models typically employs unsupervised methods on extensive datasets. Despite their impressive performance on various tasks, these models often lack the practical, real-world knowledge required for both domain-specific and enterprise applications. Furthermore, since domain-specific data is not included in the public domain datasets used for pre-training or fine-tuning large language models (LLMs), integrating knowledge graphs (KGs) become fundamental for injecting proprietary knowledge into LLMs. To infuse this knowledge into LLMs during training, many techniques have been researched in recent years, resulting in three main state-of-the-art approaches [1]: 

  1. Verbalization of KGs into LLM inputs (See answer 1)
  2. Integrate KGs during pre-training (See answer 2)
  3. Integration KGs during Fine-Tuning (See answer 3)

Explanation of concepts

The term pre-training objectives describes the techniques that guide the learning process of a model from its training data. In the context of pre-training large language models, various methods have been employed based on the model's architecture. Decoder-only models such as GPT-4 usually use Causal Language Modelling (CLM), where the model is presented with a sequence of tokens and learns to predict the next token in the sequence based solely on the preceding tokens [2]. Within the first approach, the standard LLM pre-training objective of generating coherent and contextually relevant text remains untouched, and the knowledge augmentation task is modeled as a linguistic task. Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs through text, thereby transforming structured data into a text format from which the LLM can process and learn. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, fine-tuning) or during inference (in-context learning). In contrast to the first approach, the second approach extends the pre-training procedure. Integrating KGs into the training objectives involves extending the standard LLM pre-training objective of generating coherent and contextually relevant text by designing a knowledge-aware pre-training. In the context of large language models (LLMs), fine-tuning can serve several purposes: adapting the model for a specific task, such as classification or sentiment analysis (Task Adaptation), expanding the pretrained model's knowledge to specialize it for a particular domain or enterprise needs (Knowledge Enhancement) or teaching the model to follow human instrcutions using datasets of prompts (Instruction Tuning).

Brief description of the state-of-the-art


In the context of integrating KGs into LLM inputs, the current state-of-the-art approach focuses on infusing knowledge without modifying the textual sequence itself. The methods proposed by Liu et al. [3] and Sun et al. [4] address the issue of "knowledge noise", a challenge highlighted by Liu et al. [4] that can arise when knowledge triples are simply concatenated with their corresponding sentences, as in the approach of Zhang et al [5]. 

Answer 1: Integrate KGs into LLM Inputs (verbalize KG for LLM training) – before pre-training

Contributors:

  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima) – Lead
  • Michael Wetzel (Coreon)
  • Rene Pietzsch (ECC)
  • ... 


Description:

Verbalizing knowledge graphs for LLM the pre-training task:

  • Simple concatenation of KG triples with text
  • Entity/Token alignment prediction

Considerations:

  • Simple concatenation of tokens and triples from KG can cause "knowledge noise"

Standards:

  • Prediction alignment links between tokens and entities
  • Entity embeddings + additional entity prediction task to token-only pretraining objective


Answer 2: Integrate KGs during pre-training

Description: These methods use KG knowledge directly during the LLM pre-training phase by modifying the encoder side of the transformer architecture and improving the training tasks.
LLMs do not process KG structure directly; therefore a KG representation that allows us a combination with text embeddings is necessary, i.e., KG embeddings. There is a need to have an alignment between the text and the (sub)graph pre-training data. To allow the LLM to learn from KG embeddings, there are three main modifications to the transformer-encoder architecture (which future research may extend): 

a) Incorporate a knowledge encoder to fuse textual context (Text Embeddings) and knowledge context (KG embeddings). The LLM could stay frozen and reuse just the output of the Transformer encoder.

b) Insert knowledge encoding layers in the middle of the transformer layers to adjust the encoding mechanism, enabling the LLM to process knowledge from the KG.

c) Add independent adapters to process knowledge. These adapters match 1x1 the transformer layers and are easy to train because they do not affect the parameters of the original LLM during pre-training.

Image Added

Although nothing prohibits implementing all these modifications simultaneously, we see (recommend) implementing just one of these variations during LLM pretraining.

The pre-training task allows LLM to learn and model the world. Thus, another option is to modify the pretraining task. In the Encoder side of LLMs, the typical task is to MASK words in the context. A simple modification is to MASK not random words but entities represented in the KG. Another option is to perform a multi-tasking pre-training, i.e., perform MASK and KG link predictions.

Considerations:

  • As a result, we have LLMs with better language understanding.
  • Empirical evaluation has shown that this combination can improve reasoning capabilities in LLMs.
  • Tail entities, i.e., entities not frequently mentioned in the text, are better learned and modeled by the resulting LLM.
  • Harmonizing and combining heterogeneous embedding spaces, such as text and graph embeddings, is challenging. Therefore, experts in NLP and Graph Machine Learning are required to properly apply these methods.
  • More resources are required because of the pre-training time being extended.

Standards:

  • TODO 

Answer 3: Integrate KGs during Fine-Tuning – Post pre-training enhancement

Description: These methods inject KG knowledge into LLMs through fine-tuning with relevant data on additional tasks. The goal is to improve the model’s performance on specific domain tasks. We focus on Parameter Efficient Fine-Tuning (PEFT) methods on Decoder-Only transformers, such as GPT and Llama models, due to the significant potential to complement LLMs widely offered by different organizations. The methods transform structured knowledge from the KG into textual descriptions and are utilized in the following ways:

  1. Task specificity should go hand in hand with domain orientation. Thus, we generate fine-tuning data, leveraging the power of KGs and reasoning to build task—and domain-specific corpora for LLM fine-tuning.
  2. Knowledge-Enhanced Prompts, automatically generating prompts to improve the outputs of LLMs, in scenarios that require recommendations and explain causality. Iteratively partition and encode the neighborhood subgraph around each node into textual sentences for finetuning data. This transforms graph structure into a format that large language models can ingest and fine-tune. We explore encoding strategies.

Considerations:

  • The methods are low-cost and more straightforward to implement than pre-training LLMs.
  • They can effectively improve LLMs’ performance on specific tasks.
  • Suitable for domain-specific tasks and text generation scenarios that require sensitive information filtering.
  • Finding the most relevant knowledge in the KG may limit the Fine-Tuning process.
  • These methods may impose certain limitations on the LLM to freely create content.

Standards:

  • Fine-Tuning Large Enterprise Language Models via Ontological Reasoning
  • GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding
  • GraphGPT: Graph Instruction Tuning for Large Language Models 

References:

  • [1] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“, IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
  • [2] T. Wang u. a., „What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.html
  • [3] Liu, Weijie, u. a. K-BERT: Enabling Language Representation with Knowledge Graph. arXiv:1909.07606, arXiv, 17. September 2019. arXiv.org, http://arxiv.org/abs/1909.07606.
  • [4] Sun, Tianxiang, u. a. „CoLAKE: Contextualized Language and Knowledge Embedding“. Proceedings of the 28th International Conference on Computational Linguistics, herausgegeben von Donia Scott u. a., International Committee on Computational Linguistics, 2020, S. 3660–70. ACLWeb, https://doi.org/10.18653/v1/2020.coling-main.327.
  • [5] Zhang, Zhengyan, u. a. „ERNIE: Enhanced Language Representation with Informative Entities“. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, herausgegeben von Anna Korhonen u. a., Association for Computational Linguistics, 2019, S. 1441–51. ACLWeb, https://doi.org/10.18653/v1/P19-1139

ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME

How do I use KGs for Retrieval-Augmented Generation (RAG)? (2.1 – Prompt Enhancement) – length: up to one page

Lead: Diego Collarana 

Contributors:

  • Daniel Burkhardt (FSTI)
  • Robert David (SWC)
  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima)
  • Michael Wetzel (Coreon)

Problem statement

RAG methods aim to enhance the capabilities of LLMs by providing real-time information and domain-specific knowledge that may not be present in their training data. Despite its advantages over standalone LLMs, conventional RAG has the following limitations:

  1. Struggles to answer queries that require the intricate interconnectedness of information and global context crucial for generating comprehensive summaries.
  2. It cannot integrate structure and unstructured data, a use case typically required in industrial applications.
  3. Limited accuracy due to context loss during text chunking and its reliance on text similarity search.
  4. It has limited reasoning capabilities, especially with abstract questions that require reasoning, inference, or the synthesis of new information not explicitly stated in the source material.
  5. The answers cannot be backtracked to the information sources (factual grounding).
  6. The external knowledge, while consistent, can still lead to inconsistencies in the generated answer.

Explanation of concepts

  • Retrieval-augmented generation (RAG) methods combine retrieval mechanisms with generative models to enhance the output of LLMs by incorporating external knowledge. By grounding the generated output in specific and relevant information, RAG methods improve the quality and accuracy of the generated output.
  • Types of RAG:

    • Conventional RAG has three components: 1) Knowledge Base, typically created by chunking text documents, transforming them into embeddings, and storing them in a vector store. 2) Retriever searches the vector database for chunks that exhibit high similarity to the query. 3) Generator feeds the retrieved chunks, alongside the original query, to an LLM to generate the final response.
    • Graph RAG integrates knowledge graphs into the RAG framework, allowing for the retrieval of structured data that can provide additional context and factual accuracy to the generative model. 
      The retrieval can be done on any source with a semantic representation, e.g., documents with semantic annotations or relational data via OBDA or R2RML, thereby ingesting structured and unstructured source information into the Graph RAG.
  • RAG is used in various natural language processing tasks, including question-answering, information extraction, sentiment analysis, and summarization. It is particularly beneficial in scenarios requiring domain-specific knowledge.

Brief description of the state-of-the-art

The emerging field of Graph RAG develops methods to exploit the rich, structured relationships between entities within a KG to retrieve more precise, factually relevant context for LLMs [9]. Graph RAG methods encompass graph construction, knowledge retrieval, and answer-generation techniques [1,2,5]. We find methods that leverage existing open-source KGs [3] to methods for automatically building domain-specific KGs from raw textual data using LLMs [6]. The retrieval phase focuses on efficiently extracting pertinent subgraphs, paths, or nodes relevant to a user query with techniques like embedding similarity, pre-defined rules, or LLM-guided search. In the generation phase, retrieved graph information is transformed into LLM-compatible formats, such as graph languages, embeddings, or GNN encoding, to generate enriched and contextually grounded responses [4]. Recently, significant attention has been given to hybrid approaches combining conventional RAG and Graph RAG strengths [7,8]. HybridRAG integrates contextual information from traditional vector databases and knowledge graphs, resulting in a more balanced and effective system that surpasses individual RAG approaches in critical metrics like faithfulness, answer relevancy, and context recall. 

We describe various solutions for integrating knowledge graphs into RAG systems to improve accuracy, reliability, and explainability. 

Answer 1: Knowledge Graph as a Database with Natural Language Queries (NLQ)

Description: This solution treats the knowledge graph as a structured database and leverages natural language queries (NLQ) to retrieve specific information. The implementation steps are as follows:

  • First, the user's question is processed to extract key entities and relationships using entity linking and relationship extraction techniques. (Natural Language Understanding)
  • Next, the natural language query is partially or fully mapped into a graph query language, e.g., Cypher or SPARQL. (Graph Query Construction)
  • Then, the constructed graph query is executed against the knowledge graph database, which retrieves precise and targeted information from the knowledge graph. (Knowledge Graph Execution)
  • Finally, the retrieved results are passed to the LLM for summarization or further processing to generate the final answer. (Response generation)

Considerations:

  • Accurate Query Mapping: Requires advanced NLP techniques to map natural language queries to graph queries accurately. Entity linking and relationship extraction must be precise to ensure correct query formulation.
  • Performance Efficiency: Executing complex graph queries may impact performance, especially with large-scale knowledge graphs. Optimization of graph databases and queries is necessary for real-time applications.
  • Scalability: The system should handle growing knowledge graphs without significant performance loss. Scalable graph database solutions are essential.
  • User Experience: The system must effectively interpret user intent from natural language inputs. Providing clear and concise answers enhances usability and trust.

Standards and Protocols:

  • Compliance with Data Standards: Ensure the knowledge graph adheres to relevant data modeling standards. Where applicable, utilize standardized vocabularies and ontologies.
  • Interoperability: Design the system for various graph databases and query languages. Support integration with external data sources and systems.

Answer 2: Knowledge Graph-Guided Retrieval Mechanisms

Description: KG-Guided Retrieval Mechanisms involve using, for example, knowledge graphs or vector databases to enhance the retrieval process in RAG systems. Knowledge graphs provide a structured representation of knowledge, enabling more precise and contextually aware information retrieval. This approach can directly query knowledge graphs or use them to augment queries to other data sources, improving the relevance and accuracy of the retrieved information.

  • First, the user's question is processed to extract key entities and relationships using entity linking and relationship extraction techniques as a (semantic) graph representation of the question. (Natural Language Understanding)
  • Next, the graph representation is executed against the knowledge graph database, which first retrieves information from the knowledge graph and then retrieves the associated mapped data source.
    Data sources can be of different kinds:
    • Knowledge graph data
    • Non-knowledge graph data with a graph representation:
      • Tabular and relational data, e.g., via OBDA or R2RML.
      • Semi-structured data, e.g., XML or DITA.
      • Unstructured natural language, e.g., via semantic annotations.
  • Then, the retrieved (different kinds of) results are consolidated (preprocessed) to be ingested into the LLM prompt. (Data consolidation)
  • Finally, the consolidated results are passed to the LLM for summarization or further processing to generate the final answer. (Response generation)

Considerations:

  • Limited input data: a short user's question poses a challenge to effectively create a graph representation sufficiently expressive for a high-quality retrieval of information.
  • Knowledge model: a high-quality graph representation of both the user question and the actual information in the database is very likely to need a knowledge model with sufficient expressivity in the background.
  • Graph representation: doing graph-based retrieval of (heterogeneous) data sources needs an established graph representation for each source.
  • Consolidation architecture: Setting up a system architecture for consolidated data sources needs different kinds of integration components.
  • Semantic gap: there is the risk of a gap of semantic information between the retrieved information and the LLM-generated answer, because any semantics contained in the knowledge graph and any knowledge model cannot be preserved during ingestion into the LLM generation.

Standards and Protocols:

  • Compliance with Data Standards: Ensure the knowledge graph adheres to relevant data modeling standards. Where applicable, utilize standardized vocabularies and ontologies.
  • Interoperability: Design the system for various graph databases and query languages. Support integration with external data sources and systems.

Answer 3: Hybrid RAG Combining KGs and Dense Vectors

Draft from Daniel Burkhardt

Description: Hybrid Retrieval combines the strengths of knowledge graphs and dense vector representations to improve information retrieval. This approach leverages the structured, relational data from knowledge graphs and the semantic similarity captured by dense vectors, resulting in enhanced retrieval capabilities. Hybrid retrieval systems can improve semantic understanding and contextual insights while addressing scalability and integration complexity challenges.

  • First, the user submits a query that is analyzed to select which retrieval approach (1.*) (Arbitrator or Classification)
  • The retrieval components are called either in parallel or sequentially (Hybrid Retrieval Process)
    • Vector Search: Retrieves data based on vector embeddings
    • Keyword Search: Retrieves data based on keyword matching
    • Graph Queries: Retrieves structured data from the knowledge graph
  • Then, we combine results from all retrieval methods. (Result Integration)
  • Response Generation: LLM generates and delivers the response.

Considerations:

  • Requires efficient result fusion techniques
  • Addresses diverse data types and sources
  • Increase in latency of response

REFERENCE TO BE REMOVED

References

  • [1] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang: Graph Retrieval-Augmented Generation: A Survey. CoRR abs/2408.08921 (2024)
  • [2] Diego Collarana, Moritz Busch, Christoph Lange: Knowledge Graph Treatments for Hallucinating Large Language Models. ERCIM News 2024(136) (2024)
  • [3] Junde Wu, Jiayuan Zhu, Yunli Qi: Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. CoRR abs/2408.04187 (2024)
  • [4] Sen, Priyanka, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations - NLRSE (2023)
  • [5] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, Xindong Wu: Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 36 (7) (2024)
  • [6] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson: From Local to Global: A Graph RAG Approach to Query-Focused Summarization. CoRR abs/2404.16130 (2024)
  • [7] Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta: HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. CoRR abs/2408.04948 (2024)
  • [8] Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, Sahar Vahdati: Beyond Boundaries: A Human-like Approach for Question Answering over Structured and Unstructured Information Sources. Trans. Assoc. Comput. Linguistics (2024)
  • [9] Juan Sequeda, Dean Allemang, Bryon Jacob: A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. GRADES/NDA (2024)

How do I enhance LLM explainability by using KGs? (2.2 – Answer Verification) – length: up to one page

Lead: Daniel Burkhardt

Problem Statement

LLMs have demonstrated impressive capabilities in generating human-like responses across diverse applications. However, their internal workings remain opaque, posing challenges to explainability and interpretability [2]. This lack of transparency introduces risks, especially in high-stakes applications, where LLMs may produce outputs with factual inaccuracies—commonly known as hallucinations—or even harmful content due to misinterpretation of prompts [6, 9]. Consequently, there is a pressing need for enhanced explainability in LLMs to ensure the accuracy, trustworthiness, and accessibility of model outputs for both end-users and researchers alike [2, 4].

One promising approach to improving LLM explainability is integrating KGs, which provide structured, fact-based representations of knowledge. KGs store relationships between entities in a networked format, enabling models to reference explicit connections between concepts and use these as reasoning pathways in generating text [10]. By aligning LLM responses with verified facts from KGs, we aim to reduce hallucinations and create outputs grounded in reliable data. For example, multi-hop reasoning over KGs can improve consistency by allowing LLMs to draw links across related entities—a particularly valuable approach for complex, domain-specific queries [11]. Additionally, retrieval-augmented methods that incorporate KG triplets can further enhance the factuality of LLM outputs by directly integrating structured knowledge into response generation, thereby minimizing unsupported claims [3].

However, the integration of KGs with LLMs presents unique challenges, particularly in terms of scalability and model complexity. The vast number of parameters in LLMs makes interpreting and tracing decision paths challenging, especially when the model must align with external knowledge sources like KGs [8]. Traditional interpretability methods, such as feature attribution techniques like SHAP and gradient-based approaches, are computationally intensive and less feasible for models with billions of parameters [2, 12]. Therefore, advancing KG-augmented approaches is essential for creating scalable, efficient solutions for real-world applications.

The need for KG-augmented LLMs is especially critical in domain-specific contexts, where high-fidelity, specialized information is essential. In fields such as medicine and scientific research, domain-specific KGs provide precise and contextually relevant information that general-purpose KGs cannot match [1]. Effective alignment with these KGs would not only support more accurate predictions but also enable structured, explainable reasoning, thereby making LLMs’ decision-making processes transparent and accessible for both domain experts and general users alike.

Explanation of concepts 

  • KG Alignment with LLMs: This refers to ensuring that the representations generated by LLMs are in sync with the structured knowledge found in KGs. For example, frameworks like GLaM fine-tune LLMs to align their outputs with KG-based knowledge, ensuring that responses are factually accurate and well-grounded in known data [3]. By aligning LLMs with structured knowledge, the explainability of model predictions is improved, making it easier for users to verify how and why certain information was provided [1].

  • KG-Guided (post-hoc) Explanation Generation: KGs assist in generating explanations for LLM outputs by providing a logical path or structure to the answer. By referencing entities and their relationships within a KG, LLMs can produce detailed, justifiable answers. Studies like those in the education domain use KG data to provide clear, factually supported explanations for LLM-generated responses [2,5]. Some approaches involve equipping LLMs with tools, such as fact-checking systems, that reference KG data for verifying outputs post-generation. Through this process, known as post-hoc explanation, LLMs can justify or clarify responses by referencing relevant facts from KGs, enhancing user trust and transparency. This augmentation allows LLMs to provide clearer justifications and improve the credibility of their outputs by aligning with trusted knowledge sources [7].

  • Domain-Specific Knowledge Enhancement: In specialized fields like medicine or science, domain-specific KGs provide high-fidelity information that general-purpose KGs cannot offer. Leveraging these specialized KGs, LLMs can generate responses that are both contextually relevant and reliable, meeting the specific knowledge needs of domain experts. This alignment with specialized KGs is critical to ensuring that outputs are appropriate for expert users and rooted in precise, authoritative knowledge [1].
  • Factuality and Verification: Knowledge Graphs (KGs) provide structured, factual knowledge that serves as a grounding source for LLM outputs. By referencing verified relationships between entities, KGs help reduce the occurrence of hallucinations and ensure that responses are factually accurate. This grounding aligns LLM outputs with established knowledge, which is essential in high-stakes fields where accuracy is critical. Systems like GraphEval [6] analyze LLM outputs by comparing them to large-scale KGs, ensuring that the content is factual. This verification step mitigates hallucination risks and ensures outputs are reliable [2,6,7].

Brief description of the state of the art

Recent research in integrating KGs with LLMs has produced several frameworks and methodologies designed to enhance model transparency, factuality, and domain relevance. Key initiatives include KG alignment and post-hoc verification techniques, both of which aim to improve the explainability and reliability of LLM outputs.

For KG alignment, approaches such as the GLaM framework fine-tune LLMs to align responses with KG-based knowledge. This ensures that model outputs remain factually grounded, particularly by embedding KG information into the LLM’s representation space. GLaM has demonstrated that aligning model outputs with structured knowledge can reduce factual inconsistencies, supporting applications that require reliable, fact-based answers [3].

In post-hoc explanation generation, frameworks like FACTKG leverage KG data to verify model responses after generation, producing detailed justifications that reference specific entities and relationships. This KG-guided approach has shown efficacy in fields like education, where models need to generate clear, factually supported answers to complex questions. FACTKG’s methodology enables LLMs to produce explanations that are both traceable and verifiable, thereby improving user trust in the generated content [5].

In domain-specific contexts, specialized KGs provide high-fidelity information that general-purpose KGs cannot. For instance, in the medical domain, projects like KnowPAT have incorporated domain-specific KGs to enhance LLM accuracy in delivering contextually appropriate responses. By training LLMs with healthcare-specific KGs, KnowPAT enables models to provide precise, authoritative responses that align with expert knowledge, which is crucial for sensitive fields where general-purpose knowledge may be insufficient [1].

Further, initiatives such as GraphEval underscore the role of KGs in factuality and verification. By analyzing LLM outputs through comparisons with large-scale KGs, GraphEval ensures that model responses align with known, structured facts, helping mitigate hallucination risks. This comparison process has proven valuable in high-stakes fields, as it enables verification of LLM-generated information against a vast repository of established facts, making outputs more reliable and reducing potential inaccuracies [6].

Answer 1: Measuring KG Alignment in LLM Representations

Description

Measuring the alignment between LLM representations and KGs involves comparing how well the LLM’s output matches the structured knowledge in the KG. For example, in GLaM [3], fine-tuning is performed to align LLM outputs with KG-derived entities and relationships, ensuring that responses are not only accurate but also interpretable. The alignment helps reduce issues like hallucinations by grounding responses in verifiable data. This method was used to improve performance in domain-specific applications, where LLMs need to accurately reflect relationships and entities defined in KGs [3, 4, 5].

Considerations

  • Data Quality and Coverage: The quality and completeness of the KG significantly impact alignment. If the KG lacks comprehensive data, the LLM might still produce hallucinations or incomplete outputs despite alignment efforts.
  • Alignment Metrics: Measuring alignment between LLM and KG representations requires specific metrics. These may include entity coverage, similarity scores between LLM-generated and KG-retrieved responses, or accuracy in matching relational paths in the KG.
  • Computational Complexity: Aligning LLM representations with extensive KGs is computationally intensive, especially with large-scale KGs. Efficient alignment techniques and resource allocation are crucial for scalable implementations.

Standards, Protocols and Scientific Publications

  • Embedding Alignment: Techniques like TransE and BERT-based entity alignment support embedding alignment between LLMs and KGs.
  • KG Ontologies: Ontologies help structure KG data and provide a common format, such as OWL (Web Ontology Language) and RDFS (RDF Schema).

Answer 2: KG-Guided Explanation Generation

Description

KGs can be used to guide the explanation generation process, where the LLM references structured data in the KG to justify its output. For instance, in the educational domain, explanations are generated using semantic relations from KGs to ensure that recommendations and answers are factually supported [4]. This method not only provides the user with an understandable explanation but also reduces hallucination risks by ensuring that every output can be traced back to a known fact in the KG. Studies on KG-guided explanation generation in various fields confirm its utility in making LLM outputs more transparent and understandable to non-experts [4,5].

Considerations

  • Interpretability for Non-Experts: Presenting KG-based explanations to lay audiences can be challenging if the KG structure is complex. Simplified language or visual aids may be necessary to enhance comprehension.
  • Semantic Completeness: The KG needs to encompass the relevant relationships to generate comprehensive explanations. Missing relationships may lead to partial or insufficient justifications.
  • Consistency Across Domains: Cross-domain use cases require consistent explanations, which may be complex if the KG includes overlapping or domain-specific relations that vary significantly.

Standards, Protocols and Scientific Publications

Answer 3: KG-Based Fact-Checking and Verification

Description: 

KG-based fact-checking is an essential method for improving LLM explainability. By cross-referencing LLM outputs with structured knowledge in KGs, fact-checking systems like GraphEval ensure that generated responses are accurate and grounded in truth [6, 7]. This is especially useful for reducing hallucinations. GraphEval automates the process of verifying LLM outputs against a KG containing millions of facts, allowing for scalable and efficient fact-checking that improves both explainability and user trust [6].

  • First, the generated output is analysed regarding the knowledge graph and key entities and relationships are extracted to create a graph representation of the LLM answer.
  • Next, this graph representation is then analyzed regarding the knowledge graph used for retrieval and any knowledge models in the background are also included. The analysis retrieves a graph representation of an explanation or justification and is returned as a (sub)graph or graph traversal with any additional information added, like RDF* weights.
  • Finally, the explanation is then returned to the user in a human-readable way to be cross-checked with the LLM generated answer.

Considerations:

  • Limited Input Data: Short LLM-generated answers may lack sufficient information for thorough fact-checking. In such cases, additional context or auxiliary information may be required to validate the response accurately.
  • Presentation for Non-Experts: Graph-based explanations can be challenging to interpret for lay audiences. Translating complex graph structures into user-friendly formats or summaries can improve accessibility.
  • Data Complexity and Scale: Fact-checking across large KGs can be resource-intensive, requiring efficient query algorithms and substantial computational power for high-speed verification

Standards, Protocols and Scientific Publications:

References

  1. Zhang et al., 2024, " Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
  2. Zhao et al., 2023, "Explainability for Large Language Models: A Survey
  3. Dernbach et al., 2024, "GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding
  4. Rasheed et al., 2024, "Knowledge Graphs as Context Sources for LLM-Based Explanations of Learning Recommendations" 
  5. Kim et al., 2023, "FACTKG: Fact Verification via Reasoning on Knowledge Graphs
  6. Liu et al., 2024, "Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs
  7. Hao et al., 2024, "ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
  8. Jiang et al., 2023, "Efficient Knowledge Infusion via KG-LLM Alignment
  9. Weidinger, L., et al., 2021. "Ethical and social risks of harm from Language Models". arXiv preprint arXiv:2112.04359
  10. Liao et al., 2021, "To hop or not, that is the question: Towards effective multi-hop reasoning over knoweldge graphs
  11. Bratanič et al., 2024, "Knowledge Graphs & LLMs: Multi-Hop Question Answering
  12. Sundararajan, M., et al. (2017). "Axiomatic Attribution for Deep Networks." International Conference on Machine Learning.
  13. Bordes et al. (2013). "Translating Embeddings for Modeling Multi-relational Data".  

How do I enhance LLM reasoning through KGs? (2.3 – Answer Augmentation) – length: up to one page

Lead: Daniel Burkhardt

Problem Statement 

Integrating KGs with LLMs offers promising enhancements for reasoning, yet it introduces several critical challenges. One primary issue is the complexity of multi-hop reasoning, which demands navigating multiple KG relationships; this process is computationally intensive and error-prone for LLMs, as they require repeated calls for each step along the reasoning path. This accumulation of errors, combined with extensive computational demands, makes it difficult to maintain accuracy and efficiency in multi-hop queries​​ [1, 6, 7]​.

Scalability and adaptability further complicate the integration, as real-world KGs are often vast and incomplete, requiring costly adaptations that may not generalize well across different domains. This lack of adaptability limits LLM performance on task-specific KGs and can lead to knowledge gaps in complex or specialized domains​​​. Additionally, LLMs are prone to hallucinations—generating plausible but incorrect outputs—particularly when KG paths are misaligned with the models’ internal representations. This misalignment leads to factual inaccuracies and inconsistencies, a significant risk in domains that require precision, such as healthcare​​​ [8, 9]. 

Efficiency in retrieval and processing also poses a challenge, as KG-augmented models need to filter relevant data without overwhelming the LLM with unnecessary information, and standard retrieval methods often struggle with the interconnected data in KGs​​ [3, 7]. 

Finally, ensuring interpretability is a key challenge. While KGs can support transparent reasoning by providing structured pathways, integrating these with LLMs in a way that produces human-understandable rationales remains difficult. Current methods often fall short of providing coherent, traceable explanations, affecting the trustworthiness of LLM-based outputs​​. These challenges underscore the need for further research to address the computational, scalability, and interpretability limitations of KG-augmented LLM reasoning systems​ [9, 10]​.

Explanation of concepts 

  • Multi-hop Reasoning with KGs: Multi-hop reasoning involves traversing several relationships within a KG to connect multiple pieces of information. By structuring queries across these relational "hops," LLMs can access layered knowledge, making it possible to answer complex questions that require linking distant but related entities. This is particularly useful for in-depth domain-specific queries where multiple steps are essential to arrive at accurate answers​​ [1, 6].

  • Tool-Augmented Reasoning: In tool-augmented reasoning, LLMs integrate external resources like KGs to aid in decision-making during reasoning. Models like ToolkenGPT [5] leverage tool-based embeddings, allowing the LLM to perform dynamic, real-time KG queries. This augmentation enables LLMs to retrieve structured knowledge that informs reasoning paths and aids logical, stepwise problem-solving​​ [2, 5].

  • Consistency Checking in Reasoning: Consistency checking ensures that LLMs adhere to logical coherence throughout their reasoning process. By systematically cross-referencing LLM outputs with KG facts, systems like KONTEST [4] can evaluate the alignment of generated answers with established knowledge, identifying logical inconsistencies that may arise during reasoning. This reduces contradictions and improves the factual reliability of responses​​. [4, 5].

  • Chain of Thought (CoT) Reasoning Enhanced by KGs: CoT reasoning, when combined with KGs, supports a structured multi-step reasoning process. By organizing KG-based reasoning paths in a sequence, LLMs can maintain logical flow and improve interpretability in complex queries. This structured reasoning, enabled by tracing relationships in the KG, enhances transparency in decision-making by allowing users to follow the steps that led to the final answer​ [3, 7]​.
  • Graph-Constrained Reasoning: In graph-constrained reasoning, LLMs are guided by the constraints of the KG, which restricts possible reasoning paths to those that align with verified entity relationships. By adhering to KG-encoded logical structures, LLMs can reduce spurious or unrelated associations, focusing only on reasoning paths that conform to the graph’s factual framework. This enhances logical accuracy and minimizes errors in multi-step reasoning​​ [11, 12].

Brief description of the state of the art 

The current landscape for enhancing LLM reasoning with KGs is advancing with several state-of-the-art methods and tools. Multi-hop reasoning and RAG are foundational techniques enabling LLMs to connect multiple pieces of information across KG paths, facilitating answers to complex, layered questions​​. Multi-hop approaches, like Paths-over-Graph (PoG), dynamically explore reasoning paths in KGs, integrating path pruning to optimize the reasoning process and focus on relevant data​​ [1, 6, 7, 11]. 

Another significant development is tool-augmented reasoning, exemplified by systems such as ToolkenGPT, which equips LLMs with the ability to access and utilize KG-based tools during inference. ToolkenGPT creates embeddings for external tools (referred to as "toolkens"), enabling real-time KG lookups that aid in logical, structured reasoning by supplementing the LLM's outputs with factual data drawn directly from KGs​​. Similarly, Toolformer offers dynamic API-based access to KG data, facilitating reasoning with external support for tasks requiring specific, fact-based insights​ [2, 5].

Consistency-checking frameworks are also essential for enhancing reasoning accuracy. Systems like KONTEST evaluate LLM outputs against KG facts, ensuring logical coherence and flagging inconsistencies. This method reduces the logical errors LLMs might otherwise produce in reasoning tasks by cross-referencing generated answers with verified KG knowledge​​. Furthermore, GraphEval is used to assess the factuality of LLM responses, leveraging a judge model that systematically aligns generated answers with KG-derived facts​ [4, 13, 14].

Chain of Thought (CoT) reasoning combined with KGs enables LLMs to approach reasoning tasks in a multi-step, structured manner. By organizing KG-based reasoning paths into sequential steps, CoT supports transparency and traceability in complex queries, particularly useful for answering multi-entity questions​​. Lastly, graph-constrained reasoning, as seen in frameworks like Graph-Constrained Reasoning and PoG, directs LLM reasoning within predefined KG paths, minimizing irrelevant associations and enhancing logical consistency by adhering to factual constraints within the graph structure​​ [3, 7, 11, 12].

Answer 1: KG-Guided Multi-hop Reasoning

Description: 

KG-guided multi-hop reasoning enables Large Language Models (LLMs) to connect multiple entities or facts across a Knowledge Graph (KG) to address complex queries. By utilizing KGs, LLMs can follow structured, logical paths through interconnected data, which helps them generate answers that would be challenging to derive from unstructured data sources. Multi-hop reasoning allows the LLM to step through relevant nodes within the KG, effectively using each "hop" to refine its understanding and response. For example, the Neo4j framework supports multi-hop reasoning by allowing LLMs to query interconnected entities efficiently, which enhances performance on tasks that require detailed, stepwise reasoning across multiple facts and relationships​​. Additionally, models like Paths-over-Graph (PoG) leverage KGs to perform dynamic multi-hop path exploration, optimizing data retrieval by pruning irrelevant information, thus helping the LLM access only the most relevant paths needed for accurate answers​ [1, 3, 7, 11].

Considerations

  1. Path Optimization: Not all multi-hop paths within a KG are equally relevant. Applying path optimization techniques, like pruning irrelevant nodes and paths, ensures that LLMs focus on the most pertinent information, reducing computational overhead and enhancing answer accuracy [1, 6].
  2. Data Quality and Completeness: The reliability of multi-hop reasoning depends heavily on the quality and completeness of the KG. Incomplete or noisy data can lead to erroneous inferences or missed connections, so it is essential to maintain an accurate and well-curated KG [1, 6].
  3. Scalability and Efficiency: KGs, especially large-scale ones, can be computationally intensive to query in multi-hop settings. Efficient query mechanisms and algorithms are crucial to minimize latency and enhance the overall responsiveness of the LLM [1, 6].
  4. Error Accumulation in Multi-hop Paths: With each additional hop, the likelihood of error increases, especially if the KG contains outdated or incorrect information. Error correction techniques, such as consistency checks, can help maintain the quality of multi-hop reasoning [1, 6].

Standards and Protocols and Scientific Publications

Answer 2: KG-Based Consistency Checking in LLM Outputs

Description: 

KG-based consistency checking is a method to enhance the accuracy and logical coherence of LLM outputs by cross-referencing generated answers with structured facts in a Knowledge Graph (KG). This approach helps to ensure that the information LLMs provide aligns with verified knowledge. Systems like KONTEST exemplify this method by systematically using KGs to generate consistency tests, checking the logical validity of LLM outputs before presenting them to users. By evaluating LLM responses against established facts, KONTEST reduces errors in reasoning and enhances the reliability and trustworthiness of model-generated conclusions​​. Additionally, the use of consistency-checking frameworks like GraphEval allows for scalable verification, applying KG-based facts to systematically evaluate and align LLM outputs, which further mitigates inaccuracies​ [4, 13, 14].

Considerations

  • Data Completeness and Accuracy: The effectiveness of consistency checking depends on the completeness and accuracy of the KG. Gaps or inaccuracies in the KG can lead to incorrect assessments of LLM outputs, so maintaining a high-quality KG is essential [4, 13, 14].
  • Computational Overhead: Consistency checking involves comparing multiple elements within the LLM response against the KG, which can introduce significant computational costs, especially for large KGs or high-frequency queries [4, 13, 14].
  • Contextual Matching: For effective consistency checks, it’s crucial that the KG context aligns with the LLM's response context. Misalignment may result in false positives or negatives in consistency assessments, affecting accuracy [4, 13, 14].
  • Human-Readable Output: Consistency checks often require translating graph-based verification results into explanations that are accessible to non-expert users, particularly in sensitive applications where explainability is critical​​​ [4, 13, 14].

Standards and Protocols and Scientific Publications

  • KONTEST Testing Protocol: Used to ensure logical consistency of outputs by cross-verifying LLM results with KG data​ [4]​.
  • GraphEval for Automated Consistency Testing: Applies large-scale KGs to systematically assess the factuality and consistency of LLM responses [13]​.

References

  1. Liao et al., 2021, "To hop or not, that is the question: Towards effective multi-hop reasoning over knoweldge graphs
  2. Schick et al., 2023, "Toolformer: Language Models Can Teach Themselves to Use Tools
  3. Bratanič et al., 2024, "Knowledge Graphs & LLMs: Multi-Hop Question Answering 
  4. Rajan et al., 2024, "Knowledge-based Consistency Testing of Large Language Models
  5. Hao et al., 2024, "ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
  6. Choudhary, N., & Reddy, C. K. (2024). Complex Logical Reasoning over Knowledge Graphs using Large Language Models
  7. Jiang, B., et al. (2024). Reasoning on Efficient Knowledge Paths: Knowledge Graph Guides Large Language Model for Domain Question Answering​.
  8. Ding, R., et al. (2023). A Unified Knowledge Graph Augmentation Service for Boosting Domain-specific NLP Tasks
  9. Wang, S., et al. (2023). Unifying Structure Reasoning and Language Pre-training for Complex Reasoning Tasks
  10. Zhao et al., 2023, "Explainability for Large Language Models: A Survey" 
  11. Tan, X., et al. (2024). Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning
  12. Akirato/LLM-KG-Reasoning GitHub repository (2023). Graph-Constrained Reasoning​.
  13. Liu et al., 2024, "Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs" ,
  14. Lo, P-C., et al. (2023). On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs(2312.00353v1)​.


How do I evaluate LLMs through KGs? (3) – length: up to one page

Lead: Fabio

Contributors:

  • Daniel Burkhardt (FSTI)
  • Daniel Baldassare (doctima)
  • Fabio Barth (DFKI)
  • Max Ploner (HU)
  • Alan Akbik (HU)
  • ...

Problem statement 

Automatic evaluation of LLMs is usually done by comparing generated model output with a desired result. Therefore, many well-established metrics, like direct matching or similarity metrics (BLEU, N-gram, ROUGE, BERTScore), are used. However, especially when the output deviates from the reference answers, conventional similarity metrics are insufficient to measure the factuality of the generated output. Incorporating information from knowledge graphs (KGs) into the evaluation can help ensure an accurate measurement of the factual integrity and reliability of LLM outputs.

However, there are various reasons why KG should be used in the evaluation to support or enhance these evaluations.

Explanation of concepts 

  • Represented Knowledge: KG triples can be used to evaluate how much knowledge an LLM can leverage from the training process and how consistently this knowledge can be retrieved.
  • Factuality: KG triplets can be used to evaluate the output of an LLM by extracting information from the output and comparing it with a KG to check factuality or knowledge coverage. Examples of this knowledge coverage would be political positions, cultural or sporting events, or current news information. Furthermore, the extracted KG triplets can be used to evaluate tasks/features where a similarity comparison of the LLM output is undesirable. This is the case for identifying and evaluating hallucinations of LLMs.
  • Biases: The final reason is to use KGs to enhance LLM inputs with relevant information. This method is beneficial, for example, if the goal is to use in-context learning to provide relevant information for a specific task to the LLM. In addition, planned adversarial attacks can be carried out on the LLM to uncover biases or weak points. 

Properties to evaluate the LLM on:

  • Represented Knowledge: Which fact queries can the LLM answer correctly & consistently?
  • Factuality: When generating an output, are the facts an LLM uses in its answer correct?
  • Biases: How can bias in LLMs be detected and mitigated using KG?

Brief description of the state of the art 

Knowledge Graphs (KGs) provide a structured and reliable basis for evaluating the knowledge encoded in LLMs. Relational triples from the KGs can be used to systematically test whether an LLM can accurately retrieve relevant information. Additionally, in cases where direct comparisons between reference text and LLM-generated output fall short in assessing factual accuracy, the output can be converted into a meaningful representation to measure alignment with the KG. Finally, the neutral and structured nature of KG data makes it a valuable tool for identifying and analyzing potential biases within LLMs.

Answer 1: Using KGs to Evaluate LLM Represented Knowledge

Description:

Relational triples from a Knowledge Graph (KG) can be leveraged to create up-to-date and domain-specific datasets for evaluating knowledge within Language Models (LMs) [1, 2, 4, 5, 6]. The LLM can then be queried with the subject and relation to predict the object using either a question-answer pattern [6], predicting the answer using masked language modeling [4, 5], or predicting the correct statement from a multiple-choice item [1,2]. KGs provide the correct answers and enable the creation of plausible distractors (incorrect answer options), allowing the generation of multiple-choice items to evaluate the knowledge represented in an LM [1, 2].

Considerations:

  • While KGs are inherently structured to maintain consistency in factual representation, LMs do not always yield consistent answers, especially when queries are rephrased [2, 3]. Integrating KGs for evaluation set generation can address this by allowing multiple phrasings of a single query, all linked to the same answer and relational triple. This approach helps measure an LM’s robustness in recognizing equivalent rewordings of the same fact [1, 2, 3].
  • When using a question-answering format (to evaluate text-generating / autoregressive LMs), the free-form answer of the model needs to be compared to the reference answer [1]. While there are multiple ways of comparing the answer to the reference [1], no single approach is ideal. Multiple-choice-based approaches mitigate this problem entirely [1, 2, 6] but inherently have a limited answer space and may encourage educated guessing, simplifying the task by providing plausible options. Conversely, open-ended answers require the model to generate the correct response without cues and may align better with real-world use cases.
  • Verbalizing a triple requires not only labels for the subject and object (which are typically annotated with one or multiple labels) but also a rule [1] or template [2] that translates the formal triple into natural text. This may require humans to create one or multiple rules per relation. Depending on the target language and the number of relations used, this can be a non-negligible amount of work.


Standards and Protocols and Scientific Publications:

  • RDF (Resource Description Framework) is a W3C-standardized method for modeling graph data. It encodes information as triples: subject, predicate, and object, where the subject and object are nodes, and the predicate is an arc linking them, often identified by URIs; objects may also be literals.
  • LAMA [4]: Seminal work demonstrating the evaluation of knowledge represented in LLMs using KGs (the use of KGs for generating evaluation datasets has, since then, been employed in various further scientific publications [1, 2, 5, 6, 7]).

Answer 2: Using KGs to Evaluate LLM Factuality

Maybe add additional properties such as factuality, correctness, precision etc. or perhaps keep these that we have right now and call them "selected properties" ... (We could move the definition of these properties to the top and discuss which answer addresses which property)

Description:

KGs hold factual knowledge for various domains, which can be used to analyze and evaluate LLM knowledge coverage [1]. This involves verifying the knowledge represented in an LLM using KGs. Similar to previous solutions, the target object can be predicted using either QA patterns [7, 8]. However, the information embedded in a KG can not be compared using strict matching or similarity metrics with the target object of an LLM due to the abstract structure of KG triples. Therefore, the output prediction has to be transformed into a meaning representation that describes the core semantic concepts and relations of an output sequence [8]. Meaning representations should be extracted as a graph-based semantic representation. Thereby, the congruence of the extracted target graph and an objective KG can then be evaluated, and missing or misplaced relations and missing or false knots can be detected [7, 8].  

Considerations:

  • Meaningful graph representations: Meaningful graph representations formally represent semantics that capture a sentence's meaning in a natural language. Various meaningful representations can be used to describe the meaning of a sentence and, therefore, have to be well-defined before evaluating an LLM on factuality using KGs. Target and objective KG should be mapped onto the same meaningful graph representations [8].
  • Information Extraction: Any evaluated LLM output must be encoded into the pre-defined KG meaning representation. These process concepts are versatile and multiple solutions have been used and tested in research. Text-to-graph generation models [8, 9], KG construction prompt [7], or multi-component extraction where entities, coreference resolutions, and relations are detected and extracted in multiple stages [7].
  • KG factuality: Depending on the KG generation strategy, the target and objective KG can be compared and analyzed at different levels and granularities. The general idea is to check whether each triple in the target KG is factually consistent given an objective KG (or context). For instance, a graph neural network (GNN) that encodes edge representations derived from the corresponding entity nodes can be trained on binary classification of factuality or non-factuality of each encoded edge [8]. 


Standards and Protocols and Scientific Publications:

  • For meaningful graph representations, the standard protocols are, for instance, Abstract Meaning Representation (AMR) [10] or Open Information Extraction (OpenIE) [11]. AMR is a semantic representation language generated as rooted, directed, edge-labeled, and leaf-labeled graphs [10]. In AMR, the edges are semantic relations, and the nodes are concepts. AMR has a fixed relation vocabulary of approximately 100 relations and the inverse of each relation. In OpenIE, on the other hand, relation triples are represented as a subject, an open relation, and the object of the open relation. An open relation means that OpenIE does not contain a fixed relation vocabulary. Therefore, each sentence is represented as a directed acyclic graph, and an extractor is used to enumerate all word pairs and make a parallel prediction of the relation [11].
  • Extracting information from a text and generating or enhancing a KG from it will be discussed in Chapter 4.2. NLP tasks like named entity recognition, coreference resolutions, and relation extraction are well-established problems in this field of research that are solved using either generative LLMs or fine-tuned language models [12, 13]. The third option of using prompting for generating a KG is based on two techniques: in-context learning and chain-of-thought reasoning (explained in Section 4) [7].
  • KG factuality The standard protocol for checking the factuality of a generated KG from an LLM output sequence would be to encode the KG using an LLM or a GNN and predict the factuality using binary classification [8, 14]. For both models, context can be provided in addition to the generated KG for higher precision in the prediction. For this task, the GNN has to be fine-tuned to factuality prediction. When using an LLM for the prediction, prompting can be used to predict the factuality of KG triples [7]. The prompt can be enhanced with in-context learning examples or the context of factual KG relations [14].
  • Current publications that use the explained techniques are GraphEval [7] and FactGraph [8]. GraphEval uses SOTA LLMs like LLaMA to extract and generate the KG from a given model output. The Framework identifies each extracted triple on whether they are factually consistent given the provided context. FactGraph builds on text and graph encoders that are augmented with structure-aware adapters to classify actuality [8]. 

Answer 3: Analyzing LLM Biases through KG Comparisons

Description:

KG can also enhance model inputs with structured KG information instead of extracting meaning and knowledge from LLM outputs. For bias detection, in-context samples can be generated from domain-specific KGs to establish them as so-called biased "superior knowledge" to manipulate the prediction of LLMs and test their robustness against them [15]. This technique can be seen as an adversarial attack because the model gets manipulated to check for leveraged bias from the pre-training that is not mitigated by red-teaming or other bias mitigation techniques. The first step in setting up such an evaluation pipeline is to define a KG or extract relevant subgraphs from a larger KG covering a desirable evaluation bias or a biased context. Those KG are called bias KG. The bias KG nodes can be encoded, and the top k nodes representing a bias based on a context or gold standard can be extracted using an arbitrarily efficient retrieval method [15, 16]. With those top k-biased nodes, a set of k-input in-context samples can be generated using a graph-to-text generation model and embedded in the input prompt for evaluating an LLM. The generated output will then be evaluated based on the pre-defined bias.

Considerations:

  • The bias KG is generated with sensitive attributes that can be considered a potential bias target [15]. Defining those attributes is important for the quality of the adversarial attacks. A closed set of sensitive attributes is advised for the bias evaluation so that the results can be analyzed properly. Various approaches can be used to generate the bias KG. Most of the established approaches are discussed in Section 4, in this Section in Answer 2, and in Chapter 4.2.


Standards and Protocols and Scientific Publications:

  • The standard protocols for each step are partially discussed in different sections. RAG systems would be the closest best practice for this evaluation technique (See Section 4). In BiasKG [15], for instance, bias KG is constructed from free text using RAG methodology. KG triples and entities are mapped into a vectorized embedding space. Those can now be clustered and retrieved for the input text generation.

References

  1. Wenxuan Wang et al. The Earth is Flat? Unveiling Factual Errors in Large Language Models (2024)
  2. Jacek Wiland et al. BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models (2024)
  3. Badr AlKhamissi et al. A Review on Language Models as Knowledge Bases (2022)
  4. Fabio Petroni et al. Language Models as Knowledge Bases? (2019)
  5. Jan-Christoph Kalo et al. KAMEL: Knowledge Analysis with Multitoken Entities in Language Models (2022)
  6. Alon Talmor et al. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge (2019)
  7. Hannah Sansford et al. GRAPHEVAL: A Knowledge Graph-Based LLM Hallucination Evaluation Framework (2023)
  8. Leonardo F. R. Ribeiro et al. FACTGRAPH: Evaluating Factuality in Summarization with Semantic Graph Representations (2022)
  9. Yu Wang et al. Large Graph Generative Models (2024)
  10. Laura Banarescu et al. Abstract Meaning Representation for Sembanking (2013)
  11. Bowen Yu et al. Towards Generalized Open Information Extraction (2022)
  12. Hanwen Zheng et al. A Survey of Document-Level Information Extraction (2023)
  13. Derong Xu et al. Large Language Models for Generative Information Extraction: A Survey(2024)
  14. Joshua Maynez et al. On Faithfulness and Factuality in Abstractive Summarization (2020)
  15. Chu Fei Luo et al. BiasKG: Adversarial Knowledge Graphs to Induce Bias in Large Language Models (2024)
  16. Zhilin Yang et al. HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (2018)
  17. Ziyang Xu et al. Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction (2024)

References:

  1. https://arxiv.org/pdf/2401.00761 (The Earth is Flat?)
  2. https://aclanthology.org/2024.findings-naacl.155/ (BEAR)
  3. https://arxiv.org/pdf/2204.06031 (Review)
  4. https://aclanthology.org/D19-1250/ (LAMA)
  5. https://www.akbc.ws/2022/assets/pdfs/15_kamel_knowledge_analysis_with_.pdf (KAMEL)
  6. https://aclanthology.org/N19-1421/ (CommonsenseQA)
  7. https://www.amazon.science/publications/grapheval-a-knowledge-graph-based-llm-hallucination-evaluation-framework (Fact)
  8. https://aclanthology.org/2022.naacl-main.236.pdf (FactGraph)
  9. https://arxiv.org/pdf/2406.05109 (Large Graph Gerative Models)
  10. https://aclanthology.org/W13-2322.pdf (AMR)
  11. https://aclanthology.org/2022.findings-emnlp.103.pdf (OpenIE)
  12. https://arxiv.org/pdf/2309.13249 (NER, CR, RE on Doc lvl)
  13. https://arxiv.org/pdf/2312.17617 (IE survey)
  14. https://aclanthology.org/2020.acl-main.173.pdf (On Faithfulness and Factuality in Abstractive Summarization)
  15. https://arxiv.org/abs/2405.04756 (BiasKG)
  16. https://aclanthology.org/D18-1259/ (retrieval algorithm)
  17. http://arxiv.org/abs/2403.09963 (Take Care of Your Prompt Bias!)

...