Weitere Themen
KI-Interessierte auf DIN.ONE

Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

How do I enhance/augment/extend LLM training through KGs? (LLM TRAINING) – length: up to one page

Lead: Daniel Baldassare

Answer 1: integrate KGs into the LLM Training Objective

Contributors:

Diego Collarana (FIT)

Short definition/description of this topic: please fill in ...

Content ...
Content ...
Content ...

...

Daniel Baldassare (doctima) – Lead
Michael Wetzel (Coreon)
Rene Pietzsch (ECC)

Problem statement

The training of large language models typically employs unsupervised methods on extensive datasets. Despite their impressive performance on a range of tasks, these models often lack the practical, real-world knowledge required for certain applications. Furthermore, since domain-specific data is not included in the public domain datasets used for pre-training or fine-tuning large language models (LLMs), the integration of knowledge graphs (KGs) becomes fundamental for the injection of proprietary knowledge into LLMs, especially for enterprise solutions. In order to infuse this knowledge into LLMs during training, many techniques have been researched in recent years, resulting in three main state-of-the-art methods (Pan et al, 2024):

Integration of KGs into training objectives (See answer 1)
Verbalization of KGs into LLM inputs (See answer 2)
Integrate KGs by Fusion Modules: Joint training of graph and language models (See answer 3)

Explanation of concepts

The first method focuses on extending the pre-training procedure. The term pretraining objectives is used to describe the techniques that guide the learning process of a model from its training data. In the context of pre-training large language models, a variety of methods have been employed based on the architecture of the model itself. Decoder-only models such as GPT-4 usually use Casual Language Modelling (CLM), where the model is presented with a sequence of tokens and learns to predict the next token in the sequence based solely on the preceding tokens (Wang et al., 2022). Integrating KGs into training objectives consists in extending the standard llm's pre-training objective of generating coherent and contextually relevant text by designing a knowledge aware pre-training.

The second method involves integrating KGs directly into the LLM's input by verbalising the knowledge graph into the prompt, thereby transforming structured data into text format that the LLM can process and learn from. Data from the knowledge is either prepended or postpended to the user's question as contextual information in the prompt. Within this approach the standard llm pre-training objective of generating coherent and contextually relevant text remains untouched and the knowledge augmentation task is modeled as a linguistic task.

Brief description of the state of the art

First draft to be created until 11 October 2024

Proposed solutions:

Answer 1: integrate KGs into the LLM Training Objective

Contributors:

Diego Collarana (FIT)

Short definition/description of this topic: please fill in ...

Content ...
Content ...
Content ...

Answer 2: integrate KGs into LLM Inputs (verbalize KG for LLM training)

Contributors:

Diego Collarana (FIT)
Daniel Baldassare (doctima) – Lead
Michael Wetzel (Coreon)
Rene Pietzsch (ECC)
...

Draft from Daniel Baldassare :

Short definition/description of this topic: Verbalizing knowledge graphs for LLM is the task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:

Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques
Encoding strategies for nodes, relationship between nodes, nodes communities and metadata Talk like a graph: Encoding graphs for large language models (research.google)
What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances

Answer 3: Integrate KGs by Fusion Modules

Contributors:

Diego Collarana (FIT)
Daniel Baldassare (doctima) – Lead
Michael Wetzel (Coreon)
Rene Pietzsch (ECC)

Short definition/description of this topic: please fill in ...

Content ...
Content ...
Content ...

Draft from Daniel Baldassare :

Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques
Encoding strategies for nodes, relationship between nodes, nodes communities and metadata Talk like a graph: Encoding graphs for large language models (research.google)
What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances

Answer 3: Integrate KGs by Fusion Modules

Contributors:

Diego Collarana (FIT)

Short definition/description of this topic: please fill in ...

...

References:

S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“, IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
T. Wang u. a., „What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.html

...

ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME

...

Content

Space Tools

Versions Compared

Old Version 24

New Version 25

Key

How do I enhance/augment/extend LLM training through KGs? (LLM TRAINING) – length: up to one page

Answer 1: integrate KGs into the LLM Training Objective

Answer 1: integrate KGs into the LLM Training Objective

Answer 2: integrate KGs into LLM Inputs (verbalize KG for LLM training)

Answer 3: Integrate KGs by Fusion Modules

Answer 3: Integrate KGs by Fusion Modules