Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Verbalization of KGs into LLM inputs (See answer 1)
  2. Integrate KGs during pre-training (See answer 2)
  3. Integration KGs during Fine-Tuning (See answer 3)

Explanation of concepts

...

  • Pre-training: an unsupervised training step where a language model learns language representations by processing big unannotated corpora.
  • Pre-Training objectives: the techniques that guide the learning process of a model from its training data.

...

  • Standards techniques for language learning are Masked Language Modeling (MLM) and Casual Language Modeling (CLM). For knowledge aware training with KGs triplets directly into llm inputs Knowledge masked language model
  • Masked language model (MLM): a pre-training technique where the model predicts a masked token in a sequence by considering the context of sorrounding tokens.
  • Casual language model (CLM): a pretraining technique where the model is presented with a sequence of tokens and learns to predict the next token in the sequence based solely on the preceding tokens

...

  • Knowledge masked language model (KMLM): a knowledge-aware pre-training technique that extends MLM by adding triplets from KGs directly to the text tokens. In contrast to MLM, not only tokens from the text sequence are masked, but also entities of the triplet that have been added as tokens to the LLM input.
  • Verbalization of KGs: the task of representing knowledge graphs through text, thereby transforming structured data into a text format from which the LLM can process and learn. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, fine-tuning) or during inference (in-context learning).

...

  • Embeddings: a numerical representation of data that captures semantic meaning in a continous multidimensional space.
  • Finetuning: a process of taking a pre-trained model and extending it for:
    • Task Adaption: adapting the model for a new specific task, such as classification or sentiment analysis

...

    • Knowledge Enhancement: expanding the pretrained model's knowledge to specialize it for a particular domain or enterprise needs

...

    • Instruction Tuning: teaching the model to follow human instrcutions using datasets of prompts

...

  • Adapter: A trainable layer that can be added to the original LLM architecture, introducing a small number of new parameters that are independent of the LLM parameters. These can be used either for knowledge-based pre-training (see answer 2) or for efficient fine-tuning (Peft).
  • Parameter Efficient Tuning (PeFT): a technique to fine tune models where only a small number of the pretrained model's parameters or a new small subset of parameters are trained. This technique is useful when computational resources are limited.

Brief description of the state-of-the-art

The current state-of-the-art method to integrate knowledge graph (KG) triplets into llm inputs is Knowledge Masked Language Modelling (KMLM), where tokens from both the text sequence and the KG triplets are masked during training to encourage the model to learn contextual relationships. For example, ERNIE [1] prependes knowledge triplets to the text sequence and randomly masks tokens from either the text or the triplets. In contrast, K-BERT integrates triplets by appending knowledge information immediately after the corresponding entities in the text tokens, while restricting the triplets from influencing text tokens in the sentence. This method prevents semantic changes to the original sentence and mitigates potential knowledge noise that can occur when tokens from the triplets and the text interact directly, as is the case in ERNIE. In addition, verbalising knowledge graphs can improve performance, as demonstrated by KELM [7].

When it comes to integrating KG knowledge directly into the training process using knowledge embeddings, current state-of-the-art techniques modify the model architecture in three main ways:

  1. Inserting a knowledge encoder after the transform encoder to fuse text embeddings with knowledge embeddings after the initial text encoding, allowing the model to use both textual and knowledge-based information.
  2. inserting a knowledge encoding layer inbetween the transformer to enable the llm to process knowledge from the KG
  3. Add an adapter that is trained independently of the LLM and is easier to train because it has its own set of parameters

Answer 1: Integrate KGs into LLM Inputs (verbalize KG for LLM training) – before pre-training

Contributors:

  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima) – Lead
  • Michael Wetzel (Coreon)
  • Rene Pietzsch (ECC)
  • ... 


Description:

Integrating Knowledge Graphs (KGs) into Large Language Models (LLMs) is a knowledge-aware pre-training technique that enhances LLMs by incorporating triplets from KGs directly into their textual input. This approach maintains the standard LLM pre-training goal of generating coherent and contextually relevant text, while introducing structured factual knowledge. Since standard pre-training typically involves only plain text, LLMs struggle to capture structured factual knowledge effectively [1,2,3,4,5,6].  Two main methods have been used for integrating KGs triplets into the LLM inputs:

  1. Concatenation of triplets with text (eitehr prepended to text [4] or inbetween text [2])
  2. Verbalisation of triplets followed by concatenation with text [7]

Considerations:

  • Directly concatenating triplets with text can introduce knowledge noise and alter the original meaning of sentences. This happens because tokens in the sentence interact directly with the tokens of the triplets, as observed from [6]. 
  • To overcome this problem, K-BERT and CoLAKE have designed training procedures where only the entities can access the information in the triples. By using masked self-attention mechanisms, K-BERT limit how much a token can influence others in the sequence. Specifically, the attention scores of tokens that form the knowledge triples are set to zero, preventing knowledge tokens from contributing to the hidden states of text tokens [6].
  • Only a few studies, such as DkLLM and Dict-BERT, also consider low-frequency and long-tail entities [2,3].
  • Oshin et al. show that verbalizing triplets is an effective method for integrating knowledge graphs into LLM inputs, improving performance over simply concatenating triplets in their structured format (subject entity, relation, object entity) [7].


Standards and Protocols and Scientific Publication:

Masked language modelling is the standard procedure for improving LLM pre-training by integrating triplets into LLM inputs, whether they are verbalised or not. Unlike traditional masked language modelling, knowledge-masked language modelling masks not only tokens from the sentence, but also tokens that are part of the triplets.


  • ERNIE concatenates knowledge graph (KG) triplets with sentences (triplets + sentence) and randomly masks tokens from either the sentences or the triplets. Triplets are prepended to the sentence. [4]
  • K-BERT integrates triplets into the input by appending knowledge information immediately after the corresponding entities in the text. To prevent semantic changes to the original meaning of the sentence, tokens from the triplets are restricted from influencing tokens in the sentence. [6]
  • CoLAKE proposes a unified word knowledge graph where tokens in the sequence are aligned with entities from the triplets and their neighbouring entities. [5]
  • KELM goes a step further by verbalising knowledge triplets into synthetic sentences using a text-to-text generator trained on a corpus of heuristically aligned Wikipedia text and Wikidata KG triplets. [7]

Brief description of the state-of-the-art

In the context of integrating KGs into LLM inputs, the current state-of-the-art approach focuses on infusing knowledge without modifying the textual sequence itself. The methods proposed by Liu et al. [3] and Sun et al. [4] address the issue of "knowledge noise", a challenge highlighted by Liu et al. [4] that can arise when knowledge triples are simply concatenated with their corresponding sentences, as in the approach of Zhang et al [5]. 

Answer 1: Integrate KGs into LLM Inputs (verbalize KG for LLM training) – before pre-training

Contributors:

  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima) – Lead
  • Michael Wetzel (Coreon)
  • Rene Pietzsch (ECC)
  • ... 

Description:

Verbalizing knowledge graphs for LLM the pre-training task:

  • Simple concatenation of KG triples with text
  • Entity/Token alignment prediction

Considerations:

  • Simple concatenation of tokens and triples from KG can cause "knowledge noise"

Standards:

  • Prediction alignment links between tokens and entities
  • Entity embeddings + additional entity prediction task to token-only pretraining objective

Answer 2: Integrate KGs during pre-training

...

  • [1] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“,   IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
  • [2] TW. Wang Yu u. a., „What „Dict-BERT: Enhancing Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.htmlPre-training with Dictionary“, 20. März 2022, arXiv: arXiv:2110.06490. Zugegriffen: 15. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/2110.06490
  • [3] T. Zhang u. a., „DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understanding“, 16. Oktober 2022, arXiv: arXiv:2112.01047. doi: 10.48550/arXiv.2112.01047.
  • [4] Y. Sun u. a., „ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation“, 5. Juli 2021, arXiv: arXiv:2107.02137. Zugegriffen: 14. November 2024. [Online]. Verfügbar unter: [3] Liu, Weijie, u. a. K-BERT: Enabling Language Representation with Knowledge Graph. arXiv:1909.07606, arXiv, 17. September 2019. arXiv.org, http://arxiv.org/abs/1909.076062107.02137
  • [45] T. Sun , Tianxiang, u. a., „CoLAKE: Contextualized Language and Knowledge Embedding“. , in Proceedings of the 28th International Conference on Computational Linguistics, herausgegeben von Donia Scott u. a., D. Scott, N. Bel, und C. Zong, Hrsg., Barcelona, Spain (Online): International Committee on Computational Linguistics, Dez. 2020, S. 3660–703660–3670. ACLWeb, https://doi.org/doi: 10.18653/v1/2020.coling-main.327.[5] Zhang, Zhengyan, u. a. „ERNIE: Enhanced Language Representation with Informative Entities“. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, herausgegeben von Anna Korhonen u. a., Association for Computational Linguistics, 2019, S. 1441–51. ACLWeb, https://doi.org/10.18653/v1/P19-1139.327.
  • [6] W. Liu u. a., „K-BERT: Enabling Language Representation with Knowledge Graph“, 17. September 2019, arXiv: arXiv:1909.07606. Zugegriffen: 7. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/1909.07606
  • [7] O. Agarwal, H. Ge, S. Shakeri, und R. Al-Rfou, „Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training“, 13. März 2021, arXiv: arXiv:2010.12688. Zugegriffen: 14. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/2010.12688


ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME

...

It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM. 

Answer 1: Using KGs to Evaluate LLM Represented Knowledge

...