Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Diego Collarana (FIT)
  • Daniel Baldassare (doctima) – Lead
  • Michael Wetzel (Coreon)
  • Rene Pietzsch (ECC)
  • ... 


Description:

Verbalizing knowledge graphs for LLM the pre-training task:

  • Simple concatenation of KG triples with text
  • Entity/Token alignment prediction

Considerations:

  • Simple concatenation of tokens and triples from KG can cause "knowledge noise"

Standards:

Integrating Knowledge Graphs (KGs) into Large Language Models (LLMs) is a knowledge-aware pre-training technique that enhances LLMs by incorporating triplets from KGs directly into their textual input. This approach maintains the standard LLM pre-training goal of generating coherent and contextually relevant text, while introducing structured factual knowledge. Since standard pre-training typically involves only plain text, LLMs struggle to capture structured factual knowledge effectively [1,2,3,4,5,6].  Two main methods have been used for integrating KGs triplets into the LLM inputs:

  1. Concatenation of triplets with text (eitehr prepended to text [4] or inbetween text [2])
  2. Verbalisation of triplets followed by concatenation with text [7]

Considerations:

  • Directly concatenating triplets with text can introduce knowledge noise and alter the original meaning of sentences. This happens because tokens in the sentence interact directly with the tokens of the triplets, as observed from [6]. 
  • To overcome this problem, K-BERT and CoLAKE have designed training procedures where only the entities can access the information in the triples. By using masked self-attention mechanisms, K-BERT limit how much a token can influence others in the sequence. Specifically, the attention scores of tokens that form the knowledge triples are set to zero, preventing knowledge tokens from contributing to the hidden states of text tokens [6].
  • Only a few studies, such as DkLLM and Dict-BERT, also consider low-frequency and long-tail entities [2,3].
  • Oshin et al. show that verbalizing triplets is an effective method for integrating knowledge graphs into LLM inputs, improving performance over simply concatenating triplets in their structured format (subject entity, relation, object entity) [7].


Standards and Protocols and Scientific Publication:

Masked language modelling is the standard procedure for improving LLM pre-training by integrating triplets into LLM inputs, whether they are verbalised or not. Unlike traditional masked language modelling, knowledge-masked language modelling masks not only tokens from the sentence, but also tokens that are part of the triplets.


  • ERNIE concatenates knowledge graph (KG) triplets with sentences (triplets + sentence) and randomly masks tokens from either the sentences or the triplets. Triplets are prepended to the sentence. [4]
  • K-BERT integrates triplets into the input by appending knowledge information immediately after the corresponding entities in the text. To prevent semantic changes to the original meaning of the sentence, tokens from the triplets are restricted from influencing tokens in the sentence. [6]
  • CoLAKE proposes a unified word knowledge graph where tokens in the sequence are aligned with entities from the triplets and their neighbouring entities. [5]
  • KELM goes a step further by verbalising knowledge triplets into synthetic sentences using a text-to-text generator trained on a corpus of heuristically aligned Wikipedia text and Wikidata KG triplets. [7]
  • Prediction alignment links between tokens and entities
  • Entity embeddings + additional entity prediction task to token-only pretraining objective

Answer 2: Integrate KGs during pre-training

...

  • [1] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, und X. Wu, „Unifying Large Language Models and Knowledge Graphs: A Roadmap“,   IEEE Trans. Knowl. Data Eng., Bd. 36, Nr. 7, S. 3580–3599, Juli 2024, doi: 10.1109/TKDE.2024.3352100.
  • [2] TW. Wang Yu u. a., „What „Dict-BERT: Enhancing Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?“, in Proceedings of the 39th International Conference on Machine Learning, PMLR, Juni 2022, S. 22964–22984. Zugegriffen: 3. Oktober 2024. [Online]. Verfügbar unter: https://proceedings.mlr.press/v162/wang22u.htmlPre-training with Dictionary“, 20. März 2022, arXiv: arXiv:2110.06490. Zugegriffen: 15. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/2110.06490
  • [3] T. Zhang u. a., „DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understanding“, 16. Oktober 2022, arXiv: arXiv:2112.01047. doi: 10.48550/arXiv.2112.01047.
  • [4] Y. Sun u. a., „ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation“, 5. Juli 2021, arXiv: arXiv:2107.02137. Zugegriffen: 14. November 2024. [Online]. Verfügbar unter: [3] Liu, Weijie, u. a. K-BERT: Enabling Language Representation with Knowledge Graph. arXiv:1909.07606, arXiv, 17. September 2019. arXiv.org, http://arxiv.org/abs/1909.076062107.02137
  • [45] T. Sun , Tianxiang, u. a., „CoLAKE: Contextualized Language and Knowledge Embedding“. , in Proceedings of the 28th International Conference on Computational Linguistics, herausgegeben von Donia Scott u. a., D. Scott, N. Bel, und C. Zong, Hrsg., Barcelona, Spain (Online): International Committee on Computational Linguistics, Dez. 2020, S. 3660–703660–3670. ACLWeb, https://doi.org/doi: 10.18653/v1/2020.coling-main.327.[5] Zhang, Zhengyan, u. a. „ERNIE: Enhanced Language Representation with Informative Entities“. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, herausgegeben von Anna Korhonen u. a., Association for Computational Linguistics, 2019, S. 1441–51. ACLWeb, https://doi.org/10.18653/v1/P19-1139.327.
  • [6] W. Liu u. a., „K-BERT: Enabling Language Representation with Knowledge Graph“, 17. September 2019, arXiv: arXiv:1909.07606. Zugegriffen: 7. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/1909.07606
  • [7] O. Agarwal, H. Ge, S. Shakeri, und R. Al-Rfou, „Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training“, 13. März 2021, arXiv: arXiv:2010.12688. Zugegriffen: 14. November 2024. [Online]. Verfügbar unter: http://arxiv.org/abs/2010.12688


ADD NEW TOP LEVEL SECTION: ENHANCING LLMs AT INFERENCE TIME

...

It should be noted that the accuracy of a LM's evaluation is inherently tied to the quality of the KG it relies on. This is especially relevant for publicly editable KGs, which are susceptible to factual inaccuracies due to unreliable or unverified sources and can even be intentionally manipulated to disseminate misinformation. Therefore, considering the quality and reliability of the underlying KG is crucial when evaluating the LM. 

Answer 1: Using KGs to Evaluate LLM Represented Knowledge

...