Page History
...
- Diego Collarana (FIT)
- Daniel Baldassare (doctima) – Lead
- Michael Wetzel (Coreon)
- Rene Pietzsch (ECC)
- ...
Description:
Verbalizing knowledge graphs for LLM is the pre-training task of representing knowledge graphs as text so that they can be written directly in the prompt, the main input source of LLM. Verbalization consists of finding textual representations for nodes, relationships between nodes, and their metadata. Verbalization can take place at different stages of the LLM lifecycle, during training (pre-training, instruction fine-tuning) or during inference (in-context learning), and consists in:
- Mark boundaries of graph data using special tokens, like already for SQL-Queries: Improving Generalization in Language Model-Based Text-to-SQL
Semantic Parsing: Two Simple Semantic Boundary-Based Techniques - Encoding strategies for nodes, relationship between nodes, nodes communities, and metadata Talk like a graph: Encoding graphs for large language models (research.google)
- What needs to be verbalized and where? System prompt for static information like KG-schema, user prompt for data instances
Considerations:
...
:
- Simple concatenation of KG triples with text
- Entity/Token alignment prediction
Considerations:
- Simple concatenation of tokens and triples from KG can cause "knowledge noise"
Standards:
- Prediction alignment links between tokens and entities
- Entity embeddings + additional entity prediction task to token-only pretraining objective
Answer 2: Integrate KGs during pre-training
...