🚀 Knowledge Graph Preparation for Talk2KnowledgeGraphs (T2KG)

📌 Overview

By default, Talk2KnowledgeGraphs (T2KG) includes a small subset of the PrimeKG knowledge graph focused on inflammatory bowel disease (IBD). This subset is enriched with multimodal biomedical metadata and embedded node/edge representations, powered by BioBridge and StarkQA.

These default files are available at:

aiagents4pharma/talk2knowledgegraphs/tests/files/biobridge_multimodal

If you'd like to use a different disease-specific graph or build your own custom PrimeKG graph, follow the step-by-step instructions below.

🧰 Preparing Your Local Environment

Before preprocessing your custom knowledge graph, you must set up your local environment. Please follow the general setup instructions in the repository's main README.

✅ Prerequisites

After installing the required Python packages, make sure you have the following:

✅ OpenAI API Key — for generating text embeddings.
✅ NVIDIA API Key — for creating a NIM instance.
✅ NVIDIA NIM for MolMIM — for embedding drug SMILES representations.

➡️ Refer to this notebook to enable MolMIM-based SMILES embedding: AIAgents4Pharma/aiagents4pharma/docs/notebooks/talk2knowledgegraphs/tutorial_primekg_smiles_enrich_embed.ipynb

🏗️ Constructing a Custom PrimeKG Graph

T2KG supports both disease-specific and full PrimeKG multimodal knowledge graphs.

🔹 Disease-Specific Multimodal Graph

You can filter and process subgraphs from PrimeKG using:

🧬 IBD-Specific PrimeKG Subgraph → Generates a focused graph for IBD with enriched and embedded node/edge features.
📤 Migrate IBD Data to Milvus → Prepares and formats the dataframes for Milvus ingestion. (Tip: You only need to follow steps up to storing the dataframes as Parquet files.)

🔹 Full PrimeKG Multimodal Graph

For processing the complete PrimeKG, use:

🔬 BioBridge-PrimeKG Multimodal → Utilizes preloaded multimodal BioBridge data to enrich PrimeKG.
📚 PrimeKG Enrichment Pipeline → Enriches and Embeds the entire PrimeKG using BioBridge, MolMIM, and textual embeddings.
📤 Migrate Full PrimeKG to Milvus → Formats and dumps the full graph into Milvus-ready Parquet files. (Tip: You only need to follow steps up to storing the dataframes as Parquet files.)

▶️ Running T2KG with Your Custom Graph

1. Copy the Environment Template

cp aiagents4pharma/talk2knowledgegraphs/.env.example .env

2. Set Environment Variables

Edit the .env file to match your custom setup. Most importantly, set your custom data directory:

...
DATA_DIR=/absolute/path/to/your/data/
...

3. Ensure Correct Folder Structure

T2KG expects the following folder structure inside your data directory:

project/
├── edges/
│   ├── embedding/
│   │   ├── edges_0.parquet.gzip
│   │   ├── edges_1.parquet.gzip
│   │   └── ...
│   └── enrichment/
│       └── edges.parquet.gzip
├── nodes/
│   ├── embedding/
│   │   ├── biological_process.parquet.gzip
│   │   ├── cellular_component.parquet.gzip
│   │   └── ...
│   └── enrichment/
│       ├── biological_process.parquet.gzip
│       ├── cellular_component.parquet.gzip
│       └── ...

This layout ensures that T2KG can properly load and query your graph content using Milvus database.

🧠 Launching the T2KG Interface

Once your environment and data are ready, you can launch T2KG and start interacting with your graph using natural language!

You can either:

🐳 Use Docker (recommended for easy deployment), or
🖥️ Run Milvus and Streamlit manually

For more information, you can find various ways to launching of the app here and here