PrimeKG Subgraph Construction¶
In this tutorial, we will showcase how to construct a subraph from PrimeKG and prepare necessary graph formats for further analysis.
In particular, we will slice a subgraph from PrimeKG related to inflammatory bowel disease (IBD).
The subgraph will contain all nodes and edges that are connected to IBD-related disease nodes, including the following relationships:
- Disease-Protein Relationship
- Disease-Disease Relationship (skipped as of now)
- Protein-Protein Relationship (skipped as of now)
- Drug-Protein Relationship
- Pathway-Protein Relationship
- Pathway-Pathway Relationship (skipped as of now)
- Bioprocess-Protein Relationship
- Molecular Function-Protein Relationship
- Cellular Component-Protein Relationship
In addition, to enrich the nodes and edges, we will perform the following tasks:
- Textual enrichment (only this task is implemented as of now)
- Multi-modal enrichment (to be added)
First of all, we need to import necessary libraries as follows:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils
# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
PrimeKG¶
We utilize the PrimeKG
class from the aiagents4pharma/talk2knowledgegraphs library.
The PrimeKG
needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory.
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")
# Invoke a method to load the data
primekg_data.load_data()
# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()
Loading nodes of PrimeKG dataset ... ../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory. Loading edges of PrimeKG dataset ... ../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.
IBD-related Data Filtering¶
IBD-related Disease Nodes¶
As a first step, we will perform data filtering over the primekg_nodes by querying the nodes that contains the following terms:
- inflammatory bowel disease
- crohn
- ulcerative colitis
As of now, this basic query is used to filter the data. However, this can be replaced with a more complex query that can capture more nodes related to IBD.
# Query for nodes related to IBD
query_str = 'node_name_lower.str.contains("inflammatory bowel disease")'
query_str += 'or node_name_lower.str.contains("crohn")'
query_str += 'or node_name_lower.str.contains("ulcerative colitis")'
# Get the nodes related to IBD
ibd_nodes_df = primekg_nodes.copy()
ibd_nodes_df["node_name_lower"] = primekg_nodes.node_name.apply(lambda x: x.lower())
ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == "disease"].query(query_str, engine='python')
ibd_nodes_df.drop(columns=["node_name_lower"], inplace=True)
ibd_nodes_df
node_index | node_name | node_source | node_id | node_type | |
---|---|---|---|---|---|
27269 | 27269 | IL21-related infantile inflammatory bowel disease | MONDO | 14338 | disease |
28158 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease |
29293 | 29293 | inflammatory bowel disease, immunodeficiency, ... | MONDO | 32601 | disease |
35814 | 35814 | Crohn ileitis and jejunitis | MONDO_grouped | 709_21207 | disease |
35815 | 35815 | small bowel Crohn disease | MONDO | 5539 | disease |
37784 | 37784 | Crohn disease | MONDO_grouped | 5011_5535 | disease |
37785 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease |
39013 | 39013 | immune dysregulation-inflammatory bowel diseas... | MONDO | 16542 | disease |
39787 | 39787 | immune dysregulation with inflammatory bowel d... | MONDO | 33967 | disease |
83770 | 83770 | Crohn's colitis | MONDO | 5532 | disease |
95279 | 95279 | Crohn jejunoileitis | MONDO | 708 | disease |
95280 | 95280 | gastroduodenal Crohn disease | MONDO | 710 | disease |
97088 | 97088 | perianal Crohn disease | MONDO | 5537 | disease |
99325 | 99325 | Crohn disease of the esophagus | MONDO | 22901 | disease |
99680 | 99680 | immune dysregulation-inflammatory bowel diseas... | MONDO | 33968 | disease |
99681 | 99681 | inflammatory bowel disease-recurrent sinopulmo... | MONDO | 33969 | disease |
Disease-Protein Relationship¶
Based on the nodes related to IBD, we can further capture the records containing the relationships of disease-gene/protein nodes.
# IBD disease_protein edges
ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) &
(primekg_edges.tail_type == 'gene/protein')],
primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) &
(primekg_edges.head_type == 'gene/protein')]])
# Check dataframe
ibd_disease_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5988787 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein |
5988788 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein |
5988789 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein |
5988790 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein |
5988791 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2712 | CASP3 | NCBI | 836 | gene/protein | associated with | disease_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3304471 | 34780 | IRGM | NCBI | 345611 | gene/protein | 35814 | Crohn ileitis and jejunitis | MONDO_grouped | 709_21207 | disease | associated with | disease_protein |
3310277 | 5022 | ITGAM | NCBI | 3684 | gene/protein | 35814 | Crohn ileitis and jejunitis | MONDO_grouped | 709_21207 | disease | associated with | disease_protein |
3313160 | 2889 | TGFB1 | NCBI | 7040 | gene/protein | 29293 | inflammatory bowel disease, immunodeficiency, ... | MONDO | 32601 | disease | associated with | disease_protein |
3314800 | 9104 | INAVA | NCBI | 55765 | gene/protein | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | associated with | disease_protein |
3314949 | 34967 | IL21 | NCBI | 59067 | gene/protein | 27269 | IL21-related infantile inflammatory bowel disease | MONDO | 14338 | disease | associated with | disease_protein |
620 rows × 12 columns
# Get unique protein index
ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),
ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))
ibd_protein_index
array([ 144, 179, 192, 279, 417, 625, 657, 729, 772, 989, 1004, 1122, 1299, 1480, 1567, 1618, 1654, 1777, 1990, 2012, 2057, 2078, 2111, 2139, 2329, 2384, 2543, 2643, 2712, 2749, 2874, 2889, 2978, 2983, 3064, 3088, 3233, 3259, 3333, 3414, 3460, 3469, 3474, 3484, 3495, 3578, 3646, 4152, 4162, 4731, 4818, 4968, 4997, 5022, 5195, 5385, 5720, 5805, 5915, 6168, 6175, 6229, 6428, 6661, 7059, 7083, 7359, 7384, 7899, 7958, 8030, 8564, 9104, 9454, 9763, 10113, 10191, 10919, 11103, 11134, 11199, 11523, 11588, 12305, 12663, 12740, 12763, 12816, 13014, 13365, 21972, 22105, 34623, 34776, 34777, 34778, 34779, 34780, 34781, 34814, 34887, 34967, 35156])
Disease-Disease Relationship¶
Here, we can get the records containing the relationships of disease-disease nodes.
# # IBD disease_disease edges
# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) &
# (primekg_edges.tail_type == 'disease')],
# primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) &
# (primekg_edges.head_type == 'disease')]])
# # Check dataframe
# ibd_disease_disease_edges_df
Protein-Protein Relationship¶
We also can get the records containing the relationships of gene/protein-gene/protein nodes.
# # IBD protein_protein edges
# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) &
# (primekg_edges.tail_type == 'gene/protein')],
# primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) &
# (primekg_edges.head_type == 'gene/protein')]])
# # Check dataframe
# ibd_protein_protein_edges_df
Drug-Protein Relationship¶
Next, we will get the records containing the relationships of drug-gene/protein nodes.
# IBD drug_protein edges
ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') &
(primekg_edges.tail_type == 'gene/protein') &
(primekg_edges.tail_index.isin(ibd_protein_index))],
primekg_edges[(primekg_edges.tail_type == 'drug') &
(primekg_edges.head_type == 'gene/protein') &
(primekg_edges.head_index.isin(ibd_protein_index))]])
# Check dataframe
ibd_drug_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
321759 | 14118 | Rose bengal | DrugBank | DB11182 | drug | 3233 | LTF | NCBI | 4057 | gene/protein | carrier | drug_protein |
321763 | 14038 | Fluticasone furoate | DrugBank | DB08906 | drug | 4152 | ABCB1 | NCBI | 5243 | gene/protein | carrier | drug_protein |
321764 | 14555 | Technetium Tc-99m tetrofosmin | DrugBank | DB09160 | drug | 4152 | ABCB1 | NCBI | 5243 | gene/protein | carrier | drug_protein |
321765 | 14040 | Fluticasone | DrugBank | DB13867 | drug | 4152 | ABCB1 | NCBI | 5243 | gene/protein | carrier | drug_protein |
322373 | 14060 | Levothyroxine | DrugBank | DB00451 | drug | 4152 | ABCB1 | NCBI | 5243 | gene/protein | enzyme | drug_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5731639 | 4152 | ABCB1 | NCBI | 5243 | gene/protein | 14498 | Risdiplam | DrugBank | DB15305 | drug | transporter | drug_protein |
5731640 | 4152 | ABCB1 | NCBI | 5243 | gene/protein | 14908 | Ubrogepant | DrugBank | DB15328 | drug | transporter | drug_protein |
5731641 | 4152 | ABCB1 | NCBI | 5243 | gene/protein | 14499 | Elexacaftor | DrugBank | DB15444 | drug | transporter | drug_protein |
5731642 | 4152 | ABCB1 | NCBI | 5243 | gene/protein | 14050 | Prednisolone acetate | DrugBank | DB15566 | drug | transporter | drug_protein |
5731643 | 4152 | ABCB1 | NCBI | 5243 | gene/protein | 15752 | Selpercatinib | DrugBank | DB15685 | drug | transporter | drug_protein |
2030 rows × 12 columns
Pathway-Protein Relationship¶
For this case, we will get the records containing the relationships of pathway-protein nodes.
# IBD pathway_protein edges
ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') &
(primekg_edges.tail_type == 'gene/protein') &
(primekg_edges.tail_index.isin(ibd_protein_index))],
primekg_edges[(primekg_edges.tail_type == 'pathway') &
(primekg_edges.head_type == 'gene/protein') &
(primekg_edges.head_index.isin(ibd_protein_index))]])
# Check dataframe
ibd_pathway_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6505784 | 62703 | Adherens junctions interactions | REACTOME | R-HSA-418990 | pathway | 8030 | CDH3 | NCBI | 1001 | gene/protein | interacts with | pathway_protein |
6506102 | 128079 | Regulation of actin dynamics for phagocytic cu... | REACTOME | R-HSA-2029482 | pathway | 2139 | ARPC2 | NCBI | 10109 | gene/protein | interacts with | pathway_protein |
6506103 | 128183 | EPHB-mediated forward signaling | REACTOME | R-HSA-3928662 | pathway | 2139 | ARPC2 | NCBI | 10109 | gene/protein | interacts with | pathway_protein |
6506104 | 128022 | RHO GTPases Activate WASPs and WAVEs | REACTOME | R-HSA-5663213 | pathway | 2139 | ARPC2 | NCBI | 10109 | gene/protein | interacts with | pathway_protein |
6506105 | 62931 | Clathrin-mediated endocytosis | REACTOME | R-HSA-8856828 | pathway | 2139 | ARPC2 | NCBI | 10109 | gene/protein | interacts with | pathway_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3834665 | 2543 | CDH1 | NCBI | 999 | gene/protein | 127731 | Integrin cell surface interactions | REACTOME | R-HSA-216083 | pathway | interacts with | pathway_protein |
3834666 | 2543 | CDH1 | NCBI | 999 | gene/protein | 127617 | Apoptotic cleavage of cell adhesion proteins | REACTOME | R-HSA-351906 | pathway | interacts with | pathway_protein |
3834667 | 2543 | CDH1 | NCBI | 999 | gene/protein | 62703 | Adherens junctions interactions | REACTOME | R-HSA-418990 | pathway | interacts with | pathway_protein |
3834668 | 2543 | CDH1 | NCBI | 999 | gene/protein | 128018 | RHO GTPases activate IQGAPs | REACTOME | R-HSA-5626467 | pathway | interacts with | pathway_protein |
3834669 | 2543 | CDH1 | NCBI | 999 | gene/protein | 129039 | InlA-mediated entry of Listeria monocytogenes ... | REACTOME | R-HSA-8876493 | pathway | interacts with | pathway_protein |
1030 rows × 12 columns
# Get unique protein index
ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),
ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))
ibd_pathway_index
array([ 62341, 62347, 62348, 62373, 62376, 62394, 62400, 62401, 62404, 62405, 62414, 62448, 62449, 62462, 62465, 62467, 62469, 62472, 62476, 62477, 62483, 62543, 62571, 62573, 62575, 62583, 62588, 62596, 62603, 62606, 62628, 62644, 62651, 62655, 62657, 62672, 62675, 62691, 62692, 62697, 62702, 62703, 62711, 62717, 62733, 62734, 62768, 62770, 62805, 62807, 62836, 62865, 62916, 62925, 62931, 62968, 62976, 62987, 62996, 63041, 63064, 63071, 63076, 127601, 127615, 127616, 127617, 127619, 127620, 127624, 127628, 127629, 127639, 127640, 127649, 127659, 127662, 127682, 127683, 127688, 127691, 127693, 127694, 127695, 127696, 127726, 127727, 127728, 127729, 127730, 127731, 127732, 127733, 127791, 127797, 127810, 127814, 127815, 127833, 127835, 127856, 127858, 127866, 127867, 127869, 127886, 127891, 127908, 127917, 127918, 127921, 127928, 127958, 127960, 127971, 127977, 127999, 128001, 128002, 128003, 128008, 128010, 128015, 128018, 128022, 128025, 128034, 128058, 128065, 128071, 128072, 128073, 128074, 128078, 128079, 128080, 128086, 128111, 128113, 128116, 128117, 128129, 128137, 128138, 128139, 128158, 128165, 128170, 128176, 128183, 128186, 128191, 128198, 128199, 128204, 128208, 128209, 128224, 128227, 128242, 128243, 128244, 128253, 128254, 128270, 128271, 128272, 128273, 128299, 128302, 128341, 128348, 128349, 128350, 128351, 128353, 128360, 128378, 128381, 128393, 128395, 128396, 128399, 128430, 128440, 128453, 128460, 128470, 128472, 128473, 128477, 128478, 128479, 128480, 128481, 128482, 128483, 128484, 128486, 128487, 128497, 128498, 128499, 128500, 128501, 128503, 128527, 128535, 128550, 128593, 128599, 128601, 128602, 128604, 128655, 128677, 128715, 128759, 128766, 128767, 128779, 128781, 128782, 128783, 128784, 128789, 128792, 128801, 128804, 128814, 128815, 128827, 128828, 128829, 128830, 128832, 128835, 128837, 128838, 128841, 128846, 128851, 128852, 128878, 128976, 128977, 128978, 128979, 128980, 128981, 128988, 128990, 129007, 129015, 129016, 129021, 129023, 129035, 129039, 129040, 129042, 129044, 129047, 129048, 129052, 129099, 129110, 129124, 129125, 129126, 129127, 129128, 129131, 129135, 129136, 129139, 129140, 129141, 129148, 129155, 129167, 129181, 129183, 129190, 129195, 129196, 129197, 129198, 129215, 129217, 129238, 129257, 129258, 129259, 129264, 129266, 129289, 129294, 129296, 129302, 129303, 129310, 129355, 129360, 129361, 129365, 129366, 129367])
Pathway-Pathway Relationship¶
As well as, a set of records containing the relationships of pathway-pathway nodes.
# # # IBD pathway_pathway edges
# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) &
# (primekg_edges.tail_type == 'pathway')],
# primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) &
# (primekg_edges.head_type == 'pathway')]])
# # Check dataframe
# ibd_pathway_pathway_edges_df
Bioprocess-Protein Relationship¶
Next step is to get the records containing the relationships of biological_process-gene/protein nodes.
# IBD bioprocess_protein edges
ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') &
(primekg_edges.tail_type == 'gene/protein') &
(primekg_edges.tail_index.isin(ibd_protein_index))],
primekg_edges[(primekg_edges.tail_type == 'biological_process') &
(primekg_edges.head_type == 'gene/protein') &
(primekg_edges.head_index.isin(ibd_protein_index))]])
# Check dataframe
ibd_bioprocess_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6351294 | 112487 | neutrophil degranulation | GO | 43312 | biological_process | 1990 | FCGR2A | NCBI | 2212 | gene/protein | interacts with | bioprocess_protein |
6351300 | 112487 | neutrophil degranulation | GO | 43312 | biological_process | 3333 | FPR2 | NCBI | 2358 | gene/protein | interacts with | bioprocess_protein |
6351340 | 112487 | neutrophil degranulation | GO | 43312 | biological_process | 2012 | CXCR1 | NCBI | 3577 | gene/protein | interacts with | bioprocess_protein |
6351341 | 112487 | neutrophil degranulation | GO | 43312 | biological_process | 3064 | CXCR2 | NCBI | 3579 | gene/protein | interacts with | bioprocess_protein |
6351346 | 112487 | neutrophil degranulation | GO | 43312 | biological_process | 5022 | ITGAM | NCBI | 3684 | gene/protein | interacts with | bioprocess_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3781707 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 51599 | negative regulation of peroxidase activity | GO | 2000469 | biological_process | interacts with | bioprocess_protein |
3781708 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 52358 | regulation of kidney size | GO | 35564 | biological_process | interacts with | bioprocess_protein |
3781710 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 109343 | negative regulation of thioredoxin peroxidase ... | GO | 1903125 | biological_process | interacts with | bioprocess_protein |
3781811 | 22105 | GPBAR1 | NCBI | 151306 | gene/protein | 105254 | cell surface bile acid receptor signaling pathway | GO | 38184 | biological_process | interacts with | bioprocess_protein |
3781824 | 34779 | NKX2-3 | NCBI | 159296 | gene/protein | 100699 | post-embryonic digestive tract morphogenesis | GO | 48621 | biological_process | interacts with | bioprocess_protein |
6300 rows × 12 columns
MolFunc-Protein Relationship¶
Here, we would like to get biological_process-gene/protein relationships.
# IBD molfunc_protein edges
ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') &
(primekg_edges.tail_type == 'gene/protein') &
(primekg_edges.tail_index.isin(ibd_protein_index))],
primekg_edges[(primekg_edges.tail_type == 'molecular_function') &
(primekg_edges.head_type == 'gene/protein') &
(primekg_edges.head_index.isin(ibd_protein_index))]])
# Check dataframe
ibd_molfunc_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6198264 | 54035 | interleukin-1 binding | GO | 19966 | molecular_function | 1654 | IL1R2 | NCBI | 7850 | gene/protein | interacts with | molfunc_protein |
6198359 | 54290 | enzyme binding | GO | 19899 | molecular_function | 3578 | ECM1 | NCBI | 1893 | gene/protein | interacts with | molfunc_protein |
6198366 | 54290 | enzyme binding | GO | 19899 | molecular_function | 2057 | FN1 | NCBI | 2335 | gene/protein | interacts with | molfunc_protein |
6198442 | 54290 | enzyme binding | GO | 19899 | molecular_function | 989 | PPARG | NCBI | 5468 | gene/protein | interacts with | molfunc_protein |
6198462 | 54290 | enzyme binding | GO | 19899 | molecular_function | 772 | RELA | NCBI | 5970 | gene/protein | interacts with | molfunc_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3553533 | 6229 | NOD2 | NCBI | 64127 | gene/protein | 122117 | muramyl dipeptide binding | GO | 32500 | molecular_function | interacts with | molfunc_protein |
3553770 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 115199 | GTP-dependent protein kinase activity | GO | 34211 | molecular_function | interacts with | molfunc_protein |
3553771 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 118105 | beta-catenin destruction complex binding | GO | 1904713 | molecular_function | interacts with | molfunc_protein |
3553773 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 119847 | peroxidase inhibitor activity | GO | 36479 | molecular_function | interacts with | molfunc_protein |
3553832 | 22105 | GPBAR1 | NCBI | 151306 | gene/protein | 116806 | G protein-coupled bile acid receptor activity | GO | 38182 | molecular_function | interacts with | molfunc_protein |
1466 rows × 12 columns
CellComp-Protein Relationship¶
Finally, we are getting the records containing the relationships of cellular_component-gene/protein nodes.
# IBD molfunc_protein edges
ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') &
(primekg_edges.tail_type == 'gene/protein') &
(primekg_edges.tail_index.isin(ibd_protein_index))],
primekg_edges[(primekg_edges.tail_type == 'cellular_component') &
(primekg_edges.head_type == 'gene/protein') &
(primekg_edges.head_index.isin(ibd_protein_index))]])
# Check dataframe
ibd_cellcomp_protein_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6267848 | 126078 | ficolin-1-rich granule lumen | GO | 1904813 | cellular_component | 3474 | MMP9 | NCBI | 4318 | gene/protein | interacts with | cellcomp_protein |
6268120 | 124245 | extracellular space | GO | 5615 | cellular_component | 2384 | CRP | NCBI | 1401 | gene/protein | interacts with | cellcomp_protein |
6268163 | 124245 | extracellular space | GO | 5615 | cellular_component | 5805 | DEFA5 | NCBI | 1670 | gene/protein | interacts with | cellcomp_protein |
6268164 | 124245 | extracellular space | GO | 5615 | cellular_component | 657 | DEFA6 | NCBI | 1671 | gene/protein | interacts with | cellcomp_protein |
6268173 | 124245 | extracellular space | GO | 5615 | cellular_component | 3578 | ECM1 | NCBI | 1893 | gene/protein | interacts with | cellcomp_protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3636708 | 2139 | ARPC2 | NCBI | 10109 | gene/protein | 126261 | muscle cell projection membrane | GO | 36195 | cellular_component | interacts with | cellcomp_protein |
3636819 | 9763 | ORMDL3 | NCBI | 94103 | gene/protein | 126815 | SPOTS complex | GO | 35339 | cellular_component | interacts with | cellcomp_protein |
3637211 | 6661 | ATG16L1 | NCBI | 55054 | gene/protein | 126444 | vacuole-isolation membrane contact site | GO | 120095 | cellular_component | interacts with | cellcomp_protein |
3637234 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 126938 | cytoplasmic side of mitochondrial outer membrane | GO | 32473 | cellular_component | interacts with | cellcomp_protein |
3637328 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 125942 | caveola neck | GO | 99400 | cellular_component | interacts with | cellcomp_protein |
1348 rows × 12 columns
Merge all dataframes¶
Once we have all of particular type of edges, we can merge them into a single dataframe representing a subgraph of IBD inferred from PrimeKG.
# PrimeKG edges related to IBD
primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,
# ibd_disease_disease_edges_df,
# ibd_protein_protein_edges_df,
ibd_drug_protein_edges_df,
ibd_pathway_protein_edges_df,
# ibd_pathway_pathway_edges_df,
ibd_bioprocess_protein_edges_df,
ibd_molfunc_protein_edges_df,
ibd_cellcomp_protein_edges_df])
primekg_ibd_edges_df["edge_type"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)
primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | edge_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) |
1 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) |
2 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) |
3 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) |
4 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2712 | CASP3 | NCBI | 836 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12747 | 2139 | ARPC2 | NCBI | 10109 | gene/protein | 126261 | muscle cell projection membrane | GO | 36195 | cellular_component | interacts with | cellcomp_protein | (gene/protein, interacts with, cellular_compon... |
12748 | 9763 | ORMDL3 | NCBI | 94103 | gene/protein | 126815 | SPOTS complex | GO | 35339 | cellular_component | interacts with | cellcomp_protein | (gene/protein, interacts with, cellular_compon... |
12749 | 6661 | ATG16L1 | NCBI | 55054 | gene/protein | 126444 | vacuole-isolation membrane contact site | GO | 120095 | cellular_component | interacts with | cellcomp_protein | (gene/protein, interacts with, cellular_compon... |
12750 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 126938 | cytoplasmic side of mitochondrial outer membrane | GO | 32473 | cellular_component | interacts with | cellcomp_protein | (gene/protein, interacts with, cellular_compon... |
12751 | 2111 | LRRK2 | NCBI | 120892 | gene/protein | 125942 | caveola neck | GO | 99400 | cellular_component | interacts with | cellcomp_protein | (gene/protein, interacts with, cellular_compon... |
12752 rows × 13 columns
We can get a dataframe of nodes based on the above edge dataframe as follows:
# PrimeKG nodes related to IBD
primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(),
primekg_ibd_edges_df.tail_index.unique()])))]
primekg_ibd_nodes_df
node_index | node_name | node_source | node_id | node_type | |
---|---|---|---|---|---|
144 | 144 | SMAD3 | NCBI | 4088 | gene/protein |
179 | 179 | IL10RB | NCBI | 3588 | gene/protein |
192 | 192 | GNA12 | NCBI | 2768 | gene/protein |
279 | 279 | HNF4A | NCBI | 3172 | gene/protein |
417 | 417 | VCAM1 | NCBI | 7412 | gene/protein |
... | ... | ... | ... | ... | ... |
129360 | 129360 | IRAK2 mediated activation of TAK1 complex upon... | REACTOME | R-HSA-975163 | pathway |
129361 | 129361 | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... | REACTOME | R-HSA-975110 | pathway |
129365 | 129365 | Antigen processing: Ubiquitination & Proteasom... | REACTOME | R-HSA-983168 | pathway |
129366 | 129366 | Antigen Presentation: Folding, assembly and pe... | REACTOME | R-HSA-983170 | pathway |
129367 | 129367 | Kinesins | REACTOME | R-HSA-983189 | pathway |
3426 rows × 5 columns
We can store the nodes and edges related to IBD in a parquet file for future use.
# Store the IBD-related nodes and edges
local_dir = '../../../../data/primekg_ibd/'
if not os.path.exists(local_dir):
os.makedirs(local_dir)
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)
# Statistics over the IBD-related nodes and edges
print(f"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}")
print(f"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}")
Number of IBD-related nodes: 3426 Number of IBD-related edges: 12752
# Count the number of nodes by node type
primekg_ibd_nodes_df.groupby('node_type').size()
node_type biological_process 1642 cellular_component 207 disease 7 drug 835 gene/protein 103 molecular_function 324 pathway 308 dtype: int64
# Count the number of edges by relation and display_relation
primekg_ibd_edges_df.groupby(['relation','display_relation']).size()
relation display_relation bioprocess_protein interacts with 6300 cellcomp_protein interacts with 1348 disease_protein associated with 620 drug_protein carrier 8 enzyme 64 target 776 transporter 1140 molfunc_protein interacts with 1466 pathway_protein interacts with 1030 dtype: int64
# Count the number of edges by edge type
primekg_ibd_edges_df.groupby(['edge_type']).size()
edge_type (biological_process, interacts with, gene/protein) 3150 (cellular_component, interacts with, gene/protein) 674 (disease, associated with, gene/protein) 310 (drug, carrier, gene/protein) 4 (drug, enzyme, gene/protein) 32 (drug, target, gene/protein) 388 (drug, transporter, gene/protein) 570 (gene/protein, associated with, disease) 310 (gene/protein, carrier, drug) 4 (gene/protein, enzyme, drug) 32 (gene/protein, interacts with, biological_process) 3150 (gene/protein, interacts with, cellular_component) 674 (gene/protein, interacts with, molecular_function) 733 (gene/protein, interacts with, pathway) 515 (gene/protein, target, drug) 388 (gene/protein, transporter, drug) 570 (molecular_function, interacts with, gene/protein) 733 (pathway, interacts with, gene/protein) 515 dtype: int64
Enrichment (using textual as of now)¶
From this point onwards, we will use the pre-processed IBD-related nodes and edges to create a set of graph formats.
Before that, we should perform enrichment and embedding over the IBD-related nodes and edges.
As of now, we will conduct a textual enrichment over the records.
Since StarQA provide most of information of the nodes, we will use StarkQA to get the information of the nodes related to IBD.
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")
# Invoke a method to load the data
starkqa_data.load_data()
# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
# starkqa_df = starkqa_data.get_starkqa()
starkqa_node_info = starkqa_data.get_starkqa_node_info()
Loading StarkQAPrimeKG dataset... ../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory. Loading StarkQAPrimeKG embeddings...
Note that not all nodes in the StarkQA-PrimeKG have additional information.
For this case, we provide a basic text enrichment for the nodes by simply specifying their node name and type.
def do_enrichment_text(data, starkqa_node_info):
"""
Enrich the node with additional textual information from BioBridge and StarkQA.
Args:
data (dict): The node data from PrimeKG
starkqa_node_info (dict): The node information from StarkQA-PrimeKG
"""
# Basic textual enrichment of the node
enriched_node = f"{data['node_name']} belongs to {data['node_type']} category. "
# Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
# has additional information in the node_info of StarkQA-PrimeKG
added_info = ''
if data['node_type'] == 'gene/protein':
added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''
elif data['node_type'] == 'drug':
added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),
str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),
str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),
str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),
str(starkqa_node_info['details']['indication']).replace('nan', '')])
elif data['node_type'] == 'disease':
added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),
str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),
str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])
elif data['node_type'] == 'pathway':
added_info += f"This pathway found in {starkqa_node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''
# Append the additional information for enrichment
enriched_node += added_info
return enriched_node
By using the above function, we can enrich the node information from PrimeKG with additional information from StarkQA-PrimeKG as shown below:
# Perform node enrichment for each row in primekg_nodes
text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()
primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes
primekg_ibd_nodes_df
/tmp/ipykernel_64662/2873064541.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes
node_index | node_name | node_source | node_id | node_type | enriched_node | |
---|---|---|---|---|---|---|
144 | 144 | SMAD3 | NCBI | 4088 | gene/protein | SMAD3 belongs to gene/protein category. The SM... |
179 | 179 | IL10RB | NCBI | 3588 | gene/protein | IL10RB belongs to gene/protein category. The p... |
192 | 192 | GNA12 | NCBI | 2768 | gene/protein | GNA12 belongs to gene/protein category. Predic... |
279 | 279 | HNF4A | NCBI | 3172 | gene/protein | HNF4A belongs to gene/protein category. The pr... |
417 | 417 | VCAM1 | NCBI | 7412 | gene/protein | VCAM1 belongs to gene/protein category. This g... |
... | ... | ... | ... | ... | ... | ... |
129360 | 129360 | IRAK2 mediated activation of TAK1 complex upon... | REACTOME | R-HSA-975163 | pathway | IRAK2 mediated activation of TAK1 complex upon... |
129361 | 129361 | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... | REACTOME | R-HSA-975110 | pathway | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... |
129365 | 129365 | Antigen processing: Ubiquitination & Proteasom... | REACTOME | R-HSA-983168 | pathway | Antigen processing: Ubiquitination & Proteasom... |
129366 | 129366 | Antigen Presentation: Folding, assembly and pe... | REACTOME | R-HSA-983170 | pathway | Antigen Presentation: Folding, assembly and pe... |
129367 | 129367 | Kinesins | REACTOME | R-HSA-983189 | pathway | Kinesins belongs to pathway category. This pat... |
3426 rows × 6 columns
Subsequently, we can perform similar textual enrichment for the edges in PrimeKG.
Since StarkQA only provides node information, we can only enrich the edges with basic information of the triples in combination with the head and tail nodes.
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges
primekg_ibd_edges_df.head()
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | edge_type | enriched_edge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... |
1 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 7359 | ADCY7 | NCBI | 113 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... |
2 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... |
3 | 28158 | inflammatory bowel disease | MONDO_grouped | 9960_12845_33643_11471_12831_12875_12941_13153... | disease | 2874 | PRDM1 | NCBI | 639 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... |
4 | 37785 | ulcerative colitis (disease) | MONDO | 5101 | disease | 2712 | CASP3 | NCBI | 836 | gene/protein | associated with | disease_protein | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... |
Embeddings (using textual embedding as of now)¶
We are going to perform embedding using the enriched nodes and edges by leveraging EmbeddingWithOllama
class.
For this purpose, we will use nomic-embed-text
.
# Using nomic-ai/nomic-embed-text-v1.5 model
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')
Node Embedding¶
We will perform node embedding for the IBD-related nodes using the Ollama model by using mini-batches of 100 nodes at a time.
# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
node_embeddings = []
for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):
outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])
node_embeddings.extend(outputs)
# node_embeddings
100%|██████████| 35/35 [00:19<00:00, 1.75it/s]
# Check the shape of the node embeddings
len(node_embeddings), len(node_embeddings[0])
(3426, 768)
# Add them as features to the dataframe
primekg_ibd_nodes_df['x'] = node_embeddings
# Drop and rename several columns
primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)
primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)
# Check dataframe of nodes
primekg_ibd_nodes_df.head()
/tmp/ipykernel_64662/3470083233.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df['x'] = node_embeddings /tmp/ipykernel_64662/3470083233.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True) /tmp/ipykernel_64662/3470083233.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)
node_id | node_name | node_type | enriched_node | x | |
---|---|---|---|---|---|
144 | 144 | SMAD3 | gene/protein | SMAD3 belongs to gene/protein category. The SM... | [0.026536005, 0.05420931, -0.17033643, -0.0248... |
179 | 179 | IL10RB | gene/protein | IL10RB belongs to gene/protein category. The p... | [0.024764946, 0.022782002, -0.16956052, -0.033... |
192 | 192 | GNA12 | gene/protein | GNA12 belongs to gene/protein category. Predic... | [0.004795947, 0.04921528, -0.14488313, -0.0492... |
279 | 279 | HNF4A | gene/protein | HNF4A belongs to gene/protein category. The pr... | [0.013905027, 0.032602787, -0.15260702, 0.0074... |
417 | 417 | VCAM1 | gene/protein | VCAM1 belongs to gene/protein category. This g... | [0.047299746, 0.032621186, -0.15677826, -0.021... |
# Duplicate a node_name as index and use it as index
primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()
/tmp/ipykernel_64662/1471123717.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']
node_id | node_name | node_type | enriched_node | x | |
---|---|---|---|---|---|
node | |||||
144 | 144 | SMAD3 | gene/protein | SMAD3 belongs to gene/protein category. The SM... | [0.026536005, 0.05420931, -0.17033643, -0.0248... |
179 | 179 | IL10RB | gene/protein | IL10RB belongs to gene/protein category. The p... | [0.024764946, 0.022782002, -0.16956052, -0.033... |
192 | 192 | GNA12 | gene/protein | GNA12 belongs to gene/protein category. Predic... | [0.004795947, 0.04921528, -0.14488313, -0.0492... |
279 | 279 | HNF4A | gene/protein | HNF4A belongs to gene/protein category. The pr... | [0.013905027, 0.032602787, -0.15260702, 0.0074... |
417 | 417 | VCAM1 | gene/protein | VCAM1 belongs to gene/protein category. This g... | [0.047299746, 0.032621186, -0.15677826, -0.021... |
# Save the embedded nodes dataframes to parquet file
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)
Edge Embedding¶
Likewise, we also conduct node embedding for the IBD-related edges using the Ollama model by using mini-batches of 100 edges at a time.
# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):
outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
edge_embeddings.extend(outputs)
# edge_embeddings
100%|██████████| 128/128 [00:48<00:00, 2.64it/s]
# Check the shape of the edge embeddings
len(edge_embeddings), len(edge_embeddings[0])
(12752, 768)
# Add them as features to the dataframe
primekg_ibd_edges_df['edge_attr'] = edge_embeddings
# Drop and rename several columns
primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)
primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)
# Check dataframe of edges
primekg_ibd_edges_df.head()
head_id | head_name | tail_id | tail_name | edge_type | enriched_edge | edge_attr | |
---|---|---|---|---|---|---|---|
0 | 37785 | ulcerative colitis (disease) | 7359 | ADCY7 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.061832674, 0.040013667, -0.15366873, -0.008... |
1 | 28158 | inflammatory bowel disease | 7359 | ADCY7 | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... | [0.050393466, 0.030410834, -0.15008788, -0.013... |
2 | 37785 | ulcerative colitis (disease) | 2874 | PRDM1 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.0401622, 0.028982995, -0.15433805, 0.006565... |
3 | 28158 | inflammatory bowel disease | 2874 | PRDM1 | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... | [0.02781422, 0.01603875, -0.14870141, 0.004470... |
4 | 37785 | ulcerative colitis (disease) | 2712 | CASP3 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.07853663, 0.050751355, -0.1470567, -0.01237... |
# Save the embedded nodes dataframes to parquet file
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)
Knowledge Graph Construction¶
For this section, we would like to convert our dataframes to networkx DiGraph
object.
# Modify the node dataframe
primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()
/tmp/ipykernel_64662/4233198491.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1) /tmp/ipykernel_64662/4233198491.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
node_id | node_name | node_type | enriched_node | x | |
---|---|---|---|---|---|
node | |||||
SMAD3_(144) | SMAD3_(144) | SMAD3 | gene/protein | SMAD3 belongs to gene/protein category. The SM... | [0.026536005, 0.05420931, -0.17033643, -0.0248... |
IL10RB_(179) | IL10RB_(179) | IL10RB | gene/protein | IL10RB belongs to gene/protein category. The p... | [0.024764946, 0.022782002, -0.16956052, -0.033... |
GNA12_(192) | GNA12_(192) | GNA12 | gene/protein | GNA12 belongs to gene/protein category. Predic... | [0.004795947, 0.04921528, -0.14488313, -0.0492... |
HNF4A_(279) | HNF4A_(279) | HNF4A | gene/protein | HNF4A belongs to gene/protein category. The pr... | [0.013905027, 0.032602787, -0.15260702, 0.0074... |
VCAM1_(417) | VCAM1_(417) | VCAM1 | gene/protein | VCAM1 belongs to gene/protein category. This g... | [0.047299746, 0.032621186, -0.15677826, -0.021... |
# Modify the edge dataframe
primekg_ibd_edges_df["head_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
primekg_ibd_edges_df["tail_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df.head()
head_id | head_name | tail_id | tail_name | edge_type | enriched_edge | edge_attr | |
---|---|---|---|---|---|---|---|
0 | ulcerative colitis (disease)_(37785) | ulcerative colitis (disease) | ADCY7_(7359) | ADCY7 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.061832674, 0.040013667, -0.15366873, -0.008... |
1 | inflammatory bowel disease_(28158) | inflammatory bowel disease | ADCY7_(7359) | ADCY7 | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... | [0.050393466, 0.030410834, -0.15008788, -0.013... |
2 | ulcerative colitis (disease)_(37785) | ulcerative colitis (disease) | PRDM1_(2874) | PRDM1 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.0401622, 0.028982995, -0.15433805, 0.006565... |
3 | inflammatory bowel disease_(28158) | inflammatory bowel disease | PRDM1_(2874) | PRDM1 | (disease, associated with, gene/protein) | inflammatory bowel disease (disease) has a dir... | [0.02781422, 0.01603875, -0.14870141, 0.004470... |
4 | ulcerative colitis (disease)_(37785) | ulcerative colitis (disease) | CASP3_(2712) | CASP3 | (disease, associated with, gene/protein) | ulcerative colitis (disease) (disease) has a d... | [0.07853663, 0.050751355, -0.1470567, -0.01237... |
# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in primekg_ibd_nodes_df.iterrows():
kg.add_node(row['node_id'], **row.to_dict())
for i, row in primekg_ibd_edges_df.iterrows():
kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())
# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:
pickle.dump(kg, f)
# # Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:
# kg = pickle.load(f)
print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())
#Nodes 3426 #Edges 12752
In addition, we can convert the networkx graph to PyG Data
object for further processing (e.g., subgraph extraction).
# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)
# Save graph object
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:
pickle.dump(pyg_graph, f)
# Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
# pyg_graph = pickle.load(f)
Lastly, we are going to prepare a textualized graph of nodes and edges for RAG application, for instance.
# Prepare nodes
nodes_df = pd.DataFrame({
'node_id': list(pyg_graph.node_id),
'node_attr': list(pyg_graph.enriched_node),
})
nodes_df
node_id | node_attr | |
---|---|---|
0 | SMAD3_(144) | SMAD3 belongs to gene/protein category. The SM... |
1 | IL10RB_(179) | IL10RB belongs to gene/protein category. The p... |
2 | GNA12_(192) | GNA12 belongs to gene/protein category. Predic... |
3 | HNF4A_(279) | HNF4A belongs to gene/protein category. The pr... |
4 | VCAM1_(417) | VCAM1 belongs to gene/protein category. This g... |
... | ... | ... |
3421 | IRAK2 mediated activation of TAK1 complex upon... | IRAK2 mediated activation of TAK1 complex upon... |
3422 | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... |
3423 | Antigen processing: Ubiquitination & Proteasom... | Antigen processing: Ubiquitination & Proteasom... |
3424 | Antigen Presentation: Folding, assembly and pe... | Antigen Presentation: Folding, assembly and pe... |
3425 | Kinesins_(129367) | Kinesins belongs to pathway category. This pat... |
3426 rows × 2 columns
# Prepare edges
edges_df = pd.DataFrame({
'head_id': list(pyg_graph.head_id),
'edge_type': list(pyg_graph.edge_type),
'tail_id': list(pyg_graph.tail_id),
})
edges_df
head_id | edge_type | tail_id | |
---|---|---|---|
0 | SMAD3_(144) | (gene/protein, associated with, disease) | Crohn disease_(37784) |
1 | SMAD3_(144) | (gene/protein, associated with, disease) | inflammatory bowel disease_(28158) |
2 | SMAD3_(144) | (gene/protein, associated with, disease) | Crohn's colitis_(83770) |
3 | SMAD3_(144) | (gene/protein, associated with, disease) | Crohn ileitis and jejunitis_(35814) |
4 | SMAD3_(144) | (gene/protein, interacts with, pathway) | Signaling by NODAL_(62373) |
... | ... | ... | ... |
12747 | IRAK2 mediated activation of TAK1 complex upon... | (pathway, interacts with, gene/protein) | TLR4_(3259) |
12748 | TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... | (pathway, interacts with, gene/protein) | TLR9_(10113) |
12749 | Antigen processing: Ubiquitination & Proteasom... | (pathway, interacts with, gene/protein) | HERC2_(1777) |
12750 | Antigen Presentation: Folding, assembly and pe... | (pathway, interacts with, gene/protein) | ERAP2_(12763) |
12751 | Kinesins_(129367) | (pathway, interacts with, gene/protein) | KIF21B_(8564) |
12752 rows × 3 columns
with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), "wb") as f:
pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)