PrimeKG Subgraph Construction¶

In this tutorial, we will showcase how to construct a subraph from PrimeKG and prepare necessary graph formats for further analysis.

In particular, we will slice a subgraph from PrimeKG related to inflammatory bowel disease (IBD).

The subgraph will contain all nodes and edges that are connected to IBD-related disease nodes, including the following relationships:

Disease-Protein Relationship
Disease-Disease Relationship (skipped as of now)
Protein-Protein Relationship (skipped as of now)
Drug-Protein Relationship
Pathway-Protein Relationship
Pathway-Pathway Relationship (skipped as of now)
Bioprocess-Protein Relationship
Molecular Function-Protein Relationship
Cellular Component-Protein Relationship

In addition, to enrich the nodes and edges, we will perform the following tasks:

Textual enrichment (only this task is implemented as of now)
Multi-modal enrichment (to be added)

First of all, we need to import necessary libraries as follows:

In [ ]:

Copied!





# Import necessary libraries
import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)

PrimeKG¶

We utilize the PrimeKG class from the aiagents4pharma/talk2knowledgegraphs library.

The PrimeKG needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory.

In [3]:

Copied!





# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")

# Invoke a method to load the data
primekg_data.load_data()

# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")

# Invoke a method to load the data
primekg_data.load_data()

# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()

Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.

As a first step, we will perform data filtering over the primekg_nodes by querying the nodes that contains the following terms:

inflammatory bowel disease
crohn
ulcerative colitis

As of now, this basic query is used to filter the data. However, this can be replaced with a more complex query that can capture more nodes related to IBD.

In [4]:

Copied!





# Query for nodes related to IBD
query_str = 'node_name_lower.str.contains("inflammatory bowel disease")'
query_str += 'or node_name_lower.str.contains("crohn")'
query_str += 'or node_name_lower.str.contains("ulcerative colitis")'

# Get the nodes related to IBD
ibd_nodes_df = primekg_nodes.copy()
ibd_nodes_df["node_name_lower"] = primekg_nodes.node_name.apply(lambda x: x.lower())
ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == "disease"].query(query_str, engine='python')
ibd_nodes_df.drop(columns=["node_name_lower"], inplace=True)
ibd_nodes_df
# Query for nodes related to IBD
query_str = 'node_name_lower.str.contains("inflammatory bowel disease")'
query_str += 'or node_name_lower.str.contains("crohn")'
query_str += 'or node_name_lower.str.contains("ulcerative colitis")'

# Get the nodes related to IBD
ibd_nodes_df = primekg_nodes.copy()
ibd_nodes_df["node_name_lower"] = primekg_nodes.node_name.apply(lambda x: x.lower())
ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == "disease"].query(query_str, engine='python')
ibd_nodes_df.drop(columns=["node_name_lower"], inplace=True)
ibd_nodes_df

Out[4]:

	node_index	node_name	node_source	node_id	node_type
27269	27269	IL21-related infantile inflammatory bowel disease	MONDO	14338	disease
28158	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease
29293	29293	inflammatory bowel disease, immunodeficiency, ...	MONDO	32601	disease
35814	35814	Crohn ileitis and jejunitis	MONDO_grouped	709_21207	disease
35815	35815	small bowel Crohn disease	MONDO	5539	disease
37784	37784	Crohn disease	MONDO_grouped	5011_5535	disease
37785	37785	ulcerative colitis (disease)	MONDO	5101	disease
39013	39013	immune dysregulation-inflammatory bowel diseas...	MONDO	16542	disease
39787	39787	immune dysregulation with inflammatory bowel d...	MONDO	33967	disease
83770	83770	Crohn's colitis	MONDO	5532	disease
95279	95279	Crohn jejunoileitis	MONDO	708	disease
95280	95280	gastroduodenal Crohn disease	MONDO	710	disease
97088	97088	perianal Crohn disease	MONDO	5537	disease
99325	99325	Crohn disease of the esophagus	MONDO	22901	disease
99680	99680	immune dysregulation-inflammatory bowel diseas...	MONDO	33968	disease
99681	99681	inflammatory bowel disease-recurrent sinopulmo...	MONDO	33969	disease

Disease-Protein Relationship¶

Based on the nodes related to IBD, we can further capture the records containing the relationships of disease-gene/protein nodes.

In [5]:

Copied!





# IBD disease_protein edges
ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.tail_type == 'gene/protein')],
                                          primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.head_type == 'gene/protein')]])

# Check dataframe
ibd_disease_protein_edges_df
# IBD disease_protein edges
ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.tail_type == 'gene/protein')],
                                          primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.head_type == 'gene/protein')]])

# Check dataframe
ibd_disease_protein_edges_df

Out[5]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
5988787	37785	ulcerative colitis (disease)	MONDO	5101	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein
5988788	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein
5988789	37785	ulcerative colitis (disease)	MONDO	5101	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein
5988790	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein
5988791	37785	ulcerative colitis (disease)	MONDO	5101	disease	2712	CASP3	NCBI	836	gene/protein	associated with	disease_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
3304471	34780	IRGM	NCBI	345611	gene/protein	35814	Crohn ileitis and jejunitis	MONDO_grouped	709_21207	disease	associated with	disease_protein
3310277	5022	ITGAM	NCBI	3684	gene/protein	35814	Crohn ileitis and jejunitis	MONDO_grouped	709_21207	disease	associated with	disease_protein
3313160	2889	TGFB1	NCBI	7040	gene/protein	29293	inflammatory bowel disease, immunodeficiency, ...	MONDO	32601	disease	associated with	disease_protein
3314800	9104	INAVA	NCBI	55765	gene/protein	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	associated with	disease_protein
3314949	34967	IL21	NCBI	59067	gene/protein	27269	IL21-related infantile inflammatory bowel disease	MONDO	14338	disease	associated with	disease_protein

620 rows × 12 columns

In [6]:

Copied!





# Get unique protein index
ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),
                                              ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))
ibd_protein_index
# Get unique protein index
ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),
                                              ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))
ibd_protein_index

Out[6]:

array([  144,   179,   192,   279,   417,   625,   657,   729,   772,
         989,  1004,  1122,  1299,  1480,  1567,  1618,  1654,  1777,
        1990,  2012,  2057,  2078,  2111,  2139,  2329,  2384,  2543,
        2643,  2712,  2749,  2874,  2889,  2978,  2983,  3064,  3088,
        3233,  3259,  3333,  3414,  3460,  3469,  3474,  3484,  3495,
        3578,  3646,  4152,  4162,  4731,  4818,  4968,  4997,  5022,
        5195,  5385,  5720,  5805,  5915,  6168,  6175,  6229,  6428,
        6661,  7059,  7083,  7359,  7384,  7899,  7958,  8030,  8564,
        9104,  9454,  9763, 10113, 10191, 10919, 11103, 11134, 11199,
       11523, 11588, 12305, 12663, 12740, 12763, 12816, 13014, 13365,
       21972, 22105, 34623, 34776, 34777, 34778, 34779, 34780, 34781,
       34814, 34887, 34967, 35156])

Disease-Disease Relationship¶

Here, we can get the records containing the relationships of disease-disease nodes.

In [7]:

Copied!





# # IBD disease_disease edges 
# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.tail_type == 'disease')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.head_type == 'disease')]])

# # Check dataframe
# ibd_disease_disease_edges_df
# # IBD disease_disease edges 
# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.tail_type == 'disease')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.head_type == 'disease')]])

# # Check dataframe
# ibd_disease_disease_edges_df

Protein-Protein Relationship¶

We also can get the records containing the relationships of gene/protein-gene/protein nodes.

In [8]:

Copied!





# # IBD protein_protein edges 
# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.tail_type == 'gene/protein')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.head_type == 'gene/protein')]])

# # Check dataframe
# ibd_protein_protein_edges_df
# # IBD protein_protein edges 
# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.tail_type == 'gene/protein')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.head_type == 'gene/protein')]])

# # Check dataframe
# ibd_protein_protein_edges_df

Drug-Protein Relationship¶

Next, we will get the records containing the relationships of drug-gene/protein nodes.

In [9]:

Copied!





# IBD drug_protein edges
ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') & 
                                                     (primekg_edges.tail_type == 'gene/protein') & 
                                                     (primekg_edges.tail_index.isin(ibd_protein_index))], 
                                       primekg_edges[(primekg_edges.tail_type == 'drug') & 
                                                     (primekg_edges.head_type == 'gene/protein') & 
                                                     (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_drug_protein_edges_df
# IBD drug_protein edges
ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') & 
                                                     (primekg_edges.tail_type == 'gene/protein') & 
                                                     (primekg_edges.tail_index.isin(ibd_protein_index))], 
                                       primekg_edges[(primekg_edges.tail_type == 'drug') & 
                                                     (primekg_edges.head_type == 'gene/protein') & 
                                                     (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_drug_protein_edges_df

Out[9]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
321759	14118	Rose bengal	DrugBank	DB11182	drug	3233	LTF	NCBI	4057	gene/protein	carrier	drug_protein
321763	14038	Fluticasone furoate	DrugBank	DB08906	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein
321764	14555	Technetium Tc-99m tetrofosmin	DrugBank	DB09160	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein
321765	14040	Fluticasone	DrugBank	DB13867	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein
322373	14060	Levothyroxine	DrugBank	DB00451	drug	4152	ABCB1	NCBI	5243	gene/protein	enzyme	drug_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
5731639	4152	ABCB1	NCBI	5243	gene/protein	14498	Risdiplam	DrugBank	DB15305	drug	transporter	drug_protein
5731640	4152	ABCB1	NCBI	5243	gene/protein	14908	Ubrogepant	DrugBank	DB15328	drug	transporter	drug_protein
5731641	4152	ABCB1	NCBI	5243	gene/protein	14499	Elexacaftor	DrugBank	DB15444	drug	transporter	drug_protein
5731642	4152	ABCB1	NCBI	5243	gene/protein	14050	Prednisolone acetate	DrugBank	DB15566	drug	transporter	drug_protein
5731643	4152	ABCB1	NCBI	5243	gene/protein	15752	Selpercatinib	DrugBank	DB15685	drug	transporter	drug_protein

2030 rows × 12 columns

Pathway-Protein Relationship¶

For this case, we will get the records containing the relationships of pathway-protein nodes.

In [10]:

Copied!





# IBD pathway_protein edges 
ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                          primekg_edges[(primekg_edges.tail_type == 'pathway') & 
                                                        (primekg_edges.head_type == 'gene/protein') & 
                                                        (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_pathway_protein_edges_df
# IBD pathway_protein edges 
ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                          primekg_edges[(primekg_edges.tail_type == 'pathway') & 
                                                        (primekg_edges.head_type == 'gene/protein') & 
                                                        (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_pathway_protein_edges_df

Out[10]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
6505784	62703	Adherens junctions interactions	REACTOME	R-HSA-418990	pathway	8030	CDH3	NCBI	1001	gene/protein	interacts with	pathway_protein
6506102	128079	Regulation of actin dynamics for phagocytic cu...	REACTOME	R-HSA-2029482	pathway	2139	ARPC2	NCBI	10109	gene/protein	interacts with	pathway_protein
6506103	128183	EPHB-mediated forward signaling	REACTOME	R-HSA-3928662	pathway	2139	ARPC2	NCBI	10109	gene/protein	interacts with	pathway_protein
6506104	128022	RHO GTPases Activate WASPs and WAVEs	REACTOME	R-HSA-5663213	pathway	2139	ARPC2	NCBI	10109	gene/protein	interacts with	pathway_protein
6506105	62931	Clathrin-mediated endocytosis	REACTOME	R-HSA-8856828	pathway	2139	ARPC2	NCBI	10109	gene/protein	interacts with	pathway_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
3834665	2543	CDH1	NCBI	999	gene/protein	127731	Integrin cell surface interactions	REACTOME	R-HSA-216083	pathway	interacts with	pathway_protein
3834666	2543	CDH1	NCBI	999	gene/protein	127617	Apoptotic cleavage of cell adhesion proteins	REACTOME	R-HSA-351906	pathway	interacts with	pathway_protein
3834667	2543	CDH1	NCBI	999	gene/protein	62703	Adherens junctions interactions	REACTOME	R-HSA-418990	pathway	interacts with	pathway_protein
3834668	2543	CDH1	NCBI	999	gene/protein	128018	RHO GTPases activate IQGAPs	REACTOME	R-HSA-5626467	pathway	interacts with	pathway_protein
3834669	2543	CDH1	NCBI	999	gene/protein	129039	InlA-mediated entry of Listeria monocytogenes ...	REACTOME	R-HSA-8876493	pathway	interacts with	pathway_protein

1030 rows × 12 columns

In [11]:

Copied!





# Get unique protein index
ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),
                                              ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))
ibd_pathway_index
# Get unique protein index
ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),
                                              ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))
ibd_pathway_index

Out[11]:

array([ 62341,  62347,  62348,  62373,  62376,  62394,  62400,  62401,
        62404,  62405,  62414,  62448,  62449,  62462,  62465,  62467,
        62469,  62472,  62476,  62477,  62483,  62543,  62571,  62573,
        62575,  62583,  62588,  62596,  62603,  62606,  62628,  62644,
        62651,  62655,  62657,  62672,  62675,  62691,  62692,  62697,
        62702,  62703,  62711,  62717,  62733,  62734,  62768,  62770,
        62805,  62807,  62836,  62865,  62916,  62925,  62931,  62968,
        62976,  62987,  62996,  63041,  63064,  63071,  63076, 127601,
       127615, 127616, 127617, 127619, 127620, 127624, 127628, 127629,
       127639, 127640, 127649, 127659, 127662, 127682, 127683, 127688,
       127691, 127693, 127694, 127695, 127696, 127726, 127727, 127728,
       127729, 127730, 127731, 127732, 127733, 127791, 127797, 127810,
       127814, 127815, 127833, 127835, 127856, 127858, 127866, 127867,
       127869, 127886, 127891, 127908, 127917, 127918, 127921, 127928,
       127958, 127960, 127971, 127977, 127999, 128001, 128002, 128003,
       128008, 128010, 128015, 128018, 128022, 128025, 128034, 128058,
       128065, 128071, 128072, 128073, 128074, 128078, 128079, 128080,
       128086, 128111, 128113, 128116, 128117, 128129, 128137, 128138,
       128139, 128158, 128165, 128170, 128176, 128183, 128186, 128191,
       128198, 128199, 128204, 128208, 128209, 128224, 128227, 128242,
       128243, 128244, 128253, 128254, 128270, 128271, 128272, 128273,
       128299, 128302, 128341, 128348, 128349, 128350, 128351, 128353,
       128360, 128378, 128381, 128393, 128395, 128396, 128399, 128430,
       128440, 128453, 128460, 128470, 128472, 128473, 128477, 128478,
       128479, 128480, 128481, 128482, 128483, 128484, 128486, 128487,
       128497, 128498, 128499, 128500, 128501, 128503, 128527, 128535,
       128550, 128593, 128599, 128601, 128602, 128604, 128655, 128677,
       128715, 128759, 128766, 128767, 128779, 128781, 128782, 128783,
       128784, 128789, 128792, 128801, 128804, 128814, 128815, 128827,
       128828, 128829, 128830, 128832, 128835, 128837, 128838, 128841,
       128846, 128851, 128852, 128878, 128976, 128977, 128978, 128979,
       128980, 128981, 128988, 128990, 129007, 129015, 129016, 129021,
       129023, 129035, 129039, 129040, 129042, 129044, 129047, 129048,
       129052, 129099, 129110, 129124, 129125, 129126, 129127, 129128,
       129131, 129135, 129136, 129139, 129140, 129141, 129148, 129155,
       129167, 129181, 129183, 129190, 129195, 129196, 129197, 129198,
       129215, 129217, 129238, 129257, 129258, 129259, 129264, 129266,
       129289, 129294, 129296, 129302, 129303, 129310, 129355, 129360,
       129361, 129365, 129366, 129367])

Pathway-Pathway Relationship¶

As well as, a set of records containing the relationships of pathway-pathway nodes.

In [12]:

Copied!





# # # IBD pathway_pathway edges 
# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.tail_type == 'pathway')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.head_type == 'pathway')]])

# # Check dataframe
# ibd_pathway_pathway_edges_df
# # # IBD pathway_pathway edges 
# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.tail_type == 'pathway')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.head_type == 'pathway')]])

# # Check dataframe
# ibd_pathway_pathway_edges_df

Bioprocess-Protein Relationship¶

Next step is to get the records containing the relationships of biological_process-gene/protein nodes.

In [13]:

Copied!





# IBD bioprocess_protein edges 
ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') & 
                                                           (primekg_edges.tail_type == 'gene/protein') & 
                                                           (primekg_edges.tail_index.isin(ibd_protein_index))],
                                             primekg_edges[(primekg_edges.tail_type == 'biological_process') & 
                                                           (primekg_edges.head_type == 'gene/protein') & 
                                                           (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_bioprocess_protein_edges_df
# IBD bioprocess_protein edges 
ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') & 
                                                           (primekg_edges.tail_type == 'gene/protein') & 
                                                           (primekg_edges.tail_index.isin(ibd_protein_index))],
                                             primekg_edges[(primekg_edges.tail_type == 'biological_process') & 
                                                           (primekg_edges.head_type == 'gene/protein') & 
                                                           (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_bioprocess_protein_edges_df

Out[13]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
6351294	112487	neutrophil degranulation	GO	43312	biological_process	1990	FCGR2A	NCBI	2212	gene/protein	interacts with	bioprocess_protein
6351300	112487	neutrophil degranulation	GO	43312	biological_process	3333	FPR2	NCBI	2358	gene/protein	interacts with	bioprocess_protein
6351340	112487	neutrophil degranulation	GO	43312	biological_process	2012	CXCR1	NCBI	3577	gene/protein	interacts with	bioprocess_protein
6351341	112487	neutrophil degranulation	GO	43312	biological_process	3064	CXCR2	NCBI	3579	gene/protein	interacts with	bioprocess_protein
6351346	112487	neutrophil degranulation	GO	43312	biological_process	5022	ITGAM	NCBI	3684	gene/protein	interacts with	bioprocess_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
3781707	2111	LRRK2	NCBI	120892	gene/protein	51599	negative regulation of peroxidase activity	GO	2000469	biological_process	interacts with	bioprocess_protein
3781708	2111	LRRK2	NCBI	120892	gene/protein	52358	regulation of kidney size	GO	35564	biological_process	interacts with	bioprocess_protein
3781710	2111	LRRK2	NCBI	120892	gene/protein	109343	negative regulation of thioredoxin peroxidase ...	GO	1903125	biological_process	interacts with	bioprocess_protein
3781811	22105	GPBAR1	NCBI	151306	gene/protein	105254	cell surface bile acid receptor signaling pathway	GO	38184	biological_process	interacts with	bioprocess_protein
3781824	34779	NKX2-3	NCBI	159296	gene/protein	100699	post-embryonic digestive tract morphogenesis	GO	48621	biological_process	interacts with	bioprocess_protein

6300 rows × 12 columns

MolFunc-Protein Relationship¶

Here, we would like to get biological_process-gene/protein relationships.

In [14]:

Copied!





# IBD molfunc_protein edges 
ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'molecular_function') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_molfunc_protein_edges_df
# IBD molfunc_protein edges 
ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'molecular_function') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_molfunc_protein_edges_df

Out[14]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
6198264	54035	interleukin-1 binding	GO	19966	molecular_function	1654	IL1R2	NCBI	7850	gene/protein	interacts with	molfunc_protein
6198359	54290	enzyme binding	GO	19899	molecular_function	3578	ECM1	NCBI	1893	gene/protein	interacts with	molfunc_protein
6198366	54290	enzyme binding	GO	19899	molecular_function	2057	FN1	NCBI	2335	gene/protein	interacts with	molfunc_protein
6198442	54290	enzyme binding	GO	19899	molecular_function	989	PPARG	NCBI	5468	gene/protein	interacts with	molfunc_protein
6198462	54290	enzyme binding	GO	19899	molecular_function	772	RELA	NCBI	5970	gene/protein	interacts with	molfunc_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
3553533	6229	NOD2	NCBI	64127	gene/protein	122117	muramyl dipeptide binding	GO	32500	molecular_function	interacts with	molfunc_protein
3553770	2111	LRRK2	NCBI	120892	gene/protein	115199	GTP-dependent protein kinase activity	GO	34211	molecular_function	interacts with	molfunc_protein
3553771	2111	LRRK2	NCBI	120892	gene/protein	118105	beta-catenin destruction complex binding	GO	1904713	molecular_function	interacts with	molfunc_protein
3553773	2111	LRRK2	NCBI	120892	gene/protein	119847	peroxidase inhibitor activity	GO	36479	molecular_function	interacts with	molfunc_protein
3553832	22105	GPBAR1	NCBI	151306	gene/protein	116806	G protein-coupled bile acid receptor activity	GO	38182	molecular_function	interacts with	molfunc_protein

1466 rows × 12 columns

CellComp-Protein Relationship¶

Finally, we are getting the records containing the relationships of cellular_component-gene/protein nodes.

In [15]:

Copied!





# IBD molfunc_protein edges 
ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'cellular_component') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_cellcomp_protein_edges_df
# IBD molfunc_protein edges 
ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'cellular_component') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_cellcomp_protein_edges_df

Out[15]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation
6267848	126078	ficolin-1-rich granule lumen	GO	1904813	cellular_component	3474	MMP9	NCBI	4318	gene/protein	interacts with	cellcomp_protein
6268120	124245	extracellular space	GO	5615	cellular_component	2384	CRP	NCBI	1401	gene/protein	interacts with	cellcomp_protein
6268163	124245	extracellular space	GO	5615	cellular_component	5805	DEFA5	NCBI	1670	gene/protein	interacts with	cellcomp_protein
6268164	124245	extracellular space	GO	5615	cellular_component	657	DEFA6	NCBI	1671	gene/protein	interacts with	cellcomp_protein
6268173	124245	extracellular space	GO	5615	cellular_component	3578	ECM1	NCBI	1893	gene/protein	interacts with	cellcomp_protein
...	...	...	...	...	...	...	...	...	...	...	...	...
3636708	2139	ARPC2	NCBI	10109	gene/protein	126261	muscle cell projection membrane	GO	36195	cellular_component	interacts with	cellcomp_protein
3636819	9763	ORMDL3	NCBI	94103	gene/protein	126815	SPOTS complex	GO	35339	cellular_component	interacts with	cellcomp_protein
3637211	6661	ATG16L1	NCBI	55054	gene/protein	126444	vacuole-isolation membrane contact site	GO	120095	cellular_component	interacts with	cellcomp_protein
3637234	2111	LRRK2	NCBI	120892	gene/protein	126938	cytoplasmic side of mitochondrial outer membrane	GO	32473	cellular_component	interacts with	cellcomp_protein
3637328	2111	LRRK2	NCBI	120892	gene/protein	125942	caveola neck	GO	99400	cellular_component	interacts with	cellcomp_protein

1348 rows × 12 columns

Merge all dataframes¶

Once we have all of particular type of edges, we can merge them into a single dataframe representing a subgraph of IBD inferred from PrimeKG.

In [47]:

Copied!





# PrimeKG edges related to IBD
primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,
                                #   ibd_disease_disease_edges_df,
                                #   ibd_protein_protein_edges_df,
                                  ibd_drug_protein_edges_df,
                                  ibd_pathway_protein_edges_df,
                                #   ibd_pathway_pathway_edges_df,
                                  ibd_bioprocess_protein_edges_df,
                                  ibd_molfunc_protein_edges_df,
                                  ibd_cellcomp_protein_edges_df])
primekg_ibd_edges_df["edge_type"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)
primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df
# PrimeKG edges related to IBD
primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,
                                #   ibd_disease_disease_edges_df,
                                #   ibd_protein_protein_edges_df,
                                  ibd_drug_protein_edges_df,
                                  ibd_pathway_protein_edges_df,
                                #   ibd_pathway_pathway_edges_df,
                                  ibd_bioprocess_protein_edges_df,
                                  ibd_molfunc_protein_edges_df,
                                  ibd_cellcomp_protein_edges_df])
primekg_ibd_edges_df["edge_type"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)
primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df

Out[47]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation	edge_type
0	37785	ulcerative colitis (disease)	MONDO	5101	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)
1	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)
2	37785	ulcerative colitis (disease)	MONDO	5101	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)
3	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)
4	37785	ulcerative colitis (disease)	MONDO	5101	disease	2712	CASP3	NCBI	836	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)
...	...	...	...	...	...	...	...	...	...	...	...	...	...
12747	2139	ARPC2	NCBI	10109	gene/protein	126261	muscle cell projection membrane	GO	36195	cellular_component	interacts with	cellcomp_protein	(gene/protein, interacts with, cellular_compon...
12748	9763	ORMDL3	NCBI	94103	gene/protein	126815	SPOTS complex	GO	35339	cellular_component	interacts with	cellcomp_protein	(gene/protein, interacts with, cellular_compon...
12749	6661	ATG16L1	NCBI	55054	gene/protein	126444	vacuole-isolation membrane contact site	GO	120095	cellular_component	interacts with	cellcomp_protein	(gene/protein, interacts with, cellular_compon...
12750	2111	LRRK2	NCBI	120892	gene/protein	126938	cytoplasmic side of mitochondrial outer membrane	GO	32473	cellular_component	interacts with	cellcomp_protein	(gene/protein, interacts with, cellular_compon...
12751	2111	LRRK2	NCBI	120892	gene/protein	125942	caveola neck	GO	99400	cellular_component	interacts with	cellcomp_protein	(gene/protein, interacts with, cellular_compon...

12752 rows × 13 columns

We can get a dataframe of nodes based on the above edge dataframe as follows:

In [48]:

Copied!





# PrimeKG nodes related to IBD
primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(), 
                                                                                   primekg_ibd_edges_df.tail_index.unique()])))]
primekg_ibd_nodes_df
# PrimeKG nodes related to IBD
primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(), 
                                                                                   primekg_ibd_edges_df.tail_index.unique()])))]
primekg_ibd_nodes_df

Out[48]:

	node_index	node_name	node_source	node_id	node_type
144	144	SMAD3	NCBI	4088	gene/protein
179	179	IL10RB	NCBI	3588	gene/protein
192	192	GNA12	NCBI	2768	gene/protein
279	279	HNF4A	NCBI	3172	gene/protein
417	417	VCAM1	NCBI	7412	gene/protein
...	...	...	...	...	...
129360	129360	IRAK2 mediated activation of TAK1 complex upon...	REACTOME	R-HSA-975163	pathway
129361	129361	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...	REACTOME	R-HSA-975110	pathway
129365	129365	Antigen processing: Ubiquitination & Proteasom...	REACTOME	R-HSA-983168	pathway
129366	129366	Antigen Presentation: Folding, assembly and pe...	REACTOME	R-HSA-983170	pathway
129367	129367	Kinesins	REACTOME	R-HSA-983189	pathway

3426 rows × 5 columns

We can store the nodes and edges related to IBD in a parquet file for future use.

In [49]:

Copied!





# Store the IBD-related nodes and edges
local_dir = '../../../../data/primekg_ibd/'
if not os.path.exists(local_dir):
    os.makedirs(local_dir)
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)
# Store the IBD-related nodes and edges
local_dir = '../../../../data/primekg_ibd/'
if not os.path.exists(local_dir):
    os.makedirs(local_dir)
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)

In [50]:

Copied!

# Statistics over the IBD-related nodes and edges
print(f"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}")
print(f"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}")
# Statistics over the IBD-related nodes and edges
print(f"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}")
print(f"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}")

Number of IBD-related nodes: 3426
Number of IBD-related edges: 12752

In [51]:

Copied!

# Count the number of nodes by node type
primekg_ibd_nodes_df.groupby('node_type').size()
# Count the number of nodes by node type
primekg_ibd_nodes_df.groupby('node_type').size()

Out[51]:

node_type
biological_process    1642
cellular_component     207
disease                  7
drug                   835
gene/protein           103
molecular_function     324
pathway                308
dtype: int64

In [52]:

Copied!

# Count the number of edges by relation and display_relation
primekg_ibd_edges_df.groupby(['relation','display_relation']).size()
# Count the number of edges by relation and display_relation
primekg_ibd_edges_df.groupby(['relation','display_relation']).size()

Out[52]:

relation            display_relation
bioprocess_protein  interacts with      6300
cellcomp_protein    interacts with      1348
disease_protein     associated with      620
drug_protein        carrier                8
                    enzyme                64
                    target               776
                    transporter         1140
molfunc_protein     interacts with      1466
pathway_protein     interacts with      1030
dtype: int64

In [53]:

Copied!

# Count the number of edges by edge type
primekg_ibd_edges_df.groupby(['edge_type']).size()
# Count the number of edges by edge type
primekg_ibd_edges_df.groupby(['edge_type']).size()

Out[53]:

edge_type
(biological_process, interacts with, gene/protein)    3150
(cellular_component, interacts with, gene/protein)     674
(disease, associated with, gene/protein)               310
(drug, carrier, gene/protein)                            4
(drug, enzyme, gene/protein)                            32
(drug, target, gene/protein)                           388
(drug, transporter, gene/protein)                      570
(gene/protein, associated with, disease)               310
(gene/protein, carrier, drug)                            4
(gene/protein, enzyme, drug)                            32
(gene/protein, interacts with, biological_process)    3150
(gene/protein, interacts with, cellular_component)     674
(gene/protein, interacts with, molecular_function)     733
(gene/protein, interacts with, pathway)                515
(gene/protein, target, drug)                           388
(gene/protein, transporter, drug)                      570
(molecular_function, interacts with, gene/protein)     733
(pathway, interacts with, gene/protein)                515
dtype: int64

Enrichment (using textual as of now)¶

From this point onwards, we will use the pre-processed IBD-related nodes and edges to create a set of graph formats.

Before that, we should perform enrichment and embedding over the IBD-related nodes and edges.

As of now, we will conduct a textual enrichment over the records.

Since StarQA provide most of information of the nodes, we will use StarkQA to get the information of the nodes related to IBD.

In [54]:

Copied!





# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
# starkqa_df = starkqa_data.get_starkqa()
starkqa_node_info = starkqa_data.get_starkqa_node_info()
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
# starkqa_df = starkqa_data.get_starkqa()
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Loading StarkQAPrimeKG dataset...
../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.
Loading StarkQAPrimeKG embeddings...

Note that not all nodes in the StarkQA-PrimeKG have additional information.

For this case, we provide a basic text enrichment for the nodes by simply specifying their node name and type.

In [55]:

Copied!





def do_enrichment_text(data, starkqa_node_info):
    """
    Enrich the node with additional textual information from BioBridge and StarkQA.

    Args:
        data (dict): The node data from PrimeKG
        starkqa_node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} category. "

    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),
                               str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),
                               str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(starkqa_node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {starkqa_node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''

    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node
def do_enrichment_text(data, starkqa_node_info):
    """
    Enrich the node with additional textual information from BioBridge and StarkQA.

    Args:
        data (dict): The node data from PrimeKG
        starkqa_node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} category. "

    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),
                               str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),
                               str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(starkqa_node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {starkqa_node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''

    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node

By using the above function, we can enrich the node information from PrimeKG with additional information from StarkQA-PrimeKG as shown below:

In [56]:

Copied!





# Perform node enrichment for each row in primekg_nodes
text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()
primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes
primekg_ibd_nodes_df
# Perform node enrichment for each row in primekg_nodes
text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()
primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes
primekg_ibd_nodes_df

/tmp/ipykernel_64662/2873064541.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes

Out[56]:

	node_index	node_name	node_source	node_id	node_type	enriched_node
144	144	SMAD3	NCBI	4088	gene/protein	SMAD3 belongs to gene/protein category. The SM...
179	179	IL10RB	NCBI	3588	gene/protein	IL10RB belongs to gene/protein category. The p...
192	192	GNA12	NCBI	2768	gene/protein	GNA12 belongs to gene/protein category. Predic...
279	279	HNF4A	NCBI	3172	gene/protein	HNF4A belongs to gene/protein category. The pr...
417	417	VCAM1	NCBI	7412	gene/protein	VCAM1 belongs to gene/protein category. This g...
...	...	...	...	...	...	...
129360	129360	IRAK2 mediated activation of TAK1 complex upon...	REACTOME	R-HSA-975163	pathway	IRAK2 mediated activation of TAK1 complex upon...
129361	129361	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...	REACTOME	R-HSA-975110	pathway	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...
129365	129365	Antigen processing: Ubiquitination & Proteasom...	REACTOME	R-HSA-983168	pathway	Antigen processing: Ubiquitination & Proteasom...
129366	129366	Antigen Presentation: Folding, assembly and pe...	REACTOME	R-HSA-983170	pathway	Antigen Presentation: Folding, assembly and pe...
129367	129367	Kinesins	REACTOME	R-HSA-983189	pathway	Kinesins belongs to pathway category. This pat...

3426 rows × 6 columns

Subsequently, we can perform similar textual enrichment for the edges in PrimeKG.

Since StarkQA only provides node information, we can only enrich the edges with basic information of the triples in combination with the head and tail nodes.

In [57]:

Copied!





# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges
primekg_ibd_edges_df.head()
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges
primekg_ibd_edges_df.head()

Out[57]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation	edge_type	enriched_edge
0	37785	ulcerative colitis (disease)	MONDO	5101	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...
1	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	7359	ADCY7	NCBI	113	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...
2	37785	ulcerative colitis (disease)	MONDO	5101	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...
3	28158	inflammatory bowel disease	MONDO_grouped	9960_12845_33643_11471_12831_12875_12941_13153...	disease	2874	PRDM1	NCBI	639	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...
4	37785	ulcerative colitis (disease)	MONDO	5101	disease	2712	CASP3	NCBI	836	gene/protein	associated with	disease_protein	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...

Embeddings (using textual embedding as of now)¶

We are going to perform embedding using the enriched nodes and edges by leveraging EmbeddingWithOllama class.

For this purpose, we will use nomic-embed-text.

In [58]:

Copied!

# Using nomic-ai/nomic-embed-text-v1.5 model
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')
# Using nomic-ai/nomic-embed-text-v1.5 model
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

Node Embedding¶

We will perform node embedding for the IBD-related nodes using the Ollama model by using mini-batches of 100 nodes at a time.

In [59]:

Copied!





# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
node_embeddings = []
for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])
    node_embeddings.extend(outputs)
# node_embeddings
# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
node_embeddings = []
for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])
    node_embeddings.extend(outputs)
# node_embeddings

100%|██████████| 35/35 [00:19<00:00,  1.75it/s]

In [60]:

Copied!

# Check the shape of the node embeddings
len(node_embeddings), len(node_embeddings[0])
# Check the shape of the node embeddings
len(node_embeddings), len(node_embeddings[0])

Out[60]:

(3426, 768)

In [61]:

Copied!





# Add them as features to the dataframe
primekg_ibd_nodes_df['x'] = node_embeddings

# Drop and rename several columns
primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)
primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)

# Check dataframe of nodes
primekg_ibd_nodes_df.head()
# Add them as features to the dataframe
primekg_ibd_nodes_df['x'] = node_embeddings

# Drop and rename several columns
primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)
primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)

# Check dataframe of nodes
primekg_ibd_nodes_df.head()

/tmp/ipykernel_64662/3470083233.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df['x'] = node_embeddings
/tmp/ipykernel_64662/3470083233.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)
/tmp/ipykernel_64662/3470083233.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)

Out[61]:

	node_id	node_name	node_type	enriched_node	x
144	144	SMAD3	gene/protein	SMAD3 belongs to gene/protein category. The SM...	[0.026536005, 0.05420931, -0.17033643, -0.0248...
179	179	IL10RB	gene/protein	IL10RB belongs to gene/protein category. The p...	[0.024764946, 0.022782002, -0.16956052, -0.033...
192	192	GNA12	gene/protein	GNA12 belongs to gene/protein category. Predic...	[0.004795947, 0.04921528, -0.14488313, -0.0492...
279	279	HNF4A	gene/protein	HNF4A belongs to gene/protein category. The pr...	[0.013905027, 0.032602787, -0.15260702, 0.0074...
417	417	VCAM1	gene/protein	VCAM1 belongs to gene/protein category. This g...	[0.047299746, 0.032621186, -0.15677826, -0.021...

In [62]:

Copied!





# Duplicate a node_name as index and use it as index
primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()
# Duplicate a node_name as index and use it as index
primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()

/tmp/ipykernel_64662/1471123717.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']

Out[62]:

	node_id	node_name	node_type	enriched_node	x
node
144	144	SMAD3	gene/protein	SMAD3 belongs to gene/protein category. The SM...	[0.026536005, 0.05420931, -0.17033643, -0.0248...
179	179	IL10RB	gene/protein	IL10RB belongs to gene/protein category. The p...	[0.024764946, 0.022782002, -0.16956052, -0.033...
192	192	GNA12	gene/protein	GNA12 belongs to gene/protein category. Predic...	[0.004795947, 0.04921528, -0.14488313, -0.0492...
279	279	HNF4A	gene/protein	HNF4A belongs to gene/protein category. The pr...	[0.013905027, 0.032602787, -0.15260702, 0.0074...
417	417	VCAM1	gene/protein	VCAM1 belongs to gene/protein category. This g...	[0.047299746, 0.032621186, -0.15677826, -0.021...

In [63]:

Copied!

# Save the embedded nodes dataframes to parquet file
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)
# Save the embedded nodes dataframes to parquet file
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)

Edge Embedding¶

Likewise, we also conduct node embedding for the IBD-related edges using the Ollama model by using mini-batches of 100 edges at a time.

In [64]:

Copied!





# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
    edge_embeddings.extend(outputs)
# edge_embeddings
# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
    edge_embeddings.extend(outputs)
# edge_embeddings

100%|██████████| 128/128 [00:48<00:00,  2.64it/s]

In [65]:

Copied!

# Check the shape of the edge embeddings
len(edge_embeddings), len(edge_embeddings[0])
# Check the shape of the edge embeddings
len(edge_embeddings), len(edge_embeddings[0])

Out[65]:

(12752, 768)

In [66]:

Copied!





# Add them as features to the dataframe
primekg_ibd_edges_df['edge_attr'] = edge_embeddings

# Drop and rename several columns
primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)
primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
primekg_ibd_edges_df.head()
# Add them as features to the dataframe
primekg_ibd_edges_df['edge_attr'] = edge_embeddings

# Drop and rename several columns
primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)
primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
primekg_ibd_edges_df.head()

Out[66]:

	head_id	head_name	tail_id	tail_name	edge_type	enriched_edge	edge_attr
0	37785	ulcerative colitis (disease)	7359	ADCY7	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.061832674, 0.040013667, -0.15366873, -0.008...
1	28158	inflammatory bowel disease	7359	ADCY7	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...	[0.050393466, 0.030410834, -0.15008788, -0.013...
2	37785	ulcerative colitis (disease)	2874	PRDM1	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.0401622, 0.028982995, -0.15433805, 0.006565...
3	28158	inflammatory bowel disease	2874	PRDM1	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...	[0.02781422, 0.01603875, -0.14870141, 0.004470...
4	37785	ulcerative colitis (disease)	2712	CASP3	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.07853663, 0.050751355, -0.1470567, -0.01237...

In [67]:

Copied!

# Save the embedded nodes dataframes to parquet file
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)
# Save the embedded nodes dataframes to parquet file
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)

Knowledge Graph Construction¶

For this section, we would like to convert our dataframes to networkx DiGraph object.

In [68]:

Copied!





# Modify the node dataframe
primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()
# Modify the node dataframe
primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()

/tmp/ipykernel_64662/4233198491.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
/tmp/ipykernel_64662/4233198491.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)

Out[68]:

	node_id	node_name	node_type	enriched_node	x
node
SMAD3_(144)	SMAD3_(144)	SMAD3	gene/protein	SMAD3 belongs to gene/protein category. The SM...	[0.026536005, 0.05420931, -0.17033643, -0.0248...
IL10RB_(179)	IL10RB_(179)	IL10RB	gene/protein	IL10RB belongs to gene/protein category. The p...	[0.024764946, 0.022782002, -0.16956052, -0.033...
GNA12_(192)	GNA12_(192)	GNA12	gene/protein	GNA12 belongs to gene/protein category. Predic...	[0.004795947, 0.04921528, -0.14488313, -0.0492...
HNF4A_(279)	HNF4A_(279)	HNF4A	gene/protein	HNF4A belongs to gene/protein category. The pr...	[0.013905027, 0.032602787, -0.15260702, 0.0074...
VCAM1_(417)	VCAM1_(417)	VCAM1	gene/protein	VCAM1 belongs to gene/protein category. This g...	[0.047299746, 0.032621186, -0.15677826, -0.021...

In [69]:

Copied!





# Modify the edge dataframe
primekg_ibd_edges_df["head_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
primekg_ibd_edges_df["tail_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df.head()
# Modify the edge dataframe
primekg_ibd_edges_df["head_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
primekg_ibd_edges_df["tail_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df.head()

Out[69]:

	head_id	head_name	tail_id	tail_name	edge_type	enriched_edge	edge_attr
0	ulcerative colitis (disease)_(37785)	ulcerative colitis (disease)	ADCY7_(7359)	ADCY7	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.061832674, 0.040013667, -0.15366873, -0.008...
1	inflammatory bowel disease_(28158)	inflammatory bowel disease	ADCY7_(7359)	ADCY7	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...	[0.050393466, 0.030410834, -0.15008788, -0.013...
2	ulcerative colitis (disease)_(37785)	ulcerative colitis (disease)	PRDM1_(2874)	PRDM1	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.0401622, 0.028982995, -0.15433805, 0.006565...
3	inflammatory bowel disease_(28158)	inflammatory bowel disease	PRDM1_(2874)	PRDM1	(disease, associated with, gene/protein)	inflammatory bowel disease (disease) has a dir...	[0.02781422, 0.01603875, -0.14870141, 0.004470...
4	ulcerative colitis (disease)_(37785)	ulcerative colitis (disease)	CASP3_(2712)	CASP3	(disease, associated with, gene/protein)	ulcerative colitis (disease) (disease) has a d...	[0.07853663, 0.050751355, -0.1470567, -0.01237...

In [70]:

Copied!





# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in primekg_ibd_nodes_df.iterrows():
    kg.add_node(row['node_id'], **row.to_dict())
for i, row in primekg_ibd_edges_df.iterrows():
    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())
# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in primekg_ibd_nodes_df.iterrows():
    kg.add_node(row['node_id'], **row.to_dict())
for i, row in primekg_ibd_edges_df.iterrows():
    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())

In [71]:

Copied!





# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:
    pickle.dump(kg, f)

# # Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:
#     kg = pickle.load(f)
# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:
    pickle.dump(kg, f)

# # Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:
#     kg = pickle.load(f)

In [72]:

Copied!

print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())
print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())

#Nodes 3426
#Edges 12752

In addition, we can convert the networkx graph to PyG Data object for further processing (e.g., subgraph extraction).

In [74]:

Copied!





# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)

# Save graph object
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:
    pickle.dump(pyg_graph, f)

# Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
#     pyg_graph = pickle.load(f)
# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)

# Save graph object
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:
    pickle.dump(pyg_graph, f)

# Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
#     pyg_graph = pickle.load(f)

Lastly, we are going to prepare a textualized graph of nodes and edges for RAG application, for instance.

In [75]:

Copied!





# Prepare nodes
nodes_df = pd.DataFrame({
    'node_id': list(pyg_graph.node_id),
    'node_attr': list(pyg_graph.enriched_node),
})
nodes_df
# Prepare nodes
nodes_df = pd.DataFrame({
    'node_id': list(pyg_graph.node_id),
    'node_attr': list(pyg_graph.enriched_node),
})
nodes_df

Out[75]:

	node_id	node_attr
0	SMAD3_(144)	SMAD3 belongs to gene/protein category. The SM...
1	IL10RB_(179)	IL10RB belongs to gene/protein category. The p...
2	GNA12_(192)	GNA12 belongs to gene/protein category. Predic...
3	HNF4A_(279)	HNF4A belongs to gene/protein category. The pr...
4	VCAM1_(417)	VCAM1 belongs to gene/protein category. This g...
...	...	...
3421	IRAK2 mediated activation of TAK1 complex upon...	IRAK2 mediated activation of TAK1 complex upon...
3422	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...
3423	Antigen processing: Ubiquitination & Proteasom...	Antigen processing: Ubiquitination & Proteasom...
3424	Antigen Presentation: Folding, assembly and pe...	Antigen Presentation: Folding, assembly and pe...
3425	Kinesins_(129367)	Kinesins belongs to pathway category. This pat...

3426 rows × 2 columns

In [76]:

Copied!





# Prepare edges
edges_df = pd.DataFrame({
    'head_id': list(pyg_graph.head_id),
    'edge_type': list(pyg_graph.edge_type),
    'tail_id': list(pyg_graph.tail_id),
})
edges_df
# Prepare edges
edges_df = pd.DataFrame({
    'head_id': list(pyg_graph.head_id),
    'edge_type': list(pyg_graph.edge_type),
    'tail_id': list(pyg_graph.tail_id),
})
edges_df

Out[76]:

	head_id	edge_type	tail_id
0	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn disease_(37784)
1	SMAD3_(144)	(gene/protein, associated with, disease)	inflammatory bowel disease_(28158)
2	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn's colitis_(83770)
3	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn ileitis and jejunitis_(35814)
4	SMAD3_(144)	(gene/protein, interacts with, pathway)	Signaling by NODAL_(62373)
...	...	...	...
12747	IRAK2 mediated activation of TAK1 complex upon...	(pathway, interacts with, gene/protein)	TLR4_(3259)
12748	TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...	(pathway, interacts with, gene/protein)	TLR9_(10113)
12749	Antigen processing: Ubiquitination & Proteasom...	(pathway, interacts with, gene/protein)	HERC2_(1777)
12750	Antigen Presentation: Folding, assembly and pe...	(pathway, interacts with, gene/protein)	ERAP2_(12763)
12751	Kinesins_(129367)	(pathway, interacts with, gene/protein)	KIF21B_(8564)

12752 rows × 3 columns

In [77]:

Copied!

with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), "wb") as f:
    pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)
with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), "wb") as f:
    pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)