Multimodal BioBridge-PrimeKG Graph Construction¶

In this tutorial, we will perform a simple pre-processing task over BioBridge-PrimeKG dataset that employs multimodal data. In particular, we are using the pre-loaded embeddings which are already provided by BioBridge joined with PrimKG IBD dataset obtained from previous tutorial:

docs/notebooks/talk2knowledgegraphs/tutorial_primekg_subgraph.ipynb

First of all, we need to import necessary libraries as follows:

In [1]:

Copied!





# Import necessary libraries
# %load_ext cudf.pandas

import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
import blosc

from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.biobridge_primekg import BioBridgePrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
# from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
# Import necessary libraries
# %load_ext cudf.pandas

import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
import blosc

from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.biobridge_primekg import BioBridgePrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
# from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)

/home/awmulyadi/Repositories/temp/office2/AIAgents4Pharma/venv/lib/python3.12/site-packages/torch_geometric/typing.py:86: UserWarning: An issue occurred while importing 'torch-scatter'. Disabling its usage. Stacktrace: /home/awmulyadi/Repositories/temp/office2/AIAgents4Pharma/venv/lib/python3.12/site-packages/torch_scatter/_version_cuda.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
  warnings.warn(f"An issue occurred while importing 'torch-scatter'. "
/home/awmulyadi/Repositories/temp/office2/AIAgents4Pharma/venv/lib/python3.12/site-packages/torch_geometric/typing.py:124: UserWarning: An issue occurred while importing 'torch-sparse'. Disabling its usage. Stacktrace: /home/awmulyadi/Repositories/temp/office2/AIAgents4Pharma/venv/lib/python3.12/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
  warnings.warn(f"An issue occurred while importing 'torch-sparse'. "
/home/awmulyadi/Repositories/temp/office2/AIAgents4Pharma/venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Prepare BioBridge dataset¶

The BioBridgePrimeKG allows to load the data from related Github repository if the data is not available locally.

Otherwise, the data is loaded from the local directory as defined in the local_dir and primekg_dir.

In [2]:

Copied!





# Define biobridge primekg data by providing a local directory where the data is stored
biobridge_data = BioBridgePrimeKG(primekg_dir="../../../../data/primekg/",
                                  local_dir="../../../../data/biobridge_primekg/")

# Invoke a method to load the data
biobridge_data.load_data()

# Get the node information of the BioBridge PrimeKG
biobridge_node_info = biobridge_data.get_node_info_dict()
biobridge_node_info.keys()
# Define biobridge primekg data by providing a local directory where the data is stored
biobridge_data = BioBridgePrimeKG(primekg_dir="../../../../data/primekg/",
                                  local_dir="../../../../data/biobridge_primekg/")

# Invoke a method to load the data
biobridge_data.load_data()

# Get the node information of the BioBridge PrimeKG
biobridge_node_info = biobridge_data.get_node_info_dict()
biobridge_node_info.keys()

Loading PrimeKG dataset...
Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.
Loading data config file of BioBridgePrimeKG...
File data_config.json already exists in ../../../../data/biobridge_primekg/.
Building node embeddings...
Building full triplets...
Building train-test split...

Out[2]:

dict_keys(['gene/protein', 'molecular_function', 'cellular_component', 'biological_process', 'drug', 'disease'])

We also utilize another source of information: StarkQA PrimeKG that provide us with the information of each node in the graph. We can use StarkQAPrimeKG class to load the data. Subsequently, we can use the get_node_info_dict method to obtain the node information of the StarkQA PrimeKG after loading the data using the load_data method.

In [3]:

Copied!





# As an additional source of information, we utilize StarkQA PrimeKG 
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the node information of the StarkQA PrimeKG
starkqa_node_info = starkqa_data.get_starkqa_node_info()
# As an additional source of information, we utilize StarkQA PrimeKG 
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the node information of the StarkQA PrimeKG
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Loading StarkQAPrimeKG dataset...
../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.
Loading StarkQAPrimeKG embeddings...

The following codes will prepare the nodes and edges dataframes from the BioBridge dataset.

In [4]:

Copied!





# Prepare BioBridge-PrimeKG edges
# Build the node index list
node_info_dict = {}
node_index_list = []
for i, node_type in enumerate(biobridge_data.preselected_node_types):
    df_node = pd.read_csv(os.path.join(biobridge_data.local_dir, "processed", f"{node_type}.csv"))
    node_info_dict[biobridge_data.node_type_map[node_type]] = df_node
    node_index_list.extend(df_node["node_index"].tolist())

# Filter the PrimeKG dataset to take into account only the selected node types
edges_df = biobridge_data.primekg.get_edges().copy()
edges_df = edges_df[
    edges_df["head_index"].isin(node_index_list) &\
    edges_df["tail_index"].isin(node_index_list)
]
edges_df = edges_df.reset_index(drop=True)

# Further filtering out some nodes in the embedding dictionary
edges_df = edges_df[
    edges_df["head_index"].isin(list(biobridge_data.emb_dict.keys())) &\
    edges_df["tail_index"].isin(list(biobridge_data.emb_dict.keys()))
].reset_index(drop=True)
# Prepare BioBridge-PrimeKG edges
# Build the node index list
node_info_dict = {}
node_index_list = []
for i, node_type in enumerate(biobridge_data.preselected_node_types):
    df_node = pd.read_csv(os.path.join(biobridge_data.local_dir, "processed", f"{node_type}.csv"))
    node_info_dict[biobridge_data.node_type_map[node_type]] = df_node
    node_index_list.extend(df_node["node_index"].tolist())

# Filter the PrimeKG dataset to take into account only the selected node types
edges_df = biobridge_data.primekg.get_edges().copy()
edges_df = edges_df[
    edges_df["head_index"].isin(node_index_list) &\
    edges_df["tail_index"].isin(node_index_list)
]
edges_df = edges_df.reset_index(drop=True)

# Further filtering out some nodes in the embedding dictionary
edges_df = edges_df[
    edges_df["head_index"].isin(list(biobridge_data.emb_dict.keys())) &\
    edges_df["tail_index"].isin(list(biobridge_data.emb_dict.keys()))
].reset_index(drop=True)

In [5]:

Copied!





# Prepare BioBridge-PrimeKG nodes
nodes_df = biobridge_data.primekg.get_nodes().copy()
nodes_df = nodes_df[nodes_df["node_index"].isin(np.unique(np.concatenate([edges_df.head_index.unique(), 
                                                                          edges_df.tail_index.unique()])))].reset_index(drop=True)
# Prepare BioBridge-PrimeKG nodes
nodes_df = biobridge_data.primekg.get_nodes().copy()
nodes_df = nodes_df[nodes_df["node_index"].isin(np.unique(np.concatenate([edges_df.head_index.unique(), 
                                                                          edges_df.tail_index.unique()])))].reset_index(drop=True)

As we would like to use a small subset of PrimeKG dataset in this tutorial, we will load IBD graph data and further filter the BioBridge-PrimeKG dataset with it.

In [6]:

Copied!





# Load IBD PyG data to further filter the nodes
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
    ibd_pyg_graph = pickle.load(f)

# Get node name
ibd_node_name = [node.split('_')[0] for node in ibd_pyg_graph.node_id]
# ibd_node_name

# Filter the nodes using node name existing in the IBD PyG graph
nodes_df = nodes_df[nodes_df["node_name"].isin(ibd_node_name)].reset_index(drop=True)
nodes_df.head(5)
# Load IBD PyG data to further filter the nodes
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
    ibd_pyg_graph = pickle.load(f)

# Get node name
ibd_node_name = [node.split('_')[0] for node in ibd_pyg_graph.node_id]
# ibd_node_name

# Filter the nodes using node name existing in the IBD PyG graph
nodes_df = nodes_df[nodes_df["node_name"].isin(ibd_node_name)].reset_index(drop=True)
nodes_df.head(5)

Out[6]:

	node_index	node_name	node_source	node_id	node_type
0	144	SMAD3	NCBI	4088	gene/protein
1	179	IL10RB	NCBI	3588	gene/protein
2	192	GNA12	NCBI	2768	gene/protein
3	279	HNF4A	NCBI	3172	gene/protein
4	417	VCAM1	NCBI	7412	gene/protein

In [7]:

Copied!

# Check the number of nodes
print(f"Number of nodes: {len(nodes_df)}")
# Check the number of nodes
print(f"Number of nodes: {len(nodes_df)}")

Number of nodes: 2991

BioBridge dataset provides multimodal data for diverse node types, including: gene/proten, molecular_function, cellular_component, biological_process, drug, and disease. The following code snippet demonstrates how to obtain such information.

In [8]:

Copied!





# Define feature columns 
dict_feature_columns = {
    "gene/protein": "sequence",
    "molecular_function": "description",
    "cellular_component": "description",
    "biological_process": "description",
    "drug": "smiles",
    "disease": "definition",
}

# Obtain the node embeddings of the BioBridge
biobridge_node_embeddings = biobridge_data.get_node_embeddings()
# Define feature columns 
dict_feature_columns = {
    "gene/protein": "sequence",
    "molecular_function": "description",
    "cellular_component": "description",
    "biological_process": "description",
    "drug": "smiles",
    "disease": "definition",
}

# Obtain the node embeddings of the BioBridge
biobridge_node_embeddings = biobridge_data.get_node_embeddings()

Node Enrichment & Embedding¶

As mentioned earlier, we can use StarkQA PrimeKG dataset to simplify the enrichment process of textual data for the nodes.

In [9]:

Copied!





def get_textual_enrichment(data, node_info):
    """
    Enrich the node with additional information from StarkQA-PrimeKG

    Args:
        data (dict): The node data from PrimeKG
        node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} node. "
    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info += f"{data['node_name']} is {node_info['details']['name']}. " if 'name' in node_info['details'] else ''
        added_info += node_info['details']['summary'] if 'summary' in node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(node_info['details']['description']).replace('nan', ''),
                               str(node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(node_info['details']['protein_binding']).replace('nan', ''),
                               str(node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(node_info['details']['mondo_definition']).replace('nan', ''),
                               str(node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in node_info['details']['summation']]) if 'details' in node_info else ''
    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node
def get_textual_enrichment(data, node_info):
    """
    Enrich the node with additional information from StarkQA-PrimeKG

    Args:
        data (dict): The node data from PrimeKG
        node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} node. "
    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info += f"{data['node_name']} is {node_info['details']['name']}. " if 'name' in node_info['details'] else ''
        added_info += node_info['details']['summary'] if 'summary' in node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(node_info['details']['description']).replace('nan', ''),
                               str(node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(node_info['details']['protein_binding']).replace('nan', ''),
                               str(node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(node_info['details']['mondo_definition']).replace('nan', ''),
                               str(node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in node_info['details']['summation']]) if 'details' in node_info else ''
    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node

In [10]:

Copied!

# Enrich the node with additional textual description from StarkQA-PrimeKG
nodes_df["desc"] = nodes_df.apply(lambda x: get_textual_enrichment(x, starkqa_node_info[x['node_index']]), axis=1)
nodes_df.head(5)
# Enrich the node with additional textual description from StarkQA-PrimeKG
nodes_df["desc"] = nodes_df.apply(lambda x: get_textual_enrichment(x, starkqa_node_info[x['node_index']]), axis=1)
nodes_df.head(5)

Out[10]:

	node_index	node_name	node_source	node_id	node_type	desc
0	144	SMAD3	NCBI	4088	gene/protein	SMAD3 belongs to gene/protein node. SMAD3 is S...
1	179	IL10RB	NCBI	3588	gene/protein	IL10RB belongs to gene/protein node. IL10RB is...
2	192	GNA12	NCBI	2768	gene/protein	GNA12 belongs to gene/protein node. GNA12 is G...
3	279	HNF4A	NCBI	3172	gene/protein	HNF4A belongs to gene/protein node. HNF4A is h...
4	417	VCAM1	NCBI	7412	gene/protein	VCAM1 belongs to gene/protein node. VCAM1 is v...

Afterwards, we will perform embeddings over such description column using the Ollama model (i.e., nomic-embed-text).

In [11]:

Copied!





# Update textual pre-loaded embeddings from BioBride with 'nomic-embed-text' embeddings
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Use mini-batch processing to perform the embedding
mini_batch_size = 100
desc_embeddings = []
for i in tqdm(range(0, nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(nodes_df.desc.values.tolist()[i:i+mini_batch_size])
    desc_embeddings.extend(outputs)

# Add them as features to the dataframe
nodes_df['desc_x'] = desc_embeddings
nodes_df.head(5)
# Update textual pre-loaded embeddings from BioBride with 'nomic-embed-text' embeddings
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Use mini-batch processing to perform the embedding
mini_batch_size = 100
desc_embeddings = []
for i in tqdm(range(0, nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(nodes_df.desc.values.tolist()[i:i+mini_batch_size])
    desc_embeddings.extend(outputs)

# Add them as features to the dataframe
nodes_df['desc_x'] = desc_embeddings
nodes_df.head(5)

100%|██████████| 30/30 [00:18<00:00,  1.66it/s]

Out[11]:

	node_index	node_name	node_source	node_id	node_type	desc	desc_x
0	144	SMAD3	NCBI	4088	gene/protein	SMAD3 belongs to gene/protein node. SMAD3 is S...	[0.029749377, 0.053500228, -0.1706713, -0.0258...
1	179	IL10RB	NCBI	3588	gene/protein	IL10RB belongs to gene/protein node. IL10RB is...	[0.028421732, 0.019860065, -0.16853006, -0.038...
2	192	GNA12	NCBI	2768	gene/protein	GNA12 belongs to gene/protein node. GNA12 is G...	[0.003668847, 0.05138056, -0.13865656, -0.0554...
3	279	HNF4A	NCBI	3172	gene/protein	HNF4A belongs to gene/protein node. HNF4A is h...	[0.017971933, 0.021827668, -0.15494126, -0.000...
4	417	VCAM1	NCBI	7412	gene/protein	VCAM1 belongs to gene/protein node. VCAM1 is v...	[0.04492683, 0.02438596, -0.15689379, -0.02166...

We then obtain enriched node by using BioBridge data along with its embeddings.

In [12]:

Copied!





# Obtain modality-specific information
nodes_df["enriched_node"] = nodes_df.apply(lambda x: 
                                           biobridge_node_info[x["node_type"]][biobridge_node_info[x["node_type"]]["node_index"] == x["node_index"]][dict_feature_columns[x["node_type"]]].values[0], axis=1)
nodes_df["enriched_node"] = nodes_df.apply(lambda x: 
                                           x["enriched_node"] 
                                           if not pd.isnull(x["enriched_node"]) else x["node_name"], axis=1)
nodes_df["x"] = nodes_df.apply(lambda x: 
                               biobridge_node_embeddings[x["node_index"]] 
                               if x["node_index"] in biobridge_node_embeddings else np.NaN, axis=1)
nodes_df.dropna(subset=["x"], inplace=True)
nodes_df.head(5)
# Obtain modality-specific information
nodes_df["enriched_node"] = nodes_df.apply(lambda x: 
                                           biobridge_node_info[x["node_type"]][biobridge_node_info[x["node_type"]]["node_index"] == x["node_index"]][dict_feature_columns[x["node_type"]]].values[0], axis=1)
nodes_df["enriched_node"] = nodes_df.apply(lambda x: 
                                           x["enriched_node"] 
                                           if not pd.isnull(x["enriched_node"]) else x["node_name"], axis=1)
nodes_df["x"] = nodes_df.apply(lambda x: 
                               biobridge_node_embeddings[x["node_index"]] 
                               if x["node_index"] in biobridge_node_embeddings else np.NaN, axis=1)
nodes_df.dropna(subset=["x"], inplace=True)
nodes_df.head(5)

Out[12]:

	node_index	node_name	node_source	node_id	node_type	desc	desc_x	enriched_node	x
0	144	SMAD3	NCBI	4088	gene/protein	SMAD3 belongs to gene/protein node. SMAD3 is S...	[0.029749377, 0.053500228, -0.1706713, -0.0258...	MSSILPFTPPIVKRLLGWKKGEQNGQEEKWCEKAVKSLVKKLKKTG...	[-0.014456028118729591, -0.03834506496787071, ...
1	179	IL10RB	NCBI	3588	gene/protein	IL10RB belongs to gene/protein node. IL10RB is...	[0.028421732, 0.019860065, -0.16853006, -0.038...	MAWSLGSWLGGCLLVSALGMVPPPENVRMNSVNFKNILQWESPAFA...	[-0.06711604446172714, 0.058091215789318085, 0...
2	192	GNA12	NCBI	2768	gene/protein	GNA12 belongs to gene/protein node. GNA12 is G...	[0.003668847, 0.05138056, -0.13865656, -0.0554...	MSGVVRTLSRCLLPAEAGGARERRAGSGARDAEREARRRSRDIDAL...	[-0.015191752463579178, -0.13006462156772614, ...
3	279	HNF4A	NCBI	3172	gene/protein	HNF4A belongs to gene/protein node. HNF4A is h...	[0.017971933, 0.021827668, -0.15494126, -0.000...	MRLSKTLVDMDMADYSAALDPAYTTLEFENVQVLTMGNDTSPSEGT...	[0.0008836743654683232, 0.011145174503326416, ...
4	417	VCAM1	NCBI	7412	gene/protein	VCAM1 belongs to gene/protein node. VCAM1 is v...	[0.04492683, 0.02438596, -0.15689379, -0.02166...	MPGKMVVILGASNILWIMFAASQAFKIETTPESRYLAQIGDSVSLT...	[0.008272849954664707, 0.04085301235318184, 0....

In [13]:

Copied!

# Check if there are any NaN values in the enriched_node column
nodes_df["enriched_node"].isna().any()
# Check if there are any NaN values in the enriched_node column
nodes_df["enriched_node"].isna().any()

Out[13]:

False

Note that for nodes with textual embeddings, we will replace the original embeddings with the new ones that are retrieved from Ollama model (to be further used in the following talk2knowledgegraphs application).

In [14]:

Copied!





# Update textual pre-loaded embeddings from BioBride with 'nomic-embed-text' embeddings
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
text_based_df = nodes_df[nodes_df.node_type.isin(['disease', 'biological_process', 'cellular_component', 'molecular_function'])]
text_node_indexes = []
text_node_embeddings = []
for i in tqdm(range(0, text_based_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(text_based_df.enriched_node.values.tolist()[i:i+mini_batch_size])
    text_node_indexes.extend(text_based_df.node_index.values.tolist()[i:i+mini_batch_size])
    text_node_embeddings.extend(outputs)
dic_text_embeddings = dict(zip(text_node_indexes, text_node_embeddings))
# dic_text_embeddings
# Update textual pre-loaded embeddings from BioBride with 'nomic-embed-text' embeddings
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
text_based_df = nodes_df[nodes_df.node_type.isin(['disease', 'biological_process', 'cellular_component', 'molecular_function'])]
text_node_indexes = []
text_node_embeddings = []
for i in tqdm(range(0, text_based_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(text_based_df.enriched_node.values.tolist()[i:i+mini_batch_size])
    text_node_indexes.extend(text_based_df.node_index.values.tolist()[i:i+mini_batch_size])
    text_node_embeddings.extend(outputs)
dic_text_embeddings = dict(zip(text_node_indexes, text_node_embeddings))
# dic_text_embeddings

100%|██████████| 22/22 [00:08<00:00,  2.50it/s]

In [15]:

Copied!

# Replace the embeddings of the nodes with the updated embeddings for text-based nodes
nodes_df["x"] = nodes_df.apply(lambda x: dic_text_embeddings[x["node_index"]] if x["node_index"] in dic_text_embeddings else x["x"], axis=1)
nodes_df.head(5)
# Replace the embeddings of the nodes with the updated embeddings for text-based nodes
nodes_df["x"] = nodes_df.apply(lambda x: dic_text_embeddings[x["node_index"]] if x["node_index"] in dic_text_embeddings else x["x"], axis=1)
nodes_df.head(5)

Out[15]:

	node_index	node_name	node_source	node_id	node_type	desc	desc_x	enriched_node	x
0	144	SMAD3	NCBI	4088	gene/protein	SMAD3 belongs to gene/protein node. SMAD3 is S...	[0.029749377, 0.053500228, -0.1706713, -0.0258...	MSSILPFTPPIVKRLLGWKKGEQNGQEEKWCEKAVKSLVKKLKKTG...	[-0.014456028118729591, -0.03834506496787071, ...
1	179	IL10RB	NCBI	3588	gene/protein	IL10RB belongs to gene/protein node. IL10RB is...	[0.028421732, 0.019860065, -0.16853006, -0.038...	MAWSLGSWLGGCLLVSALGMVPPPENVRMNSVNFKNILQWESPAFA...	[-0.06711604446172714, 0.058091215789318085, 0...
2	192	GNA12	NCBI	2768	gene/protein	GNA12 belongs to gene/protein node. GNA12 is G...	[0.003668847, 0.05138056, -0.13865656, -0.0554...	MSGVVRTLSRCLLPAEAGGARERRAGSGARDAEREARRRSRDIDAL...	[-0.015191752463579178, -0.13006462156772614, ...
3	279	HNF4A	NCBI	3172	gene/protein	HNF4A belongs to gene/protein node. HNF4A is h...	[0.017971933, 0.021827668, -0.15494126, -0.000...	MRLSKTLVDMDMADYSAALDPAYTTLEFENVQVLTMGNDTSPSEGT...	[0.0008836743654683232, 0.011145174503326416, ...
4	417	VCAM1	NCBI	7412	gene/protein	VCAM1 belongs to gene/protein node. VCAM1 is v...	[0.04492683, 0.02438596, -0.15689379, -0.02166...	MPGKMVVILGASNILWIMFAASQAFKIETTPESRYLAQIGDSVSLT...	[0.008272849954664707, 0.04085301235318184, 0....

In [16]:

Copied!

# Statistics of nodes
print("Number of nodes in BioBridge-PrimeKG: %d" % nodes_df.shape[0])
# Statistics of nodes
print("Number of nodes in BioBridge-PrimeKG: %d" % nodes_df.shape[0])

Number of nodes in BioBridge-PrimeKG: 2991

In [17]:

Copied!





def store_data_into_blosc(data, path, filename, typesize=8, cname='zstd', clevel=9):
    """
    Store data into a blosc file.
    """
    # Create the directory if it doesn't exist
    os.makedirs(path, exist_ok=True)

    # Serialize the data using pickle
    serialized_data = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
    # Compress the serialized data using blosc
    compressed_data = blosc.compress(serialized_data, typesize=typesize, cname=cname, clevel=clevel)

    # Save the compressed data to a file
    with open(os.path.join(path, filename), 'wb') as f:
        f.write(compressed_data)
    print(f"Data is successfully stored in {os.path.join(path, filename)}")

def load_data_from_blosc(path, filename):
    """
    Load data from a blosc file.
    """
    # Read the compressed data from the file
    with open(os.path.join(path, filename), 'rb') as f:
        compressed_data = f.read()
    # Decompress the data using blosc
    decompressed_data = blosc.decompress(compressed_data)
    # Deserialize the data using pickle
    data = pickle.loads(decompressed_data)
    # Return the data
    return data
def store_data_into_blosc(data, path, filename, typesize=8, cname='zstd', clevel=9):
    """
    Store data into a blosc file.
    """
    # Create the directory if it doesn't exist
    os.makedirs(path, exist_ok=True)

    # Serialize the data using pickle
    serialized_data = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
    # Compress the serialized data using blosc
    compressed_data = blosc.compress(serialized_data, typesize=typesize, cname=cname, clevel=clevel)

    # Save the compressed data to a file
    with open(os.path.join(path, filename), 'wb') as f:
        f.write(compressed_data)
    print(f"Data is successfully stored in {os.path.join(path, filename)}")

def load_data_from_blosc(path, filename):
    """
    Load data from a blosc file.
    """
    # Read the compressed data from the file
    with open(os.path.join(path, filename), 'rb') as f:
        compressed_data = f.read()
    # Decompress the data using blosc
    decompressed_data = blosc.decompress(compressed_data)
    # Deserialize the data using pickle
    data = pickle.loads(decompressed_data)
    # Return the data
    return data

In [18]:

Copied!





# We would like to store both the metadata and the embeddings as blosc files
# Save the nodes dataframe
local_dir = '../../../../data/biobridge_primekg/'
store_data_into_blosc(nodes_df[["node_index", "node_name", "node_source", "node_id", "node_type"]],
                      local_dir,
                      'biobridge_nodes.blosc')
# Save the node embeddings (desc)
store_data_into_blosc(dict(zip(nodes_df["node_index"], nodes_df["desc_x"])),
                      local_dir,
                      'biobridge_nodes_desc_embeddings.blosc')
# Save the node embeddings
store_data_into_blosc(dict(zip(nodes_df["node_index"], nodes_df["x"])),
                      local_dir,
                      'biobridge_nodes_embeddings.blosc')
# We would like to store both the metadata and the embeddings as blosc files
# Save the nodes dataframe
local_dir = '../../../../data/biobridge_primekg/'
store_data_into_blosc(nodes_df[["node_index", "node_name", "node_source", "node_id", "node_type"]],
                      local_dir,
                      'biobridge_nodes.blosc')
# Save the node embeddings (desc)
store_data_into_blosc(dict(zip(nodes_df["node_index"], nodes_df["desc_x"])),
                      local_dir,
                      'biobridge_nodes_desc_embeddings.blosc')
# Save the node embeddings
store_data_into_blosc(dict(zip(nodes_df["node_index"], nodes_df["x"])),
                      local_dir,
                      'biobridge_nodes_embeddings.blosc')

Data is successfully stored in ../../../../data/biobridge_primekg/biobridge_nodes.blosc
Data is successfully stored in ../../../../data/biobridge_primekg/biobridge_nodes_desc_embeddings.blosc
Data is successfully stored in ../../../../data/biobridge_primekg/biobridge_nodes_embeddings.blosc

In [19]:

Copied!





# # Uncomment the following lines to load the data from the blosc files
# local_dir = '../../../../data/biobridge_primekg/'
# nodes_ = load_data_from_blosc(local_dir, 'biobridge_nodes.blosc')
# nodes_desc_embeddings_dict_ = load_data_from_blosc(local_dir, 'biobridge_nodes_desc_embeddings.blosc')
# nodes_embeddings_dict_ = load_data_from_blosc(local_dir, 'biobridge_nodes_embeddings.blosc')
# print("Number of nodes in BioBridge-PrimeKG: %d" % len(nodes_))
# print("Number of nodes embeddings in BioBridge-PrimeKG: %d" % len(nodes_embeddings_dict_))
# # Uncomment the following lines to load the data from the blosc files
# local_dir = '../../../../data/biobridge_primekg/'
# nodes_ = load_data_from_blosc(local_dir, 'biobridge_nodes.blosc')
# nodes_desc_embeddings_dict_ = load_data_from_blosc(local_dir, 'biobridge_nodes_desc_embeddings.blosc')
# nodes_embeddings_dict_ = load_data_from_blosc(local_dir, 'biobridge_nodes_embeddings.blosc')
# print("Number of nodes in BioBridge-PrimeKG: %d" % len(nodes_))
# print("Number of nodes embeddings in BioBridge-PrimeKG: %d" % len(nodes_embeddings_dict_))

Edge Enrichment & Embedding¶

We will also perform enrichment and embedding for the edges of the BioBridge-PrimeKG.

This time, we just use textual enrichment by using simple concatenation of the head, tail and relation.

In [20]:

Copied!





# Filtering edges that exists in BioBridge PrimeKG
edges_df = edges_df[edges_df['head_index'].isin(nodes_df.node_index.unique()) & 
                    edges_df['tail_index'].isin(nodes_df.node_index.unique())]

# Adding an additional column to the edges dataframe
edges_df["edge_type"] = edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)

# As of now, we are enriching each edge using textual information 
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
edges_df['enriched_edge'] = text_enriched_edges
edges_df.head(5)
# Filtering edges that exists in BioBridge PrimeKG
edges_df = edges_df[edges_df['head_index'].isin(nodes_df.node_index.unique()) & 
                    edges_df['tail_index'].isin(nodes_df.node_index.unique())]

# Adding an additional column to the edges dataframe
edges_df["edge_type"] = edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)

# As of now, we are enriching each edge using textual information 
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
edges_df['enriched_edge'] = text_enriched_edges
edges_df.head(5)

Out[20]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation	edge_type	enriched_edge
4104	1004	IL1B	NCBI	3553	gene/protein	772	RELA	NCBI	5970	gene/protein	ppi	protein_protein	(gene/protein, ppi, gene/protein)	IL1B (gene/protein) has a direct relationship ...
11048	4968	ICAM1	NCBI	3383	gene/protein	729	STAT3	NCBI	6774	gene/protein	ppi	protein_protein	(gene/protein, ppi, gene/protein)	ICAM1 (gene/protein) has a direct relationship...
17692	772	RELA	NCBI	5970	gene/protein	11134	NR1H4	NCBI	9971	gene/protein	ppi	protein_protein	(gene/protein, ppi, gene/protein)	RELA (gene/protein) has a direct relationship ...
17800	2384	CRP	NCBI	1401	gene/protein	2057	FN1	NCBI	2335	gene/protein	ppi	protein_protein	(gene/protein, ppi, gene/protein)	CRP (gene/protein) has a direct relationship o...
20031	3259	TLR4	NCBI	7099	gene/protein	4731	RIPK2	NCBI	8767	gene/protein	ppi	protein_protein	(gene/protein, ppi, gene/protein)	TLR4 (gene/protein) has a direct relationship ...

By following the above steps when we filter the nodes with IBD data, we will also be able to filter the edges accordingly.

In [21]:

Copied!





# Fiter the edges based on IBD PyG graph
ibd_edges_df = pd.DataFrame({
    'head_index' : ibd_pyg_graph.head_id,
    'tail_index' : ibd_pyg_graph.tail_id,
    'edge_type' : ibd_pyg_graph.edge_type,
})
ibd_edges_df["head_index"] = ibd_edges_df["head_index"].apply(lambda x: int(x.split("_(")[1].replace(")", "")))
ibd_edges_df["tail_index"] = ibd_edges_df["tail_index"].apply(lambda x: int(x.split("_(")[1].replace(")", "")))
ibd_edges_df["display_relation"] = ibd_edges_df["edge_type"].apply(lambda x: x[1])
ibd_edges_df.drop(columns=["edge_type"], inplace=True)

# Merge the edges dataframe with the IBD edges dataframe
edges_df = pd.merge(edges_df, ibd_edges_df, how='inner', on=['head_index', 'tail_index', 'display_relation'], suffixes=('', '_y'))
edges_df.head(5)
# Fiter the edges based on IBD PyG graph
ibd_edges_df = pd.DataFrame({
    'head_index' : ibd_pyg_graph.head_id,
    'tail_index' : ibd_pyg_graph.tail_id,
    'edge_type' : ibd_pyg_graph.edge_type,
})
ibd_edges_df["head_index"] = ibd_edges_df["head_index"].apply(lambda x: int(x.split("_(")[1].replace(")", "")))
ibd_edges_df["tail_index"] = ibd_edges_df["tail_index"].apply(lambda x: int(x.split("_(")[1].replace(")", "")))
ibd_edges_df["display_relation"] = ibd_edges_df["edge_type"].apply(lambda x: x[1])
ibd_edges_df.drop(columns=["edge_type"], inplace=True)

# Merge the edges dataframe with the IBD edges dataframe
edges_df = pd.merge(edges_df, ibd_edges_df, how='inner', on=['head_index', 'tail_index', 'display_relation'], suffixes=('', '_y'))
edges_df.head(5)

Out[21]:

	head_index	head_name	head_source	head_id	head_type	tail_index	tail_name	tail_source	tail_id	tail_type	display_relation	relation	edge_type	enriched_edge
0	14118	Rose bengal	DrugBank	DB11182	drug	3233	LTF	NCBI	4057	gene/protein	carrier	drug_protein	(drug, carrier, gene/protein)	Rose bengal (drug) has a direct relationship o...
1	14038	Fluticasone furoate	DrugBank	DB08906	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein	(drug, carrier, gene/protein)	Fluticasone furoate (drug) has a direct relati...
2	14555	Technetium Tc-99m tetrofosmin	DrugBank	DB09160	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein	(drug, carrier, gene/protein)	Technetium Tc-99m tetrofosmin (drug) has a dir...
3	14040	Fluticasone	DrugBank	DB13867	drug	4152	ABCB1	NCBI	5243	gene/protein	carrier	drug_protein	(drug, carrier, gene/protein)	Fluticasone (drug) has a direct relationship o...
4	14060	Levothyroxine	DrugBank	DB00451	drug	4152	ABCB1	NCBI	5243	gene/protein	enzyme	drug_protein	(drug, enzyme, gene/protein)	Levothyroxine (drug) has a direct relationship...

After that, we perform the same embedding process for the edges using Ollama model.

In [22]:

Copied!





# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, edges_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
    edge_embeddings.extend(outputs)

# Add them as features to the dataframe
edges_df['edge_attr'] = edge_embeddings
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, edges_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
    edge_embeddings.extend(outputs)

# Add them as features to the dataframe
edges_df['edge_attr'] = edge_embeddings

  0%|          | 0/113 [00:00<?, ?it/s]

100%|██████████| 113/113 [00:43<00:00,  2.61it/s]

In [23]:

Copied!





# Drop and rename several columns including the prefix_enriched_node column
edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'relation'], inplace=True)
edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
edges_df.head(5)
# Drop and rename several columns including the prefix_enriched_node column
edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'relation'], inplace=True)
edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
edges_df.head(5)

Out[23]:

	head_id	head_name	tail_id	tail_name	display_relation	edge_type	enriched_edge	edge_attr
0	14118	Rose bengal	3233	LTF	carrier	(drug, carrier, gene/protein)	Rose bengal (drug) has a direct relationship o...	[0.071049586, 0.0060329223, -0.17035195, 0.001...
1	14038	Fluticasone furoate	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone furoate (drug) has a direct relati...	[0.025471492, 0.054160915, -0.17022943, -0.018...
2	14555	Technetium Tc-99m tetrofosmin	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Technetium Tc-99m tetrofosmin (drug) has a dir...	[-0.008589362, 0.06356438, -0.14342338, -0.003...
3	14040	Fluticasone	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone (drug) has a direct relationship o...	[0.021936357, 0.05227478, -0.16180754, -0.0218...
4	14060	Levothyroxine	4152	ABCB1	enzyme	(drug, enzyme, gene/protein)	Levothyroxine (drug) has a direct relationship...	[0.023618879, 0.018524365, -0.1605938, 0.00940...

In [24]:

Copied!

# Statistics of nodes\
print("Number of nodes in BioBridge-PrimeKG: %d" % edges_df.shape[0])
# Statistics of nodes\
print("Number of nodes in BioBridge-PrimeKG: %d" % edges_df.shape[0])

Number of nodes in BioBridge-PrimeKG: 11272

In [25]:

Copied!





# Make an additional edge index column as identifier
edges_df.reset_index(inplace=True)
edges_df.rename(columns={'index': 'triplet_index'}, inplace=True)
edges_df.head(5)
# Make an additional edge index column as identifier
edges_df.reset_index(inplace=True)
edges_df.rename(columns={'index': 'triplet_index'}, inplace=True)
edges_df.head(5)

Out[25]:

	triplet_index	head_id	head_name	tail_id	tail_name	display_relation	edge_type	enriched_edge	edge_attr
0	0	14118	Rose bengal	3233	LTF	carrier	(drug, carrier, gene/protein)	Rose bengal (drug) has a direct relationship o...	[0.071049586, 0.0060329223, -0.17035195, 0.001...
1	1	14038	Fluticasone furoate	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone furoate (drug) has a direct relati...	[0.025471492, 0.054160915, -0.17022943, -0.018...
2	2	14555	Technetium Tc-99m tetrofosmin	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Technetium Tc-99m tetrofosmin (drug) has a dir...	[-0.008589362, 0.06356438, -0.14342338, -0.003...
3	3	14040	Fluticasone	4152	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone (drug) has a direct relationship o...	[0.021936357, 0.05227478, -0.16180754, -0.0218...
4	4	14060	Levothyroxine	4152	ABCB1	enzyme	(drug, enzyme, gene/protein)	Levothyroxine (drug) has a direct relationship...	[0.023618879, 0.018524365, -0.1605938, 0.00940...

In [26]:

Copied!





# We would like to store both the metadata and the embeddings as blosc files
# Save the edges dataframe
local_dir = '../../../../data/biobridge_primekg/'
store_data_into_blosc(edges_df[["head_id", "head_name", "tail_id", "tail_name", "display_relation", "enriched_edge"]],
                      local_dir,
                      'biobridge_edges.blosc')
# We would like to store both the metadata and the embeddings as blosc files
# Save the edges dataframe
local_dir = '../../../../data/biobridge_primekg/'
store_data_into_blosc(edges_df[["head_id", "head_name", "tail_id", "tail_name", "display_relation", "enriched_edge"]],
                      local_dir,
                      'biobridge_edges.blosc')

Data is successfully stored in ../../../../data/biobridge_primekg/biobridge_edges.blosc

In [27]:

Copied!





# Save the edges embeddings
store_data_into_blosc(dict(zip(edges_df["triplet_index"], edges_df["edge_attr"])),
                      local_dir,
                      'biobridge_edges_embeddings.blosc')
# Save the edges embeddings
store_data_into_blosc(dict(zip(edges_df["triplet_index"], edges_df["edge_attr"])),
                      local_dir,
                      'biobridge_edges_embeddings.blosc')

Data is successfully stored in ../../../../data/biobridge_primekg/biobridge_edges_embeddings.blosc

In [28]:

Copied!





# Uncomment the following lines to load the data from the blosc files
# triplets_ = load_data_from_blosc(local_dir, 'biobridge_triplets.blosc')
# triplets_embeddings_dict = load_data_from_blosc(local_dir, 'biobridge_triplets_embeddings.blosc')
# print("Number of nodes in BioBridge-PrimeKG: %d" % len(nodes))
# print("Number of nodes embeddings in BioBridge-PrimeKG: %d" % len(triplets_embeddings_dict))
# Uncomment the following lines to load the data from the blosc files
# triplets_ = load_data_from_blosc(local_dir, 'biobridge_triplets.blosc')
# triplets_embeddings_dict = load_data_from_blosc(local_dir, 'biobridge_triplets_embeddings.blosc')
# print("Number of nodes in BioBridge-PrimeKG: %d" % len(nodes))
# print("Number of nodes embeddings in BioBridge-PrimeKG: %d" % len(triplets_embeddings_dict))

Knowledge Graph Construction¶

We would like to convert our dataframes to networkx DiGraph object.

In [29]:

Copied!





# Modify the node dataframe
nodes_df["node"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df["node_id"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df.drop(columns=['node_index', 'node_source'], inplace=True)
nodes_df.set_index('node', inplace=True)
nodes_df.head(5)
# Modify the node dataframe
nodes_df["node"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df["node_id"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df.drop(columns=['node_index', 'node_source'], inplace=True)
nodes_df.set_index('node', inplace=True)
nodes_df.head(5)

Out[29]:

	node_name	node_id	node_type	desc	desc_x	enriched_node	x
node
SMAD3_(144)	SMAD3	SMAD3_(144)	gene/protein	SMAD3 belongs to gene/protein node. SMAD3 is S...	[0.029749377, 0.053500228, -0.1706713, -0.0258...	MSSILPFTPPIVKRLLGWKKGEQNGQEEKWCEKAVKSLVKKLKKTG...	[-0.014456028118729591, -0.03834506496787071, ...
IL10RB_(179)	IL10RB	IL10RB_(179)	gene/protein	IL10RB belongs to gene/protein node. IL10RB is...	[0.028421732, 0.019860065, -0.16853006, -0.038...	MAWSLGSWLGGCLLVSALGMVPPPENVRMNSVNFKNILQWESPAFA...	[-0.06711604446172714, 0.058091215789318085, 0...
GNA12_(192)	GNA12	GNA12_(192)	gene/protein	GNA12 belongs to gene/protein node. GNA12 is G...	[0.003668847, 0.05138056, -0.13865656, -0.0554...	MSGVVRTLSRCLLPAEAGGARERRAGSGARDAEREARRRSRDIDAL...	[-0.015191752463579178, -0.13006462156772614, ...
HNF4A_(279)	HNF4A	HNF4A_(279)	gene/protein	HNF4A belongs to gene/protein node. HNF4A is h...	[0.017971933, 0.021827668, -0.15494126, -0.000...	MRLSKTLVDMDMADYSAALDPAYTTLEFENVQVLTMGNDTSPSEGT...	[0.0008836743654683232, 0.011145174503326416, ...
VCAM1_(417)	VCAM1	VCAM1_(417)	gene/protein	VCAM1 belongs to gene/protein node. VCAM1 is v...	[0.04492683, 0.02438596, -0.15689379, -0.02166...	MPGKMVVILGASNILWIMFAASQAFKIETTPESRYLAQIGDSVSLT...	[0.008272849954664707, 0.04085301235318184, 0....

In [30]:

Copied!





# Modify the edge dataframe
edges_df["head_id"] = edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
edges_df["tail_id"] = edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
edges_df.reset_index(drop=True, inplace=True)
edges_df.head(5)
# Modify the edge dataframe
edges_df["head_id"] = edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
edges_df["tail_id"] = edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
edges_df.reset_index(drop=True, inplace=True)
edges_df.head(5)

Out[30]:

	triplet_index	head_id	head_name	tail_id	tail_name	display_relation	edge_type	enriched_edge	edge_attr
0	0	Rose bengal_(14118)	Rose bengal	LTF_(3233)	LTF	carrier	(drug, carrier, gene/protein)	Rose bengal (drug) has a direct relationship o...	[0.071049586, 0.0060329223, -0.17035195, 0.001...
1	1	Fluticasone furoate_(14038)	Fluticasone furoate	ABCB1_(4152)	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone furoate (drug) has a direct relati...	[0.025471492, 0.054160915, -0.17022943, -0.018...
2	2	Technetium Tc-99m tetrofosmin_(14555)	Technetium Tc-99m tetrofosmin	ABCB1_(4152)	ABCB1	carrier	(drug, carrier, gene/protein)	Technetium Tc-99m tetrofosmin (drug) has a dir...	[-0.008589362, 0.06356438, -0.14342338, -0.003...
3	3	Fluticasone_(14040)	Fluticasone	ABCB1_(4152)	ABCB1	carrier	(drug, carrier, gene/protein)	Fluticasone (drug) has a direct relationship o...	[0.021936357, 0.05227478, -0.16180754, -0.0218...
4	4	Levothyroxine_(14060)	Levothyroxine	ABCB1_(4152)	ABCB1	enzyme	(drug, enzyme, gene/protein)	Levothyroxine (drug) has a direct relationship...	[0.023618879, 0.018524365, -0.1605938, 0.00940...

In [31]:

Copied!





# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in nodes_df.iterrows():
    kg.add_node(row['node_id'], **row.to_dict())
for i, row in edges_df.iterrows():
    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())
# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in nodes_df.iterrows():
    kg.add_node(row['node_id'], **row.to_dict())
for i, row in edges_df.iterrows():
    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())

In [32]:

Copied!





# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'wb') as f:
    pickle.dump(kg, f)

# # Load graph object
with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'rb') as f:
    kg_2 = pickle.load(f)
# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'wb') as f:
    pickle.dump(kg, f)

# # Load graph object
with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'rb') as f:
    kg_2 = pickle.load(f)

In [33]:

Copied!

print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())
print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())

#Nodes 2991
#Edges 11272

We can convert the networkx graph to PyG Data object

In [34]:

Copied!





# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)
pyg_graph.num_nodes = kg.number_of_nodes()
pyg_graph.num_edges = kg.number_of_edges()

# Save graph object
with open(os.path.join(local_dir, 'biobridge_multimodal_pyg_graph.pkl'), 'wb') as f:
    pickle.dump(pyg_graph, f)

# Load graph object
# with open(os.path.join(local_dir, 'biobridge_ibd_pyg_graph.pkl'), 'rb') as f:
#     pyg_graph = pickle.load(f)
# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)
pyg_graph.num_nodes = kg.number_of_nodes()
pyg_graph.num_edges = kg.number_of_edges()

# Save graph object
with open(os.path.join(local_dir, 'biobridge_multimodal_pyg_graph.pkl'), 'wb') as f:
    pickle.dump(pyg_graph, f)

# Load graph object
# with open(os.path.join(local_dir, 'biobridge_ibd_pyg_graph.pkl'), 'rb') as f:
#     pyg_graph = pickle.load(f)

Lastly, we will also prepare a textualized graph of nodes and edges for RAG application, for instance.

In [35]:

Copied!





# Prepare nodes
nodes_df = pd.DataFrame({
    'node_id': list(pyg_graph.node_id),
    'node_attr': list(pyg_graph.desc),
})
nodes_df.head(5)
# Prepare nodes
nodes_df = pd.DataFrame({
    'node_id': list(pyg_graph.node_id),
    'node_attr': list(pyg_graph.desc),
})
nodes_df.head(5)

Out[35]:

	node_id	node_attr
0	SMAD3_(144)	SMAD3 belongs to gene/protein node. SMAD3 is S...
1	IL10RB_(179)	IL10RB belongs to gene/protein node. IL10RB is...
2	GNA12_(192)	GNA12 belongs to gene/protein node. GNA12 is G...
3	HNF4A_(279)	HNF4A belongs to gene/protein node. HNF4A is h...
4	VCAM1_(417)	VCAM1 belongs to gene/protein node. VCAM1 is v...

In [36]:

Copied!





# Prepare edges
edges_df = pd.DataFrame({
    'head_id': list(pyg_graph.head_id),
    'edge_type': list(pyg_graph.edge_type),
    'tail_id': list(pyg_graph.tail_id),
})
edges_df.head(5)
# Prepare edges
edges_df = pd.DataFrame({
    'head_id': list(pyg_graph.head_id),
    'edge_type': list(pyg_graph.edge_type),
    'tail_id': list(pyg_graph.tail_id),
})
edges_df.head(5)

Out[36]:

	head_id	edge_type	tail_id
0	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn disease_(37784)
1	SMAD3_(144)	(gene/protein, associated with, disease)	inflammatory bowel disease_(28158)
2	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn's colitis_(83770)
3	SMAD3_(144)	(gene/protein, associated with, disease)	Crohn ileitis and jejunitis_(35814)
4	SMAD3_(144)	(gene/protein, interacts with, molecular_funct...	protein binding_(53699)

In [37]:

Copied!

with open(os.path.join(local_dir, 'biobridge_multimodal_text_graph.pkl'), "wb") as f:
    pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)
with open(os.path.join(local_dir, 'biobridge_multimodal_text_graph.pkl'), "wb") as f:
    pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)

In [ ]:

Multimodal BioBridge-PrimeKG Graph Construction¶

Prepare BioBridge dataset¶

Modal-Specific Enrichment & Embedding¶

Node Enrichment & Embedding¶

Edge Enrichment & Embedding¶

Knowledge Graph Construction¶