PrimeKG Loader¶
In this tutorial, we will explain how to load dataframes of PrimeKG containing the information of the entities and the relations of the knowledge graph.
Prior information about the PrimeKG can be found in the following repositories:
Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM
By the time we are writing this tutorial, the latest version of PrimeKG (kg.csv
) is 2.1
.
First of all, we need to import necessary libraries as follows:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
c:\Users\mulyadi\Repo\AIAgents4Pharma\venv\Lib\site-packages\pydantic\_internal\_fields.py:132: UserWarning: Field "model_id" in SysBioModel has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( c:\Users\mulyadi\Repo\AIAgents4Pharma\venv\Lib\site-packages\pydantic\_internal\_fields.py:132: UserWarning: Field "model_id" in BasicoModel has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( c:\Users\mulyadi\Repo\AIAgents4Pharma\venv\Lib\site-packages\pydantic\_internal\_fields.py:132: UserWarning: Field "model_data" in SimulateModelInput has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn(
Load PrimeKG¶
The PrimeKG
dataset allows to load the data from the Harvard Dataverse server if the data is not available locally.
Otherwise, the data is loaded from the local directory as defined in the local_dir
.
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")
To load the dataframes of nodes and edges from PrimeKG, we just need to invoke a method as follows.
# Invoke a method to load the data
primekg_data.load_data()
# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()
Downloading node file from https://dataverse.harvard.edu/api/access/datafile/6180617
100%|██████████| 8.89M/8.89M [00:01<00:00, 8.06MiB/s]
Downloading edge file from https://dataverse.harvard.edu/api/access/datafile/6180616
100%|██████████| 387M/387M [00:35<00:00, 10.8MiB/s]
Check PrimeKG Dataframes¶
As mentioned before, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively.
We can further analyze the dataframes to extract the information we need.
For instance, we can construct a graph from the nodes and edges dataframes using the networkx library.
PrimeKG Nodes¶
primekg_nodes
is a dataframe of nodes, which has the following columns:
node_index
: the index of the nodenode
: the node namenode_id
: the id of the node (currently set as node name itself, for visualization purposes)node_uid
: the unique identifier of the node (source name + unique id)node_type
: the type of the node
We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows.
# Check a sample of the primekg nodes
primekg_nodes.head()
node_index | node_name | node_source | node_id | node_type | |
---|---|---|---|---|---|
0 | 0 | PHYHIP | NCBI | 9796 | gene/protein |
1 | 1 | GPANK1 | NCBI | 7918 | gene/protein |
2 | 2 | ZRSR2 | NCBI | 8233 | gene/protein |
3 | 3 | NRF1 | NCBI | 4899 | gene/protein |
4 | 4 | PI4KA | NCBI | 5297 | gene/protein |
The current version of PrimeKG has about 130K of nodes in total as we can observe in the following cell.
# Check dimensions of the primekg nodes
primekg_nodes.shape
(129375, 5)
We can breakdown the statistics of the primekg nodes by their types as follows.
# Show node types and their counts
primekg_nodes['node_type'].value_counts()
node_type biological_process 28642 gene/protein 27671 disease 17080 effect/phenotype 15311 anatomy 14035 molecular_function 11169 drug 7957 cellular_component 4176 pathway 2516 exposure 818 Name: count, dtype: int64
PrimeKG was built using various sources, as we can observe from their unique node sources as follows.
# Show source of the primekg nodes
primekg_nodes['node_source'].value_counts()
node_source GO 43987 NCBI 27671 MONDO 15813 HPO 15311 UBERON 14035 DrugBank 7957 REACTOME 2516 MONDO_grouped 1267 CTD 818 Name: count, dtype: int64
PrimeKG Edges¶
primekg_edges
is a dataframe of edges, which has the following columns:
head_index
: the index of the head nodehead_name
: the name of the head nodehead_source
: the source database of head nodehead_id
: the id in source database of head nodetail_index
: the index of the tail nodetail_name
: the name of the tail nodetail_source
: the source database of tail nodetail_id
: the id in source database of tail nodedisplay_relation
: the type of the edge
We can also check a sample of the primekg edges to see the interconnections between the nodes in the PrimeKG dataset as follows.
# Check a sample of the primekg edges
primekg_edges.head()
head_index | head_name | head_source | head_id | head_type | tail_index | tail_name | tail_source | tail_id | tail_type | display_relation | relation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | PHYHIP | NCBI | 9796 | gene/protein | 8889 | KIF15 | NCBI | 56992 | gene/protein | ppi | protein_protein |
1 | 1 | GPANK1 | NCBI | 7918 | gene/protein | 2798 | PNMA1 | NCBI | 9240 | gene/protein | ppi | protein_protein |
2 | 2 | ZRSR2 | NCBI | 8233 | gene/protein | 5646 | TTC33 | NCBI | 23548 | gene/protein | ppi | protein_protein |
3 | 3 | NRF1 | NCBI | 4899 | gene/protein | 11592 | MAN1B1 | NCBI | 11253 | gene/protein | ppi | protein_protein |
4 | 4 | PI4KA | NCBI | 5297 | gene/protein | 2122 | RGS20 | NCBI | 8601 | gene/protein | ppi | protein_protein |
The current version of PrimeKG has about 8.1M of edges in total as we can observe in the following cell.
# Check dimensions of the primekg nodes
primekg_edges.shape
(8100498, 12)
We can breakdown the statistics of the primekg edges by their types as follows.
# Show edge types and their counts
primekg_edges['display_relation'].value_counts()
display_relation expression present 3036406 synergistic interaction 2672628 interacts with 686550 ppi 642150 phenotype present 300634 parent-child 281744 associated with 167482 side effect 129568 contraindication 61350 expression absent 39774 target 32760 indication 18776 enzyme 10634 transporter 6184 off-label use 5136 linked to 4608 phenotype absent 2386 carrier 1728 Name: count, dtype: int64