Textual Enrichment over StarkQA-PrimeKG using Ollama¶
In this tutorial, we will explain an example to perform textual enrichment using Ollama LLMs for StarkQA-PrimeKG nodes.
First of all, we need to import necessary libraries as follows:
# Import necessary libraries
import sys
sys.path.append("../../..")
# Set the logging level for httpx to WARNING to suppress INFO messages
import logging
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.ollama import (
EnrichmentWithOllama,
)
logging.getLogger("httpx").setLevel(logging.WARNING)
/home/awmulyadi/Repositories/office/AIAgents4Pharma/venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm /home/awmulyadi/Repositories/office/AIAgents4Pharma/venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:132: UserWarning: Field "model_id" in SysBioModel has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( /home/awmulyadi/Repositories/office/AIAgents4Pharma/venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:132: UserWarning: Field "model_id" in BasicoModel has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( /home/awmulyadi/Repositories/office/AIAgents4Pharma/venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:132: UserWarning: Field "model_data" in SimulateModelInput has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn(
Load PrimeKG¶
To load PrimeKG dataset, we can utilize the PrimeKG
class from the aiagents4pharma/talk2knowledgegraphs library.
The PrimeKG
needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory.
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")
# Invoke a method to load the data
primekg_data.load_data()
# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()
Loading nodes of PrimeKG dataset ... ../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory. Loading edges of PrimeKG dataset ... ../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.
Load StarkQA¶
The StarkQAPrimeKG
allows loading the data from the HuggingFace Hub if the data is not available locally.
Otherwise, the data is loaded from the local directory as defined in the local_dir
.
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")
# Invoke a method to load the data
starkqa_data.load_data()
# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
starkqa_df = starkqa_data.get_starkqa()
starkqa_split_indices = starkqa_data.get_starkqa_split_indicies()
starkqa_node_info = starkqa_data.get_starkqa_node_info()
Loading StarkQAPrimeKG dataset... ../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory. Loading StarkQAPrimeKG embeddings...
# We can obtain additional node information from StarkQAPrimeKG
starkqa_node_info[0]
{'id': 9796, 'type': 'gene/protein', 'name': 'PHYHIP', 'source': 'NCBI', 'details': {'query': 'PHYHIP', '_id': '9796', '_score': 17.934021, 'alias': ['DYRK1AP3', 'PAHX-AP', 'PAHXAP1'], 'genomic_pos': {'chr': '8', 'end': 22232101, 'ensemblgene': 'ENSG00000168490', 'start': 22219703, 'strand': -1}, 'name': 'phytanoyl-CoA 2-hydroxylase interacting protein', 'summary': 'Enables protein tyrosine kinase binding activity. Involved in protein localization. Located in cytoplasm. [provided by Alliance of Genome Resources, Apr 2022]'}}
Several nodes of gene/protein have summary information that can be use for further downstream analysis of knowledge graphs.
However, there are also nodes that do not have this summary information.
Textual Enrichment over PrimeKG Nodes using Ollama¶
StarkQA provides additional node information for PrimeKG as a dictionary for each node. This allows us to further enrich the features of the knowledge graph nodes.
However, there are several missing information over the nodes that were not provided by StarkQA. Thus, in this example, we will enrich the node information using Ollama.
As a disclaimer, since we perform textual enrichment using LLMs, there are several considerations to keep in mind:
- the risk of generating text that is not relevant to the input data
- the risk of generating text that is not scientifically accurate or misleading due to hallucinations
- the risk of generating biased text based on the training data of particular LLMs
For a simple use case, we will firstly find a node IDs of gene/protein nodes that have no summary information as follows.
# Find the gene/protein nodes, which lack summary information
node_wo_info_ids = [
n_id
for n_id in starkqa_node_info.keys()
if starkqa_node_info[n_id]["type"] == "gene/protein"
and "summary" not in starkqa_node_info[n_id]["details"]
]
# Check an example of node without summary information
starkqa_node_info[node_wo_info_ids[0]]
{'id': 339229, 'type': 'gene/protein', 'name': 'OXLD1', 'source': 'NCBI', 'details': {'query': 'OXLD1', '_id': '339229', '_score': 17.692335, 'alias': 'C17orf90', 'genomic_pos': {'chr': '17', 'end': 81666635, 'ensemblgene': 'ENSG00000204237', 'start': 81665036, 'strand': -1}, 'name': 'oxidoreductase like domain containing 1'}}
Before enriching the data, we need to define the enrichment model along with its configuration.
Note that, we can set further the prompt for the enrichment for more specific information.
# Config for Ollama enrichment
ollama_config = {
"model_name": "llama3.1",
"prompt_enrichment": """
You are a helpful expert in biomedical knowledge graph analysis.
Your role is to enrich the inputs (nodes or relations) using textual description.
A node is represented as string, e.g., "ADAM17" in the input list, while a relation is
represented as tuples, e.g., "(NOD2, gene causation disease, Crohn disease)".
DO NOT mistake one for the other. If the input is a list of nodes, treat each node as
a unique entity, and provide a description. If the input is a list of relations,
treat each tuple in the relation list as a unique relation between nodes,
and provide a description for each tuple.
All provided information about the node or relations should be concise
(a single sentence), informative, factual, and relevant in the biomedical domain.
! IMPORTANT ! Make sure that the output is in valid format and can be parsed as
a list of dictionaries correctly and without any prepend information.
DO NOT forget to close the brackets properly.
KEEP the order consistent between the input and the output.
See <example> for reference.
<example>
Input: ["ADAM17", "IL23R"]
Output: [{{"desc" : "ADAM17 is a metalloproteinase involved in the shedding of
membrane proteins and plays a role in inflammatory processes."}}, {{"desc":
"IL23R is a receptor for interleukin-23, which is involved in inflammatory responses
and has been linked to inflammatory bowel disease."}}]
</example>
<example>
Input: ["(NOD2, gene causation disease, Crohn disease)", "(IL23R,
gene causation disease, Inflammatory Bowel Disease)"]
Output: [{{"desc" : "NOD2 is a gene that contributes to immune responses and has
been implicated in Crohn's disease, particularly through genetic mutations linked to
inflammation."}}, {{"desc" : "IL23R is a receptor gene that plays a role in
immune system regulation and has been associated with susceptibility to
Inflammatory Bowel Disease."}}]
</example>
Input: {input}
""",
"temperature": 0.0,
"streaming": False,
}
# Prepre enrichment model
enr_model = EnrichmentWithOllama(
model_name=ollama_config["model_name"],
prompt_enrichment=ollama_config["prompt_enrichment"],
temperature=ollama_config["temperature"],
streaming=ollama_config["streaming"],
)
To perform textual enrichment, we can use the EnrichmentWithOllama class via its method:
enrich_documents
: performs enrichment on a list of queries (documents)
# To perform enrichment over all missing information of gene/protein nodes
# As an example, we will enrich the first 10 nodes
batch_ = node_wo_info_ids[:10]
nodes_ = []
enriched_nodes = []
for n_id in batch_:
nodes_text_desc = f"{starkqa_node_info[n_id]['name']}"
if "name" in starkqa_node_info[n_id]["details"]:
nodes_text_desc += f" ({starkqa_node_info[n_id]['details']['name']})"
nodes_.append(nodes_text_desc)
enriched_nodes = enr_model.enrich_documents(nodes_)
nodes_, enriched_nodes
(['OXLD1 (oxidoreductase like domain containing 1)', 'CRCT1 (cysteine rich C-terminal 1)', 'ATXN7L2 (ataxin 7 like 2)', 'TSTD2 (thiosulfate sulfurtransferase like domain containing 2)', 'GUCD1 (guanylyl cyclase domain containing 1)', 'PRR22 (proline rich 22)', 'PHB', 'TRBV2 (T cell receptor beta variable 2)', 'TEX37', 'MT-ND5 (NADH dehydrogenase subunit 5)'], [{'desc': 'OXLD1 is a protein that contains an oxidoreductase-like domain and plays a role in cellular processes.'}, {'desc': 'CRCT1 is a protein with a cysteine-rich C-terminal region, which may be involved in protein-protein interactions.'}, {'desc': 'ATXN7L2 is a protein that shares similarities with ataxin-7 and may play a role in maintaining genome stability.'}, {'desc': 'TSTD2 is a protein containing a thiosulfate sulfurtransferase-like domain, which may be involved in sulfur metabolism.'}, {'desc': 'GUCD1 is a protein containing a guanylyl cyclase domain, which plays a crucial role in signaling pathways.'}, {'desc': 'PRR22 is a protein rich in proline residues and may play a structural or regulatory role in cellular processes.'}, {'desc': 'PHB is a protein involved in various cellular functions, including transcriptional regulation and cell cycle control.'}, {'desc': 'TRBV2 is a variable region of the T-cell receptor beta chain, which plays a key role in immune system recognition and response.'}, {'desc': 'TEX37 is a protein with unknown function, but may be involved in specific cellular processes or interactions.'}, {'desc': 'MT-ND5 is a mitochondrial NADH dehydrogenase subunit that plays a critical role in the electron transport chain and energy production within cells.'}])