Cell annotations with cell2sentence¶

Please note this notebook was inspired from cell2sentence's tutorial 0 and tutorial 6 on how to preprocess and predict cell types.¶

In this tutorial, we will guide you through the process of preparing and preprocessing a single-cell RNA sequencing dataset for use with C2S.

Please note:

This notebook assumes you are starting from a single-cell dataset with raw transcript counts. You can follow these steps to preprocess your dataset for use with C2S.
If you are starting from pre-normalized data, keep in mind that C2S normalization consists of the standard filtering & count normalization steps outlined by Scanpy, with the exception that the log1p transformation is applied with base 10.
We use scverse 's Rapids-singlecell for high computational efficiency. For more information, check out scverse and the NVIDIA blog. Please see here for installations or follow the steps below:

pip install rapids-singlecell

pip install \
    --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu11==25.4.*" "dask-cudf-cu11==25.4.*" "cuml-cu11==25.4.*" \
    "cugraph-cu11==25.4.*" "nx-cugraph-cu11==25.4.*" "cuspatial-cu11==25.4.*" \
    "cuproj-cu11==25.4.*" "cuxfilter-cu11==25.4.*" "cucim-cu11==25.4.*" \
    "pylibraft-cu11==25.4.*" "raft-dask-cu11==25.4.*" "cuvs-cu11==25.4.*" \
    "nx-cugraph-cu11==25.4.*"

This notebook will:

Load an Crohns disease single-cell dataset from Domínguez Conde et al. Citation: Wiendl, M., Dedden, M., Liu, LJ. et al. Etrolizumab-s fails to control E-Cadherin-dependent co-stimulation of highly activated cytotoxic T cells. Nat Commun 15, 1043 (2024). Link to paper. Link to GEO submission.
Perform preprocessing using Rapids-singlecell. Please note that not all SCANPY operations are supported by Rapids-singlecell. Hence, we may move the anndata b/w GPU and CPU.
Perform pathway and GO enrichment analsysis
Load a C2S model (C2S-Pythia-410M cell type prediction foundation model) which was pretrained to do cell type annotation on many datasets
Annotate the single-cell dataset using the C2S model
Save the enrichment analysis and predicted cell types in an excel sheet, which can be directly uploaded to T2KG and AA4P

Import standard libraries¶

In [ ]:

Copied!





# Python built-in libraries
import random

import rapids_singlecell as rsc

# Third-party libraries
from pygeneconverter import ensembl_to_hugo
import pandas as pd
import gseapy as gp
import numpy as np

# Single-cell libraries
import anndata
import scanpy as sc

# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.tasks import predict_cell_types_of_data
# Python built-in libraries
import random

import rapids_singlecell as rsc

# Third-party libraries
from pygeneconverter import ensembl_to_hugo
import pandas as pd
import gseapy as gp
import numpy as np

# Single-cell libraries
import anndata
import scanpy as sc

# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.tasks import predict_cell_types_of_data

Start with a random seed¶

In [12]:

Copied!

SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

Load your data¶

In [11]:

Copied!

# Load the raw counts CSV
df = pd.read_csv("/data/GSM7994747_CD/GSM7994747_CD.csv.gz", index_col=0)  # Genes in rows, cells in columns
df.head()
# Load the raw counts CSV
df = pd.read_csv("/data/GSM7994747_CD/GSM7994747_CD.csv.gz", index_col=0)  # Genes in rows, cells in columns
df.head()

Out[11]:

	AAACCCAAGCATGGGT_SAN	AAACCCAAGTAGACAT_SAN	AAACCCAAGTCCCAGC_SAN	AAACCCAAGTGAGTTA_SAN	AAACCCACACTTGGCG_SAN	AAACCCAGTATGGGAC_SAN	AAACCCAGTTACCGTA_SAN	AAACCCATCGAGTCTA_SAN	AAACCCATCGTTAGAC_SAN	AAACGAAAGGGTTGCA_SAN	...	TTTGGTTAGATGAACT_SAN	TTTGGTTCAAGGTCAG_SAN	TTTGGTTCACAGTGAG_SAN	TTTGGTTCAGATTAAG_SAN	TTTGGTTCAGCTTCGG_SAN	TTTGGTTGTAGAGGAA_SAN	TTTGGTTGTTTGGAAA_SAN	TTTGTTGAGCGTGAGT_SAN	TTTGTTGGTAACACGG_SAN	TTTGTTGGTTATGTCG_SAN
ENSG00000243485	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
ENSG00000237613	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
ENSG00000186092	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
ENSG00000238009	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
ENSG00000239945	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 9639 columns

Convert ENSEMBL IDs to HUGO nomenclature¶

In [13]:

Copied!





# Convert from ENSEMBL to HUGO
ensembl_df = ensembl_to_hugo(df.index.tolist())
# Select only protein coding genes (or choose another filter)
ensembl_df_protein_coding = ensembl_df[ensembl_df["GENE_TYPE"]=="protein_coding"]
ensembl_df_protein_coding.head()
# Convert from ENSEMBL to HUGO
ensembl_df = ensembl_to_hugo(df.index.tolist())
# Select only protein coding genes (or choose another filter)
ensembl_df_protein_coding = ensembl_df[ensembl_df["GENE_TYPE"]=="protein_coding"]
ensembl_df_protein_coding.head()

Out[13]:

	ENSEMBL_ID	GENE_TYPE	HGNC_ID
2	ENSG00000186092	protein_coding	OR4F5
20	ENSG00000187634	protein_coding	SAMD11
21	ENSG00000188976	protein_coding	NOC2L
22	ENSG00000187961	protein_coding	KLHL17
23	ENSG00000187583	protein_coding	PLEKHN1

Replace ENSEMBL IDs by HUGO symbols in the data¶

In [14]:

Copied!





# Ensure mapper_df has old and new IDs
mapping = dict(zip(ensembl_df_protein_coding['ENSEMBL_ID'], ensembl_df_protein_coding['HGNC_ID']))

# Filter only rows where index exists in mapping
df_filtered = df[df.index.isin(mapping.keys())]

# Replace index using the mapping
df_filtered.index = df_filtered.index.map(mapping)
df_filtered.head()
# Ensure mapper_df has old and new IDs
mapping = dict(zip(ensembl_df_protein_coding['ENSEMBL_ID'], ensembl_df_protein_coding['HGNC_ID']))

# Filter only rows where index exists in mapping
df_filtered = df[df.index.isin(mapping.keys())]

# Replace index using the mapping
df_filtered.index = df_filtered.index.map(mapping)
df_filtered.head()

Out[14]:

	AAACCCAGTTACCGTA_SAN	...
OR4F5	0.0	...
SAMD11	0.0	...
NOC2L	1.0	...
KLHL17	0.0	...
PLEKHN1	0.0	...

5 rows × 9639 columns

In [15]:

Copied!





# Transpose to match AnnData format: cells as rows, genes as columns
adata = anndata.AnnData(df_filtered.T)
adata.var_names_make_unique()
adata.var['gene_name'] = adata.var.index
adata.var.head()
# Transpose to match AnnData format: cells as rows, genes as columns
adata = anndata.AnnData(df_filtered.T)
adata.var_names_make_unique()
adata.var['gene_name'] = adata.var.index
adata.var.head()

/opt/conda/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.

Out[15]:

	gene_name
OR4F5	OR4F5
SAMD11	SAMD11
NOC2L	NOC2L
KLHL17	KLHL17
PLEKHN1	PLEKHN1

Make var_names unique¶

In [16]:

Copied!

adata.var_names = adata.var_names.str.upper()
adata.var_names_make_unique()
adata.var_names
adata.var_names = adata.var_names.str.upper()
adata.var_names_make_unique()
adata.var_names

Out[16]:

Index(['OR4F5', 'SAMD11', 'NOC2L', 'KLHL17', 'PLEKHN1', 'C1ORF170', 'HES4',
       'ISG15', 'AGRN', 'RNF223',
       ...
       'MT-ATP6', 'MT-CO3', 'MT-ND3', 'MT-ND4L', 'MT-ND4', 'MT-ND5', 'MT-ND6',
       'MT-CYB', 'NBPF10-1', 'AL355149.2'],
      dtype='object', length=19121)

Move anndata to GPU and apply basic scanpy filters¶

In [17]:

Copied!

rsc.get.anndata_to_GPU(adata) # moves `.X` to the GPU
rsc.pp.filter_cells(adata, min_genes=200)
rsc.pp.filter_genes(adata, min_cells=3)
rsc.get.anndata_to_GPU(adata) # moves `.X` to the GPU
rsc.pp.filter_cells(adata, min_genes=200)
rsc.pp.filter_genes(adata, min_cells=3)

filtered out 5579 genes that are detected in less than 3 cells

In [18]:

Copied!

# DATA_PATH = "/data/sc_data_IBD/sample.h5ad"
# adata = anndata.read_h5ad(DATA_PATH)
adata
# DATA_PATH = "/data/sc_data_IBD/sample.h5ad"
# adata = anndata.read_h5ad(DATA_PATH)
adata

Out[18]:

AnnData object with n_obs × n_vars = 9639 × 13542
    obs: 'n_counts', 'n_genes'
    var: 'gene_name', 'n_counts', 'n_cells'

In [19]:

Copied!

## Print OBS head
adata.obs.head()
## Print OBS head
adata.obs.head()

Out[19]:

	n_counts	n_genes
AAACCCAAGCATGGGT_SAN	3624.0	1125
AAACCCAAGTAGACAT_SAN	7453.0	1978
AAACCCAAGTCCCAGC_SAN	6372.0	1684
AAACCCAAGTGAGTTA_SAN	4374.0	1214
AAACCCACACTTGGCG_SAN	4894.0	1069

In [20]:

Copied!

## Print VAR head
adata.var.head()
## Print VAR head
adata.var.head()

Out[20]:

	gene_name	n_counts	n_cells
SAMD11	SAMD11	8.0	8
NOC2L	NOC2L	1084.0	999
KLHL17	KLHL17	90.0	89
PLEKHN1	PLEKHN1	49.0	42
C1ORF170	C1orf170	4.0	4

Add new observations "organism" and "cell_type"¶

"organism" should contain the name of the organism and, "cell_type" should contain the known cell_type (if any; this can be useful when comparing with the predicted cell types later)

In [21]:

Copied!





adata.obs['organism'] = 'Homo sapiens'
adata.obs['cell_type'] = ''
# adata.obs = adata.obs.rename(columns={'cell_anno_NovershternPanel': 'cell_type'})
adata.obs.head()
adata.obs['organism'] = 'Homo sapiens'
adata.obs['cell_type'] = ''
# adata.obs = adata.obs.rename(columns={'cell_anno_NovershternPanel': 'cell_type'})
adata.obs.head()

Out[21]:

	n_counts	n_genes	organism
AAACCCAAGCATGGGT_SAN	3624.0	1125	Homo sapiens
AAACCCAAGTAGACAT_SAN	7453.0	1978	Homo sapiens
AAACCCAAGTCCCAGC_SAN	6372.0	1684	Homo sapiens
AAACCCAAGTGAGTTA_SAN	4374.0	1214	Homo sapiens
AAACCCACACTTGGCG_SAN	4894.0	1069	Homo sapiens

c2S requires only "organism" and "cell_type" obs. We can remove the others.¶

In [22]:

Copied!

adata_obs_cols_to_keep  = adata.obs.columns.tolist()
adata_obs_cols_to_keep = ['organism', 'cell_type']
adata_obs_cols_to_keep  = adata.obs.columns.tolist()
adata_obs_cols_to_keep = ['organism', 'cell_type']

Annotate mitochondrial gene counts for additional potential filtering steps¶

In [23]:

Copied!





# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
rsc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], log1p=False
)
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
rsc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], log1p=False
)

In [24]:

Copied!





sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)

No description has been provided for this image

Filter out cells based on quality-control metrics based on mitochondrial genes (Optional)¶

Check out Scanpy guidance for more information: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#quality-control

In [25]:

Copied!

adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :].copy()
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :].copy()

Count normalization and apply additional filters¶

In [26]:

Copied!





# Count normalization
rsc.pp.normalize_total(adata)
# moves `.X` to the CPU befre applying log transformation
rsc.get.anndata_to_CPU(adata)
# Lop1p transformation with base 10 - base 10 is important for C2S transformation!
sc.pp.log1p(adata, base=10)
# Count normalization
rsc.pp.normalize_total(adata)
# moves `.X` to the CPU befre applying log transformation
rsc.get.anndata_to_CPU(adata)
# Lop1p transformation with base 10 - base 10 is important for C2S transformation!
sc.pp.log1p(adata, base=10)

Visualization¶

In [27]:

Copied!





# moves `.X` back to the GPU
rsc.get.anndata_to_GPU(adata)
rsc.tl.pca(adata)
rsc.pp.neighbors(adata)
rsc.tl.umap(adata)
# moves `.X` back to the GPU
rsc.get.anndata_to_GPU(adata)
rsc.tl.pca(adata)
rsc.pp.neighbors(adata)
rsc.tl.umap(adata)

[2025-05-13 07:40:34.671] [CUML] [info] Unused keyword parameter: random_state during cuML estimator initialization

In [28]:

Copied!

# Clustering
rsc.tl.leiden(adata)
# Clustering
rsc.tl.leiden(adata)

In [29]:

Copied!

# Plot UMAP
sc.pl.umap(adata, color=["leiden"])
# Plot UMAP
sc.pl.umap(adata, color=["leiden"])

Enrichment analysis¶

In [30]:

Copied!

# Run differential expression (example: cluster 0 vs all, cluster 1 vs all, etc.)
rsc.tl.rank_genes_groups_logreg(adata, groupby='leiden', method='wilcoxon')
# Run differential expression (example: cluster 0 vs all, cluster 1 vs all, etc.)
rsc.tl.rank_genes_groups_logreg(adata, groupby='leiden', method='wilcoxon')

[2025-05-13 07:41:51.502] [CUML] [info] Unused keyword parameter: method during cuML estimator initialization
[2025-05-13 07:44:26.571] [CUML] [warning] L-BFGS: max iterations reached
[2025-05-13 07:44:26.572] [CUML] [warning] Maximum iterations reached before solver is converged. To increase model accuracy you can increase the number of iterations (max_iter) or improve the scaling of the input data.

Create a union of top 25 DE genes from each cluster¶

In [47]:

Copied!





top_n = 25
groups = adata.uns['rank_genes_groups']['names'].dtype.names

# Store top genes per group
top_genes = set()
for group in groups:
    genes = adata.uns['rank_genes_groups']['names'][group][:top_n]
    top_genes.update(genes)

# 3. Convert to list or sorted list
top_genes_union = sorted(top_genes)  # or list(top_genes)
top_genes_union = [str(g) for g in top_genes_union]

print(f"Total unique top genes across clusters: {len(top_genes_union)}")
print(top_genes_union)
top_n = 25
groups = adata.uns['rank_genes_groups']['names'].dtype.names

# Store top genes per group
top_genes = set()
for group in groups:
    genes = adata.uns['rank_genes_groups']['names'][group][:top_n]
    top_genes.update(genes)

# 3. Convert to list or sorted list
top_genes_union = sorted(top_genes)  # or list(top_genes)
top_genes_union = [str(g) for g in top_genes_union]

print(f"Total unique top genes across clusters: {len(top_genes_union)}")
print(top_genes_union)

Total unique top genes across clusters: 189
['ACTG1', 'AIF1', 'ALOX5AP', 'ANKRD12', 'ANXA1', 'ANXA5', 'APBB1IP', 'ARF6', 'ARHGDIB', 'ARID4B', 'ATP5D', 'B2M', 'BTG1', 'C12ORF57', 'C12ORF75', 'C19ORF60', 'C9ORF142', 'CALM1', 'CAPZB', 'CCL5', 'CCR7', 'CD2', 'CD3E', 'CD52', 'CD74', 'CD8A', 'CD8B', 'CD96', 'CEBPD', 'CHCHD10', 'CKLF', 'CMC1', 'COA4', 'CORO1B', 'COX4I1', 'CRIP1', 'CST7', 'CTSW', 'CX3CR1', 'DDX5', 'DSTN', 'DUSP1', 'EEF1A1', 'EEF1G', 'EEF2', 'EFHD2', 'EIF1', 'EIF1AX', 'EIF3E', 'FAM173A', 'FGD3', 'FGFBP2', 'FOS', 'FTH1', 'FUBP1', 'FYB', 'GLRX', 'GLTSCR2', 'GMFG', 'GNAS', 'GNLY', 'GPR183', 'GYPC', 'GZMA', 'GZMB', 'GZMH', 'GZMK', 'GZMM', 'H3F3A', 'HCST', 'HLA-A', 'HLA-DPB1', 'HMGB2', 'HNRNPU', 'HOPX', 'HSP90B1', 'IFITM2', 'IL32', 'IL7R', 'ILF2', 'ITGA4', 'ITGB1', 'ITM2B', 'KLF6', 'KLRB1', 'KLRD1', 'KLRF1', 'KLRG1', 'KLRK1', 'KRTCAP2', 'LEPROTL1', 'LGALS1', 'LGALS3', 'LIME1', 'LIMS1', 'LTB', 'LYAR', 'MALT1', 'MBP', 'MT-ATP6', 'MT-CO1', 'MT-CO2', 'MT-CO3', 'MT-CYB', 'MT-ND1', 'MT-ND2', 'MT-ND3', 'MT-ND4', 'MT-ND5', 'MT-ND6', 'MT2A', 'MTIF3', 'MTRNR2L12', 'NCR3', 'NDUFA13', 'NDUFV2', 'NKG7', 'NKTR', 'NOSIP', 'NSA2', 'NSG1', 'PASK', 'PFN1', 'PHACTR2', 'PHF11', 'PHF3', 'PLCG2', 'PLEK', 'PLIN2', 'PNISR', 'PPIB', 'PPP2R5C', 'PRF1', 'PTPRC', 'RNASET2', 'RPL13A', 'RPL14', 'RPL17', 'RPL21', 'RPL23', 'RPL30', 'RPL31', 'RPL32', 'RPL36', 'RPL36AL', 'RPL37', 'RPL38', 'RPL39', 'RPL41', 'RPLP0', 'RPLP1', 'RPS11', 'RPS12', 'RPS15A', 'RPS18', 'RPS19', 'RPS21', 'RPS24', 'RPS27', 'RPS29', 'RPS3A', 'RPS6', 'RPS8', 'RPSA', 'S100A10', 'S100A11', 'S100A4', 'S100A6', 'SELL', 'SET', 'SH2D1A', 'SLC25A5', 'SLC4A10', 'SPOCK2', 'SRP14', 'TCF7', 'TIGIT', 'TIMP1', 'TMEM66', 'TMSB10', 'TMSB4X', 'TPT1', 'TSHZ2', 'TXN', 'TXNIP', 'TYROBP', 'UCP2', 'VIM', 'ZEB2']

Peform enrichment analysis using EnrichR implementation in python (via gseapy)¶

In [33]:

Copied!

# Get a list of background databases available in EnrichR
names = gp.get_library_name()
print (names)
# Get a list of background databases available in EnrichR
names = gp.get_library_name()
print (names)

['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_2024', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DGIdb_Drug_Targets_2024', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_CRISPR_GeneDependency_CellLines_2023', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2013', 'GO_Biological_Process_2015', 'GO_Biological_Process_2017', 'GO_Biological_Process_2017b', 'GO_Biological_Process_2018', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Biological_Process_2025', 'GO_Cellular_Component_2013', 'GO_Cellular_Component_2015', 'GO_Cellular_Component_2017', 'GO_Cellular_Component_2017b', 'GO_Cellular_Component_2018', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Cellular_Component_2025', 'GO_Molecular_Function_2013', 'GO_Molecular_Function_2015', 'GO_Molecular_Function_2017', 'GO_Molecular_Function_2017b', 'GO_Molecular_Function_2018', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GO_Molecular_Function_2025', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_DISEASES_Curated_2025', 'Jensen_DISEASES_Experimental_2025', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_2013', 'MGI_Mammalian_Phenotype_2017', 'MGI_Mammalian_Phenotype_Level_3', 'MGI_Mammalian_Phenotype_Level_4', 'MGI_Mammalian_Phenotype_Level_4_2019', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MGI_Mammalian_Phenotype_Level_4_2024', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2016', 'NIBR_DRUGseq_2025_down', 'NIBR_DRUGseq_2025_up', 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_Human_AutoRIF', 'NIH_Funded_PIs_2017_Human_GeneRIF', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'PerturbAtlas', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'RNAseq_Automatic_GEO_Signatures_Human_Down', 'RNAseq_Automatic_GEO_Signatures_Human_Up', 'RNAseq_Automatic_GEO_Signatures_Mouse_Down', 'RNAseq_Automatic_GEO_Signatures_Mouse_Up', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2013', 'Reactome_2015', 'Reactome_2016', 'Reactome_2022', 'Reactome_Pathways_2024', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SynGO_2024', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'The_Kinase_Library_2024', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'WikiPathways_2024_Human', 'WikiPathways_2024_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']

In [48]:

Copied!





# Run enrichment
enr = gp.enrichr(
    gene_list = top_genes_union,          # Your DEG list
    organism = 'Human',                  # or 'Mouse'
    gene_sets = ['Reactome_Pathways_2024', 'GO_Biological_Process_2025', 'GO_Molecular_Function_2025', 'GO_Cellular_Component_2025'],  # choose database(s)
    outdir = 'enrichr_results'           # output directory
)

# View top pathways
print(enr.results.head())
# Run enrichment
enr = gp.enrichr(
    gene_list = top_genes_union,          # Your DEG list
    organism = 'Human',                  # or 'Mouse'
    gene_sets = ['Reactome_Pathways_2024', 'GO_Biological_Process_2025', 'GO_Molecular_Function_2025', 'GO_Cellular_Component_2025'],  # choose database(s)
    outdir = 'enrichr_results'           # output directory
)

# View top pathways
print(enr.results.head())

                 Gene_set                                               Term  \
0  Reactome_Pathways_2024                  Eukaryotic Translation Elongation   
1  Reactome_Pathways_2024                           Peptide Chain Elongation   
2  Reactome_Pathways_2024           Formation of a Pool of Free 40S Subunits   
3  Reactome_Pathways_2024                                    rRNA Processing   
4  Reactome_Pathways_2024  L13a-mediated Translational Silencing of Cerul...   

  Overlap       P-value  Adjusted P-value  Old P-value  Old Adjusted P-value  \
0   32/99  6.008515e-41      3.611117e-38            0                     0   
1   31/94  5.357908e-40      1.610051e-37            0                     0   
2  31/106  4.116050e-38      8.245821e-36            0                     0   
3  39/237  2.865285e-37      4.305090e-35            0                     0   
4  31/116  9.935308e-37      1.194224e-34            0                     0   

   Odds Ratio  Combined Score  \
0   60.063504     5562.649971   
1   61.501708     5561.281480   
2   51.630042     4444.486449   
3   25.754444     2167.055722   
4   45.532837     3774.651853   

                                               Genes  
0  RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...  
1  RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...  
2  RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...  
3  MT-ND4;RPL30;MT-ND5;RPL32;MT-CO1;RPL31;RPLP1;R...  
4  RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...

Filter based on adjusted p-value¶

In [49]:

Copied!

filtered_df = enr.results[enr.results['Adjusted P-value'] < 0.01]
filtered_df
filtered_df = enr.results[enr.results['Adjusted P-value'] < 0.01]
filtered_df

Out[49]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Old P-value	Old Adjusted P-value	Odds Ratio	Combined Score	Genes
0	Reactome_Pathways_2024	Eukaryotic Translation Elongation	32/99	6.008515e-41	3.611117e-38	0	0	60.063504	5562.649971	RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...
1	Reactome_Pathways_2024	Peptide Chain Elongation	31/94	5.357908e-40	1.610051e-37	0	0	61.501708	5561.281480	RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...
2	Reactome_Pathways_2024	Formation of a Pool of Free 40S Subunits	31/106	4.116050e-38	8.245821e-36	0	0	51.630042	4444.486449	RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...
3	Reactome_Pathways_2024	rRNA Processing	39/237	2.865285e-37	4.305090e-35	0	0	25.754444	2167.055722	MT-ND4;RPL30;MT-ND5;RPL32;MT-CO1;RPL31;RPLP1;R...
4	Reactome_Pathways_2024	L13a-mediated Translational Silencing of Cerul...	31/116	9.935308e-37	1.194224e-34	0	0	45.532837	3774.651853	RPL30;RPL32;RPL31;RPLP1;RPLP0;RPS15A;RPS19;RPL...
...	...	...	...	...	...	...	...	...	...	...
2149	GO_Cellular_Component_2025	Secretory Granule Lumen (GO:0034774)	11/316	2.181071e-04	3.118931e-03	0	0	3.952220	33.319283	EEF1A1;GMFG;RNASET2;TMSB4X;CTSW;TIMP1;EEF2;SRP...
2150	GO_Cellular_Component_2025	Asymmetric Synapse (GO:0032279)	7/131	2.506586e-04	3.258561e-03	0	0	6.106390	50.630634	RPL30;RPS27;RPS19;RPS18;RPLP0;RPL14;RPL38
2151	GO_Cellular_Component_2025	T Cell Receptor Complex (GO:0042101)	3/15	3.475663e-04	4.141832e-03	0	0	26.611559	211.949228	CD8B;CD8A;CD3E
2152	GO_Cellular_Component_2025	Cytoplasmic Vesicle Lumen (GO:0060205)	6/117	8.637376e-04	9.501113e-03	0	0	5.818934	41.048164	EEF1A1;GMFG;GZMB;SRP14;EEF2;S100A11
2153	GO_Cellular_Component_2025	Postsynaptic Density (GO:0014069)	7/164	9.627301e-04	9.833600e-03	0	0	4.814797	33.442314	RPL30;RPS27;RPS19;RPS18;RPLP0;RPL14;RPL38

132 rows × 10 columns

Create a dataclass to store the results¶

In [57]:

Copied!





from dataclasses import dataclass
from typing import Optional
 
# A class for holding an SC results
@dataclass
class SingleCell:
    # Attributes Declaration
    # using Type Hints
    reactome_pathways: Optional[list] = None
    go_molecular_function: Optional[list] = None
    go_biological_process: Optional[list] = None
    go_cellular_component: Optional[list] = None
    de_genes: Optional[list] = None
    # Cell types will be predicted by C2S later in the notebook
    cell_types: Optional[list] = None
from dataclasses import dataclass
from typing import Optional
 
# A class for holding an SC results
@dataclass
class SingleCell:
    # Attributes Declaration
    # using Type Hints
    reactome_pathways: Optional[list] = None
    go_molecular_function: Optional[list] = None
    go_biological_process: Optional[list] = None
    go_cellular_component: Optional[list] = None
    de_genes: Optional[list] = None
    # Cell types will be predicted by C2S later in the notebook
    cell_types: Optional[list] = None

Create an object¶

In [56]:

Copied!





# Create an object
sc_obj = SingleCell()

# Add reactome_pathways
sc_obj.reactome_pathways = filtered_df[filtered_df['Gene_set'] == 'Reactome_Pathways_2024']['Term'].tolist()
# Add GO MF
sc_obj.go_molecular_function = filtered_df[filtered_df['Gene_set'] == 'GO_Molecular_Function_2025']['Term'].tolist()
# Add GO BP
sc_obj.go_biological_process = filtered_df[filtered_df['Gene_set'] == 'GO_Biological_Process_2025']['Term'].tolist()
# Add GO CC
sc_obj.go_cellular_component = filtered_df[filtered_df['Gene_set'] == 'GO_Cellular_Component_2025']['Term'].tolist()
# Add DE genes
sc_obj.de_genes = top_genes_union
# Create an object
sc_obj = SingleCell()

# Add reactome_pathways
sc_obj.reactome_pathways = filtered_df[filtered_df['Gene_set'] == 'Reactome_Pathways_2024']['Term'].tolist()
# Add GO MF
sc_obj.go_molecular_function = filtered_df[filtered_df['Gene_set'] == 'GO_Molecular_Function_2025']['Term'].tolist()
# Add GO BP
sc_obj.go_biological_process = filtered_df[filtered_df['Gene_set'] == 'GO_Biological_Process_2025']['Term'].tolist()
# Add GO CC
sc_obj.go_cellular_component = filtered_df[filtered_df['Gene_set'] == 'GO_Cellular_Component_2025']['Term'].tolist()
# Add DE genes
sc_obj.de_genes = top_genes_union

Construct an arrow dataset of cell sentences from an AnnData object¶

In [27]:

Copied!





arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)

WARN: more variables (13542) than observations (9639)... did you mean to transpose the object (e.g. adata.T)?
WARN: more variables (13542) than observations (9639), did you mean to transpose the object (e.g. adata.T)?
100%|██████████| 9639/9639 [00:04<00:00, 2053.15it/s]

In [28]:

Copied!

arrow_ds
arrow_ds

Out[28]:

Dataset({
    features: ['cell_name', 'cell_sentence', 'organism', 'cell_type'],
    num_rows: 9639
})

Show first sample in the CSData object¶

In [480]:

Copied!

sample_idx = 0
arrow_ds[sample_idx]
sample_idx = 0
arrow_ds[sample_idx]

Out[480]:

{'cell_name': 'AAACCCAAGCATGGGT_SAN',
 'cell_sentence': 'RPS27 RPL41 RPLP1 RPS18 RPL13 RPL10 RPS29 RPL34 RPS2 RPS12 MT-ND3 RPS19 EEF1A1 RPL30 RPL26 RPS14 B2M RPS4X RPS21 RPS23 RPL18 RPS15A RPS28 RPL21 TPT1 RPL17 RPS27A RPL32 RPL9 RPLP2 RPS24 RPL37 RPS7 RPL39 MT-CYB RPL23A RPL15 RPL28 RPS6 RPS8 RPL11 MT-CO3 RPS3A RPL5 RPL18A MT-CO2 PLCG2 RPL37A MT-CO1 RPL7A RPL19 RPL8 MT-ATP6 RPL3 RPS9 RPS3 RPL35 RPL13A FAU RPS26 RPL27A RPL14 RPS13 TMSB4X RPS15 RPL29 RPL6 RPL12 MT-ND1 RPL24 RPL36 RPS25 RPL35A RPSA MT-ND4 KLRB1 RPL31 TXNIP EIF1 FTL RPS11 RPL7 NACA RPS20 RPL27 UBA52 RPL10A RPL22 RPS16 RPL38 MT-ND2 TMSB10 EEF1B2 PFDN5 RPL36A RPS17L BTG1 HCST RPLP0 DDX5 COX4I1 FTH1 RPL36AL HSP90AA1 CCL5 CALM1 EEF1D ATP5E SERF2 IL32 RPS5 HLA-C NKG7 HMGB1 CST7 HLA-B HNRNPA2B1 PTMA GLTSCR2 GNB2L1 TMA7 HIST1H3D RPS10 MTRNR2L12 CD52 UBC CDKN2D UBL5 HNRNPK SRSF5 KLF13 ARL4C CD48 H3F3A MYL12A NPM1 C12ORF57 IL7R COX7A2 PSMB9 TMEM107 COMMD6 CPSF2 HSPA8 N4BP2L2 NUCKS1 HIST1H1D PPIA ATP5L EVL SUMO2 H3F3B ZNF652 CLIC1 NDUFA4 COX7C SLC25A3 SYNE2 BCLAF1 ATP6V0E1 SORL1 ZFP36L1 HIST1H1E LYAR FXYD5 UQCRB EIF4A1 HLA-A WIPF2 FUS EIF3E KIAA1551 KLRD1 PNRC1 DNAJC19 CENPC NDUFB2 WAS COQ7 CEBPD S100A6 SKP1 MZT2A TSTD1 GMFG TNRC6C TCEA1 FOS ARPC1B RNASET2 PNRC2 XBP1 PPP2R5C ELF1 DUSP1 STOML2 CAST IPCEF1 CD3G C7ORF73 C19ORF52 FAM208A PTPRC GSK3A SRRM1 EIF3F SUGT1 USMG5 RRM1 CORO1A UQCRQ LIMD2 ARID4B SRRM2 CTSC PSME1 MPHOSPH8 KRI1 MRPS15 PSMB1 BTF3 MYH9 S100A4 XPC MT2A DYNLL2 HAX1 PYHIN1 VIM CLSTN1 DDX24 IER5 ARNTL TTC3 LOH12CR1 HDGF NDUFB11 PRR5 MT-ND4L CREBRF MYL6 ZFAND6 RPS6KA3 CD3E LCP1 FAM107B HSPH1 CHD9 FAIM3 KHDRBS1 GNG2 RPL4 SRSF3 NCOA7 C14ORF2 GTF2B MGAT4A CSTB RTF1 C5ORF24 CXCR4 ATP6V1G1 TOMM7 CHCHD2 UXT ARF1 HUWE1 BIN2 UBB ZFYVE28 NOL11 ID2 PPIB CHD1 DNAJC8 PNISR SUMO1 NME1-NME2 MTF2 UBE2D3 RBM6 MT-ND5 SSR4 HLA-E YBX1 JOSD1 PAPD5 OAZ1 EEF1G TOMM20 PRR14L UQCRH SLC38A2 C4ORF48 ZFR COX6B1 MZF1 RGS19 ZFP36L2 MRPL54 TMEM128 PAK2 SNRPD2 IL18RAP SRF PEBP1 UQCR11 MYEOV2 OST4 ANP32E C1ORF63 PABPC1 LTB BOD1L1 FAM50A COX5B LARS NDUFA1 VTI1B SRSF7 DSTN FNBP1 STX11 EIF5 RAD23B BRIX1 KLRK1 UBE2B EHMT2 LTK ECI1 TRMT11 CTD-2260A17.2 CD46 DOCK2 TP53I13 PPP3CC SLC3A2 SEC61G SRP9 DUT ZNHIT1 SNX5 ETFB SSH2 HMGB2 MOV10 NFATC2IP NGRN AAK1 ITM2A MSRB2 MAX COX8A LAMTOR5 JHDM1D ZNF35 FAM192A PTPN4 POLR2M PFDN2 GAPDH SRGN PQLC3 TXNL4A ZBTB16 ADAMTS1 CYP4V2 NDUFS5 FAM20B BBS12 ZNF446 NADSYN1 STAG1 RABAC1 ZGPAT C15ORF61 PGAM1 SNRPG CYB5D2 SERP1 AKIRIN2 HINT1 CDC42SE2 ZNF281 PAXBP1 MAML1 TARS KIAA0922 GPR108 LAMTOR4 FBXL3 LEPROTL1 GNAI3 MAN1A2 PDCD7 SPOCK2 TFDP2 TWF1 PTP4A2 VPS28 TMEM147 SF1 QKI CYB5R4 DUSP23 WAC ME2 SCRG1 TMA16 MTIF3 C1ORF43 PMM2 DPY30 MALSU1 UBXN1 TRMT1 SOS1 PIGP NDUFA3 PRKD3 KAT2B DYNC1LI2 WDR83OS KIF9 YWHAH EIF3H SNX3 MT1X TBC1D5 SNRNP25 ASH1L PIM1 MED16 EMB TM2D1 AKIRIN1 PAFAH1B1 SLC5A6 RAP1A CAMLG PCBP3 HIPK3 ARHGAP26 DAD1 CDK11A SERPINB1 LAS1L PIBF1 SRP14 NUDT16L1 SLC4A4 SLU7 SIKE1 MCTS1 SF3B4 DCAF16 NOP14 UBE2M PDCL3 FAM89B ARGLU1 FAM21A MRP63 MSN IL16 MRPL34 TCF25 DCLRE1A CERS2 SRP19 SCNM1 HSPA9 CCR6 ZNF831 VPS25 ANAPC5 SNX1 HNRNPA3 ZNF626 SAMD9L ATP5C1 XRN2 ZNF217 REST OCIAD1 TRAT1 ERBB2IP CDK11B MINOS1 NUDT15 CHCHD10 CLTB MAP3K8 SEPT7 KMT2A CHIC2 DNAJA2 TCERG1 PSMC1 C16ORF54 MRPS33 ACAP2 ZBTB8OS KIAA0947 DPYD MDM4 CAPZA1 TMEM55B TGFBR2 GZMA SNAPIN SLC38A1 SRI COPE MRPL18 PPP1R14B EIF3K DCP2 SSSCA1 RASA3 ANKRD12 METTL2A TMEM243 POLR3GL ALDH9A1 MPRIP PLAC8 SPTSSA SLC25A40 TANK C10ORF137 ZNF566 SMYD2 MPC1 PTEN DEK ATP5I PREP TOLLIP PSMC3 SOCS3 NRBP1 SPRYD7 AP3D1 PTPRA SNRNP200 CCNL1 PARP8 STK10 WDR73 PCM1 PPDPF RASSF3 MCL1 TJAP1 PRKCH DDX17 EIF4A2 FAM96B TMEM106B LINC00998 AASDHPPT KLHDC4 TNIK NCKAP1L CCNK SUCO XIAP IQCG SGK3 ECI2 APOL2 RAC2 CIRBP MAP2K2 FKBP11 ACTR1A RNF145 B3GNT1 MRPS36 MBP ATP2B4 SOCS1 PXMP4 POLK TUBB4B ST6GALNAC6 HIST2H2AC EIF4E HLCS HTT PRMT3 NDUFB4 PRKCE CTSW RHOU EMC7 MRPL41 MRPL47 SH3KBP1 AZIN1 IDI1 STK32C SP110 P4HA1 ATP5H FBL RBM26 MAP3K11 DTX3L ACTG1 OXSM YWHAZ ERN1 PPM1B C19ORF53 DNAJC4 DUSP7 C21ORF91 SAMHD1 METAP2 DYRK1B WDR33 RSF1 PREB DDOST NDUFA5 PLEK NLRP2 IMPDH2 FNIP1 UCP2 TTC39C MAP1S ZC3H13 LYSMD2 SLA2 ATP2A2 APEX1 PTPN2 VRK3 MPST MPLKIP EAPP CDC27 TWF2 NDUFA8 CYTH1 EOMES ZNF266 PMF1 CALM2 NIPBL BAG1 PSMD9 IDNK GBP2 SETX HADHA ANKRD49 BDP1 HACE1 METTL12 SNRPN TMEM223 HNRNPH2 SF3A1 MATK AURKAIP1 LATS1 PRELID1 AGPAT1 STAU1 FDFT1 GOSR2 DNASE2 ZNF263 ACTR3 FAM133B ITGB1BP1 EIF3A UFC1 OTUD5 RBMS1 CWC15 TAF15 FXR1 S100A10 FAM49A DNAJC21 UBE2L6 ISOC1 NPEPPS SLA CSTF3 TMEM8A SLC25A32 IGIP GABARAPL2 PTCD3 MAP7D1 FAM120A ARPC2 ARPC3 DBI SEC61B NME4 MAP3K10 VPS13B JTB EIF5A STT3B ASXL2 OSBPL8 PPP4C ITGA4 JAK1 NAP1L4 GIMAP7 FOSB TMEM50B ATF7 USP22 DTNBP1 C19ORF70 AGER ATP5J2 ADD1 ACADVL UPP1 CCDC167 BIN1 WASF2 SPCS2 UBE2K UBE2L3 CCNI C14ORF166 FAM65B XYLT2 CLN5 VDAC2 MAP2K3 COPS6 RGL4 USP14 ABCF1 RRP1B MYCBP2 IFNGR1 CFL1 SF3B2 DEDD ST13 RAD21 UBALD2 RASGRP1 RCAN3 C6ORF226 ITGB7 ZFP36 NUCB2 GTF2F2 SFT2D1 ANAPC16 NKAP POLR2A NRROS SH3BGRL3 DMXL1 RAB11B GTF2F1 WDR6 RGS2 ZCCHC17 NFAT5 SOD1 TBC1D10C TAF7 ACTB OAS3 NKTR NUP210 MTHFS RCSD1 GTF3A PSMD4 TBCK TMEM160 XPO1 TPM3 EIF3M SF3B1 NFKBIZ TCEB1 BUB3 GUK1 LMAN1 ZNF606 TLE1 HAUS1 EDEM2 IAH1 DNAJB6 PTPN22 CCDC85B EIF4B MPI SNRPE FEZ1 GCLM MRPL22 MFF RNPEP ACAA2 AES H2AFV PIEZO1 RNF126 TRAPPC5 PRKAA1 NDFIP1 NSA2 MSH6 FKBP1A POLR1D RAB5B GSTM4 P2RY8 COX7A2L SMARCD3 TUBA1A HSP90AB1 SUDS3 PIH1D1 YME1L1 ZNF576 OFD1 DAB1 TSN GK5 APMAP CFDP1 EEF2 COMMD1 TESPA1 ARHGAP10 TIMM8B DYNC1H1 TAF10 CNN2 KTN1 SYTL1 HPCAL1 REEP5 LSG1 BECN1 MAP7D3 ACBD3 TSC22D3 SLC30A4 CCS ARL6IP6 CRB3 SAP30BP CKLF HABP4 VPS37A SNRNP27 POLR2E ANAPC1 OSTF1 CDC14A CHCHD1 NOL8 RPAIN SAT2 THAP1 ZFP3 RBBP4 NDUFA13 NHP2L1 NT5C3A SERPINB6 PPM1M SCP2 MARCH9 TC2N GIMAP1 FLNA DYNLL1 GHDC UBE2O ACSL3 NRIP1 IFNAR1 C8ORF33 SPSB3 ARL14EP MPHOSPH10 ERP29 DDT MOB1A REL RNASEK RPL23 SAFB RAB3GAP2 VAMP2 BTG2 EPM2A SF3B5 LSM4 NAP1L1 DRAM2 PLCL2 NDUFA7 HP1BP3 DBP BRI3 KLHL6 RICTOR DAZAP1 CWF19L2 NFE2L1 HELZ UBE2Z PPP1CA THEMIS FOXN2 VPS13C MT-ATP8 MIF BTAF1 ZNF146 PLEKHF2 PHF11 TTC7A PIGF CD74 VPS4B RANBP1 PRR13 G3BP1 RBM28 SLC12A2 STAT4 RBM4 TMEM251 CAB39 ADRBK1 CALCOCO2 HNRNPDL IRF2 RBM39 SUPT3H ACP1 DYNC1I2 HERPUD2 TERF2IP C3ORF58 KLF2 ZNF559 EIF3D LRIF1 NBAS SPSB2 ZNF738 SLC25A6 ARID4A PER1 GGA1 SACM1L CLEC2D MRPL20 APOL3 MRPL24 CUL5 SLC12A5 PET100 VAMP7 ZFAND5 MRPS24 NAE1 CAPN2 ARF4 RHEB RBM14 LEMD3 TRIM26 C16ORF80 CD63 STX16 CGGBP1 FBXL19 LSM14A SNRPF GLRX PRDM1 RNF215 ITM2B KIAA1191 TDP1 AMICA1 PSMD8 H2AFZ GPSM3 RHOA RBM33 CCDC69 TOPORS PTPN6 PGK1 MPG IK PDCD4 TAF1D ZMYM6NB STRAP RRAS2 VTI1A SPTY2D1 VIMP SLAMF7 MSL2 MTHFSD FAM136A ARPC5L NASP HNRNPF COX14 SMEK1 LTN1 PSMA7 SDCBP LGALS3 NT5C PRDX6 PSMD13 PARP14 ACYP1 C11ORF58 ATP6V0A2 DNAJB9 PSMC5 ARL3 DYRK1A ANKRD28 KCTD20 LYSMD4 PGAP3 GCHFR TMEM134 FAM32A OXA1L FAM129A TCEB2 USP3 INADL SNRPB2 SELK DDX39B ADAT2 MAPK13 FERMT3 STK38L TXN STK17B GOLGA8B C15ORF40 KANSL1 UBE2D2 TM9SF2 MRPL36 CENPB TRIM4 INO80E KDELR1 DTYMK ZNF638 SMAP1 UFM1 CD151 CEP250 ZFYVE20 TOE1 RSL24D1 TOMM5 MGST3 KCTD2 ZNF280D CNBP SYF2 RNASEH2C BRD2 CYBA ATPIF1 PYROXD1 COG8 ECHS1 SH3BGRL COX6A1 GCDH BZW1 PSENEN USPL1 NMRK1 MSL3 MTMR9 CCDC88C PUM1 HBP1 C19ORF43 PSMB8 FLYWCH2 CCNDBP1 DDX1 ERBB2 ARHGAP30 C5ORF56 COX20 USP16 COX7B SMCR8',
 'organism': 'Homo sapiens',
 'cell_type': ''}

Load the C2S model¶

In [15]:

Copied!

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vandijklab/C2S-Pythia-410m-cell-type-prediction")
model = AutoModelForCausalLM.from_pretrained("vandijklab/C2S-Pythia-410m-cell-type-prediction")
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vandijklab/C2S-Pythia-410m-cell-type-prediction")
model = AutoModelForCausalLM.from_pretrained("vandijklab/C2S-Pythia-410m-cell-type-prediction")

The `GPTNeoXSdpaAttention` class is deprecated in favor of simply modifying the `config._attn_implementation`attribute of the `GPTNeoXAttention` class! It will be removed in v4.48

In [86]:

Copied!

model.save_pretrained("/scratch/model")
tokenizer.save_pretrained("/scratch/model")
model.save_pretrained("/scratch/model")
tokenizer.save_pretrained("/scratch/model")

Out[86]:

('/scratch/model/tokenizer_config.json',
 '/scratch/model/special_tokens_map.json',
 '/scratch/model/tokenizer.json')

Create CSModel object of the loaded model¶

In [89]:

Copied!





# Define CSModel object
cell_type_prediction_model_path = "/scratch/model"
save_dir = "/scratch/"
save_name = "cell_type_pred_pythia_410M_inference"
csmodel = cs.CSModel(
    model_name_or_path=cell_type_prediction_model_path,
    save_dir=save_dir,
    save_name=save_name
)
# Define CSModel object
cell_type_prediction_model_path = "/scratch/model"
save_dir = "/scratch/"
save_name = "cell_type_pred_pythia_410M_inference"
csmodel = cs.CSModel(
    model_name_or_path=cell_type_prediction_model_path,
    save_dir=save_dir,
    save_name=save_name
)

Using device: cuda

Run the following cell if you want to sample out only a subset of rows¶

For example in the cell below, we select only 10 rows

In [472]:

Copied!

increm = arrow_ds.num_rows // 10
arrow_ds = arrow_ds.select(range(0, arrow_ds.num_rows, increm))
arrow_ds
increm = arrow_ds.num_rows // 10
arrow_ds = arrow_ds.select(range(0, arrow_ds.num_rows, increm))
arrow_ds

Out[472]:

Dataset({
    features: ['cell_name', 'cell_sentence', 'organism', 'cell_type'],
    num_rows: 11
})

Specify the output directory¶

In [481]:

Copied!

c2s_save_dir = "./output"  # C2S dataset will be saved into this directory
c2s_save_name = "test_ibd"  # This will be the name of our C2S dataset on disk
c2s_save_dir = "./output"  # C2S dataset will be saved into this directory
c2s_save_name = "test_ibd"  # This will be the name of our C2S dataset on disk

Create new CSData object from a the arrow dataset¶

In [482]:

Copied!





csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds, 
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)
csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds, 
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)

Saving the dataset (1/1 shards): 100%|██████████| 9639/9639 [00:00<00:00, 100934.47 examples/s]

Predict cell types¶

Use the predict_cell_types_of_data function of C2S

In [483]:

Copied!





predicted_cell_types = predict_cell_types_of_data(
    csdata=csdata,
    csmodel=csmodel,
    n_genes=200
)
predicted_cell_types = predict_cell_types_of_data(
    csdata=csdata,
    csmodel=csmodel,
    n_genes=200
)

Reloading model from path on disk: /scratch/cell_type_pred_pythia_410M_inference
Predicting cell types for 9639 cells using CSModel...

100%|██████████| 9639/9639 [5:44:51<00:00,  2.15s/it]

Save the annotated adata¶

In [485]:

Copied!

adata.obs['predicted_cell_type'] = predicted_cell_types
SAVE_PATH = "/scratch/GSM7994747_CD_annotated.h5ad"
adata.write_h5ad(SAVE_PATH)
adata.obs['predicted_cell_type'] = predicted_cell_types
SAVE_PATH = "/scratch/GSM7994747_CD_annotated.h5ad"
adata.write_h5ad(SAVE_PATH)

Read the annotated adata¶

In [58]:

Copied!

SAVE_PATH = "/scratch/GSM7994747_CD_annotated.h5ad"
adata_new = anndata.read_h5ad(SAVE_PATH)
adata_new.obs.head()
SAVE_PATH = "/scratch/GSM7994747_CD_annotated.h5ad"
adata_new = anndata.read_h5ad(SAVE_PATH)
adata_new.obs.head()

Out[58]:

	n_genes	organism	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	predicted_cell_type
AAACCCAAGCATGGGT_SAN	1125	Homo sapiens	1125	3624.0	178.0	4.911700	effector memory CD8-positive, alpha-beta T cell.
AAACCCAAGTAGACAT_SAN	1977	Homo sapiens	1977	7452.0	371.0	4.978529	effector memory CD8-positive, alpha-beta T cell.
AAACCCAAGTCCCAGC_SAN	1684	Homo sapiens	1684	6372.0	300.0	4.708098	central memory CD4-positive, alpha-beta T cell.
AAACCCAAGTGAGTTA_SAN	1214	Homo sapiens	1214	4374.0	278.0	6.355738	CD4-positive, alpha-beta T cell.
AAACCCACACTTGGCG_SAN	1069	Homo sapiens	1069	4894.0	165.0	3.371475	naive thymus-derived CD4-positive, alpha-beta ...

Add predicted cell types to the SingleCell object¶

In [71]:

Copied!





sc_obj.cell_types = list(set(adata_new.obs['predicted_cell_type'].tolist()))

# Exclude '.' at the end
for i in range(0,  len(sc_obj.cell_types)):
    if sc_obj.cell_types[i][-1] == '.':
        sc_obj.cell_types[i] = sc_obj.cell_types[i][:-1]
sc_obj.cell_types
sc_obj.cell_types = list(set(adata_new.obs['predicted_cell_type'].tolist()))

# Exclude '.' at the end
for i in range(0,  len(sc_obj.cell_types)):
    if sc_obj.cell_types[i][-1] == '.':
        sc_obj.cell_types[i] = sc_obj.cell_types[i][:-1]
sc_obj.cell_types

Out[71]:

['CD4-positive, alpha-beta memory T cell',
 'naive thymus-derived CD4-positive, alpha-beta T cell',
 'CD8-positive, alpha-beta memory T cell',
 'mature NK T cell',
 'effector memory CD4-positive, alpha-beta T cell',
 'regulatory T cell',
 'lymphocyte',
 'T cell',
 'naive thymus-derived CD8-positive, alpha-beta T cell',
 'mature alpha-beta T cell',
 'effector memory CD8-positive, alpha-beta T cell',
 'central memory CD4-positive, alpha-beta T cell',
 'CD8-positive, alpha-beta T cell',
 'CD4-positive, alpha-beta T cell',
 'natural killer cell',
 'gamma-delta T cell']

Save results in an excel workbook¶

In [76]:

Copied!





path = "test.xlsx"

df_reactome_pathways = pd.DataFrame(sc_obj.reactome_pathways, columns=['name'])
df_go_molecular_function = pd.DataFrame(sc_obj.go_molecular_function, columns=['name'])
df_go_biological_process = pd.DataFrame(sc_obj.go_biological_process, columns=['name'])
df_go_cellular_component = pd.DataFrame(sc_obj.go_cellular_component, columns=['name'])
df_de_genes = pd.DataFrame(sc_obj.de_genes, columns=['name'])
df_cell_types = pd.DataFrame(sc_obj.cell_types, columns=['name'])

writer = pd.ExcelWriter(path, engine = 'xlsxwriter')
df_reactome_pathways.to_excel(writer, sheet_name = 'reactome_pathways', index=False)
df_go_molecular_function.to_excel(writer, sheet_name = 'molecular_function', index=False)
df_go_biological_process.to_excel(writer, sheet_name = 'biological_process', index=False)
df_go_cellular_component.to_excel(writer, sheet_name = 'cellular_component', index=False)
df_de_genes.to_excel(writer, sheet_name = 'genes', index=False)
df_cell_types.to_excel(writer, sheet_name = 'cell_types', index=False)

writer.close()
path = "test.xlsx"

df_reactome_pathways = pd.DataFrame(sc_obj.reactome_pathways, columns=['name'])
df_go_molecular_function = pd.DataFrame(sc_obj.go_molecular_function, columns=['name'])
df_go_biological_process = pd.DataFrame(sc_obj.go_biological_process, columns=['name'])
df_go_cellular_component = pd.DataFrame(sc_obj.go_cellular_component, columns=['name'])
df_de_genes = pd.DataFrame(sc_obj.de_genes, columns=['name'])
df_cell_types = pd.DataFrame(sc_obj.cell_types, columns=['name'])

writer = pd.ExcelWriter(path, engine = 'xlsxwriter')
df_reactome_pathways.to_excel(writer, sheet_name = 'reactome_pathways', index=False)
df_go_molecular_function.to_excel(writer, sheet_name = 'molecular_function', index=False)
df_go_biological_process.to_excel(writer, sheet_name = 'biological_process', index=False)
df_go_cellular_component.to_excel(writer, sheet_name = 'cellular_component', index=False)
df_de_genes.to_excel(writer, sheet_name = 'genes', index=False)
df_cell_types.to_excel(writer, sheet_name = 'cell_types', index=False)

writer.close()