Tutorial spatial
This tutorial is for running NS-Forest for spatial transcriptomic gene panel design.
The first step is filtering non-zero median expression of all genes represented in the AnnData object. This is done by taking the expression values for each gene for all cells, removing the zero values, and calculating the median per gene. Median is used instead of mean because mean is sensitive to outliers and skewed with noise. We can visualize these genes’ non-zero median expression in a histogram and plot the 10th and 99th percentiles. We remove genes with relatively low or high expression. High expression can effect the spatial imaging results due to “optical crowding”, resulting from excessive fluorescent signal from high transcript dense regions (https://www.nature.com/articles/s41467-021-27798-0).
The next step is filtering by transcript length. In spatial transcriptomics, probes are designed to target each gene; genes with more probes tend to have higher sensitivity to be detected (https://pmc.ncbi.nlm.nih.gov/articles/PMC10054990/). The minimum transcript length is to ensure high detection. The transcript lengths are from Gencode’s BioMart. Due to multiple transcript lengths associated with a single gene, the longest length is choosen. The NS-Forest package contains the gencode
files for human (v47) and mouse (vM36), which can be found in the gencode_annotation folder. The default minimum transcript length is 700. This value is adjustable depending on the organ and species. For example, when looking at a mouse brain dataset, we chose 700 to include Sst (721bp), which is a well known neuronal inhibitory marker gene.
Note: adata.var.index should be unique. It is recommended that the adata.var_names be the ensembl_id, but gene_symbol should work.
After running spaceTx_genefilter, the standard NS-Forest pipeline is run, including prep_medians, prep_binary_score, and nsforesting.NSForest. This pipeline outputs a set of 1-6 markers per cell type that can be used for spatial transcriptomic gene panel.
Setting up environment
[1]:
import sys
import os
code_folder = "C:/Users/bpeng/OneDrive - J. Craig Venter Institute/Documents/Github/NSForest"
sys.path.insert(0, os.path.abspath(code_folder))
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
import plotly.io as pio
pio.renderers.default = "notebook"
import nsforest as ns
from nsforest import utils
[2]:
cluster_header = "cluster"
output_folder = "../outputs_layer1_spatial/"
data_folder = "../demo_data/"
file = data_folder + "adata_layer1.h5ad"
adata = sc.read_h5ad(file)
adata
[2]:
AnnData object with n_obs × n_vars = 871 × 16497
obs: 'cluster'
Filtering genes to optimize spatial gene panels.
[3]:
adata = ns.pp.spaceTx_genefilter(adata, gencode_folder = f"{code_folder}/gencode_annotation")
adata
Non-zero median expression percentile limits:
0.10 1.584962
0.99 8.898539
dtype: float64
FILTER 1: 13944 out of 16497 total genes passed the expression filter (lower_percentile = 0.1, upper_percentile = 0.99).
FILTER 2: 14278 out of 16497 total genes passed the transcript length filter (min_txLength = 700).
FINAL SELECTION: 12147 out of 16497 total genes passed both filters.
[3]:
AnnData object with n_obs × n_vars = 871 × 12147
obs: 'cluster'
Calculating medians and binary scores per cluster
[4]:
adata = ns.pp.prep_medians(adata, cluster_header)
adata = ns.pp.prep_binary_scores(adata, cluster_header)
Calculating medians per cluster: 100%|██████████| 16/16 [00:01<00:00, 10.98it/s]
Saving medians as adata.varm.medians_cluster
median: 0.0
mean: 1.828
std: 2.518
Only positive genes selected. 9402 positive genes out of 12147 total genes
--- 1.6438896656036377 seconds ---
Calculating binary scores per cluster: 100%|██████████| 16/16 [01:11<00:00, 4.46s/it]
Saving binary scores as adata.varm.binary_scores_cluster
median: 0.118
mean: 0.207
std: 0.251
--- 71.59167551994324 seconds ---
Running NS-Forest
[5]:
outputfilename_prefix = cluster_header
results = ns.nsforesting.NSForest(adata, cluster_header, save_supplementary = True, save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Running NS-Forest version 4.1
Preparing adata...
Pre-selecting genes based on binary scores...
BinaryFirst_high Threshold (mean + 2 * std): 0.709
Average number of genes after gene_selection in each cluster: 577.5625
Saving number of genes selected per cluster as...
../outputs_layer1_spatial/cluster_gene_selection.csv
--- 0.04798698425292969 seconds ---
Number of clusters to evaluate: 16
1 out of 16:
e1_e299_SLC17A7_L5b_Cdh13
Pre-selected 1122 genes to feed into Random Forest.
NSForest-selected markers: ['SLC17A7', 'ANKRD33B']
fbeta: 0.959
precision: 0.988
recall: 0.856
2 out of 16:
g1_g48_GLI3_Astro_Gja1
Pre-selected 441 genes to feed into Random Forest.
NSForest-selected markers: ['LINC00498']
fbeta: 0.95
precision: 1.0
recall: 0.792
3 out of 16:
g2_g27_APBB1IP_Micro_Ctss
Pre-selected 281 genes to feed into Random Forest.
NSForest-selected markers: ['C1QC']
fbeta: 0.935
precision: 1.0
recall: 0.741
4 out of 16:
g3_g18_GPNMB_OPC_Pdgfra
Pre-selected 251 genes to feed into Random Forest.
NSForest-selected markers: ['COL20A1']
fbeta: 0.833
precision: 1.0
recall: 0.5
5 out of 16:
g4_g9_MOG_Oligo_Opalin
Pre-selected 401 genes to feed into Random Forest.
NSForest-selected markers: ['CNDP1']
fbeta: 0.976
precision: 1.0
recall: 0.889
6 out of 16:
i10_i16_TSPAN12_Vip_Mybpc1
Pre-selected 811 genes to feed into Random Forest.
NSForest-selected markers: ['CHRNB3', 'WIF1']
fbeta: 0.865
precision: 1.0
recall: 0.562
7 out of 16:
i11_i6_EGF_Vip_Mybpc1
Pre-selected 1459 genes to feed into Random Forest.
NSForest-selected markers: ['FZD8']
fbeta: 0.556
precision: 0.667
recall: 0.333
8 out of 16:
i1_i90_COL5A2_Ndnf_Car4
Pre-selected 191 genes to feed into Random Forest.
NSForest-selected markers: ['COL5A2', 'TRPC3']
fbeta: 0.845
precision: 1.0
recall: 0.522
9 out of 16:
i2_i77_LHX6_Sst_Cbln4
Pre-selected 212 genes to feed into Random Forest.
NSForest-selected markers: ['LHX6', 'GRIK3']
fbeta: 0.857
precision: 1.0
recall: 0.545
10 out of 16:
i3_i56_BAGE2_Ndnf_Cxcl14
Pre-selected 110 genes to feed into Random Forest.
NSForest-selected markers: ['SCN5A', 'GREM2']
fbeta: 0.642
precision: 0.826
recall: 0.339
11 out of 16:
i4_i54_MC4R_Ndnf_Cxcl14
Pre-selected 163 genes to feed into Random Forest.
NSForest-selected markers: ['ADAM33', 'NDNF', 'CD36']
fbeta: 0.783
precision: 0.929
recall: 0.481
12 out of 16:
i5_i47_TRPC3_Ndnf_Car4
Pre-selected 773 genes to feed into Random Forest.
NSForest-selected markers: ['TRPC3', 'CA2']
fbeta: 0.828
precision: 0.962
recall: 0.532
13 out of 16:
i6_i44_GPR149_Vip_Mybpc1
Pre-selected 303 genes to feed into Random Forest.
NSForest-selected markers: ['GPR149', 'SHISA8', 'ASIC4']
fbeta: 0.757
precision: 0.852
recall: 0.523
14 out of 16:
i7_i31_CLMP_Ndnf_Cxcl14
Pre-selected 829 genes to feed into Random Forest.
NSForest-selected markers: ['RASSF3', 'PAX6', 'CPLX3']
fbeta: 0.841
precision: 0.947
recall: 0.581
15 out of 16:
i8_i27_SNCG_Vip_Mybpc1
Pre-selected 1082 genes to feed into Random Forest.
NSForest-selected markers: ['SNCG', 'EDNRA']
fbeta: 0.759
precision: 0.923
recall: 0.444
16 out of 16:
i9_i22_TAC3_Vip_Mybpc1
Pre-selected 812 genes to feed into Random Forest.
NSForest-selected markers: ['MCTP2', 'ANGPT1']
fbeta: 0.806
precision: 1.0
recall: 0.455
--- 80.83747506141663 seconds ---
Saving supplementary table as...
../outputs_layer1_spatial/cluster_supplementary.csv
Saving markers table as...
../outputs_layer1_spatial/cluster_markers.csv
using median
Calculating medians per cluster: 100%|██████████| 16/16 [00:00<00:00, 379.53it/s]
Saving supplementary table as...
../outputs_layer1_spatial/cluster_markers_onTarget_supp.csv
Saving supplementary table as...
../outputs_layer1_spatial/cluster_markers_onTarget.csv
Saving final results table as...
../outputs_layer1_spatial/cluster_results.csv
Saving final results table as...
../outputs_layer1_spatial/cluster_results.pkl
[6]:
results
[6]:
| software_version | cluster_header | clusterName | clusterSize | f_score | precision | recall | TN | FP | FN | TP | marker_count | NSForest_markers | binary_genes | onTarget | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.1 | cluster | e1_e299_SLC17A7_L5b_Cdh13 | 299 | 0.958801 | 0.988417 | 0.856187 | 569 | 3 | 43 | 256 | 2 | [SLC17A7, ANKRD33B] | [SLC17A7, ANKRD33B, SFTA1P, TBR1, LINC00152, N... | 1.000000 |
| 1 | 4.1 | cluster | g1_g48_GLI3_Astro_Gja1 | 48 | 0.950000 | 1.000000 | 0.791667 | 823 | 0 | 10 | 38 | 1 | [LINC00498] | [LINC00498, SLC25A18, FGFR3, EMX2OS, ALDH1L1, ... | 1.000000 |
| 2 | 4.1 | cluster | g2_g27_APBB1IP_Micro_Ctss | 27 | 0.934579 | 1.000000 | 0.740741 | 844 | 0 | 7 | 20 | 1 | [C1QC] | [CSF2RA, C1QC, SYK, P2RY13, FGD2, MS4A7, CX3CR... | 1.000000 |
| 3 | 4.1 | cluster | g3_g18_GPNMB_OPC_Pdgfra | 18 | 0.833333 | 1.000000 | 0.500000 | 853 | 0 | 9 | 9 | 1 | [COL20A1] | [OLIG2, STK32A, COL20A1, KLRC3, B3GNT7, CSPG4,... | 1.000000 |
| 4 | 4.1 | cluster | g4_g9_MOG_Oligo_Opalin | 9 | 0.975610 | 1.000000 | 0.888889 | 862 | 0 | 1 | 8 | 1 | [CNDP1] | [MOBP, CNDP1, TF, ASPA, CLMN, CD22, PPP1R14A, ... | 1.000000 |
| 5 | 4.1 | cluster | i10_i16_TSPAN12_Vip_Mybpc1 | 16 | 0.865385 | 1.000000 | 0.562500 | 855 | 0 | 7 | 9 | 2 | [CHRNB3, WIF1] | [TSPAN12, TMC5, CHRNB3, ANGPT1, DCN, WIF1, HCR... | 0.494252 |
| 6 | 4.1 | cluster | i11_i6_EGF_Vip_Mybpc1 | 6 | 0.555556 | 0.666667 | 0.333333 | 864 | 1 | 4 | 2 | 1 | [FZD8] | [EGF, XYLT2, FZD8, ZSCAN23, NSRP1P1, RPUSD4, S... | 1.000000 |
| 7 | 4.1 | cluster | i1_i90_COL5A2_Ndnf_Car4 | 90 | 0.845324 | 1.000000 | 0.522222 | 781 | 0 | 43 | 47 | 2 | [COL5A2, TRPC3] | [NMBR, BMP2, COL5A2, ADRA1D, TRPC3, PAPSS2, BM... | 0.618906 |
| 8 | 4.1 | cluster | i2_i77_LHX6_Sst_Cbln4 | 77 | 0.857143 | 1.000000 | 0.545455 | 794 | 0 | 35 | 42 | 2 | [LHX6, GRIK3] | [LHX6, RSPO3, CALB1, TAC1, TRBC2, GRIK3, MAFB,... | 0.942058 |
| 9 | 4.1 | cluster | i3_i56_BAGE2_Ndnf_Cxcl14 | 56 | 0.641892 | 0.826087 | 0.339286 | 811 | 4 | 37 | 19 | 2 | [SCN5A, GREM2] | [SCN5A, GREM2, SYT10, ARHGAP18, ZNF423, GRB14,... | 0.531211 |
| 10 | 4.1 | cluster | i4_i54_MC4R_Ndnf_Cxcl14 | 54 | 0.783133 | 0.928571 | 0.481481 | 815 | 2 | 28 | 26 | 3 | [ADAM33, NDNF, CD36] | [EGFL6, ADAM33, PGAM1P5, CHRNB3, NDNF, CD36, A... | 0.351766 |
| 11 | 4.1 | cluster | i5_i47_TRPC3_Ndnf_Car4 | 47 | 0.827815 | 0.961538 | 0.531915 | 823 | 1 | 22 | 25 | 2 | [TRPC3, CA2] | [SOX13, PRSS12, SSTR2, TRPC3, PAPSS2, CA2, HAP... | 0.448483 |
| 12 | 4.1 | cluster | i6_i44_GPR149_Vip_Mybpc1 | 44 | 0.756579 | 0.851852 | 0.522727 | 823 | 4 | 21 | 23 | 3 | [GPR149, SHISA8, ASIC4] | [CXCL12, GPR149, SHISA8, IGFBP5, PXDN, ASIC4, ... | 0.546603 |
| 13 | 4.1 | cluster | i7_i31_CLMP_Ndnf_Cxcl14 | 31 | 0.841121 | 0.947368 | 0.580645 | 839 | 1 | 13 | 18 | 3 | [RASSF3, PAX6, CPLX3] | [FGF10, RASSF3, CLMP, PAX6, SP8, CPLX3, TGFBR2... | 0.629389 |
| 14 | 4.1 | cluster | i8_i27_SNCG_Vip_Mybpc1 | 27 | 0.759494 | 0.923077 | 0.444444 | 843 | 1 | 15 | 12 | 2 | [SNCG, EDNRA] | [SNCG, EDNRA, MMRN2, FBN3, RGS2, SCML4, NRP2, ... | 1.000000 |
| 15 | 4.1 | cluster | i9_i22_TAC3_Vip_Mybpc1 | 22 | 0.806452 | 1.000000 | 0.454545 | 849 | 0 | 12 | 10 | 2 | [MCTP2, ANGPT1] | [BSPRY, OFD1P10Y, MCTP2, ARHGAP29, FHDC1, ANGP... | 0.752842 |