Tutorial spatial

This tutorial is for running NS-Forest for spatial transcriptomic gene panel design.

The first step is filtering non-zero median expression of all genes represented in the AnnData object. This is done by taking the expression values for each gene for all cells, removing the zero values, and calculating the median per gene. Median is used instead of mean because mean is sensitive to outliers and skewed with noise. We can visualize these genes’ non-zero median expression in a histogram and plot the 10th and 99th percentiles. We remove genes with relatively low or high expression. High expression can effect the spatial imaging results due to “optical crowding”, resulting from excessive fluorescent signal from high transcript dense regions (https://www.nature.com/articles/s41467-021-27798-0).

The next step is filtering by transcript length. In spatial transcriptomics, probes are designed to target each gene; genes with more probes tend to have higher sensitivity to be detected (https://pmc.ncbi.nlm.nih.gov/articles/PMC10054990/). The minimum transcript length is to ensure high detection. The transcript lengths are from Gencode’s BioMart. Due to multiple transcript lengths associated with a single gene, the longest length is choosen. The NS-Forest package contains the gencode files for human (v47) and mouse (vM36), which can be found in the gencode_annotation folder. The default minimum transcript length is 700. This value is adjustable depending on the organ and species. For example, when looking at a mouse brain dataset, we chose 700 to include Sst (721bp), which is a well known neuronal inhibitory marker gene.

Note: adata.var.index should be unique. It is recommended that the adata.var_names be the ensembl_id, but gene_symbol should work.

After running spaceTx_genefilter, the standard NS-Forest pipeline is run, including prep_medians, prep_binary_score, and nsforesting.NSForest. This pipeline outputs a set of 1-6 markers per cell type that can be used for spatial transcriptomic gene panel.

Setting up environment

[1]:
import sys
import os
code_folder = "C:/Users/bpeng/OneDrive - J. Craig Venter Institute/Documents/Github/NSForest"
sys.path.insert(0, os.path.abspath(code_folder))
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
import plotly.io as pio
pio.renderers.default = "notebook"
import nsforest as ns
from nsforest import utils
[2]:
cluster_header = "cluster"
output_folder = "../outputs_layer1_spatial/"
data_folder = "../demo_data/"
file = data_folder + "adata_layer1.h5ad"
adata = sc.read_h5ad(file)
adata
[2]:
AnnData object with n_obs × n_vars = 871 × 16497
    obs: 'cluster'

Filtering genes to optimize spatial gene panels.

[3]:
adata = ns.pp.spaceTx_genefilter(adata, gencode_folder = f"{code_folder}/gencode_annotation")
adata
Non-zero median expression percentile limits:
0.10    1.584962
0.99    8.898539
dtype: float64
_images/tutorial_spatial_6_1.png
FILTER 1: 13944 out of 16497 total genes passed the expression filter (lower_percentile = 0.1, upper_percentile = 0.99).

FILTER 2: 14278 out of 16497 total genes passed the transcript length filter (min_txLength = 700).

FINAL SELECTION: 12147 out of 16497 total genes passed both filters.
[3]:
AnnData object with n_obs × n_vars = 871 × 12147
    obs: 'cluster'

Calculating medians and binary scores per cluster

[4]:
adata = ns.pp.prep_medians(adata, cluster_header)
adata = ns.pp.prep_binary_scores(adata, cluster_header)
Calculating medians per cluster: 100%|██████████| 16/16 [00:01<00:00, 10.98it/s]
Saving medians as adata.varm.medians_cluster
median: 0.0
mean: 1.828
std: 2.518
Only positive genes selected. 9402 positive genes out of 12147 total genes
--- 1.6438896656036377 seconds ---
Calculating binary scores per cluster: 100%|██████████| 16/16 [01:11<00:00,  4.46s/it]
Saving binary scores as adata.varm.binary_scores_cluster
median: 0.118
mean: 0.207
std: 0.251
--- 71.59167551994324 seconds ---

Running NS-Forest

[5]:
outputfilename_prefix = cluster_header
results = ns.nsforesting.NSForest(adata, cluster_header, save_supplementary = True, save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Running NS-Forest version 4.1

Preparing adata...
Pre-selecting genes based on binary scores...
        BinaryFirst_high Threshold (mean + 2 * std): 0.709
        Average number of genes after gene_selection in each cluster: 577.5625
Saving number of genes selected per cluster as...
../outputs_layer1_spatial/cluster_gene_selection.csv
--- 0.04798698425292969 seconds ---

Number of clusters to evaluate: 16
1 out of 16:
        e1_e299_SLC17A7_L5b_Cdh13
        Pre-selected 1122 genes to feed into Random Forest.
        NSForest-selected markers: ['SLC17A7', 'ANKRD33B']
        fbeta: 0.959
        precision: 0.988
        recall: 0.856
2 out of 16:
        g1_g48_GLI3_Astro_Gja1
        Pre-selected 441 genes to feed into Random Forest.
        NSForest-selected markers: ['LINC00498']
        fbeta: 0.95
        precision: 1.0
        recall: 0.792
3 out of 16:
        g2_g27_APBB1IP_Micro_Ctss
        Pre-selected 281 genes to feed into Random Forest.
        NSForest-selected markers: ['C1QC']
        fbeta: 0.935
        precision: 1.0
        recall: 0.741
4 out of 16:
        g3_g18_GPNMB_OPC_Pdgfra
        Pre-selected 251 genes to feed into Random Forest.
        NSForest-selected markers: ['COL20A1']
        fbeta: 0.833
        precision: 1.0
        recall: 0.5
5 out of 16:
        g4_g9_MOG_Oligo_Opalin
        Pre-selected 401 genes to feed into Random Forest.
        NSForest-selected markers: ['CNDP1']
        fbeta: 0.976
        precision: 1.0
        recall: 0.889
6 out of 16:
        i10_i16_TSPAN12_Vip_Mybpc1
        Pre-selected 811 genes to feed into Random Forest.
        NSForest-selected markers: ['CHRNB3', 'WIF1']
        fbeta: 0.865
        precision: 1.0
        recall: 0.562
7 out of 16:
        i11_i6_EGF_Vip_Mybpc1
        Pre-selected 1459 genes to feed into Random Forest.
        NSForest-selected markers: ['FZD8']
        fbeta: 0.556
        precision: 0.667
        recall: 0.333
8 out of 16:
        i1_i90_COL5A2_Ndnf_Car4
        Pre-selected 191 genes to feed into Random Forest.
        NSForest-selected markers: ['COL5A2', 'TRPC3']
        fbeta: 0.845
        precision: 1.0
        recall: 0.522
9 out of 16:
        i2_i77_LHX6_Sst_Cbln4
        Pre-selected 212 genes to feed into Random Forest.
        NSForest-selected markers: ['LHX6', 'GRIK3']
        fbeta: 0.857
        precision: 1.0
        recall: 0.545
10 out of 16:
        i3_i56_BAGE2_Ndnf_Cxcl14
        Pre-selected 110 genes to feed into Random Forest.
        NSForest-selected markers: ['SCN5A', 'GREM2']
        fbeta: 0.642
        precision: 0.826
        recall: 0.339
11 out of 16:
        i4_i54_MC4R_Ndnf_Cxcl14
        Pre-selected 163 genes to feed into Random Forest.
        NSForest-selected markers: ['ADAM33', 'NDNF', 'CD36']
        fbeta: 0.783
        precision: 0.929
        recall: 0.481
12 out of 16:
        i5_i47_TRPC3_Ndnf_Car4
        Pre-selected 773 genes to feed into Random Forest.
        NSForest-selected markers: ['TRPC3', 'CA2']
        fbeta: 0.828
        precision: 0.962
        recall: 0.532
13 out of 16:
        i6_i44_GPR149_Vip_Mybpc1
        Pre-selected 303 genes to feed into Random Forest.
        NSForest-selected markers: ['GPR149', 'SHISA8', 'ASIC4']
        fbeta: 0.757
        precision: 0.852
        recall: 0.523
14 out of 16:
        i7_i31_CLMP_Ndnf_Cxcl14
        Pre-selected 829 genes to feed into Random Forest.
        NSForest-selected markers: ['RASSF3', 'PAX6', 'CPLX3']
        fbeta: 0.841
        precision: 0.947
        recall: 0.581
15 out of 16:
        i8_i27_SNCG_Vip_Mybpc1
        Pre-selected 1082 genes to feed into Random Forest.
        NSForest-selected markers: ['SNCG', 'EDNRA']
        fbeta: 0.759
        precision: 0.923
        recall: 0.444
16 out of 16:
        i9_i22_TAC3_Vip_Mybpc1
        Pre-selected 812 genes to feed into Random Forest.
        NSForest-selected markers: ['MCTP2', 'ANGPT1']
        fbeta: 0.806
        precision: 1.0
        recall: 0.455
--- 80.83747506141663 seconds ---

Saving supplementary table as...
../outputs_layer1_spatial/cluster_supplementary.csv
Saving markers table as...
../outputs_layer1_spatial/cluster_markers.csv
using median
Calculating medians per cluster: 100%|██████████| 16/16 [00:00<00:00, 379.53it/s]
Saving supplementary table as...
../outputs_layer1_spatial/cluster_markers_onTarget_supp.csv
Saving supplementary table as...
../outputs_layer1_spatial/cluster_markers_onTarget.csv

Saving final results table as...
../outputs_layer1_spatial/cluster_results.csv
Saving final results table as...
../outputs_layer1_spatial/cluster_results.pkl

[6]:
results
[6]:
software_version cluster_header clusterName clusterSize f_score precision recall TN FP FN TP marker_count NSForest_markers binary_genes onTarget
0 4.1 cluster e1_e299_SLC17A7_L5b_Cdh13 299 0.958801 0.988417 0.856187 569 3 43 256 2 [SLC17A7, ANKRD33B] [SLC17A7, ANKRD33B, SFTA1P, TBR1, LINC00152, N... 1.000000
1 4.1 cluster g1_g48_GLI3_Astro_Gja1 48 0.950000 1.000000 0.791667 823 0 10 38 1 [LINC00498] [LINC00498, SLC25A18, FGFR3, EMX2OS, ALDH1L1, ... 1.000000
2 4.1 cluster g2_g27_APBB1IP_Micro_Ctss 27 0.934579 1.000000 0.740741 844 0 7 20 1 [C1QC] [CSF2RA, C1QC, SYK, P2RY13, FGD2, MS4A7, CX3CR... 1.000000
3 4.1 cluster g3_g18_GPNMB_OPC_Pdgfra 18 0.833333 1.000000 0.500000 853 0 9 9 1 [COL20A1] [OLIG2, STK32A, COL20A1, KLRC3, B3GNT7, CSPG4,... 1.000000
4 4.1 cluster g4_g9_MOG_Oligo_Opalin 9 0.975610 1.000000 0.888889 862 0 1 8 1 [CNDP1] [MOBP, CNDP1, TF, ASPA, CLMN, CD22, PPP1R14A, ... 1.000000
5 4.1 cluster i10_i16_TSPAN12_Vip_Mybpc1 16 0.865385 1.000000 0.562500 855 0 7 9 2 [CHRNB3, WIF1] [TSPAN12, TMC5, CHRNB3, ANGPT1, DCN, WIF1, HCR... 0.494252
6 4.1 cluster i11_i6_EGF_Vip_Mybpc1 6 0.555556 0.666667 0.333333 864 1 4 2 1 [FZD8] [EGF, XYLT2, FZD8, ZSCAN23, NSRP1P1, RPUSD4, S... 1.000000
7 4.1 cluster i1_i90_COL5A2_Ndnf_Car4 90 0.845324 1.000000 0.522222 781 0 43 47 2 [COL5A2, TRPC3] [NMBR, BMP2, COL5A2, ADRA1D, TRPC3, PAPSS2, BM... 0.618906
8 4.1 cluster i2_i77_LHX6_Sst_Cbln4 77 0.857143 1.000000 0.545455 794 0 35 42 2 [LHX6, GRIK3] [LHX6, RSPO3, CALB1, TAC1, TRBC2, GRIK3, MAFB,... 0.942058
9 4.1 cluster i3_i56_BAGE2_Ndnf_Cxcl14 56 0.641892 0.826087 0.339286 811 4 37 19 2 [SCN5A, GREM2] [SCN5A, GREM2, SYT10, ARHGAP18, ZNF423, GRB14,... 0.531211
10 4.1 cluster i4_i54_MC4R_Ndnf_Cxcl14 54 0.783133 0.928571 0.481481 815 2 28 26 3 [ADAM33, NDNF, CD36] [EGFL6, ADAM33, PGAM1P5, CHRNB3, NDNF, CD36, A... 0.351766
11 4.1 cluster i5_i47_TRPC3_Ndnf_Car4 47 0.827815 0.961538 0.531915 823 1 22 25 2 [TRPC3, CA2] [SOX13, PRSS12, SSTR2, TRPC3, PAPSS2, CA2, HAP... 0.448483
12 4.1 cluster i6_i44_GPR149_Vip_Mybpc1 44 0.756579 0.851852 0.522727 823 4 21 23 3 [GPR149, SHISA8, ASIC4] [CXCL12, GPR149, SHISA8, IGFBP5, PXDN, ASIC4, ... 0.546603
13 4.1 cluster i7_i31_CLMP_Ndnf_Cxcl14 31 0.841121 0.947368 0.580645 839 1 13 18 3 [RASSF3, PAX6, CPLX3] [FGF10, RASSF3, CLMP, PAX6, SP8, CPLX3, TGFBR2... 0.629389
14 4.1 cluster i8_i27_SNCG_Vip_Mybpc1 27 0.759494 0.923077 0.444444 843 1 15 12 2 [SNCG, EDNRA] [SNCG, EDNRA, MMRN2, FBN3, RGS2, SCML4, NRP2, ... 1.000000
15 4.1 cluster i9_i22_TAC3_Vip_Mybpc1 22 0.806452 1.000000 0.454545 849 0 12 10 2 [MCTP2, ANGPT1] [BSPRY, OFD1P10Y, MCTP2, ARHGAP29, FHDC1, ANGP... 0.752842