Tutorial

Tutorial

Setting up environment

[1]:
import sys
import os
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
CODE_PATH = "../.." # location of NSForest folder
sys.path.insert(0, os.path.abspath(CODE_PATH))
from nsforest import ns, nsforesting, utils, NSFOREST_VERSION

Data Exploration

Loading h5ad AnnData file

[2]:
data_folder = f"{CODE_PATH}/demo_data/"
file = data_folder + "adata_layer1.h5ad"
adata = sc.read_h5ad(file)
adata
[2]:
AnnData object with n_obs × n_vars = 871 × 16497
    obs: 'cluster'

Defining cluster_header as cell type annotation.

Note: Some datasets have multiple annotations per sample (ex. “broad_cell_type” and “granular_cell_type”). NS-Forest can be run on multiple cluster_header’s. Combining the parent and child markers may improve classification results.

[3]:
cluster_header = "cluster"

Defining output_folder for saving results

[4]:
output_folder = f"{CODE_PATH}/outputs_layer1/"

Looking at sample labels

[5]:
adata.obs_names
[5]:
Index(['A01_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A01_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       'A02_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A03_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A04_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A04_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A04_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       'A05_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A05_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A05_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       ...
       'P09_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH7',
       'P09_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P09_1_Nuclei_NeuNN_H200_1030_MTG_layer1_BCH8',
       'P09_BCH1_1NeuNN_H200.1025_MTG_layer_1',
       'P10_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH6',
       'P10_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P10_BCH1_1NeuNN_H200.1025_MTG_layer_1',
       'P11_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH7',
       'P11_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P11_1_Nuclei_NeuNN_H200_1030_MTG_layer1_BCH8'],
      dtype='object', length=871)

Looking at genes

Note: adata.var_names must be unique. If there is a problem, usually it can be solved by assigning adata.var.index = adata.var["ensembl_id"].

[6]:
adata.var_names
[6]:
Index(['A1CF', 'A2M', 'A2M_AS1', 'A2ML1', 'A2ML1_AS1', 'A2MP1', 'A3GALT2',
       'A4GALT', 'AAAS', 'AACS',
       ...
       'ZUFSP', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDC', 'ZYG11A', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', length=16497)

Checking cell annotation sizes

Note: Some datasets are too large and need to be downsampled to be run through the pipeline. When downsampling, be sure to have all the granular cluster annotations represented.

[7]:
adata.obs[cluster_header].value_counts()
[7]:
e1_e299_SLC17A7_L5b_Cdh13     299
i1_i90_COL5A2_Ndnf_Car4        90
i2_i77_LHX6_Sst_Cbln4          77
i3_i56_BAGE2_Ndnf_Cxcl14       56
i4_i54_MC4R_Ndnf_Cxcl14        54
g1_g48_GLI3_Astro_Gja1         48
i5_i47_TRPC3_Ndnf_Car4         47
i6_i44_GPR149_Vip_Mybpc1       44
i7_i31_CLMP_Ndnf_Cxcl14        31
g2_g27_APBB1IP_Micro_Ctss      27
i8_i27_SNCG_Vip_Mybpc1         27
i9_i22_TAC3_Vip_Mybpc1         22
g3_g18_GPNMB_OPC_Pdgfra        18
i10_i16_TSPAN12_Vip_Mybpc1     16
g4_g9_MOG_Oligo_Opalin          9
i11_i6_EGF_Vip_Mybpc1           6
Name: cluster, dtype: int64

Preprocessing

Generating scanpy dendrogram

Note: Only run if there is no pre-defined dendrogram order. Dendrogram order is stored in adata.uns["dendrogram_cluster"]["categories_ordered"].

You can add additional arguments into ns.pp.dendrogram() that can be passed into the scanpy dendrogram function. Ex: ns.pp.dendrogram(adata, cluster_header, orientation = “bottom”)

[8]:
ns.pp.dendrogram(adata, cluster_header, save = True, output_folder = output_folder, outputfilename_suffix = cluster_header)
WARNING: You’re trying to run this on 16497 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\anndata\_core\anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
WARNING: saving figure to file ..\..\outputs_layer1\dendrogram_cluster.png
_images/tutorial_19_3.png

Calculating cluster medians per gene

Run ns.pp.prep_medians before running NS-Forest.

Note: Do not run if evaluating marker lists. Do not run before generating scanpy plots (e.g. dot plot, violin plot, matrix plot). This step removes genes from adata. The full adata should be used for evaluating and plotting.

[9]:
adata_subset = adata.copy()
[10]:
adata_subset = ns.pp.prep_medians(adata_subset, cluster_header)
adata_subset
Calculating medians...
Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 57.19it/s]
Saving calculated medians as adata.varm.medians_cluster
--- 0.28774333000183105 seconds ---
median: 0.0
mean: 1.62611
std: 2.488900899887085
Only positive genes selected. 11688 positive genes out of 16497 total genes
[10]:
AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster'

Calculating binary scores per gene per cluster

Run ns.pp.prep_binary_scores before running NS-Forest. Do not need to run if evaluating marker lists. Do not need to run when generating scanpy plots.

[11]:
adata_subset = ns.pp.prep_binary_scores(adata_subset, cluster_header)
adata_subset
Calculating binary scores...
Calculating binary scores per cluster: 100%|██████████| 16/16 [00:53<00:00,  3.36s/it]
Saving calculated binary scores as adata.varm.binary_scores_cluster
--- 54.23224496841431 seconds ---
median: 0.1
mean: 0.2016436226562706
std: 0.25240082537940856
[11]:
AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'

Plotting median and binary score distributions

[12]:
plt.clf()
filename = output_folder + cluster_header + '_medians.png'
print(f"Saving median distributions as...\n{filename}")
a = plt.figure(figsize = (6, 4))
a = plt.hist(adata_subset.varm["medians_" + cluster_header].unstack(), bins = 100)
a = plt.title(f'{file.split("/")[-1].replace(".h5ad", "")}: {"medians_" + cluster_header} histogram')
a = plt.xlabel("medians_" + cluster_header)
a = plt.yscale("log")
a = plt.savefig(filename, bbox_inches='tight')
plt.show()
Saving median distributions as...
../../outputs_layer1/cluster_medians.png
<Figure size 640x480 with 0 Axes>
_images/tutorial_26_2.png
[13]:
plt.clf()
filename = output_folder + cluster_header + '_binary_scores.png'
print(f"Saving binary_score distributions as...\n{filename}")
a = plt.figure(figsize = (6, 4))
a = plt.hist(adata_subset.varm["binary_scores_" + cluster_header].unstack(), bins = 100)
a = plt.title(f'{file.split("/")[-1].replace(".h5ad", "")}: {"binary_scores_" + cluster_header} histogram')
a = plt.xlabel("binary_scores_" + cluster_header)
a = plt.yscale("log")
a = plt.savefig(filename, bbox_inches='tight')
plt.show()
Saving binary_score distributions as...
../../outputs_layer1/cluster_binary_scores.png
<Figure size 640x480 with 0 Axes>
_images/tutorial_27_2.png

Saving preprocessed AnnData as new h5ad

[14]:
# Optional, can be useful if you don't want to re-run prep_medians and prep_binary_scores
filename = file.replace(".h5ad", "_preprocessed.h5ad")
print(f"Saving new anndata object as...\n{filename}")
adata_subset.write_h5ad(filename)
adata_subset
Saving new anndata object as...
../../demo_data/adata_layer1_preprocessed.h5ad
[14]:
AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'

Running NS-Forest

Note: Do not run NS-Forest if only evaluating input marker lists.

[15]:
outputfilename_prefix = cluster_header
results = nsforesting.NSForest(adata_subset, cluster_header, save = True, save_supplementary = True,
                               output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Running NS-Forest version 4.1
Preparing adata...
--- 0.007977008819580078 seconds ---
Pre-selecting genes based on binary scores...
        BinaryFirst_high Threshold (mean + 2 * std): 0.7064452734150877
        Average number of genes after gene_selection in each cluster: 735.5
Saving number of genes selected per cluster as...
../../outputs_layer1/cluster_gene_selection.csv
Number of clusters to evaluate: 16
1 out of 16:
        e1_e299_SLC17A7_L5b_Cdh13
        Pre-selected 1356 genes to feed into Random Forest.
          NSForest-selected markers: ['LINC00507']
          fbeta: 0.96
          precision: 0.978
          recall: 0.893
2 out of 16:
        g1_g48_GLI3_Astro_Gja1
        Pre-selected 583 genes to feed into Random Forest.
          NSForest-selected markers: ['LINC00498']
          fbeta: 0.95
          precision: 1.0
          recall: 0.792
3 out of 16:
        g2_g27_APBB1IP_Micro_Ctss
        Pre-selected 420 genes to feed into Random Forest.
          NSForest-selected markers: ['ADAM28', 'PTPRC']
          fbeta: 0.976
          precision: 1.0
          recall: 0.889
4 out of 16:
        g3_g18_GPNMB_OPC_Pdgfra
        Pre-selected 353 genes to feed into Random Forest.
          NSForest-selected markers: ['GPNMB', 'OLIG2']
          fbeta: 0.862
          precision: 1.0
          recall: 0.556
5 out of 16:
        g4_g9_MOG_Oligo_Opalin
        Pre-selected 571 genes to feed into Random Forest.
          NSForest-selected markers: ['ST18']
          fbeta: 1.0
          precision: 1.0
          recall: 1.0
6 out of 16:
        i10_i16_TSPAN12_Vip_Mybpc1
        Pre-selected 1007 genes to feed into Random Forest.
          NSForest-selected markers: ['TSPAN12', 'CHRNB3']
          fbeta: 0.804
          precision: 0.9
          recall: 0.562
7 out of 16:
        i11_i6_EGF_Vip_Mybpc1
        Pre-selected 1912 genes to feed into Random Forest.
          NSForest-selected markers: ['EGF', 'FBRSL1']
          fbeta: 0.714
          precision: 1.0
          recall: 0.333
8 out of 16:
        i1_i90_COL5A2_Ndnf_Car4
        Pre-selected 238 genes to feed into Random Forest.
          NSForest-selected markers: ['COL5A2', 'BMP6']
          fbeta: 0.908
          precision: 0.97
          recall: 0.722
9 out of 16:
        i2_i77_LHX6_Sst_Cbln4
        Pre-selected 292 genes to feed into Random Forest.
          NSForest-selected markers: ['LHX6']
          fbeta: 0.817
          precision: 0.838
          recall: 0.74
10 out of 16:
        i3_i56_BAGE2_Ndnf_Cxcl14
        Pre-selected 151 genes to feed into Random Forest.
          NSForest-selected markers: ['BAGE2', 'SYT10']
          fbeta: 0.781
          precision: 0.962
          recall: 0.446
11 out of 16:
        i4_i54_MC4R_Ndnf_Cxcl14
        Pre-selected 223 genes to feed into Random Forest.
          NSForest-selected markers: ['ARHGAP36', 'ADAM33']
          fbeta: 0.857
          precision: 0.923
          recall: 0.667
12 out of 16:
        i5_i47_TRPC3_Ndnf_Car4
        Pre-selected 942 genes to feed into Random Forest.
          NSForest-selected markers: ['NTNG1', 'EYA4']
          fbeta: 0.906
          precision: 1.0
          recall: 0.66
13 out of 16:
        i6_i44_GPR149_Vip_Mybpc1
        Pre-selected 377 genes to feed into Random Forest.
          NSForest-selected markers: ['FLT1', 'GPR149']
          fbeta: 0.792
          precision: 1.0
          recall: 0.432
14 out of 16:
        i7_i31_CLMP_Ndnf_Cxcl14
        Pre-selected 1012 genes to feed into Random Forest.
          NSForest-selected markers: ['PAX6', 'TGFBR2']
          fbeta: 0.901
          precision: 1.0
          recall: 0.645
15 out of 16:
        i8_i27_SNCG_Vip_Mybpc1
        Pre-selected 1326 genes to feed into Random Forest.
          NSForest-selected markers: ['SNCG', 'EDNRA']
          fbeta: 0.759
          precision: 0.923
          recall: 0.444
16 out of 16:
        i9_i22_TAC3_Vip_Mybpc1
        Pre-selected 1005 genes to feed into Random Forest.
          NSForest-selected markers: ['BSPRY', 'MCTP2']
          fbeta: 0.69
          precision: 0.889
          recall: 0.364
Saving supplementary table as...
../../outputs_layer1/cluster_supplementary.csv
Saving markers table as...
../../outputs_layer1/cluster_markers.csv
using mean
Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 314.34it/s]
Saving supplementary table as...
../../outputs_layer1/cluster_markers_onTarget_supp.csv
Saving supplementary table as...
../../outputs_layer1/cluster_markers_onTarget.csv
Saving final results table as...
../../outputs_layer1/cluster_results.csv
Saving final results table as...
../../outputs_layer1/cluster_results.pkl
--- 63.96290040016174 seconds ---

[16]:
results
[16]:
software_version cluster_header clusterName clusterSize f_score precision recall TN FP FN TP marker_count NSForest_markers binary_genes onTarget
0 4.1 cluster e1_e299_SLC17A7_L5b_Cdh13 299 0.959741 0.978022 0.892977 566 6 32 267 1 [LINC00507] [SLC17A7, LINC00508, TBR1, ANKRD33B, NPTX1, LI... 0.479336
1 4.1 cluster g1_g48_GLI3_Astro_Gja1 48 0.950000 1.000000 0.791667 823 0 10 38 1 [LINC00498] [LINC00498, SLC25A18, EMX2OS, FAM189A2, SLC7A1... 0.882322
2 4.1 cluster g2_g27_APBB1IP_Micro_Ctss 27 0.975610 1.000000 0.888889 844 0 3 24 2 [ADAM28, PTPRC] [ADAM28, PLCG2, INPP5D, PTPRC, CSF2RA, P2RY13,... 0.736514
3 4.1 cluster g3_g18_GPNMB_OPC_Pdgfra 18 0.862069 1.000000 0.555556 853 0 8 10 2 [GPNMB, OLIG2] [GPNMB, COL20A1, OLIG2, STK32A, KLRC3, KLRC2, ... 0.656991
4 4.1 cluster g4_g9_MOG_Oligo_Opalin 9 1.000000 1.000000 1.000000 862 0 0 9 1 [ST18] [ST18, MOBP, CNDP1, MOG, CD22, FOLH1, TF, CARN... 0.622487
5 4.1 cluster i10_i16_TSPAN12_Vip_Mybpc1 16 0.803571 0.900000 0.562500 854 1 7 9 2 [TSPAN12, CHRNB3] [TSPAN12, TMC5, LINC01539, CHRNB3, FAM46A, ANG... 0.370267
6 4.1 cluster i11_i6_EGF_Vip_Mybpc1 6 0.714286 1.000000 0.333333 865 0 4 2 2 [EGF, FBRSL1] [EGF, FZD8, KCNJ2_AS1, FBRSL1, TEKT1, NRG3_AS1... 0.356487
7 4.1 cluster i1_i90_COL5A2_Ndnf_Car4 90 0.907821 0.970149 0.722222 779 2 25 65 2 [COL5A2, BMP6] [NMBR, COL5A2, C8ORF4, PAPSS2, TRPC3, BMP6, SS... 0.299027
8 4.1 cluster i2_i77_LHX6_Sst_Cbln4 77 0.816619 0.838235 0.740260 783 11 20 57 1 [LHX6] [LHX6, FLT3, TAC1, CALB1, RSPO3, TRBC2, GRIK3,... 0.795864
9 4.1 cluster i3_i56_BAGE2_Ndnf_Cxcl14 56 0.781250 0.961538 0.446429 814 1 31 25 2 [BAGE2, SYT10] [BAGE2, SCN5A, GREM2, FAM19A4, SYT10, ARHGAP18... 0.282582
10 4.1 cluster i4_i54_MC4R_Ndnf_Cxcl14 54 0.857143 0.923077 0.666667 814 3 18 36 2 [ARHGAP36, ADAM33] [ARHGAP36, MC4R, COBLL1, HLA_B, LINC01435, ADA... 0.394957
11 4.1 cluster i5_i47_TRPC3_Ndnf_Car4 47 0.906433 1.000000 0.659574 824 0 16 31 2 [NTNG1, EYA4] [SSTR2, KIRREL, TRPC3, NTNG1, TARID, EYA4, CA2... 0.265959
12 4.1 cluster i6_i44_GPR149_Vip_Mybpc1 44 0.791667 1.000000 0.431818 827 0 25 19 2 [FLT1, GPR149] [FLT1, PLCE1_AS1, CXCL12, SLC22A3, PLCE1_AS2, ... 0.455438
13 4.1 cluster i7_i31_CLMP_Ndnf_Cxcl14 31 0.900901 1.000000 0.645161 840 0 11 20 2 [PAX6, TGFBR2] [KIAA1644, FGF10, CLMP, PAX6, SP8, TGFBR2, WIF... 0.364283
14 4.1 cluster i8_i27_SNCG_Vip_Mybpc1 27 0.759494 0.923077 0.444444 843 1 15 12 2 [SNCG, EDNRA] [SNCG, MMRN2, EDNRA, FBN3, KCNK2, RGS2, SCML4,... 0.474188
15 4.1 cluster i9_i22_TAC3_Vip_Mybpc1 22 0.689655 0.888889 0.363636 848 1 14 8 2 [BSPRY, MCTP2] [BSPRY, OFD1P10Y, MCTP2, OFD1P8Y, OFD1P15Y, OF... 0.467223

Plotting classification metrics from NS-Forest results

[17]:
ns.pl.boxplot(results, "f_score", save = "html", output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_boxplot_f_score.html
[18]:
ns.pl.boxplot(results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_boxplot_f_score.html
[19]:
ns.pl.boxplot(results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_boxplot_precision.html
[20]:
ns.pl.boxplot(results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_boxplot_recall.html
[21]:
ns.pl.boxplot(results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_boxplot_onTarget.html
[22]:
ns.pl.scatter_w_clusterSize(results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_scatter_f_score.html
[23]:
ns.pl.scatter_w_clusterSize(results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_scatter_precision.html
[24]:
ns.pl.scatter_w_clusterSize(results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_scatter_recall.html
[25]:
ns.pl.scatter_w_clusterSize(results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/cluster_scatter_onTarget.html

Plotting scanpy dot plot, violin plot, matrix plot for NS-Forest markers

Note: Assign pre-defined dendrogram order here or use adata.uns["dendrogram_" + cluster_header]["categories_ordered"].

Notice how the non-altered adata is used to plot, instead of adata_subset.

[26]:
to_plot = results.copy()
[27]:
dendrogram = [] # custom dendrogram order
dendrogram = list(adata.uns["dendrogram_" + cluster_header]["categories_ordered"])
to_plot["clusterName"] = to_plot["clusterName"].astype("category")
to_plot["clusterName"] = to_plot["clusterName"].cat.set_categories(dendrogram)
to_plot = to_plot.sort_values("clusterName")
to_plot = to_plot.rename(columns = {"NSForest_markers": "markers"})
to_plot.head()
[27]:
software_version cluster_header clusterName clusterSize f_score precision recall TN FP FN TP marker_count markers binary_genes onTarget
0 4.1 cluster e1_e299_SLC17A7_L5b_Cdh13 299 0.959741 0.978022 0.892977 566 6 32 267 1 [LINC00507] [SLC17A7, LINC00508, TBR1, ANKRD33B, NPTX1, LI... 0.479336
13 4.1 cluster i7_i31_CLMP_Ndnf_Cxcl14 31 0.900901 1.000000 0.645161 840 0 11 20 2 [PAX6, TGFBR2] [KIAA1644, FGF10, CLMP, PAX6, SP8, TGFBR2, WIF... 0.364283
7 4.1 cluster i1_i90_COL5A2_Ndnf_Car4 90 0.907821 0.970149 0.722222 779 2 25 65 2 [COL5A2, BMP6] [NMBR, COL5A2, C8ORF4, PAPSS2, TRPC3, BMP6, SS... 0.299027
11 4.1 cluster i5_i47_TRPC3_Ndnf_Car4 47 0.906433 1.000000 0.659574 824 0 16 31 2 [NTNG1, EYA4] [SSTR2, KIRREL, TRPC3, NTNG1, TARID, EYA4, CA2... 0.265959
5 4.1 cluster i10_i16_TSPAN12_Vip_Mybpc1 16 0.803571 0.900000 0.562500 854 1 7 9 2 [TSPAN12, CHRNB3] [TSPAN12, TMC5, LINC01539, CHRNB3, FAM46A, ANG... 0.370267
[28]:
markers_dict = dict(zip(to_plot["clusterName"], to_plot["markers"]))
markers_dict
[28]:
{'e1_e299_SLC17A7_L5b_Cdh13': ['LINC00507'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['PAX6', 'TGFBR2'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'BMP6'],
 'i5_i47_TRPC3_Ndnf_Car4': ['NTNG1', 'EYA4'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'CHRNB3'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['ARHGAP36', 'ADAM33'],
 'i9_i22_TAC3_Vip_Mybpc1': ['BSPRY', 'MCTP2'],
 'i6_i44_GPR149_Vip_Mybpc1': ['FLT1', 'GPR149'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'EDNRA'],
 'g1_g48_GLI3_Astro_Gja1': ['LINC00498'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'OLIG2'],
 'g2_g27_APBB1IP_Micro_Ctss': ['ADAM28', 'PTPRC'],
 'g4_g9_MOG_Oligo_Opalin': ['ST18'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'FBRSL1'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'SYT10']}
[29]:
ns.pl.dotplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\dotplot_cluster.png
C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

_images/tutorial_47_2.png
[30]:
ns.pl.stackedviolin(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\stacked_violin_cluster.png
_images/tutorial_48_1.png
[31]:
ns.pl.matrixplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\matrixplot_cluster.png
_images/tutorial_49_1.png

Evaluating input marker list

Getting marker list in dictionary format: {cluster: marker_list}

[32]:
markers = pd.read_csv("../../demo_data/marker_list.csv")
markers_dict = utils.prepare_markers(markers, "clusterName", "markers")
markers_dict
[32]:
{'e1_e299_SLC17A7_L5b_Cdh13': ['SLC17A7', 'CDH13'],
 'g1_g48_GLI3_Astro_Gja1': ['GLI3', 'GJA1'],
 'g2_g27_APBB1IP_Micro_Ctss': ['APBB1IP', 'CTSS'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'PDGFRA'],
 'g4_g9_MOG_Oligo_Opalin': ['MOG', 'OPALIN'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'VIP', 'MYBPC1'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'VIP', 'MYBPC1'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'NDNF', 'CAR4'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6', 'SST', 'CBLN4'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'NDNF', 'CXCL14'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['MC4R', 'NDNF', 'CXCL14'],
 'i5_i47_TRPC3_Ndnf_Car4': ['TRPC3', 'NDNF', 'CAR4'],
 'i6_i44_GPR149_Vip_Mybpc1': ['GPR149', 'VIP', 'MYBPC1'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['CLMP', 'NDNF', 'CXCL14'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'VIP', 'MYBPC1'],
 'i9_i22_TAC3_Vip_Mybpc1': ['TAC3', 'VIP', 'MYBPC1']}
[33]:
outputfilename_prefix = "marker_eval"
evaluation_results = ns.ev.DecisionTree(adata, cluster_header, markers_dict, combinations = False, use_mean = False,
                                        save = True, save_supplementary = True,
                                        output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Running NS-Forest version 4.1
Preparing data...
--- 0.005984306335449219 seconds ---
Number of clusters to evaluate: 16
1 out of 16:
        e1_e299_SLC17A7_L5b_Cdh13
        marker genes to be evaluated: ['SLC17A7', 'CDH13']
          fbeta: 0.945
          precision: 0.988
          recall: 0.806
2 out of 16:
        g1_g48_GLI3_Astro_Gja1
        marker genes to be evaluated: ['GLI3', 'GJA1']
          fbeta: 0.893
          precision: 1.0
          recall: 0.625
3 out of 16:
        g2_g27_APBB1IP_Micro_Ctss
        marker genes to be evaluated: ['APBB1IP', 'CTSS']
          fbeta: 0.385
          precision: 1.0
          recall: 0.111
4 out of 16:
        g3_g18_GPNMB_OPC_Pdgfra
        marker genes to be evaluated: ['GPNMB', 'PDGFRA']
          fbeta: 0.929
          precision: 1.0
          recall: 0.722
5 out of 16:
        g4_g9_MOG_Oligo_Opalin
        marker genes to be evaluated: ['MOG', 'OPALIN']
          fbeta: 0.909
          precision: 1.0
          recall: 0.667
6 out of 16:
        i10_i16_TSPAN12_Vip_Mybpc1
        marker genes to be evaluated: ['TSPAN12', 'VIP', 'MYBPC1']
          fbeta: 0.469
          precision: 0.75
          recall: 0.188
7 out of 16:
        i11_i6_EGF_Vip_Mybpc1
        marker genes to be evaluated: ['EGF', 'VIP', 'MYBPC1']
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
8 out of 16:
        i1_i90_COL5A2_Ndnf_Car4
        marker genes to be evaluated: ['COL5A2', 'NDNF', 'CAR4']
warning: cannot find CAR4 in adata.var_names, excluding from DecisionTree.
          fbeta: 0.88
          precision: 0.94
          recall: 0.7
9 out of 16:
        i2_i77_LHX6_Sst_Cbln4
        marker genes to be evaluated: ['LHX6', 'SST', 'CBLN4']
          fbeta: 0.118
          precision: 1.0
          recall: 0.026
10 out of 16:
        i3_i56_BAGE2_Ndnf_Cxcl14
        marker genes to be evaluated: ['BAGE2', 'NDNF', 'CXCL14']
          fbeta: 0.565
          precision: 0.824
          recall: 0.25
11 out of 16:
        i4_i54_MC4R_Ndnf_Cxcl14
        marker genes to be evaluated: ['MC4R', 'NDNF', 'CXCL14']
          fbeta: 0.719
          precision: 0.913
          recall: 0.389
12 out of 16:
        i5_i47_TRPC3_Ndnf_Car4
        marker genes to be evaluated: ['TRPC3', 'NDNF', 'CAR4']
warning: cannot find CAR4 in adata.var_names, excluding from DecisionTree.
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
13 out of 16:
        i6_i44_GPR149_Vip_Mybpc1
        marker genes to be evaluated: ['GPR149', 'VIP', 'MYBPC1']
          fbeta: 0.188
          precision: 0.333
          recall: 0.068
14 out of 16:
        i7_i31_CLMP_Ndnf_Cxcl14
        marker genes to be evaluated: ['CLMP', 'NDNF', 'CXCL14']
          fbeta: 0.355
          precision: 0.355
          recall: 0.355
15 out of 16:
        i8_i27_SNCG_Vip_Mybpc1
        marker genes to be evaluated: ['SNCG', 'VIP', 'MYBPC1']
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
16 out of 16:
        i9_i22_TAC3_Vip_Mybpc1
        marker genes to be evaluated: ['TAC3', 'VIP', 'MYBPC1']
          fbeta: 0.526
          precision: 1.0
          recall: 0.182
using median
Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 320.85it/s]
Saving supplementary table as...
../../outputs_layer1/marker_eval_markers_onTarget_supp.csv
Saving supplementary table as...
../../outputs_layer1/marker_eval_markers_onTarget.csv
Saving final results table as...
../../outputs_layer1/marker_eval_results.csv
Saving final results table as...
../../outputs_layer1/marker_eval_results.pkl
--- 0.7569146156311035 seconds ---

[34]:
evaluation_results
[34]:
software_version cluster_header clusterName clusterSize f_score precision recall TN FP FN TP marker_count markers onTarget
0 4.1 cluster e1_e299_SLC17A7_L5b_Cdh13 299 0.945098 0.987705 0.806020 569 3 58 241 2 [SLC17A7, CDH13] 0.548527
1 4.1 cluster g1_g48_GLI3_Astro_Gja1 48 0.892857 1.000000 0.625000 823 0 18 30 2 [GLI3, GJA1] 0.854088
2 4.1 cluster g2_g27_APBB1IP_Micro_Ctss 27 0.384615 1.000000 0.111111 844 0 24 3 2 [APBB1IP, CTSS] 0.955321
3 4.1 cluster g3_g18_GPNMB_OPC_Pdgfra 18 0.928571 1.000000 0.722222 853 0 5 13 2 [GPNMB, PDGFRA] 0.680299
4 4.1 cluster g4_g9_MOG_Oligo_Opalin 9 0.909091 1.000000 0.666667 862 0 3 6 2 [MOG, OPALIN] 1.000000
5 4.1 cluster i10_i16_TSPAN12_Vip_Mybpc1 16 0.468750 0.750000 0.187500 854 1 13 3 3 [TSPAN12, VIP, MYBPC1] 0.209097
6 4.1 cluster i11_i6_EGF_Vip_Mybpc1 6 0.000000 0.000000 0.000000 865 0 6 0 3 [EGF, VIP, MYBPC1] 0.112842
7 4.1 cluster i1_i90_COL5A2_Ndnf_Car4 90 0.879888 0.940299 0.700000 777 4 27 63 2 [COL5A2, NDNF] 0.608344
8 4.1 cluster i2_i77_LHX6_Sst_Cbln4 77 0.117647 1.000000 0.025974 794 0 75 2 3 [LHX6, SST, CBLN4] 0.420224
9 4.1 cluster i3_i56_BAGE2_Ndnf_Cxcl14 56 0.564516 0.823529 0.250000 812 3 42 14 3 [BAGE2, NDNF, CXCL14] 0.103398
10 4.1 cluster i4_i54_MC4R_Ndnf_Cxcl14 54 0.719178 0.913043 0.388889 815 2 33 21 3 [MC4R, NDNF, CXCL14] 0.351766
11 4.1 cluster i5_i47_TRPC3_Ndnf_Car4 47 0.000000 0.000000 0.000000 803 21 47 0 2 [TRPC3, NDNF] 0.292197
12 4.1 cluster i6_i44_GPR149_Vip_Mybpc1 44 0.187500 0.333333 0.068182 821 6 41 3 3 [GPR149, VIP, MYBPC1] 0.235686
13 4.1 cluster i7_i31_CLMP_Ndnf_Cxcl14 31 0.354839 0.354839 0.354839 820 20 20 11 3 [CLMP, NDNF, CXCL14] 0.180560
14 4.1 cluster i8_i27_SNCG_Vip_Mybpc1 27 0.000000 0.000000 0.000000 844 0 27 0 3 [SNCG, VIP, MYBPC1] 0.153438
15 4.1 cluster i9_i22_TAC3_Vip_Mybpc1 22 0.526316 1.000000 0.181818 849 0 18 4 3 [TAC3, VIP, MYBPC1] 0.288938

Plotting classification metrics from marker evaluation

[35]:
ns.pl.boxplot(evaluation_results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_boxplot_f_score.html
[36]:
ns.pl.boxplot(evaluation_results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_boxplot_precision.html
[37]:
ns.pl.boxplot(evaluation_results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_boxplot_recall.html
[38]:
ns.pl.boxplot(evaluation_results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_boxplot_onTarget.html
[39]:
ns.pl.scatter_w_clusterSize(evaluation_results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_scatter_f_score.html
[40]:
ns.pl.scatter_w_clusterSize(evaluation_results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_scatter_precision.html
[41]:
ns.pl.scatter_w_clusterSize(evaluation_results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_scatter_recall.html
[42]:
ns.pl.scatter_w_clusterSize(evaluation_results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)
Saving...
 ../../outputs_layer1/marker_eval_scatter_onTarget.html

Plotting scanpy dot plot, violin plot, matrix plot for input marker list

Note: Assign pre-defined dendrogram order here or use adata.uns["dendrogram_" + cluster_header]["categories_ordered"].

[43]:
to_plot = evaluation_results.copy()
[44]:
dendrogram = [] # custom dendrogram order
dendrogram = list(adata.uns["dendrogram_" + cluster_header]["categories_ordered"])
to_plot["clusterName"] = to_plot["clusterName"].astype("category")
to_plot["clusterName"] = to_plot["clusterName"].cat.set_categories(dendrogram)
to_plot = to_plot.sort_values("clusterName").reset_index(drop = True)
to_plot = to_plot.rename(columns = {"NSForest_markers": "markers"})
to_plot.head()
[44]:
software_version cluster_header clusterName clusterSize f_score precision recall TN FP FN TP marker_count markers onTarget
0 4.1 cluster e1_e299_SLC17A7_L5b_Cdh13 299 0.945098 0.987705 0.806020 569 3 58 241 2 [SLC17A7, CDH13] 0.548527
1 4.1 cluster i7_i31_CLMP_Ndnf_Cxcl14 31 0.354839 0.354839 0.354839 820 20 20 11 3 [CLMP, NDNF, CXCL14] 0.180560
2 4.1 cluster i1_i90_COL5A2_Ndnf_Car4 90 0.879888 0.940299 0.700000 777 4 27 63 2 [COL5A2, NDNF] 0.608344
3 4.1 cluster i5_i47_TRPC3_Ndnf_Car4 47 0.000000 0.000000 0.000000 803 21 47 0 2 [TRPC3, NDNF] 0.292197
4 4.1 cluster i10_i16_TSPAN12_Vip_Mybpc1 16 0.468750 0.750000 0.187500 854 1 13 3 3 [TSPAN12, VIP, MYBPC1] 0.209097
[45]:
markers_dict = dict(zip(to_plot["clusterName"], to_plot["markers"]))
markers_dict
[45]:
{'e1_e299_SLC17A7_L5b_Cdh13': ['SLC17A7', 'CDH13'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['CLMP', 'NDNF', 'CXCL14'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'NDNF'],
 'i5_i47_TRPC3_Ndnf_Car4': ['TRPC3', 'NDNF'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'VIP', 'MYBPC1'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['MC4R', 'NDNF', 'CXCL14'],
 'i9_i22_TAC3_Vip_Mybpc1': ['TAC3', 'VIP', 'MYBPC1'],
 'i6_i44_GPR149_Vip_Mybpc1': ['GPR149', 'VIP', 'MYBPC1'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'VIP', 'MYBPC1'],
 'g1_g48_GLI3_Astro_Gja1': ['GLI3', 'GJA1'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'PDGFRA'],
 'g2_g27_APBB1IP_Micro_Ctss': ['APBB1IP', 'CTSS'],
 'g4_g9_MOG_Oligo_Opalin': ['MOG', 'OPALIN'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6', 'SST', 'CBLN4'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'VIP', 'MYBPC1'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'NDNF', 'CXCL14']}
[46]:
ns.pl.dotplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\dotplot_marker_eval.png
C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

_images/tutorial_68_2.png
[47]:
ns.pl.stackedviolin(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\stacked_violin_marker_eval.png
_images/tutorial_69_1.png
[48]:
ns.pl.matrixplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)
WARNING: saving figure to file ..\..\outputs_layer1\matrixplot_marker_eval.png
_images/tutorial_70_1.png
[49]:
adata
[49]:
AnnData object with n_obs × n_vars = 871 × 16497
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
[50]:
adata_subset
[50]:
AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'