Tutorial

Setting up environment

[1]:

import sys
import os
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
CODE_PATH = "../.." # location of NSForest folder
sys.path.insert(0, os.path.abspath(CODE_PATH))
from nsforest import ns, nsforesting, utils, NSFOREST_VERSION

Data Exploration

Loading h5ad AnnData file

[2]:

data_folder = f"{CODE_PATH}/demo_data/"
file = data_folder + "adata_layer1.h5ad"
adata = sc.read_h5ad(file)
adata

[2]:

AnnData object with n_obs × n_vars = 871 × 16497
    obs: 'cluster'

Defining `cluster_header` as cell type annotation.

Note: Some datasets have multiple annotations per sample (ex. “broad_cell_type” and “granular_cell_type”). NS-Forest can be run on multiple cluster_header’s. Combining the parent and child markers may improve classification results.

[3]:

cluster_header = "cluster"

Defining `output_folder` for saving results

[4]:

output_folder = f"{CODE_PATH}/outputs_layer1/"

Looking at sample labels

[5]:

adata.obs_names

[5]:

Index(['A01_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A01_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       'A02_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A03_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A04_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A04_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A04_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       'A05_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9',
       'A05_BCH1_1NeuNP_H200.1025_MTG_layer_1',
       'A05_BCH3_1NeuNP_H200.1030_MTG_Layer_1',
       ...
       'P09_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH7',
       'P09_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P09_1_Nuclei_NeuNN_H200_1030_MTG_layer1_BCH8',
       'P09_BCH1_1NeuNN_H200.1025_MTG_layer_1',
       'P10_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH6',
       'P10_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P10_BCH1_1NeuNN_H200.1025_MTG_layer_1',
       'P11_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH7',
       'P11_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9',
       'P11_1_Nuclei_NeuNN_H200_1030_MTG_layer1_BCH8'],
      dtype='object', length=871)

Looking at genes

Note: adata.var_names must be unique. If there is a problem, usually it can be solved by assigning adata.var.index = adata.var["ensembl_id"].

[6]:

adata.var_names

[6]:

Index(['A1CF', 'A2M', 'A2M_AS1', 'A2ML1', 'A2ML1_AS1', 'A2MP1', 'A3GALT2',
       'A4GALT', 'AAAS', 'AACS',
       ...
       'ZUFSP', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDC', 'ZYG11A', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', length=16497)

Checking cell annotation sizes

Note: Some datasets are too large and need to be downsampled to be run through the pipeline. When downsampling, be sure to have all the granular cluster annotations represented.

[7]:

adata.obs[cluster_header].value_counts()

[7]:

e1_e299_SLC17A7_L5b_Cdh13     299
i1_i90_COL5A2_Ndnf_Car4        90
i2_i77_LHX6_Sst_Cbln4          77
i3_i56_BAGE2_Ndnf_Cxcl14       56
i4_i54_MC4R_Ndnf_Cxcl14        54
g1_g48_GLI3_Astro_Gja1         48
i5_i47_TRPC3_Ndnf_Car4         47
i6_i44_GPR149_Vip_Mybpc1       44
i7_i31_CLMP_Ndnf_Cxcl14        31
g2_g27_APBB1IP_Micro_Ctss      27
i8_i27_SNCG_Vip_Mybpc1         27
i9_i22_TAC3_Vip_Mybpc1         22
g3_g18_GPNMB_OPC_Pdgfra        18
i10_i16_TSPAN12_Vip_Mybpc1     16
g4_g9_MOG_Oligo_Opalin          9
i11_i6_EGF_Vip_Mybpc1           6
Name: cluster, dtype: int64

Preprocessing

Generating scanpy dendrogram

Note: Only run if there is no pre-defined dendrogram order. Dendrogram order is stored in adata.uns["dendrogram_cluster"]["categories_ordered"].

You can add additional arguments into ns.pp.dendrogram() that can be passed into the scanpy dendrogram function. Ex: ns.pp.dendrogram(adata, cluster_header, orientation = “bottom”)

[8]:

ns.pp.dendrogram(adata, cluster_header, save = True, output_folder = output_folder, outputfilename_suffix = cluster_header)

WARNING: You’re trying to run this on 16497 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.

C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\anndata\_core\anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(

WARNING: saving figure to file ..\..\outputs_layer1\dendrogram_cluster.png

Calculating cluster medians per gene

Run ns.pp.prep_medians before running NS-Forest.

Note: Do not run if evaluating marker lists. Do not run before generating scanpy plots (e.g. dot plot, violin plot, matrix plot). This step removes genes from adata. The full adata should be used for evaluating and plotting.

[9]:

adata_subset = adata.copy()

[10]:

adata_subset = ns.pp.prep_medians(adata_subset, cluster_header)
adata_subset

Calculating medians...

Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 57.19it/s]

Saving calculated medians as adata.varm.medians_cluster
--- 0.28774333000183105 seconds ---
median: 0.0
mean: 1.62611
std: 2.488900899887085
Only positive genes selected. 11688 positive genes out of 16497 total genes

[10]:

AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster'

Calculating binary scores per gene per cluster

Run ns.pp.prep_binary_scores before running NS-Forest. Do not need to run if evaluating marker lists. Do not need to run when generating scanpy plots.

[11]:

adata_subset = ns.pp.prep_binary_scores(adata_subset, cluster_header)
adata_subset

Calculating binary scores...

Calculating binary scores per cluster: 100%|██████████| 16/16 [00:53<00:00,  3.36s/it]

Saving calculated binary scores as adata.varm.binary_scores_cluster
--- 54.23224496841431 seconds ---
median: 0.1
mean: 0.2016436226562706
std: 0.25240082537940856

[11]:

AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'

Plotting median and binary score distributions

[12]:

plt.clf()
filename = output_folder + cluster_header + '_medians.png'
print(f"Saving median distributions as...\n{filename}")
a = plt.figure(figsize = (6, 4))
a = plt.hist(adata_subset.varm["medians_" + cluster_header].unstack(), bins = 100)
a = plt.title(f'{file.split("/")[-1].replace(".h5ad", "")}: {"medians_" + cluster_header} histogram')
a = plt.xlabel("medians_" + cluster_header)
a = plt.yscale("log")
a = plt.savefig(filename, bbox_inches='tight')
plt.show()

Saving median distributions as...
../../outputs_layer1/cluster_medians.png

<Figure size 640x480 with 0 Axes>

[13]:

plt.clf()
filename = output_folder + cluster_header + '_binary_scores.png'
print(f"Saving binary_score distributions as...\n{filename}")
a = plt.figure(figsize = (6, 4))
a = plt.hist(adata_subset.varm["binary_scores_" + cluster_header].unstack(), bins = 100)
a = plt.title(f'{file.split("/")[-1].replace(".h5ad", "")}: {"binary_scores_" + cluster_header} histogram')
a = plt.xlabel("binary_scores_" + cluster_header)
a = plt.yscale("log")
a = plt.savefig(filename, bbox_inches='tight')
plt.show()

Saving binary_score distributions as...
../../outputs_layer1/cluster_binary_scores.png

<Figure size 640x480 with 0 Axes>

Saving preprocessed AnnData as new h5ad

[14]:

# Optional, can be useful if you don't want to re-run prep_medians and prep_binary_scores
filename = file.replace(".h5ad", "_preprocessed.h5ad")
print(f"Saving new anndata object as...\n{filename}")
adata_subset.write_h5ad(filename)
adata_subset

Saving new anndata object as...
../../demo_data/adata_layer1_preprocessed.h5ad

[14]:

AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'

Running NS-Forest

Note: Do not run NS-Forest if only evaluating input marker lists.

[15]:

outputfilename_prefix = cluster_header
results = nsforesting.NSForest(adata_subset, cluster_header, save = True, save_supplementary = True,
                               output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Running NS-Forest version 4.1
Preparing adata...
--- 0.007977008819580078 seconds ---
Pre-selecting genes based on binary scores...
        BinaryFirst_high Threshold (mean + 2 * std): 0.7064452734150877
        Average number of genes after gene_selection in each cluster: 735.5
Saving number of genes selected per cluster as...
../../outputs_layer1/cluster_gene_selection.csv
Number of clusters to evaluate: 16
1 out of 16:
        e1_e299_SLC17A7_L5b_Cdh13
        Pre-selected 1356 genes to feed into Random Forest.
          NSForest-selected markers: ['LINC00507']
          fbeta: 0.96
          precision: 0.978
          recall: 0.893
2 out of 16:
        g1_g48_GLI3_Astro_Gja1
        Pre-selected 583 genes to feed into Random Forest.
          NSForest-selected markers: ['LINC00498']
          fbeta: 0.95
          precision: 1.0
          recall: 0.792
3 out of 16:
        g2_g27_APBB1IP_Micro_Ctss
        Pre-selected 420 genes to feed into Random Forest.
          NSForest-selected markers: ['ADAM28', 'PTPRC']
          fbeta: 0.976
          precision: 1.0
          recall: 0.889
4 out of 16:
        g3_g18_GPNMB_OPC_Pdgfra
        Pre-selected 353 genes to feed into Random Forest.
          NSForest-selected markers: ['GPNMB', 'OLIG2']
          fbeta: 0.862
          precision: 1.0
          recall: 0.556
5 out of 16:
        g4_g9_MOG_Oligo_Opalin
        Pre-selected 571 genes to feed into Random Forest.
          NSForest-selected markers: ['ST18']
          fbeta: 1.0
          precision: 1.0
          recall: 1.0
6 out of 16:
        i10_i16_TSPAN12_Vip_Mybpc1
        Pre-selected 1007 genes to feed into Random Forest.
          NSForest-selected markers: ['TSPAN12', 'CHRNB3']
          fbeta: 0.804
          precision: 0.9
          recall: 0.562
7 out of 16:
        i11_i6_EGF_Vip_Mybpc1
        Pre-selected 1912 genes to feed into Random Forest.
          NSForest-selected markers: ['EGF', 'FBRSL1']
          fbeta: 0.714
          precision: 1.0
          recall: 0.333
8 out of 16:
        i1_i90_COL5A2_Ndnf_Car4
        Pre-selected 238 genes to feed into Random Forest.
          NSForest-selected markers: ['COL5A2', 'BMP6']
          fbeta: 0.908
          precision: 0.97
          recall: 0.722
9 out of 16:
        i2_i77_LHX6_Sst_Cbln4
        Pre-selected 292 genes to feed into Random Forest.
          NSForest-selected markers: ['LHX6']
          fbeta: 0.817
          precision: 0.838
          recall: 0.74
10 out of 16:
        i3_i56_BAGE2_Ndnf_Cxcl14
        Pre-selected 151 genes to feed into Random Forest.
          NSForest-selected markers: ['BAGE2', 'SYT10']
          fbeta: 0.781
          precision: 0.962
          recall: 0.446
11 out of 16:
        i4_i54_MC4R_Ndnf_Cxcl14
        Pre-selected 223 genes to feed into Random Forest.
          NSForest-selected markers: ['ARHGAP36', 'ADAM33']
          fbeta: 0.857
          precision: 0.923
          recall: 0.667
12 out of 16:
        i5_i47_TRPC3_Ndnf_Car4
        Pre-selected 942 genes to feed into Random Forest.
          NSForest-selected markers: ['NTNG1', 'EYA4']
          fbeta: 0.906
          precision: 1.0
          recall: 0.66
13 out of 16:
        i6_i44_GPR149_Vip_Mybpc1
        Pre-selected 377 genes to feed into Random Forest.
          NSForest-selected markers: ['FLT1', 'GPR149']
          fbeta: 0.792
          precision: 1.0
          recall: 0.432
14 out of 16:
        i7_i31_CLMP_Ndnf_Cxcl14
        Pre-selected 1012 genes to feed into Random Forest.
          NSForest-selected markers: ['PAX6', 'TGFBR2']
          fbeta: 0.901
          precision: 1.0
          recall: 0.645
15 out of 16:
        i8_i27_SNCG_Vip_Mybpc1
        Pre-selected 1326 genes to feed into Random Forest.
          NSForest-selected markers: ['SNCG', 'EDNRA']
          fbeta: 0.759
          precision: 0.923
          recall: 0.444
16 out of 16:
        i9_i22_TAC3_Vip_Mybpc1
        Pre-selected 1005 genes to feed into Random Forest.
          NSForest-selected markers: ['BSPRY', 'MCTP2']
          fbeta: 0.69
          precision: 0.889
          recall: 0.364
Saving supplementary table as...
../../outputs_layer1/cluster_supplementary.csv
Saving markers table as...
../../outputs_layer1/cluster_markers.csv
using mean

Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 314.34it/s]

Saving supplementary table as...
../../outputs_layer1/cluster_markers_onTarget_supp.csv
Saving supplementary table as...
../../outputs_layer1/cluster_markers_onTarget.csv
Saving final results table as...
../../outputs_layer1/cluster_results.csv
Saving final results table as...
../../outputs_layer1/cluster_results.pkl
--- 63.96290040016174 seconds ---

[16]:

results

[16]:

	software_version	cluster_header	clusterName	clusterSize	f_score	precision	recall	TN	FP	FN	TP	marker_count	NSForest_markers	binary_genes	onTarget
0	4.1	cluster	e1_e299_SLC17A7_L5b_Cdh13	299	0.959741	0.978022	0.892977	566	6	32	267	1	[LINC00507]	[SLC17A7, LINC00508, TBR1, ANKRD33B, NPTX1, LI...	0.479336
1	4.1	cluster	g1_g48_GLI3_Astro_Gja1	48	0.950000	1.000000	0.791667	823	0	10	38	1	[LINC00498]	[LINC00498, SLC25A18, EMX2OS, FAM189A2, SLC7A1...	0.882322
2	4.1	cluster	g2_g27_APBB1IP_Micro_Ctss	27	0.975610	1.000000	0.888889	844	0	3	24	2	[ADAM28, PTPRC]	[ADAM28, PLCG2, INPP5D, PTPRC, CSF2RA, P2RY13,...	0.736514
3	4.1	cluster	g3_g18_GPNMB_OPC_Pdgfra	18	0.862069	1.000000	0.555556	853	0	8	10	2	[GPNMB, OLIG2]	[GPNMB, COL20A1, OLIG2, STK32A, KLRC3, KLRC2, ...	0.656991
4	4.1	cluster	g4_g9_MOG_Oligo_Opalin	9	1.000000	1.000000	1.000000	862	0	0	9	1	[ST18]	[ST18, MOBP, CNDP1, MOG, CD22, FOLH1, TF, CARN...	0.622487
5	4.1	cluster	i10_i16_TSPAN12_Vip_Mybpc1	16	0.803571	0.900000	0.562500	854	1	7	9	2	[TSPAN12, CHRNB3]	[TSPAN12, TMC5, LINC01539, CHRNB3, FAM46A, ANG...	0.370267
6	4.1	cluster	i11_i6_EGF_Vip_Mybpc1	6	0.714286	1.000000	0.333333	865	0	4	2	2	[EGF, FBRSL1]	[EGF, FZD8, KCNJ2_AS1, FBRSL1, TEKT1, NRG3_AS1...	0.356487
7	4.1	cluster	i1_i90_COL5A2_Ndnf_Car4	90	0.907821	0.970149	0.722222	779	2	25	65	2	[COL5A2, BMP6]	[NMBR, COL5A2, C8ORF4, PAPSS2, TRPC3, BMP6, SS...	0.299027
8	4.1	cluster	i2_i77_LHX6_Sst_Cbln4	77	0.816619	0.838235	0.740260	783	11	20	57	1	[LHX6]	[LHX6, FLT3, TAC1, CALB1, RSPO3, TRBC2, GRIK3,...	0.795864
9	4.1	cluster	i3_i56_BAGE2_Ndnf_Cxcl14	56	0.781250	0.961538	0.446429	814	1	31	25	2	[BAGE2, SYT10]	[BAGE2, SCN5A, GREM2, FAM19A4, SYT10, ARHGAP18...	0.282582
10	4.1	cluster	i4_i54_MC4R_Ndnf_Cxcl14	54	0.857143	0.923077	0.666667	814	3	18	36	2	[ARHGAP36, ADAM33]	[ARHGAP36, MC4R, COBLL1, HLA_B, LINC01435, ADA...	0.394957
11	4.1	cluster	i5_i47_TRPC3_Ndnf_Car4	47	0.906433	1.000000	0.659574	824	0	16	31	2	[NTNG1, EYA4]	[SSTR2, KIRREL, TRPC3, NTNG1, TARID, EYA4, CA2...	0.265959
12	4.1	cluster	i6_i44_GPR149_Vip_Mybpc1	44	0.791667	1.000000	0.431818	827	0	25	19	2	[FLT1, GPR149]	[FLT1, PLCE1_AS1, CXCL12, SLC22A3, PLCE1_AS2, ...	0.455438
13	4.1	cluster	i7_i31_CLMP_Ndnf_Cxcl14	31	0.900901	1.000000	0.645161	840	0	11	20	2	[PAX6, TGFBR2]	[KIAA1644, FGF10, CLMP, PAX6, SP8, TGFBR2, WIF...	0.364283
14	4.1	cluster	i8_i27_SNCG_Vip_Mybpc1	27	0.759494	0.923077	0.444444	843	1	15	12	2	[SNCG, EDNRA]	[SNCG, MMRN2, EDNRA, FBN3, KCNK2, RGS2, SCML4,...	0.474188
15	4.1	cluster	i9_i22_TAC3_Vip_Mybpc1	22	0.689655	0.888889	0.363636	848	1	14	8	2	[BSPRY, MCTP2]	[BSPRY, OFD1P10Y, MCTP2, OFD1P8Y, OFD1P15Y, OF...	0.467223

Plotting classification metrics from NS-Forest results

[17]:

ns.pl.boxplot(results, "f_score", save = "html", output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_boxplot_f_score.html

[18]:

ns.pl.boxplot(results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_boxplot_f_score.html

[19]:

ns.pl.boxplot(results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_boxplot_precision.html

[20]:

ns.pl.boxplot(results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_boxplot_recall.html

[21]:

ns.pl.boxplot(results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_boxplot_onTarget.html

[22]:

ns.pl.scatter_w_clusterSize(results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_scatter_f_score.html

[23]:

ns.pl.scatter_w_clusterSize(results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_scatter_precision.html

[24]:

ns.pl.scatter_w_clusterSize(results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_scatter_recall.html

[25]:

ns.pl.scatter_w_clusterSize(results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/cluster_scatter_onTarget.html

Plotting scanpy dot plot, violin plot, matrix plot for NS-Forest markers

Note: Assign pre-defined dendrogram order here or use adata.uns["dendrogram_" + cluster_header]["categories_ordered"].

Notice how the non-altered adata is used to plot, instead of adata_subset.

[26]:

to_plot = results.copy()

[27]:

dendrogram = [] # custom dendrogram order
dendrogram = list(adata.uns["dendrogram_" + cluster_header]["categories_ordered"])
to_plot["clusterName"] = to_plot["clusterName"].astype("category")
to_plot["clusterName"] = to_plot["clusterName"].cat.set_categories(dendrogram)
to_plot = to_plot.sort_values("clusterName")
to_plot = to_plot.rename(columns = {"NSForest_markers": "markers"})
to_plot.head()

[27]:

	software_version	cluster_header	clusterName	clusterSize	f_score	precision	recall	TN	FP	FN	TP	marker_count	markers	binary_genes	onTarget
0	4.1	cluster	e1_e299_SLC17A7_L5b_Cdh13	299	0.959741	0.978022	0.892977	566	6	32	267	1	[LINC00507]	[SLC17A7, LINC00508, TBR1, ANKRD33B, NPTX1, LI...	0.479336
13	4.1	cluster	i7_i31_CLMP_Ndnf_Cxcl14	31	0.900901	1.000000	0.645161	840	0	11	20	2	[PAX6, TGFBR2]	[KIAA1644, FGF10, CLMP, PAX6, SP8, TGFBR2, WIF...	0.364283
7	4.1	cluster	i1_i90_COL5A2_Ndnf_Car4	90	0.907821	0.970149	0.722222	779	2	25	65	2	[COL5A2, BMP6]	[NMBR, COL5A2, C8ORF4, PAPSS2, TRPC3, BMP6, SS...	0.299027
11	4.1	cluster	i5_i47_TRPC3_Ndnf_Car4	47	0.906433	1.000000	0.659574	824	0	16	31	2	[NTNG1, EYA4]	[SSTR2, KIRREL, TRPC3, NTNG1, TARID, EYA4, CA2...	0.265959
5	4.1	cluster	i10_i16_TSPAN12_Vip_Mybpc1	16	0.803571	0.900000	0.562500	854	1	7	9	2	[TSPAN12, CHRNB3]	[TSPAN12, TMC5, LINC01539, CHRNB3, FAM46A, ANG...	0.370267

[28]:

markers_dict = dict(zip(to_plot["clusterName"], to_plot["markers"]))
markers_dict

[28]:

{'e1_e299_SLC17A7_L5b_Cdh13': ['LINC00507'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['PAX6', 'TGFBR2'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'BMP6'],
 'i5_i47_TRPC3_Ndnf_Car4': ['NTNG1', 'EYA4'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'CHRNB3'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['ARHGAP36', 'ADAM33'],
 'i9_i22_TAC3_Vip_Mybpc1': ['BSPRY', 'MCTP2'],
 'i6_i44_GPR149_Vip_Mybpc1': ['FLT1', 'GPR149'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'EDNRA'],
 'g1_g48_GLI3_Astro_Gja1': ['LINC00498'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'OLIG2'],
 'g2_g27_APBB1IP_Micro_Ctss': ['ADAM28', 'PTPRC'],
 'g4_g9_MOG_Oligo_Opalin': ['ST18'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'FBRSL1'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'SYT10']}

[29]:

ns.pl.dotplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\dotplot_cluster.png

C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

[30]:

ns.pl.stackedviolin(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\stacked_violin_cluster.png

[31]:

ns.pl.matrixplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\matrixplot_cluster.png

Evaluating input marker list

Getting marker list in dictionary format: {cluster: marker_list}

[32]:

markers = pd.read_csv("../../demo_data/marker_list.csv")
markers_dict = utils.prepare_markers(markers, "clusterName", "markers")
markers_dict

[32]:

{'e1_e299_SLC17A7_L5b_Cdh13': ['SLC17A7', 'CDH13'],
 'g1_g48_GLI3_Astro_Gja1': ['GLI3', 'GJA1'],
 'g2_g27_APBB1IP_Micro_Ctss': ['APBB1IP', 'CTSS'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'PDGFRA'],
 'g4_g9_MOG_Oligo_Opalin': ['MOG', 'OPALIN'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'VIP', 'MYBPC1'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'VIP', 'MYBPC1'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'NDNF', 'CAR4'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6', 'SST', 'CBLN4'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'NDNF', 'CXCL14'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['MC4R', 'NDNF', 'CXCL14'],
 'i5_i47_TRPC3_Ndnf_Car4': ['TRPC3', 'NDNF', 'CAR4'],
 'i6_i44_GPR149_Vip_Mybpc1': ['GPR149', 'VIP', 'MYBPC1'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['CLMP', 'NDNF', 'CXCL14'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'VIP', 'MYBPC1'],
 'i9_i22_TAC3_Vip_Mybpc1': ['TAC3', 'VIP', 'MYBPC1']}

[33]:

outputfilename_prefix = "marker_eval"
evaluation_results = ns.ev.DecisionTree(adata, cluster_header, markers_dict, combinations = False, use_mean = False,
                                        save = True, save_supplementary = True,
                                        output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Running NS-Forest version 4.1
Preparing data...
--- 0.005984306335449219 seconds ---
Number of clusters to evaluate: 16
1 out of 16:
        e1_e299_SLC17A7_L5b_Cdh13
        marker genes to be evaluated: ['SLC17A7', 'CDH13']
          fbeta: 0.945
          precision: 0.988
          recall: 0.806
2 out of 16:
        g1_g48_GLI3_Astro_Gja1
        marker genes to be evaluated: ['GLI3', 'GJA1']
          fbeta: 0.893
          precision: 1.0
          recall: 0.625
3 out of 16:
        g2_g27_APBB1IP_Micro_Ctss
        marker genes to be evaluated: ['APBB1IP', 'CTSS']
          fbeta: 0.385
          precision: 1.0
          recall: 0.111
4 out of 16:
        g3_g18_GPNMB_OPC_Pdgfra
        marker genes to be evaluated: ['GPNMB', 'PDGFRA']
          fbeta: 0.929
          precision: 1.0
          recall: 0.722
5 out of 16:
        g4_g9_MOG_Oligo_Opalin
        marker genes to be evaluated: ['MOG', 'OPALIN']
          fbeta: 0.909
          precision: 1.0
          recall: 0.667
6 out of 16:
        i10_i16_TSPAN12_Vip_Mybpc1
        marker genes to be evaluated: ['TSPAN12', 'VIP', 'MYBPC1']
          fbeta: 0.469
          precision: 0.75
          recall: 0.188
7 out of 16:
        i11_i6_EGF_Vip_Mybpc1
        marker genes to be evaluated: ['EGF', 'VIP', 'MYBPC1']
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
8 out of 16:
        i1_i90_COL5A2_Ndnf_Car4
        marker genes to be evaluated: ['COL5A2', 'NDNF', 'CAR4']
warning: cannot find CAR4 in adata.var_names, excluding from DecisionTree.
          fbeta: 0.88
          precision: 0.94
          recall: 0.7
9 out of 16:
        i2_i77_LHX6_Sst_Cbln4
        marker genes to be evaluated: ['LHX6', 'SST', 'CBLN4']
          fbeta: 0.118
          precision: 1.0
          recall: 0.026
10 out of 16:
        i3_i56_BAGE2_Ndnf_Cxcl14
        marker genes to be evaluated: ['BAGE2', 'NDNF', 'CXCL14']
          fbeta: 0.565
          precision: 0.824
          recall: 0.25
11 out of 16:
        i4_i54_MC4R_Ndnf_Cxcl14
        marker genes to be evaluated: ['MC4R', 'NDNF', 'CXCL14']
          fbeta: 0.719
          precision: 0.913
          recall: 0.389
12 out of 16:
        i5_i47_TRPC3_Ndnf_Car4
        marker genes to be evaluated: ['TRPC3', 'NDNF', 'CAR4']
warning: cannot find CAR4 in adata.var_names, excluding from DecisionTree.
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
13 out of 16:
        i6_i44_GPR149_Vip_Mybpc1
        marker genes to be evaluated: ['GPR149', 'VIP', 'MYBPC1']
          fbeta: 0.188
          precision: 0.333
          recall: 0.068
14 out of 16:
        i7_i31_CLMP_Ndnf_Cxcl14
        marker genes to be evaluated: ['CLMP', 'NDNF', 'CXCL14']
          fbeta: 0.355
          precision: 0.355
          recall: 0.355
15 out of 16:
        i8_i27_SNCG_Vip_Mybpc1
        marker genes to be evaluated: ['SNCG', 'VIP', 'MYBPC1']
          fbeta: 0.0
          precision: 0.0
          recall: 0.0
16 out of 16:
        i9_i22_TAC3_Vip_Mybpc1
        marker genes to be evaluated: ['TAC3', 'VIP', 'MYBPC1']
          fbeta: 0.526
          precision: 1.0
          recall: 0.182
using median

Calculating medians (means) per cluster: 100%|██████████| 16/16 [00:00<00:00, 320.85it/s]

Saving supplementary table as...
../../outputs_layer1/marker_eval_markers_onTarget_supp.csv
Saving supplementary table as...
../../outputs_layer1/marker_eval_markers_onTarget.csv
Saving final results table as...
../../outputs_layer1/marker_eval_results.csv
Saving final results table as...
../../outputs_layer1/marker_eval_results.pkl
--- 0.7569146156311035 seconds ---

[34]:

evaluation_results

[34]:

	software_version	cluster_header	clusterName	clusterSize	f_score	precision	recall	TN	FP	FN	TP	marker_count	markers	onTarget
0	4.1	cluster	e1_e299_SLC17A7_L5b_Cdh13	299	0.945098	0.987705	0.806020	569	3	58	241	2	[SLC17A7, CDH13]	0.548527
1	4.1	cluster	g1_g48_GLI3_Astro_Gja1	48	0.892857	1.000000	0.625000	823	0	18	30	2	[GLI3, GJA1]	0.854088
2	4.1	cluster	g2_g27_APBB1IP_Micro_Ctss	27	0.384615	1.000000	0.111111	844	0	24	3	2	[APBB1IP, CTSS]	0.955321
3	4.1	cluster	g3_g18_GPNMB_OPC_Pdgfra	18	0.928571	1.000000	0.722222	853	0	5	13	2	[GPNMB, PDGFRA]	0.680299
4	4.1	cluster	g4_g9_MOG_Oligo_Opalin	9	0.909091	1.000000	0.666667	862	0	3	6	2	[MOG, OPALIN]	1.000000
5	4.1	cluster	i10_i16_TSPAN12_Vip_Mybpc1	16	0.468750	0.750000	0.187500	854	1	13	3	3	[TSPAN12, VIP, MYBPC1]	0.209097
6	4.1	cluster	i11_i6_EGF_Vip_Mybpc1	6	0.000000	0.000000	0.000000	865	0	6	0	3	[EGF, VIP, MYBPC1]	0.112842
7	4.1	cluster	i1_i90_COL5A2_Ndnf_Car4	90	0.879888	0.940299	0.700000	777	4	27	63	2	[COL5A2, NDNF]	0.608344
8	4.1	cluster	i2_i77_LHX6_Sst_Cbln4	77	0.117647	1.000000	0.025974	794	0	75	2	3	[LHX6, SST, CBLN4]	0.420224
9	4.1	cluster	i3_i56_BAGE2_Ndnf_Cxcl14	56	0.564516	0.823529	0.250000	812	3	42	14	3	[BAGE2, NDNF, CXCL14]	0.103398
10	4.1	cluster	i4_i54_MC4R_Ndnf_Cxcl14	54	0.719178	0.913043	0.388889	815	2	33	21	3	[MC4R, NDNF, CXCL14]	0.351766
11	4.1	cluster	i5_i47_TRPC3_Ndnf_Car4	47	0.000000	0.000000	0.000000	803	21	47	0	2	[TRPC3, NDNF]	0.292197
12	4.1	cluster	i6_i44_GPR149_Vip_Mybpc1	44	0.187500	0.333333	0.068182	821	6	41	3	3	[GPR149, VIP, MYBPC1]	0.235686
13	4.1	cluster	i7_i31_CLMP_Ndnf_Cxcl14	31	0.354839	0.354839	0.354839	820	20	20	11	3	[CLMP, NDNF, CXCL14]	0.180560
14	4.1	cluster	i8_i27_SNCG_Vip_Mybpc1	27	0.000000	0.000000	0.000000	844	0	27	0	3	[SNCG, VIP, MYBPC1]	0.153438
15	4.1	cluster	i9_i22_TAC3_Vip_Mybpc1	22	0.526316	1.000000	0.181818	849	0	18	4	3	[TAC3, VIP, MYBPC1]	0.288938

Plotting classification metrics from marker evaluation

[35]:

ns.pl.boxplot(evaluation_results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_boxplot_f_score.html

[36]:

ns.pl.boxplot(evaluation_results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_boxplot_precision.html

[37]:

ns.pl.boxplot(evaluation_results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_boxplot_recall.html

[38]:

ns.pl.boxplot(evaluation_results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_boxplot_onTarget.html

[39]:

ns.pl.scatter_w_clusterSize(evaluation_results, "f_score", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_scatter_f_score.html

[40]:

ns.pl.scatter_w_clusterSize(evaluation_results, "precision", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_scatter_precision.html

[41]:

ns.pl.scatter_w_clusterSize(evaluation_results, "recall", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_scatter_recall.html

[42]:

ns.pl.scatter_w_clusterSize(evaluation_results, "onTarget", save = True, output_folder = output_folder, outputfilename_prefix = outputfilename_prefix)

Saving...
 ../../outputs_layer1/marker_eval_scatter_onTarget.html

Plotting scanpy dot plot, violin plot, matrix plot for input marker list

Note: Assign pre-defined dendrogram order here or use adata.uns["dendrogram_" + cluster_header]["categories_ordered"].

[43]:

to_plot = evaluation_results.copy()

[44]:

dendrogram = [] # custom dendrogram order
dendrogram = list(adata.uns["dendrogram_" + cluster_header]["categories_ordered"])
to_plot["clusterName"] = to_plot["clusterName"].astype("category")
to_plot["clusterName"] = to_plot["clusterName"].cat.set_categories(dendrogram)
to_plot = to_plot.sort_values("clusterName").reset_index(drop = True)
to_plot = to_plot.rename(columns = {"NSForest_markers": "markers"})
to_plot.head()

[44]:

	software_version	cluster_header	clusterName	clusterSize	f_score	precision	recall	TN	FP	FN	TP	marker_count	markers	onTarget
0	4.1	cluster	e1_e299_SLC17A7_L5b_Cdh13	299	0.945098	0.987705	0.806020	569	3	58	241	2	[SLC17A7, CDH13]	0.548527
1	4.1	cluster	i7_i31_CLMP_Ndnf_Cxcl14	31	0.354839	0.354839	0.354839	820	20	20	11	3	[CLMP, NDNF, CXCL14]	0.180560
2	4.1	cluster	i1_i90_COL5A2_Ndnf_Car4	90	0.879888	0.940299	0.700000	777	4	27	63	2	[COL5A2, NDNF]	0.608344
3	4.1	cluster	i5_i47_TRPC3_Ndnf_Car4	47	0.000000	0.000000	0.000000	803	21	47	0	2	[TRPC3, NDNF]	0.292197
4	4.1	cluster	i10_i16_TSPAN12_Vip_Mybpc1	16	0.468750	0.750000	0.187500	854	1	13	3	3	[TSPAN12, VIP, MYBPC1]	0.209097

[45]:

markers_dict = dict(zip(to_plot["clusterName"], to_plot["markers"]))
markers_dict

[45]:

{'e1_e299_SLC17A7_L5b_Cdh13': ['SLC17A7', 'CDH13'],
 'i7_i31_CLMP_Ndnf_Cxcl14': ['CLMP', 'NDNF', 'CXCL14'],
 'i1_i90_COL5A2_Ndnf_Car4': ['COL5A2', 'NDNF'],
 'i5_i47_TRPC3_Ndnf_Car4': ['TRPC3', 'NDNF'],
 'i10_i16_TSPAN12_Vip_Mybpc1': ['TSPAN12', 'VIP', 'MYBPC1'],
 'i4_i54_MC4R_Ndnf_Cxcl14': ['MC4R', 'NDNF', 'CXCL14'],
 'i9_i22_TAC3_Vip_Mybpc1': ['TAC3', 'VIP', 'MYBPC1'],
 'i6_i44_GPR149_Vip_Mybpc1': ['GPR149', 'VIP', 'MYBPC1'],
 'i8_i27_SNCG_Vip_Mybpc1': ['SNCG', 'VIP', 'MYBPC1'],
 'g1_g48_GLI3_Astro_Gja1': ['GLI3', 'GJA1'],
 'g3_g18_GPNMB_OPC_Pdgfra': ['GPNMB', 'PDGFRA'],
 'g2_g27_APBB1IP_Micro_Ctss': ['APBB1IP', 'CTSS'],
 'g4_g9_MOG_Oligo_Opalin': ['MOG', 'OPALIN'],
 'i2_i77_LHX6_Sst_Cbln4': ['LHX6', 'SST', 'CBLN4'],
 'i11_i6_EGF_Vip_Mybpc1': ['EGF', 'VIP', 'MYBPC1'],
 'i3_i56_BAGE2_Ndnf_Cxcl14': ['BAGE2', 'NDNF', 'CXCL14']}

[46]:

ns.pl.dotplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\dotplot_marker_eval.png

C:\Users\bpeng\AppData\Local\anaconda3\Lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

[47]:

ns.pl.stackedviolin(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\stacked_violin_marker_eval.png

[48]:

ns.pl.matrixplot(adata, markers_dict, cluster_header, dendrogram = dendrogram, save = True, output_folder = output_folder, outputfilename_suffix = outputfilename_prefix)

WARNING: saving figure to file ..\..\outputs_layer1\matrixplot_marker_eval.png

[49]:

adata

[49]:

AnnData object with n_obs × n_vars = 871 × 16497
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'

[50]:

adata_subset

[50]:

AnnData object with n_obs × n_vars = 871 × 11688
    obs: 'cluster'
    uns: 'dendrogram_cluster'
    obsm: 'X_pca'
    varm: 'medians_cluster', 'binary_scores_cluster'

Tutorial

Tutorial

Setting up environment

Data Exploration

Loading h5ad AnnData file

Defining cluster_header as cell type annotation.

Defining output_folder for saving results

Looking at sample labels

Looking at genes

Checking cell annotation sizes

Preprocessing

Generating scanpy dendrogram

Calculating cluster medians per gene

Calculating binary scores per gene per cluster

Plotting median and binary score distributions

Saving preprocessed AnnData as new h5ad

Running NS-Forest

Plotting classification metrics from NS-Forest results

Plotting scanpy dot plot, violin plot, matrix plot for NS-Forest markers

Evaluating input marker list

Getting marker list in dictionary format: {cluster: marker_list}

Plotting classification metrics from marker evaluation

Plotting scanpy dot plot, violin plot, matrix plot for input marker list

Defining `cluster_header` as cell type annotation.

Defining `output_folder` for saving results