[["index.html", "Assigning cell types with SingleR Preface", " Assigning cell types with SingleR Authors: Aaron Lun [aut, cre] Version: 1.2.0 Modified: 2021-02-07 Compiled: 2021-05-24 Environment: R version 4.1.0 (2021-05-18), Bioconductor 3.13 License: CC BY Copyright: Aaron Lun, 2020 Source: https://github.com/LTLA/SingleRBook-base Preface Imagine a world without a reference genome. Whenever we receive new RNA-seq data, we’d need to run it through an assembler to identify the expressed sequences. We would then need to inspect each sequence to determine its likely function, e.g., based on sequence motifs. This process is analogous to current practice in single-cell data analysis; simply replace reads with cells, assemblies with clusters, and genes with cell types. A typical practitioner will hope that their clusters are reasonable proxies for the biological states of interest and that their manual annotation of the clusters is accurate. Such an “artisanal” process is difficult to reproduce and scale to larger datasets involving more diverse cell types. The solution is to perform automated cell type annotation, a.k.a. cell type classification (or occasionally, “label transfer”). These methods compare cells in a new dataset against curated reference profiles of known cell types, assigning each new cell to the reference type that its expression profile is most similar to. This allows users to skip the mundane annotation of their data and jump directly to the interesting questions - does my cell type change in abundance or expression across treatments? Is there interesting substructure within an existing population? In this respect, automated annotation methods are the single-cell field’s equivalent to genome aligners, and we anticipate that the former will also become standard procedure for single-cell data analysis. This book covers the use of SingleR, one implementation of an automated annotation method. If you want a survey of different annotation methods - this book is not for you. If you want to create hand-crafted cluster definitions - this book is not for you. (Read the other one instead.) If you want to use the pre-Bioconductor version of the package - this book is not for you. But if you’re tired of manually annotating your single-cell data and you want to do something better with your life, then read on. "],["introduction.html", "Chapter 1 Introduction 1.1 Motivation 1.2 Method description 1.3 Quick start 1.4 Where to get help Session information", " Chapter 1 Introduction .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Motivation The Bioconductor package SingleR implements an automatic annotation method for single-cell RNA sequencing (scRNA-seq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it assigns those labels to new cells from a test dataset based on similarities in their expression profiles. This provides a convenient way of transferring biological knowledge across datasets, allowing users to leverage the domain expertise implicit in the creation of each reference. The most common application of SingleR involves predicting cell type (or “state”, or “kind”) in a new dataset, a process that is facilitated by the availability of curated references and compatibility with user-supplied datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner. 1.2 Method description SingleR can be considered a robust variant of nearest-neighbors classification, with some tweaks to improve resolution for closely related labels. For each test cell: We compute the Spearman correlation between its expression profile and that of each reference sample. The use of Spearman’s correlation provides a measure of robustness to batch effects across datasets. The calculation only uses the union of marker genes identified by pairwise comparisons between labels in the reference data, so as to improve resolution of separation between labels. We define the per-label score as a fixed quantile (by default, 0.8) of the correlations across all samples with that label. This accounts for differences in the number of reference samples for each label, which interferes with simpler flavors of nearest neighbor classification; it also avoids penalizing classifications to heterogeneous labels by only requiring a good match to a minority of samples. We repeat the score calculation for all labels in the reference dataset. The label with the highest score is used as SingleR’s prediction for this cell. We optionally perform a fine-tuning step to improve resolution between closely related labels. The reference dataset is subsetted to only include labels with scores close to the maximum; scores are recomputed using only marker genes for the subset of labels, thus focusing on the most relevant features; and this process is iterated until only one label remains. 1.3 Quick start We will demonstrate the use of SingleR() on a well-known 10X Genomics dataset (Zheng et al. 2017) with the Human Primary Cell Atlas dataset (Mabbott et al. 2013) as the reference. # Loading test data. library(TENxPBMCData) new.data &lt;- TENxPBMCData(&quot;pbmc4k&quot;) # Loading reference data with Ensembl annotations. library(celldex) ref.data &lt;- HumanPrimaryCellAtlasData(ensembl=TRUE) # Performing predictions. library(SingleR) predictions &lt;- SingleR(test=new.data, assay.type.test=1, ref=ref.data, labels=ref.data$label.main) table(predictions$labels) ## ## B_cell CMP DC GMP ## 606 8 1 2 ## Monocyte NK_cell Platelets Pre-B_cell_CD34- ## 1164 217 3 46 ## T_cells ## 2293 And that’s it, really. 1.4 Where to get help Questions on the general use of SingleR should be posted to the Bioconductor support site. Please send requests for general assistance and advice to the support site rather than to the individual authors. Bug reports or feature requests should be made to the GitHub repository; well-considered suggestions for improvements are always welcome. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] SingleR_1.6.1 ensembldb_2.16.0 [3] AnnotationFilter_1.16.0 GenomicFeatures_1.44.0 [5] AnnotationDbi_1.54.0 celldex_1.2.0 [7] TENxPBMCData_1.10.0 HDF5Array_1.20.0 [9] rhdf5_2.36.0 DelayedArray_0.18.0 [11] Matrix_1.3-3 SingleCellExperiment_1.14.1 [13] SummarizedExperiment_1.22.0 Biobase_2.52.0 [15] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 [17] IRanges_2.26.0 S4Vectors_0.30.0 [19] BiocGenerics_0.38.0 MatrixGenerics_1.4.0 [21] matrixStats_0.58.0 BiocStyle_2.20.0 [23] rebook_1.2.0 loaded via a namespace (and not attached): [1] rjson_0.2.20 ellipsis_0.3.2 [3] XVector_0.32.0 BiocNeighbors_1.10.0 [5] bit64_4.0.5 interactiveDisplayBase_1.30.0 [7] fansi_0.4.2 codetools_0.2-18 [9] sparseMatrixStats_1.4.0 cachem_1.0.5 [11] knitr_1.33 jsonlite_1.7.2 [13] Rsamtools_2.8.0 dbplyr_2.1.1 [15] png_0.1-7 graph_1.70.0 [17] shiny_1.6.0 BiocManager_1.30.15 [19] compiler_4.1.0 httr_1.4.2 [21] assertthat_0.2.1 fastmap_1.1.0 [23] lazyeval_0.2.2 BiocSingular_1.8.0 [25] later_1.2.0 htmltools_0.5.1.1 [27] prettyunits_1.1.1 tools_4.1.0 [29] rsvd_1.0.5 glue_1.4.2 [31] GenomeInfoDbData_1.2.6 dplyr_1.0.6 [33] rappdirs_0.3.3 Rcpp_1.0.6 [35] jquerylib_0.1.4 vctrs_0.3.8 [37] Biostrings_2.60.0 rhdf5filters_1.4.0 [39] ExperimentHub_2.0.0 rtracklayer_1.52.0 [41] DelayedMatrixStats_1.14.0 xfun_0.23 [43] stringr_1.4.0 beachmat_2.8.0 [45] irlba_2.3.3 mime_0.10 [47] lifecycle_1.0.0 restfulr_0.0.13 [49] XML_3.99-0.6 AnnotationHub_3.0.0 [51] zlibbioc_1.38.0 ProtGenerics_1.24.0 [53] hms_1.1.0 promises_1.2.0.1 [55] yaml_2.2.1 curl_4.3.1 [57] memoise_2.0.0 sass_0.4.0 [59] biomaRt_2.48.0 stringi_1.6.2 [61] RSQLite_2.2.7 BiocVersion_3.13.1 [63] BiocIO_1.2.0 ScaledMatrix_1.0.0 [65] filelock_1.0.2 BiocParallel_1.26.0 [67] rlang_0.4.11 pkgconfig_2.0.3 [69] bitops_1.0-7 evaluate_0.14 [71] lattice_0.20-44 purrr_0.3.4 [73] Rhdf5lib_1.14.0 GenomicAlignments_1.28.0 [75] CodeDepends_0.6.5 bit_4.0.4 [77] tidyselect_1.1.1 magrittr_2.0.1 [79] bookdown_0.22 R6_2.5.0 [81] generics_0.1.0 DBI_1.1.1 [83] pillar_1.6.1 withr_2.4.2 [85] KEGGREST_1.32.0 RCurl_1.98-1.3 [87] tibble_3.1.2 dir.expiry_1.0.0 [89] crayon_1.4.1 utf8_1.2.1 [91] BiocFileCache_2.0.0 rmarkdown_2.8 [93] progress_1.2.2 grid_4.1.0 [95] blob_1.2.1 digest_0.6.27 [97] xtable_1.8-4 httpuv_1.6.1 [99] bslib_0.2.5.1 Bibliography "],["using-the-classic-mode.html", "Chapter 2 Using the classic mode 2.1 Overview 2.2 Annotating the test dataset 2.3 Interaction with quality control 2.4 Choices of assay data 2.5 Comments on choice of references Session information", " Chapter 2 Using the classic mode .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Overview SingleR detects markers in a pairwise manner between labels in the reference dataset. Specifically, for each label of interest, it performs pairwise comparisons to every other label in the reference and identifies the genes that are upregulated in the label of interest for each comparison. The initial score calculation is then performed on the union of marker genes across all comparisons for all label. This approach ensures that the selected subset of features will contain genes that distinguish each label from any other label. (In contrast, other approaches that treat the “other” labels as a single group do not offer this guarantee; see here for a discussion.) It also allows the fine-tuning step to aggressively improve resolution by only using marker genes from comparisons where both labels have scores close to the maximum. The original (“classic”) marker detection algorithm used in Aran et al. (2019) identified marker genes based on their log-fold changes in each pairwise comparison. Specifically, it used the genes with the largest positive differences in the per-label median log-expression values between labels. The number of genes taken from each pairwise comparison was defined as \\(500 (\\frac{2}{3})^{\\log_{2}(n)}\\), where \\(n\\) is the number of unique labels in the reference; this scheme aimed to reduce the number of genes (and thus the computational time) as the number of labels and pairwise comparisons increased. Classic mode is primarily intended for reference datasets that have little or no replication, a description that covers many of the bulk-derived references and precludes more complicated marker detection procedures (Chapter 3). 2.2 Annotating the test dataset For demonstration purposes, we will use the Grun et al. (2016) haematopoietic stem cell (HSC) dataset from the scRNAseq package. The GrunHSCData() function conveniently returns a SingleCellExperiment object containing the count matrix for this dataset. library(scRNAseq) sce &lt;- GrunHSCData(ensembl=TRUE) sce ## class: SingleCellExperiment ## dim: 21817 1915 ## metadata(0): ## assays(1): counts ## rownames(21817): ENSMUSG00000109644 ENSMUSG00000007777 ... ## ENSMUSG00000055670 ENSMUSG00000039068 ## rowData names(3): symbol chr originalName ## colnames(1915): JC4_349_HSC_FE_S13_ JC4_350_HSC_FE_S13_ ... ## JC48P6_1203_HSC_FE_S8_ JC48P6_1204_HSC_FE_S8_ ## colData names(2): sample protocol ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Our aim is to annotate each cell with the ImmGen reference dataset (Heng et al. 2008) from the celldex package. (Some further comments on the choice of reference are provided below in Section 2.5.) Calling the ImmGenData() function returns a SummarizedExperiment object containing a matrix of log-expression values with sample-level labels. We also set ensembl=TRUE to match the reference’s gene annotation with that in the sce object - the default behavior is to use the gene symbol. library(celldex) immgen &lt;- ImmGenData(ensembl=TRUE) immgen ## class: SummarizedExperiment ## dim: 21352 830 ## metadata(0): ## assays(1): logcounts ## rownames(21352): ENSMUSG00000079681 ENSMUSG00000066372 ... ## ENSMUSG00000034640 ENSMUSG00000036940 ## rowData names(0): ## colnames(830): ## GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL ## GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ... ## GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL ## GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL ## colData names(3): label.main label.fine label.ont Each celldex dataset actually has three sets of labels that primarily differ in their resolution. For the purposes of this demonstration, we will use the “fine” labels in the label.fine metadata field, which represents the highest resolution of annotation available for this dataset. head(immgen$label.fine) ## [1] &quot;Macrophages (MF.11C-11B+)&quot; &quot;Macrophages (MF.11C-11B+)&quot; ## [3] &quot;Macrophages (MF.11C-11B+)&quot; &quot;Macrophages (MF.ALV)&quot; ## [5] &quot;Macrophages (MF.ALV)&quot; &quot;Macrophages (MF.ALV)&quot; We perform annotation by calling SingleR() on our test (Grun) dataset and the reference (ImmGen) dataset, leaving the default of de.method=\"classic\" to use the original marker detection scheme. This applies the algorithm described in Section 1.2, returning a DataFrame where each row contains prediction results for a single cell in the sce object. Labels are provided before fine-tuning (first.labels), after fine-tuning (labels) and after pruning (pruned.labels); some of the other fields are discussed in more detail in Chapter 4. library(SingleR) # See &#39;Choices of assay data&#39; for &#39;assay.type.test=&#39; explanation. pred &lt;- SingleR(test = sce, ref = immgen, labels = immgen$label.fine, assay.type.test=1) colnames(pred) ## [1] &quot;scores&quot; &quot;first.labels&quot; &quot;tuning.scores&quot; &quot;labels&quot; ## [5] &quot;pruned.labels&quot; 2.3 Interaction with quality control Upon examining the distribution of assigned labels, we see that many of them are related to stem cells. However, there are quite a large number of more differentiated labels mixed in, which is not what we expect from a sorted population of HSCs. head(sort(table(pred$labels), decreasing=TRUE)) ## ## Stem cells (SC.MEP) Neutrophils (GN.ARTH) Macrophages (MF) ## 362 306 166 ## Stem cells (SC.STSL) B cells (proB.FrA) Stem cells (SC.LT34F) ## 143 121 103 This is probably because - despite what its name might suggest - the dataset obtained by GrunHSCData() actually contains more than HSCs. If we restrict our analysis to the sorted HSCs (obviously) and remove one low-quality batch (see the analysis here for the rationale) we can see that the distribution of cell type labels is more similar to what we might expect. Low-quality cells lack information for accurate label assignment and need to be removed to enable interpretation of the results. actual.hsc &lt;- pred$labels[sce$protocol==&quot;sorted hematopoietic stem cells&quot; &amp; sce$sample!=&quot;JC4&quot;] head(sort(table(actual.hsc), decreasing=TRUE)) ## actual.hsc ## Stem cells (SC.STSL) Stem cells (SC.LT34F) ## 110 98 ## Stem cells (SC.ST34F) Stem cells (SC.CD150-CD48-) ## 37 15 ## Stem cells (LTHSC) Stem cells (MLP) ## 12 7 Filtering the annotation results in the above manner is valid because SingleR() operates independently on each test cell. The annotation is orthogonal to any decisions about the relative quality of the cells in the test dataset; the same results will be obtained regardless of whether SingleR is run before or after quality control. This is logistically convenient as it means that the annotation does not have to be repeated if the quality control scheme (or any other downstream step, like clustering) changes throughout the lifetime of the analysis. 2.4 Choices of assay data For the reference dataset, the assay matrix must contain log-transformed normalized expression values. This is because the default marker detection scheme computes log-fold changes by subtracting the medians, which makes little sense unless the input expression values are already log-transformed. For alternative schemes, this requirement may be relaxed (e.g., Wilcoxon rank sum tests do not require transformation); similarly, if pre-defined markers are supplied, no transformation or normalization is necessary. For the test data, the assay data need not be log-transformed or even (scale) normalized. This is because SingleR() computes Spearman correlations within each cell, which is unaffected by monotonic transformations like cell-specific scaling or log-transformation. It is perfectly satisfactory to provide the raw counts for the test dataset to SingleR(), which is the reason for setting assay.type.test=1 in our previous SingleR() call for the Grun dataset. The exception to this rule occurs when comparing data from full-length technologies to the celldex references. These references are intended to be comparable to data from unique molecular identifier (UMI) protocols where the expression values are less sensitive to differences in gene length. Thus, when annotating Smart-seq2 test datasets against the celldex references, better performance can often be achieved by processing the test counts to transcripts-per-million values. We demonstrate below using another HSC dataset that was generated using the Smart-seq2 protocol (Nestorowa et al. 2016). Again, we see that most of the predicted labels are related to stem cells, which is comforting. sce.nest &lt;- NestorowaHSCData() # Getting the exonic gene lengths. library(AnnotationHub) mm.db &lt;- AnnotationHub()[[&quot;AH73905&quot;]] mm.exons &lt;- exonsBy(mm.db, by=&quot;gene&quot;) mm.exons &lt;- reduce(mm.exons) mm.len &lt;- sum(width(mm.exons)) # Computing the TPMs with a simple scaling by gene length. library(scater) keep &lt;- intersect(names(mm.len), rownames(sce.nest)) tpm.nest &lt;- calculateTPM(sce.nest[keep,], lengths=mm.len[keep]) # Performing the assignment. pred &lt;- SingleR(test = tpm.nest, ref = immgen, labels = immgen$label.fine) head(sort(table(pred$labels), decreasing=TRUE), 10) ## ## Stem cells (SC.MEP) Stem cells (SC.ST34F) ## 409 357 ## Stem cells (SC.MPP34F) Stem cells (SC.CMP.DR) ## 329 298 ## Stem cells (MLP) Stem cells (GMP) ## 167 102 ## Stem cells (SC.STSL) Stem cells (SC.MDP) ## 71 66 ## Stem cells (SC.CD150-CD48-) Stem cells (SC.LT34F) ## 55 37 2.5 Comments on choice of references Unsurprisingly, the choice of reference has a major impact on the annotation results. We need to pick a reference that contains a superset of the labels that we expect to be present in our test dataset. Whether the original authors assigned appropriate labels to the reference samples is largely taken as a matter of faith; it is not entirely unexpected that some references are “better” than others depending on the quality of sample preparation. We would also prefer a reference that is generated from a similar technology or protocol as our test dataset, though this is usually not an issue when using SingleR() to annotate well-defined cell types. Users are advised to read the relevant vignette for more details about the available references as well as some recommendations on which to use. (As an aside, the ImmGen dataset and other references were originally supplied along with SingleR itself but have since been migrated to the separate celldex package for more general use throughout Bioconductor.) Of course, as we shall see in the next Chapter, it is entirely possible to supply your own reference datasets instead; all we need are log-expression values and a set of labels for the cells or samples. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.20.0 ggplot2_3.3.3 [3] scuttle_1.2.0 AnnotationHub_3.0.0 [5] BiocFileCache_2.0.0 dbplyr_2.1.1 [7] SingleR_1.6.1 celldex_1.2.0 [9] ensembldb_2.16.0 AnnotationFilter_1.16.0 [11] GenomicFeatures_1.44.0 AnnotationDbi_1.54.0 [13] scRNAseq_2.6.0 SingleCellExperiment_1.14.1 [15] SummarizedExperiment_1.22.0 Biobase_2.52.0 [17] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 [19] IRanges_2.26.0 S4Vectors_0.30.0 [21] BiocGenerics_0.38.0 MatrixGenerics_1.4.0 [23] matrixStats_0.58.0 BiocStyle_2.20.0 [25] rebook_1.2.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-1 [3] rjson_0.2.20 ellipsis_0.3.2 [5] XVector_0.32.0 BiocNeighbors_1.10.0 [7] bit64_4.0.5 interactiveDisplayBase_1.30.0 [9] fansi_0.4.2 codetools_0.2-18 [11] sparseMatrixStats_1.4.0 cachem_1.0.5 [13] knitr_1.33 jsonlite_1.7.2 [15] Rsamtools_2.8.0 png_0.1-7 [17] graph_1.70.0 shiny_1.6.0 [19] BiocManager_1.30.15 compiler_4.1.0 [21] httr_1.4.2 assertthat_0.2.1 [23] Matrix_1.3-3 fastmap_1.1.0 [25] lazyeval_0.2.2 BiocSingular_1.8.0 [27] later_1.2.0 htmltools_0.5.1.1 [29] prettyunits_1.1.1 tools_4.1.0 [31] rsvd_1.0.5 gtable_0.3.0 [33] glue_1.4.2 GenomeInfoDbData_1.2.6 [35] dplyr_1.0.6 rappdirs_0.3.3 [37] Rcpp_1.0.6 jquerylib_0.1.4 [39] vctrs_0.3.8 Biostrings_2.60.0 [41] ExperimentHub_2.0.0 rtracklayer_1.52.0 [43] DelayedMatrixStats_1.14.0 xfun_0.23 [45] stringr_1.4.0 beachmat_2.8.0 [47] irlba_2.3.3 mime_0.10 [49] lifecycle_1.0.0 restfulr_0.0.13 [51] XML_3.99-0.6 scales_1.1.1 [53] zlibbioc_1.38.0 hms_1.1.0 [55] promises_1.2.0.1 ProtGenerics_1.24.0 [57] yaml_2.2.1 curl_4.3.1 [59] gridExtra_2.3 memoise_2.0.0 [61] sass_0.4.0 biomaRt_2.48.0 [63] stringi_1.6.2 RSQLite_2.2.7 [65] BiocVersion_3.13.1 BiocIO_1.2.0 [67] ScaledMatrix_1.0.0 filelock_1.0.2 [69] BiocParallel_1.26.0 rlang_0.4.11 [71] pkgconfig_2.0.3 bitops_1.0-7 [73] evaluate_0.14 lattice_0.20-44 [75] purrr_0.3.4 GenomicAlignments_1.28.0 [77] CodeDepends_0.6.5 bit_4.0.4 [79] tidyselect_1.1.1 magrittr_2.0.1 [81] bookdown_0.22 R6_2.5.0 [83] generics_0.1.0 DelayedArray_0.18.0 [85] DBI_1.1.1 pillar_1.6.1 [87] withr_2.4.2 KEGGREST_1.32.0 [89] RCurl_1.98-1.3 tibble_3.1.2 [91] dir.expiry_1.0.0 crayon_1.4.1 [93] utf8_1.2.1 rmarkdown_2.8 [95] viridis_0.6.1 progress_1.2.2 [97] grid_4.1.0 blob_1.2.1 [99] digest_0.6.27 xtable_1.8-4 [101] httpuv_1.6.1 munsell_0.5.0 [103] viridisLite_0.4.0 beeswarm_0.3.1 [105] vipor_0.4.5 bslib_0.2.5.1 Bibliography "],["more-markers.html", "Chapter 3 Controlling marker detection 3.1 Overview 3.2 Annotation with test-based marker detection 3.3 Defining custom markers 3.4 Pseudo-bulk aggregation Session information", " Chapter 3 Controlling marker detection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Overview One of the most important steps in SingleR (beyond the choice of reference, of course) is the derivation of the marker genes used in the score calculation. We have already introduced the classic approach in the previous chapter, but it is similarly straightforward to perform marker detection with conventional statistical tests. In particular, we identify top-ranked markers based on pairwise Wilcoxon rank sum tests or \\(t\\)-tests between labels; this allows us to account for the variability across cells to choose genes that are robustly upregulated in each label. The availability of variance-aware marker detection methods is most relevant for reference datasets that contain a reasonable number (i.e., at least two) of replicate samples for each label. An obvious use case is that of single-cell datasets that are used as a reference to annotate other single-cell datasets. It is also possible for users to supply their own custom marker lists to SingleR(), facilitating incorporation of prior biological knowledge into the annotation process. We will demonstrate these capabilities below in this chapter. 3.2 Annotation with test-based marker detection To demonstrate, we will use two human pancreas scRNA-seq datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference, computing log-normalized expression values as discussed in Section 2.4. library(scRNAseq) sceM &lt;- MuraroPancreasData() # Removing unlabelled cells or cells without a clear label. sceM &lt;- sceM[,!is.na(sceM$label) &amp; sceM$label!=&quot;unclear&quot;] library(scater) sceM &lt;- logNormCounts(sceM) sceM ## class: SingleCellExperiment ## dim: 19059 2122 ## metadata(0): ## assays(2): counts logcounts ## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17 ## ZZZ3__chr1 ## rowData names(2): symbol chr ## colnames(2122): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC # Seeing the available labels in this dataset. table(sceM$label) ## ## acinar alpha beta delta duct endothelial ## 219 812 448 193 245 21 ## epsilon mesenchymal pp ## 3 80 101 We then set up our test dataset from Grun et al. (2016), applying some basic quality control as discusssed here and in Section 2.3. We also compute the log-transformed values here, not because it is strictly necessary but so that we don’t have to keep on typing assay.type.test=1 in later calls to SingleR(). sceG &lt;- GrunPancreasData() sceG &lt;- addPerCellQC(sceG) qc &lt;- quickPerCellQC(colData(sceG), percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sceG$donor, subset=sceG$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sceG &lt;- sceG[,!qc$discard] sceG &lt;- logNormCounts(sceG) sceG ## class: SingleCellExperiment ## dim: 20064 1064 ## metadata(0): ## assays(2): counts logcounts ## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17 ## ZZZ3__chr1 ## rowData names(2): symbol chr ## colnames(1064): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95 ## colData names(9): donor sample ... total sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC We run SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm, as the latter may fail for low-coverage data where the median for each label is often zero. library(SingleR) pred.grun &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;) table(pred.grun$labels) ## ## acinar alpha beta delta duct endothelial ## 277 203 181 50 306 5 ## epsilon mesenchymal pp ## 1 22 19 By default, the function will take the top de.n (default: 10) genes from each pairwise comparison between labels. A larger number of markers increases the robustness of the annotation by ensuring that relevant genes are not omitted, especially if the reference dataset has study-specific effects that cause uninteresting genes to dominate the top set. However, this comes at the cost of increasing noise and computational time. library(SingleR) pred.grun &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;, de.n=50) table(pred.grun$labels) ## ## acinar alpha beta delta duct endothelial ## 275 203 177 55 307 5 ## epsilon mesenchymal pp ## 1 23 18 3.3 Defining custom markers The marker detection in SingleR() is built on top of the testing framework in scran, so most options in ?pairwiseWilcox and friends can be applied via the de.args= option. For example, we could use the \\(t\\)-test and test against a log-fold change threshold with de.args=list(lfc=1). library(SingleR) pred.grun2 &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;t&quot;, de.args=list(lfc=1)) table(pred.grun2$labels) ## ## acinar alpha beta delta duct endothelial ## 285 200 177 54 296 5 ## epsilon mesenchymal pp ## 5 24 18 However, users can also construct their own marker lists with any DE testing machinery. For example, we can perform pairwise binomial tests to identify genes that are differentially detected (i.e., have differences in the proportion of cells with non-zero counts) between labels in the reference Muraro dataset. We then take the top 10 marker genes from each pairwise comparison, obtaining a list of lists of character vectors containing the identities of the markers for that comparison. library(scran) out &lt;- pairwiseBinom(counts(sceM), sceM$label, direction=&quot;up&quot;) markers &lt;- getTopMarkers(out$statistics, out$pairs, n=10) # Upregulated in acinar compared to alpha: markers$acinar$alpha ## [1] &quot;KCNQ1__chr11&quot; &quot;FAM129A__chr1&quot; &quot;KLK1__chr19&quot; &quot;NTN4__chr12&quot; ## [5] &quot;RASEF__chr9&quot; &quot;CTRL__chr16&quot; &quot;LGALS2__chr22&quot; &quot;NUPR1__chr16&quot; ## [9] &quot;LGALS3__chr14&quot; &quot;NR5A2__chr1&quot; # Upregulated in alpha compared to acinar: markers$alpha$acinar ## [1] &quot;SLC38A4__chr12&quot; &quot;ARX__chrX&quot; &quot;CRYBA2__chr2&quot; &quot;FSTL5__chr4&quot; ## [5] &quot;GNG2__chr14&quot; &quot;NOL4__chr18&quot; &quot;IRX2__chr5&quot; &quot;KCNMB2__chr3&quot; ## [9] &quot;CFC1__chr2&quot; &quot;KCNJ6__chr21&quot; Once we have this list of lists, we supply it to SingleR() via the genes= argument, which causes the function to bypass the internal marker detection to use the supplied gene sets instead. The most obvious benefit of this approach is that the user can achieve greater control of the markers, allowing integration of prior biological knowledge to obtain more relevant genes and a more robust annotation. pred.grun2b &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=markers) table(pred.grun2b$labels) ## ## acinar alpha beta delta duct endothelial ## 276 202 175 54 302 5 ## epsilon mesenchymal pp ## 2 25 23 In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels. This is accommodated by supplying a named list of character vectors to genes. Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded. # Creating label-specific markers. label.markers &lt;- lapply(markers, unlist) label.markers &lt;- lapply(label.markers, unique) str(label.markers) ## List of 9 ## $ acinar : chr [1:40] &quot;KCNQ1__chr11&quot; &quot;FAM129A__chr1&quot; &quot;KLK1__chr19&quot; &quot;NTN4__chr12&quot; ... ## $ alpha : chr [1:41] &quot;SLC38A4__chr12&quot; &quot;ARX__chrX&quot; &quot;CRYBA2__chr2&quot; &quot;FSTL5__chr4&quot; ... ## $ beta : chr [1:47] &quot;ELAVL4__chr1&quot; &quot;PRUNE2__chr9&quot; &quot;NMNAT2__chr1&quot; &quot;PLCB4__chr20&quot; ... ## $ delta : chr [1:44] &quot;NOL4__chr18&quot; &quot;CABP7__chr22&quot; &quot;UNC80__chr2&quot; &quot;HEPACAM2__chr7&quot; ... ## $ duct : chr [1:50] &quot;ADCY5__chr3&quot; &quot;PDE3A__chr12&quot; &quot;SLC3A1__chr2&quot; &quot;BICC1__chr10&quot; ... ## $ endothelial: chr [1:26] &quot;GPR4__chr19&quot; &quot;TMEM204__chr16&quot; &quot;GPR116__chr6&quot; &quot;CYYR1__chr21&quot; ... ## $ epsilon : chr [1:14] &quot;BHMT__chr5&quot; &quot;JPH3__chr16&quot; &quot;SERPINA10__chr14&quot; &quot;UGT2B4__chr4&quot; ... ## $ mesenchymal: chr [1:34] &quot;TNFAIP6__chr2&quot; &quot;THBS2__chr6&quot; &quot;CDH11__chr16&quot; &quot;SRPX2__chrX&quot; ... ## $ pp : chr [1:44] &quot;SERTM1__chr13&quot; &quot;ETV1__chr7&quot; &quot;ARX__chrX&quot; &quot;ELAVL4__chr1&quot; ... pred.grun2c &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=label.markers) table(pred.grun2c$labels) ## ## acinar alpha beta delta duct endothelial ## 262 204 169 59 317 6 ## epsilon mesenchymal pp ## 2 24 21 3.4 Pseudo-bulk aggregation Single-cell reference datasets provide a like-for-like comparison to our test single-cell datasets, yielding a more accurate classification of the cells in the latter (hopefully). However, there are frequently many more samples in single-cell references compared to bulk references, increasing the computational work involved in classification. We overcome this by aggregating cells into one “pseudo-bulk” sample per label (e.g., by averaging across log-expression values) and using that as the reference profile, which allows us to achieve the same efficiency as the use of bulk references. The obvious cost of this approach is that we discard potentially useful information about the distribution of cells within each label. Cells that belong to a heterogeneous population may not be correctly assigned if they are far from the population center. To preserve some of this information, we perform \\(k\\)-means clustering within each label to create pseudo-bulk samples that are representative of a particular region of the expression space (i.e., vector quantization). We create \\(\\sqrt{N}\\) clusters given a label with \\(N\\) cells, which provides a reasonable compromise between reducing computational work and preserving the label’s internal distribution. To enable this aggregation, we simply set aggr.ref=TRUE in the SingleR() call. This uses the aggregateReference() function to perform \\(k\\)-means clustering within each label (typically after principal components analysis on the log-expression matrix, for greater speed) and average expression values for each within-label cluster. Note that marker detection is still performed on the unaggregated data so as to make full use of the distribution of expression values across cells. set.seed(100) # for the k-means step. pred.grun3 &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;, aggr.ref=TRUE) table(pred.grun3$labels) ## ## acinar alpha beta delta duct endothelial ## 271 202 181 51 311 5 ## epsilon mesenchymal pp ## 1 22 20 Obviously, the aggregation itself requires computational work so setting aggr.ref=TRUE in SingleR() itself may not improve speed. Rather, the real power of this approach lies in pre-aggregating the reference dataset so that it can be repeatedly applied to quickly annotate multiple test datasets. This approach is discussed in more detail in Chapter 7. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.20.0 SingleR_1.6.1 [3] scater_1.20.0 ggplot2_3.3.3 [5] scuttle_1.2.0 scRNAseq_2.6.0 [7] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0 [9] Biobase_2.52.0 GenomicRanges_1.44.0 [11] GenomeInfoDb_1.28.0 IRanges_2.26.0 [13] S4Vectors_0.30.0 BiocGenerics_0.38.0 [15] MatrixGenerics_1.4.0 matrixStats_0.58.0 [17] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.0.0 BiocFileCache_2.0.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.26.0 digest_0.6.27 [7] ensembldb_2.16.0 htmltools_0.5.1.1 [9] viridis_0.6.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] ScaledMatrix_1.0.0 cluster_2.1.2 [15] limma_3.48.0 Biostrings_2.60.0 [17] prettyunits_1.1.1 colorspace_2.0-1 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.23 dplyr_1.0.6 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.70.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.38.0 XVector_0.32.0 [31] DelayedArray_0.18.0 BiocSingular_1.8.0 [33] scales_1.1.1 edgeR_3.34.0 [35] DBI_1.1.1 Rcpp_1.0.6 [37] viridisLite_0.4.0 xtable_1.8-4 [39] progress_1.2.2 dqrng_0.3.0 [41] bit_4.0.4 rsvd_1.0.5 [43] metapod_1.0.0 httr_1.4.2 [45] dir.expiry_1.0.0 ellipsis_0.3.2 [47] pkgconfig_2.0.3 XML_3.99-0.6 [49] CodeDepends_0.6.5 sass_0.4.0 [51] dbplyr_2.1.1 locfit_1.5-9.4 [53] utf8_1.2.1 tidyselect_1.1.1 [55] rlang_0.4.11 later_1.2.0 [57] AnnotationDbi_1.54.0 munsell_0.5.0 [59] BiocVersion_3.13.1 tools_4.1.0 [61] cachem_1.0.5 generics_0.1.0 [63] RSQLite_2.2.7 ExperimentHub_2.0.0 [65] evaluate_0.14 stringr_1.4.0 [67] fastmap_1.1.0 yaml_2.2.1 [69] knitr_1.33 bit64_4.0.5 [71] purrr_0.3.4 KEGGREST_1.32.0 [73] AnnotationFilter_1.16.0 sparseMatrixStats_1.4.0 [75] mime_0.10 biomaRt_2.48.0 [77] compiler_4.1.0 beeswarm_0.3.1 [79] filelock_1.0.2 curl_4.3.1 [81] png_0.1-7 interactiveDisplayBase_1.30.0 [83] statmod_1.4.36 tibble_3.1.2 [85] bslib_0.2.5.1 stringi_1.6.2 [87] GenomicFeatures_1.44.0 lattice_0.20-44 [89] bluster_1.2.0 ProtGenerics_1.24.0 [91] Matrix_1.3-3 vctrs_0.3.8 [93] pillar_1.6.1 lifecycle_1.0.0 [95] BiocManager_1.30.15 jquerylib_0.1.4 [97] BiocNeighbors_1.10.0 bitops_1.0-7 [99] irlba_2.3.3 httpuv_1.6.1 [101] rtracklayer_1.52.0 R6_2.5.0 [103] BiocIO_1.2.0 bookdown_0.22 [105] promises_1.2.0.1 gridExtra_2.3 [107] vipor_0.4.5 codetools_0.2-18 [109] assertthat_0.2.1 rjson_0.2.20 [111] withr_2.4.2 GenomicAlignments_1.28.0 [113] Rsamtools_2.8.0 GenomeInfoDbData_1.2.6 [115] hms_1.1.0 grid_4.1.0 [117] beachmat_2.8.0 rmarkdown_2.8 [119] DelayedMatrixStats_1.14.0 shiny_1.6.0 [121] ggbeeswarm_0.6.0 restfulr_0.0.13 Bibliography "],["annotation-diagnostics.html", "Chapter 4 Annotation diagnostics 4.1 Overview 4.2 Based on the scores within cells 4.3 Based on the deltas across cells 4.4 Based on marker gene expression 4.5 Comparing to unsupervised clustering Session information", " Chapter 4 Annotation diagnostics .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Overview In addition to the labels, SingleR() returns a number of helpful diagnostics about the annotation process that can be used to determine whether the assignments are appropriate. Unambiguous assignments corroborated by expression of canonical markers add confidence to the results; conversely, low-confidence assignments can be pruned out to avoid adding noise to downstream analyses. This chapter will demonstrate some of these common sanity checks on the pancreas datasets from Chapter 3 (Muraro et al. 2016; Grun et al. 2016). View set-up code (Chapter 8) #--- loading-muraro ---# library(scRNAseq) sceM &lt;- MuraroPancreasData() sceM &lt;- sceM[,!is.na(sceM$label) &amp; sceM$label!=&quot;unclear&quot;] #--- normalize-muraro ---# library(scater) sceM &lt;- logNormCounts(sceM) #--- loading-grun ---# sceG &lt;- GrunPancreasData() sceG &lt;- addPerCellQC(sceG) qc &lt;- quickPerCellQC(colData(sceG), percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sceG$donor, subset=sceG$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sceG &lt;- sceG[,!qc$discard] #--- normalize-grun ---# sceG &lt;- logNormCounts(sceG) #--- annotation ---# library(SingleR) pred.grun &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;) 4.2 Based on the scores within cells The most obvious diagnostic reported by SingleR() is the nested matrix of per-cell scores in the scores field. This contains the correlation-based scores prior to any fine-tuning for each cell (row) and reference label (column). Ideally, we would see unambiguous assignments where, for any given cell, one label’s score is clearly larger than the others. pred.grun$scores[1:10,] ## acinar alpha beta delta duct endothelial epsilon mesenchymal ## [1,] 0.6797 0.1463 0.1554 0.1029 0.4971 0.2030 0.04434 0.1848 ## [2,] 0.6832 0.1849 0.1611 0.1250 0.5133 0.2231 0.08909 0.2140 ## [3,] 0.6845 0.2032 0.2141 0.1875 0.5060 0.2286 0.08960 0.1961 ## [4,] 0.6472 0.2050 0.2115 0.1879 0.7300 0.2285 0.12107 0.2708 ## [5,] 0.6079 0.2285 0.2419 0.1938 0.6616 0.2659 0.16007 0.3228 ## [6,] 0.5816 0.2789 0.2611 0.2661 0.6827 0.2857 0.21781 0.3151 ## [7,] 0.5704 0.3268 0.2990 0.2618 0.6293 0.2947 0.20834 0.3263 ## [8,] 0.5692 0.2455 0.2152 0.2033 0.6495 0.2652 0.18469 0.2943 ## [9,] 0.6421 0.2101 0.1929 0.1894 0.7201 0.2795 0.14615 0.3142 ## [10,] 0.7153 0.2109 0.1747 0.1606 0.5896 0.1687 0.15639 0.1778 ## pp ## [1,] 0.08954 ## [2,] 0.11207 ## [3,] 0.15207 ## [4,] 0.14351 ## [5,] 0.18475 ## [6,] 0.21898 ## [7,] 0.25871 ## [8,] 0.19160 ## [9,] 0.17281 ## [10,] 0.16314 To check whether this is indeed the case, we use the plotScoreHeatmap() function to visualize the score matrix (Figure 4.1). Here, the key is to examine the spread of scores within each cell, i.e., down the columns of the heatmap. Similar scores for a group of labels indicates that the assignment is uncertain for those columns, though this may be acceptable if the uncertainty is distributed across closely related cell types. (Note that the assigned label for a cell may not be the visually top-scoring label if fine-tuning is applied, as the only the pre-tuned scores are directly comparable across all labels.) library(SingleR) plotScoreHeatmap(pred.grun) Figure 4.1: Heatmap of normalized scores for the Grun dataset. Each cell is a column while each row is a label in the reference Muraro dataset. The final label (after fine-tuning) for each cell is shown in the top color bar. We can also display other metadata information for each cell by setting clusters= or annotation_col=. This is occasionally useful for examining potential batch effects, differences in cell type composition between conditions, relationship to clusters from an unsupervised analysis and so on,. For example, Figure 4.2 displays the donor of origin for each cell; we can see that each cell type has contributions from multiple donors, which is reassuring as it indicates that our assignments are not (purely) driven by donor effects. plotScoreHeatmap(pred.grun, annotation_col=as.data.frame(colData(sceG)[,&quot;donor&quot;,drop=FALSE])) Figure 4.2: Heatmap of normalized scores for the Grun dataset, including the donor of origin for each cell. The scores matrix has several caveats associated with its interpretation. Only the pre-tuned scores are stored in this matrix, as scores after fine-tuning are not comparable across all labels. This means that the label with the highest score for a cell may not be the cell’s final label if fine-tuning is applied. Moreover, the magnitude of differences in the scores has no clear interpretation; indeed, plotScoreHeatmap() dispenses with any faithful representation of the scores and instead adjusts the values to highlight any differences between labels within each cell. 4.3 Based on the deltas across cells We identify poor-quality or ambiguous assignments based on the per-cell “delta”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Our assumption is that most of the labels in the reference are not relevant to any given cell. Thus, the median across all labels can be used as a measure of the baseline correlation, while the gap from the assigned label to this baseline can be used as a measure of the assignment confidence. Low deltas indicate that the assignment is uncertain, possibly because the cell’s true label does not exist in the reference. An obvious next step is to apply a threshold on the delta to filter out these low-confidence assignments. We use the delta rather than the assignment score as the latter is more sensitive to technical effects. For example, changes in library size affect the technical noise and can increase/decrease all scores for a given cell, while the delta is somewhat more robust as it focuses on the differences between scores within each cell. SingleR() will set a threshold on the delta for each label using an outlier-based strategy. Specifically, we identify cells with deltas that are small outliers relative to the deltas of other cells with the same label. This assumes that, for any given label, most cells assigned to that label are correct. We focus on outliers to avoid difficulties with setting a fixed threshold, especially given that the magnitudes of the deltas are about as uninterpretable as the scores themselves. Pruned labels are reported in the pruned.labels field where low-quality assignments are replaced with NA. to.remove &lt;- is.na(pred.grun$pruned.labels) table(Label=pred.grun$labels, Removed=to.remove) ## Removed ## Label FALSE TRUE ## acinar 251 26 ## alpha 198 5 ## beta 180 1 ## delta 49 1 ## duct 301 5 ## endothelial 4 1 ## epsilon 1 0 ## mesenchymal 22 0 ## pp 17 2 However, the default pruning parameters may not be appropriate for every dataset. For example, if one label is consistently misassigned, the assumption that most cells are correctly assigned will not be appropriate. In such cases, we can revert to a fixed threshold by manually calling the underlying pruneScores() function with min.diff.med=. The example below discards cells with deltas below an arbitrary threshold of 0.2, where higher thresholds correspond to greater assignment certainty. to.remove &lt;- pruneScores(pred.grun, min.diff.med=0.2) table(Label=pred.grun$labels, Removed=to.remove) ## Removed ## Label FALSE TRUE ## acinar 250 27 ## alpha 155 48 ## beta 148 33 ## delta 33 17 ## duct 301 5 ## endothelial 4 1 ## epsilon 0 1 ## mesenchymal 22 0 ## pp 4 15 This entire process can be visualized using the plotScoreDistribution() function, which displays the per-label distribution of the deltas across cells (Figure 4.3). We can use this plot to check that outlier detection in pruneScores() behaved sensibly. Labels with especially low deltas may warrant some additional caution in their interpretation. plotDeltaDistribution(pred.grun) Figure 4.3: Distribution of deltas for the Grun dataset. Each facet represents a label in the Muraro dataset, and each point represents a cell assigned to that label (colored by whether it was pruned). If fine-tuning was performed, we can apply an even more stringent filter based on the difference between the highest and second-highest scores after fine-tuning. Cells will only pass the filter if they are assigned to a label that is clearly distinguishable from any other label. In practice, this approach tends to be too conservative as assignments involving closely related labels are heavily penalized. to.remove2 &lt;- pruneScores(pred.grun, min.diff.next=0.1) table(Label=pred.grun$labels, Removed=to.remove2) ## Removed ## Label FALSE TRUE ## acinar 235 42 ## alpha 166 37 ## beta 117 64 ## delta 25 25 ## duct 157 149 ## endothelial 4 1 ## epsilon 0 1 ## mesenchymal 22 0 ## pp 9 10 4.4 Based on marker gene expression Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. The marker genes used for each label are reported in the metadata() of the SingleR() output, so we can simply retrieve them to visualize their (usually log-transformed) expression values across the test dataset. In Figure 4.4, we use the plotHeatmap() function from scater to examine the expression of markers used to identify beta cells. all.markers &lt;- metadata(pred.grun)$de.genes beta.markers &lt;- unique(unlist(all.markers$beta)) sceG$labels &lt;- pred.grun$labels library(scater) plotHeatmap(sceG, order_columns_by=&quot;labels&quot;, features=beta.markers) Figure 4.4: Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset. Assigned labels for each cell are shown at the top of the plot. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. We would also hope that those label’s markers are biologically meaningful; in this case, we do observe strong upregulation of insulin (INS) in the beta cells, which is reassuring and gives greater confidence to the correctness of the assignment. If the identified markers are not meaningful or not consistently upregulated, some skepticism towards the quality of the assignments is warranted. In practice, the heatmap may be overwhelmingly large if there too many reference-derived markers. To resolve this, we can prune the set of markers to focus on the most interesting genes based on their test expression profiles. Figure 4.5 is limited to the top genes with the strongest evidence for upregulation in our test dataset using the assigned labels; such genes are effectively markers for beta cells in both the reference and test datasets. As a diagnostic plot, this is much more amenable to quick inspection to check that the expected genes are present. # Taking the first 20 reference markers that are the top empirical markers. library(scran) empirical.markers &lt;- findMarkers(sceG, sceG$labels, direction=&quot;up&quot;) m &lt;- match(beta.markers, rownames(empirical.markers$beta)) m &lt;- beta.markers[rank(m) &lt;= 20] library(scater) plotHeatmap(sceG, order_columns_by=&quot;labels&quot;, features=m) Figure 4.5: Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset, pruned to those that are also upregulated in the assigned cells in the Grun dataset. Assigned labels for each cell are shown at the top of the plot. It is straightforward to repeat this process for all labels by wrapping this code in a loop, as shown below in Figure 4.6. Note that plotHeatmap() is not the only function that can be used for this visualization; we could also use plotDots() to create a Seurat-style dot plot, or we could use other heatmap plotting functions such as dittoHeatmap() from dittoSeq. collected &lt;- list() for (lab in unique(pred.grun$labels)) { lab.markers &lt;- unique(unlist(all.markers[[lab]])) m &lt;- match(lab.markers, rownames(empirical.markers[[lab]])) m &lt;- lab.markers[rank(m) &lt;= 20] collected[[lab]] &lt;- plotHeatmap(sceG, silent=TRUE, order_columns_by=&quot;labels&quot;, main=lab, features=m)[[4]] } do.call(gridExtra::grid.arrange, collected) Figure 4.6: Heatmaps of log-expression values in the Grun dataset for all marker genes upregulated in each label in the Muraro reference dataset. Assigned labels for each cell are shown at the top of each plot. In general, the heatmap provides a more interpretable diagnostic visualization than the plots of scores and deltas. However, it does require more effort to inspect and may not be feasible for large numbers of labels. It is also difficult to use a heatmap to determine the correctness of assignment for closely related labels. 4.5 Comparing to unsupervised clustering It can also be instructive to compare the assigned labels to the groupings generated from unsupervised clustering algorithms. The assumption is that the differences between reference labels are also the dominant factor of variation in the test dataset; this implies that we should expect strong agreement between the clusters and the assigned labels. To demonstrate, we’ll use the sceG from Chapter 8 where clusters have generated using a graph-based method (Xu and Su 2015). View set-up code (Chapter 8) #--- loading-muraro ---# library(scRNAseq) sceM &lt;- MuraroPancreasData() sceM &lt;- sceM[,!is.na(sceM$label) &amp; sceM$label!=&quot;unclear&quot;] #--- normalize-muraro ---# library(scater) sceM &lt;- logNormCounts(sceM) #--- loading-grun ---# sceG &lt;- GrunPancreasData() sceG &lt;- addPerCellQC(sceG) qc &lt;- quickPerCellQC(colData(sceG), percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sceG$donor, subset=sceG$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sceG &lt;- sceG[,!qc$discard] #--- normalize-grun ---# sceG &lt;- logNormCounts(sceG) #--- annotation ---# library(SingleR) pred.grun &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;) #--- clustering ---# library(scran) decG &lt;- modelGeneVarWithSpikes(sceG, &quot;ERCC&quot;) set.seed(1000100) sceG &lt;- denoisePCA(sceG, decG) library(bluster) sceG$cluster &lt;- clusterRows(reducedDim(sceG), NNGraphParam(k=5)) We compare these clusters to the labels generated by SingleR. Any similarity can be quantified with the adjusted rand index (ARI) with pairwiseRand() from the bluster package. Large ARIs indicate that the two partitionings are in agreement, though an acceptable definition of “large” is difficult to gauge; experience suggests that a reasonable level of consistency is achieved at ARIs above 0.5. library(bluster) pairwiseRand(sceG$cluster, pred.grun$labels, mode=&quot;index&quot;) ## [1] 0.3922 In practice, it is more informative to examine the distribution of cells across each cluster/label combination. Figure ?? shows that most clusters are nested within labels, a difference in resolution that is likely responsible for reducing the ARI. Clusters containing multiple labels are particularly interesting for diagnostic purposes, as this suggests that the differences between labels are not strong enough to drive formation of distinct clusters in the test. tab &lt;- table(cluster=sceG$cluster, label=pred.grun$labels) pheatmap::pheatmap(log10(tab+10)) # using a larger pseudo-count for smoothing. Figure 4.7: Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Grun dataset. The underlying assumption is somewhat reasonable in most scenarios where the labels relate to cell type identity. However, disagreements between the clusters and labels should not be cause for much concern. The whole point of unsupervised clustering is to identify novel variation that, by definition, is not in the reference. It is entirely possible for the clustering and labels to be different without compromising the validity or utility of either; the former captures new heterogeneity while the latter facilitates interpretation in the context of existing knowledge. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.2.0 scran_1.20.0 [3] scater_1.20.0 ggplot2_3.3.3 [5] scuttle_1.2.0 SingleCellExperiment_1.14.1 [7] SingleR_1.6.1 SummarizedExperiment_1.22.0 [9] Biobase_2.52.0 GenomicRanges_1.44.0 [11] GenomeInfoDb_1.28.0 IRanges_2.26.0 [13] MatrixGenerics_1.4.0 matrixStats_0.58.0 [15] S4Vectors_0.30.0 BiocGenerics_0.38.0 [17] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] RColorBrewer_1.1-2 tools_4.1.0 [5] bslib_0.2.5.1 utf8_1.2.1 [7] R6_2.5.0 irlba_2.3.3 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-1 withr_2.4.2 [13] tidyselect_1.1.1 gridExtra_2.3 [15] compiler_4.1.0 graph_1.70.0 [17] BiocNeighbors_1.10.0 DelayedArray_0.18.0 [19] labeling_0.4.2 bookdown_0.22 [21] sass_0.4.0 scales_1.1.1 [23] stringr_1.4.0 digest_0.6.27 [25] rmarkdown_2.8 XVector_0.32.0 [27] pkgconfig_2.0.3 htmltools_0.5.1.1 [29] sparseMatrixStats_1.4.0 limma_3.48.0 [31] highr_0.9 rlang_0.4.11 [33] DelayedMatrixStats_1.14.0 jquerylib_0.1.4 [35] generics_0.1.0 farver_2.1.0 [37] jsonlite_1.7.2 BiocParallel_1.26.0 [39] dplyr_1.0.6 RCurl_1.98-1.3 [41] magrittr_2.0.1 BiocSingular_1.8.0 [43] GenomeInfoDbData_1.2.6 Matrix_1.3-3 [45] ggbeeswarm_0.6.0 Rcpp_1.0.6 [47] munsell_0.5.0 fansi_0.4.2 [49] viridis_0.6.1 lifecycle_1.0.0 [51] edgeR_3.34.0 stringi_1.6.2 [53] yaml_2.2.1 zlibbioc_1.38.0 [55] grid_4.1.0 dqrng_0.3.0 [57] crayon_1.4.1 dir.expiry_1.0.0 [59] lattice_0.20-44 beachmat_2.8.0 [61] locfit_1.5-9.4 CodeDepends_0.6.5 [63] metapod_1.0.0 knitr_1.33 [65] pillar_1.6.1 igraph_1.2.6 [67] codetools_0.2-18 ScaledMatrix_1.0.0 [69] XML_3.99-0.6 glue_1.4.2 [71] evaluate_0.14 BiocManager_1.30.15 [73] vctrs_0.3.8 gtable_0.3.0 [75] purrr_0.3.4 assertthat_0.2.1 [77] xfun_0.23 rsvd_1.0.5 [79] viridisLite_0.4.0 tibble_3.1.2 [81] pheatmap_1.0.12 beeswarm_0.3.1 [83] cluster_2.1.2 statmod_1.4.36 [85] ellipsis_0.3.2 Bibliography "],["using-multiple-references.html", "Chapter 5 Using multiple references 5.1 Overview 5.2 Using reference-specific labels 5.3 Comparing scores across references 5.4 Using harmonized labels Session info", " Chapter 5 Using multiple references .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Overview In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators. Several strategies are available to combine inferences from multiple references: using reference-specific labels in a combined reference using harmonized labels in a combined reference combining scores across multiple references This chapter discusses the various strengths and weaknesses of each strategy and provides some practical demonstrations of each. Here, we will use the HPCA and Blueprint/ENCODE datasets as our references and (yet another) PBMC dataset as the test. library(TENxPBMCData) pbmc &lt;- TENxPBMCData(&quot;pbmc8k&quot;) library(celldex) hpca &lt;- HumanPrimaryCellAtlasData(ensembl=TRUE) bpe &lt;- BlueprintEncodeData(ensembl=TRUE) 5.2 Using reference-specific labels In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, “B cell” - in reference dataset X is considered to be different from a “B cell” label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts. We can easily implement this approach by combining the expression matrices together and pasting the reference name onto the corresponding character vector of labels. This modification ensures that the downstream SingleR() call will treat each label-reference combination as a distinct entity. hpca2 &lt;- hpca hpca2$label.main &lt;- paste0(&quot;HPCA.&quot;, hpca2$label.main) bpe2 &lt;- bpe bpe2$label.main &lt;- paste0(&quot;BPE.&quot;, bpe2$label.main) shared &lt;- intersect(rownames(hpca2), rownames(bpe2)) combined &lt;- cbind(hpca2[shared,], bpe2[shared,]) It is then straightforward to perform annotation with the usual methods. library(SingleR) com.res1 &lt;- SingleR(pbmc, ref=combined, labels=combined$label.main, assay.type.test=1) table(com.res1$labels) ## ## BPE.B-cells BPE.CD4+ T-cells BPE.CD8+ T-cells BPE.HSC ## 1179 1708 2656 20 ## BPE.Monocytes BPE.NK cells HPCA.HSC_-G-CSF HPCA.Platelets ## 2348 460 1 7 ## HPCA.T_cells ## 2 However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin. 5.3 Comparing scores across references 5.3.1 Combining inferences from individual references Another strategy - and the default approach implemented in SingleR() - involves performing classification separately within each reference, and then collating the results to choose the label with the highest score across references. This is a relatively expedient approach that avoids the need for explicit harmonization while also reducing exposure to reference-specific batch effects. To use this method, we simply pass multiple objects to the ref= and label= argument in SingleR(). The combining strategy is as follows: The function first annotates the test dataset with each reference individually in the same manner as described in Section 1.2. This step is almost equivalent to simply looping over all individual references and running SingleR() on each. For each cell, the function collects its predicted labels across all references. In doing so, it also identifies the union of markers that are upregulated in the predicted label in each reference. The function identifies the overall best-scoring label as the final prediction for that cell. This step involves a recomputation of the scores across the identified marker subset to ensure that these scores are derived from the same set of genes (and are thus comparable across references). The function will then return a DataFrame of combined results for each cell in the test dataset, including the overall label and the reference from which it was assigned. com.res2 &lt;- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe, HPCA=hpca), labels = list(bpe$label.main, hpca$label.main)) # Check the final label from the combined assignment. table(com.res2$labels) ## ## B-cells B_cell CD4+ T-cells CD8+ T-cells ## 1170 14 1450 2936 ## GMP HSC Monocyte Monocytes ## 1 22 753 1560 ## NK cells NK_cell Platelets Pre-B_cell_CD34- ## 372 10 9 16 ## T_cells ## 68 # Check the &#39;winning&#39; reference for each cell. table(com.res2$reference) ## ## 1 2 ## 7510 871 The main appeal of this approach lies in the fact that it is based on the results of annotation with individual references. This avoids batch effects from comparing expression values across references; it reduces the need for any coordination in the label scheme between references; and simultaneously provides the per-reference annotations in the results. The last feature is particularly useful as it allows for more detailed diagnostics, troubleshooting and further analysis. head(com.res2$orig.results$BPE$labels) ## [1] &quot;B-cells&quot; &quot;Monocytes&quot; &quot;CD8+ T-cells&quot; &quot;CD8+ T-cells&quot; &quot;Monocytes&quot; ## [6] &quot;Monocytes&quot; head(com.res2$orig.results$HPCA$labels) ## [1] &quot;B_cell&quot; &quot;Monocyte&quot; &quot;T_cells&quot; &quot;T_cells&quot; &quot;Monocyte&quot; &quot;Monocyte&quot; The main downside is that it is somewhat suboptimal if there are many labels that are unique to one reference, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference. The continued lack of consistency in the labels across references also complicates interpretation of the results, though we can overcome this by using harmonized labels as described below. 5.3.2 Combined diagnostics All of the diagnostic plots in SingleR will naturally operate on these combined results. For example, we can create a heatmap of the scores in all of the individual references as well as for the recomputed scores in the combined results (Figure 5.1). Note that scores are only recomputed for the labels predicted in the individual references, so all labels outside of those are simply set to NA - hence the swathes of grey. plotScoreHeatmap(com.res2) Figure 5.1: Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell. The deltas for each individual reference can also be plotted with plotDeltaDistribution() (Figure 5.2). No deltas are shown for the recomputed scores as the assumption described in Section 4.3 may not be applicable across the predicted labels from the individual references. For example, if all individual references suggest the same cell type with similar recomputed scores, any delta would be low even though the assignment is highly confident. plotDeltaDistribution(com.res2) Figure 5.2: Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference. We can similarly extract marker genes to use in heatmaps as described in Section 4.4. As annotation was performed to each individual reference, we can simply extract the marker genes from the nested DataFrames as shown in Figure 5.3. hpca.markers &lt;- metadata(com.res2$orig.results$HPCA)$de.genes bpe.markers &lt;- metadata(com.res2$orig.results$BPE)$de.genes mono.markers &lt;- unique(unlist(hpca.markers$Monocyte, bpe.markers$Monocytes)) library(scater) plotHeatmap(logNormCounts(pbmc), order_columns_by=list(I(com.res2$labels)), features=mono.markers) Figure 5.3: Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top. 5.4 Using harmonized labels 5.4.1 Sharing information during marker detection One of the major problems with using multiple references is the presence of study-specific nomenclature. For example, the concept of a B cell may be annotated as B cells in one reference, B_cells in another reference, and then B and B-cell and so on in other references. We can overcome this by using harmonized labels where the same cell type is assigned as the same label across references, simplifying interpretation and ensuring that irrelevant discrepancies in labelling do not intefere with downstream analysis. Many of the SingleR reference datasets already have their labels mapped to the Cell Ontology, which provides a standard vocabulary to refer to the same cell type across diverse datasets. We will describe the utility of Cell Ontology terms in more detail in Chapter 6; at this point, the key idea is that the same term is used for the same conceptual cell type in each reference. To simplify interpretation, we set cell.ont=\"nonna\" to remove all samples that could not be mapped to the ontology. hpca.ont &lt;- HumanPrimaryCellAtlasData(ensembl=TRUE, cell.ont=&quot;nonna&quot;) bpe.ont &lt;- BlueprintEncodeData(ensembl=TRUE, cell.ont=&quot;nonna&quot;) # Using the same sets of genes: shared &lt;- intersect(rownames(hpca.ont), rownames(bpe.ont)) hpca.ont &lt;- hpca.ont[shared,] bpe.ont &lt;- bpe.ont[shared,] # Showing the top 10 most frequent terms: head(sort(table(hpca.ont$label.ont)), 10) ## ## CL:0002259 CL:0000017 CL:0000049 CL:0000050 CL:0000084 CL:0000127 CL:0000557 ## 1 2 2 2 2 2 2 ## CL:0000798 CL:0000816 CL:0000836 ## 2 2 2 head(sort(table(bpe.ont$label.ont)), 10) ## ## CL:0000451 CL:0000771 CL:0000787 CL:0000815 CL:0000904 CL:0000905 CL:0000907 ## 1 1 1 1 1 1 1 ## CL:0000913 CL:0000972 CL:0000127 ## 1 1 2 The simplest way to take advantage of the standardization in terminology is to use label.ont in place of label.main in the previous section’s SingleR() call. This yields annotations that have follow the same vocabulary regardless of the reference used for assignment. com.res3a &lt;- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont)) table(Label=com.res3a$labels, Reference=com.res3a$reference) ## Reference ## Label 1 2 ## CL:0000037 2 0 ## CL:0000050 6 0 ## CL:0000051 6 0 ## CL:0000233 0 3 ## CL:0000236 0 70 ## CL:0000556 7 0 ## CL:0000557 4 1 ## CL:0000576 1520 548 ## CL:0000623 304 10 ## CL:0000624 734 34 ## CL:0000625 591 89 ## CL:0000786 2 0 ## CL:0000787 270 3 ## CL:0000788 728 12 ## CL:0000798 0 2 ## CL:0000815 78 0 ## CL:0000816 0 21 ## CL:0000837 5 0 ## CL:0000895 0 399 ## CL:0000904 126 867 ## CL:0000905 137 231 ## CL:0000907 729 0 ## CL:0000913 479 5 ## CL:0000955 0 13 ## CL:0000972 101 0 ## CL:0001054 0 244 A more advanced approach is to share information across references during the marker detection stage. This is done by favoring genes the exhibit upregulation consistently in multiple references, which increases the likelihood that those markers will generalize to other datasets. For classic marker detection, we achieve this by calling getClassicMarkers() to obtain markers for use in SingleR(); the same effect can be achieved for test-based methods in scran functions by setting block=. We then use these improved markers by passing them to genes= as described in Section 3.3. In this case, we specify com.markers twice in a list to indicate that we are using them for both of our references. com.markers &lt;- getClassicMarkers( ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont)) com.res3b &lt;- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont), genes = list(com.markers, com.markers)) table(Label=com.res3b$labels, Reference=com.res3b$reference) ## Reference ## Label 1 2 ## CL:0000037 4 0 ## CL:0000050 5 0 ## CL:0000051 8 0 ## CL:0000233 0 2 ## CL:0000236 0 106 ## CL:0000556 8 0 ## CL:0000557 1 2 ## CL:0000576 1435 669 ## CL:0000623 306 20 ## CL:0000624 605 90 ## CL:0000625 424 171 ## CL:0000786 3 0 ## CL:0000787 225 2 ## CL:0000788 719 30 ## CL:0000798 0 2 ## CL:0000815 111 0 ## CL:0000816 0 36 ## CL:0000837 5 0 ## CL:0000895 0 410 ## CL:0000904 55 986 ## CL:0000905 108 272 ## CL:0000907 728 0 ## CL:0000913 503 24 ## CL:0000955 0 10 ## CL:0000972 91 0 ## CL:0001054 0 205 It is worth noting that, in the above code, the DE genes are still identified within each reference and then the statistics are merged across references to identify the top markers. This ensures that we do not directly compare expression values across references, which reduces the susceptibility of marker detection to batch effects. The most obvious problem with this approach is that it assumes that harmonized labels are available. This is usually not true and requires some manual mapping of the author-provided labels to a common vocabulary. The mapping process also runs the risk of discarding relevant information about the biological status (e.g., activation status, disease condition) if there is no obvious counterpart for that state in the ontology. 5.4.2 Manual label harmonization The matchReferences() function provides a simple approach for label harmonization between two references. Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed, i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other and vice versa? Probabilities close to 1 in Figure 5.4 indicate there is a 1:1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference. library(SingleR) bp.se &lt;- BlueprintEncodeData() hpca.se &lt;- HumanPrimaryCellAtlasData() matched &lt;- matchReferences(bp.se, hpca.se, bp.se$label.main, hpca.se$label.main) pheatmap::pheatmap(matched, col=viridis::plasma(100)) Figure 5.4: Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns). This function can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, neurons are considered to be unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here. As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering. Session info View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.20.0 ggplot2_3.3.3 [3] scuttle_1.2.0 SingleR_1.6.1 [5] ensembldb_2.16.0 AnnotationFilter_1.16.0 [7] GenomicFeatures_1.44.0 AnnotationDbi_1.54.0 [9] celldex_1.2.0 TENxPBMCData_1.10.0 [11] HDF5Array_1.20.0 rhdf5_2.36.0 [13] DelayedArray_0.18.0 Matrix_1.3-3 [15] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0 [17] Biobase_2.52.0 GenomicRanges_1.44.0 [19] GenomeInfoDb_1.28.0 IRanges_2.26.0 [21] S4Vectors_0.30.0 BiocGenerics_0.38.0 [23] MatrixGenerics_1.4.0 matrixStats_0.58.0 [25] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.0.0 BiocFileCache_2.0.0 [3] lazyeval_0.2.2 BiocParallel_1.26.0 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.6.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] ScaledMatrix_1.0.0 Biostrings_2.60.0 [13] prettyunits_1.1.1 colorspace_2.0-1 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.23 dplyr_1.0.6 [19] crayon_1.4.1 RCurl_1.98-1.3 [21] jsonlite_1.7.2 graph_1.70.0 [23] glue_1.4.2 gtable_0.3.0 [25] zlibbioc_1.38.0 XVector_0.32.0 [27] BiocSingular_1.8.0 Rhdf5lib_1.14.0 [29] scales_1.1.1 pheatmap_1.0.12 [31] DBI_1.1.1 Rcpp_1.0.6 [33] viridisLite_0.4.0 xtable_1.8-4 [35] progress_1.2.2 bit_4.0.4 [37] rsvd_1.0.5 httr_1.4.2 [39] dir.expiry_1.0.0 RColorBrewer_1.1-2 [41] ellipsis_0.3.2 pkgconfig_2.0.3 [43] XML_3.99-0.6 farver_2.1.0 [45] CodeDepends_0.6.5 sass_0.4.0 [47] dbplyr_2.1.1 utf8_1.2.1 [49] tidyselect_1.1.1 labeling_0.4.2 [51] rlang_0.4.11 later_1.2.0 [53] munsell_0.5.0 BiocVersion_3.13.1 [55] tools_4.1.0 cachem_1.0.5 [57] generics_0.1.0 RSQLite_2.2.7 [59] ExperimentHub_2.0.0 evaluate_0.14 [61] stringr_1.4.0 fastmap_1.1.0 [63] yaml_2.2.1 knitr_1.33 [65] bit64_4.0.5 purrr_0.3.4 [67] KEGGREST_1.32.0 sparseMatrixStats_1.4.0 [69] mime_0.10 biomaRt_2.48.0 [71] compiler_4.1.0 beeswarm_0.3.1 [73] filelock_1.0.2 curl_4.3.1 [75] png_0.1-7 interactiveDisplayBase_1.30.0 [77] tibble_3.1.2 bslib_0.2.5.1 [79] stringi_1.6.2 highr_0.9 [81] lattice_0.20-44 ProtGenerics_1.24.0 [83] vctrs_0.3.8 pillar_1.6.1 [85] lifecycle_1.0.0 rhdf5filters_1.4.0 [87] BiocManager_1.30.15 jquerylib_0.1.4 [89] BiocNeighbors_1.10.0 bitops_1.0-7 [91] irlba_2.3.3 httpuv_1.6.1 [93] rtracklayer_1.52.0 R6_2.5.0 [95] BiocIO_1.2.0 bookdown_0.22 [97] promises_1.2.0.1 gridExtra_2.3 [99] vipor_0.4.5 codetools_0.2-18 [101] assertthat_0.2.1 rjson_0.2.20 [103] withr_2.4.2 GenomicAlignments_1.28.0 [105] Rsamtools_2.8.0 GenomeInfoDbData_1.2.6 [107] hms_1.1.0 grid_4.1.0 [109] beachmat_2.8.0 rmarkdown_2.8 [111] DelayedMatrixStats_1.14.0 shiny_1.6.0 [113] ggbeeswarm_0.6.0 restfulr_0.0.13 "],["exploiting-the-cell-ontology.html", "Chapter 6 Exploiting the cell ontology 6.1 Motivation 6.2 Basic manipulation 6.3 Adjusting resolution Session information", " Chapter 6 Exploiting the cell ontology .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Motivation As previously discussed in Section 5.4, SingleR maps the labels in its references to the Cell Ontology. The most obvious advantage of doing this is to provide a standardized vocabulary with which to describe cell types, thus facilitating integrated analyses with multiple references. However, another useful feature of the Cell Ontology is its hierarchical organization of terms, allowing us to adjust cell type annotations to the desired resolution. This represents a more dynamic alternative to the static label.main and label.fine options in each reference. 6.2 Basic manipulation We use the ontoProc package to load in the Cell Ontology. This produces an ontology_index object (from the ontologyIndex package) that we can query for various pieces of information. # TODO: wrap in utility function. library(ontoProc) bfc &lt;- BiocFileCache::BiocFileCache(ask=FALSE) path &lt;- BiocFileCache::bfcrpath(bfc, &quot;http://purl.obolibrary.org/obo/cl.obo&quot;) cl &lt;- get_ontology(path, extract_tags=&quot;everything&quot;) cl ## Ontology with 9941 terms ## ## format-version: 1.2 ## data-version: releases/2021-03-05 ## ontology: cl ## ## Properties: ## id: character ## name: character ## parents: list ## children: list ## ancestors: list ## obsolete: logical ## RO:0002100: list ## RO:0002102: list ## RO:0002103: list ## RO:0002104: list ## RO:0002120: list ## RO:0002130: list ## RO:0002292: list ## RO:0013001: list ## adjacent_to: list ## alt_id: list ## anterior_to: list ## anteriorly_connected_to: list ## attaches_to: list ## attaches_to_part_of: list ## bearer_of: list ## bounding_layer_of: list ## branching_part_of: list ## capable_of: list ## capable_of_part_of: list ## channel_for: list ## channels_from: list ## channels_into: list ## comment: character ## composed_primarily_of: list ## conduit_for: list ## connected_to: list ## connects: list ## consider: list ## contains: list ## continuous_with: list ## contributes_to_morphology_of: list ## created_by: character ## creation_date: character ## data-version: list ## dc-contributor: list ## dc-creator: list ## decreased_in_magnitude_relative_to: list ## deep_to: list ## def: character ## derives_from: list ## developmentally_induced_by: list ## developmentally_replaces: list ## develops_from: list ## develops_from_part_of: list ## develops_in: list ## develops_into: list ## directly_develops_from: list ## disjoint_from: list ## distal_to: list ## distally_connected_to: list ## distalmost_part_of: list ## domain: list ## dorsal_to: list ## drains: list ## ends: list ## ends_with: list ## equivalent_to_chain: list ## existence_ends_during: list ## existence_ends_during_or_before: list ## existence_ends_with: list ## existence_starts_and_ends_during: list ## existence_starts_during: list ## existence_starts_during_or_after: list ## existence_starts_with: list ## expand_assertion_to: list ## expand_expression_to: list ## extends_fibers_into: list ## filtered_through: list ## format-version: list ## has_boundary: list ## has_completed: list ## has_component: list ## has_cross_section: list ## has_developmental_contribution_from: list ## has_gene_template: list ## has_high_plasma_membrane_amount: list ## has_low_plasma_membrane_amount: list ## has_member: list ## has_muscle_antagonist: list ## has_muscle_insertion: list ## has_muscle_origin: list ## has_not_completed: list ## has_part: list ## has_potential_to_develop_into: list ## has_potential_to_developmentally_contribute_to: list ## has_quality: list ## has_relative_magnitude: list ## has_role: list ## has_skeleton: list ## holds_over_chain: list ## immediate_transformation_of: list ## immediately_deep_to: list ## immediately_preceded_by: list ## immediately_superficial_to: list ## in_anterior_side_of: list ## in_deep_part_of: list ## in_distal_side_of: list ## in_dorsal_side_of: list ## in_lateral_side_of: list ## in_left_side_of: list ## in_posterior_side_of: list ## in_proximal_side_of: list ## in_right_side_of: list ## in_taxon: list ## increased_in_magnitude_relative_to: list ## indirectly_supplies: list ## innervated_by: list ## innervates: list ## intersection_of: list ## intersects_midsagittal_plane_of: list ## inverse_of: list ## is_a: list ## is_class_level: list ## is_cyclic: list ## is_functional: list ## is_metadata_tag: list ## is_symmetric: list ## is_transitive: list ## lacks_part: list ## lacks_plasma_membrane_part: list ## located_in: list ## location_of: list ## lumen_of: list ## luminal_space_of: list ## namespace: list ## negatively_regulates: list ## never_in_taxon: list ## occurs_in: list ## only_in_taxon: list ## ontology: list ## output_of: list ## part_of: list ## participates_in: list ## positively_regulates: list ## posterior_to: list ## posteriorly_connected_to: list ## preaxialmost_part_of: list ## preceded_by: list ## precedes: list ## produced_by: list ## produces: list ## property_value: list ## protects: list ## proximal_to: list ## proximally_connected_to: list ## proximalmost_part_of: list ## range: list ## regulates: list ## remark: list ## replaced_by: list ## seeAlso: list ## sexually_homologous_to: list ## skeleton_of: list ## starts: list ## starts_with: list ## subdivision_of: list ## subset: list ## subsetdef: list ## superficial_to: list ## supplies: list ## surrounded_by: list ## surrounds: list ## synonym: list ## synonymtypedef: list ## transformation_of: list ## transitive_over: list ## tributary_of: list ## trunk_part_of: list ## union_of: list ## ventral_to: list ## xref: list ## Roots: ## BFO:0000003 - occurrent ## functionally_related_to - functionally related to ## CHEBI:36342 - subatomic particle ## RO:0002410 - causally related to ## BFO:0000002 - continuant ## UBERON:0001062 - anatomical entity ## RO:0002323 - mereotopologically related to ## GO:0005575 - cellular_component ## UBERON:0000000 - processual entity ## RO:0000057 - has participant ## ... 82 more The most immediate use of this object lies in mapping ontology terms to their plain-English descriptions. We can use this to translate annotations produced by SingleR() from the label.ont labels into a more interpretable form. We demonstrate this approach using celldex’s collection of mouse RNA-seq references (Aran et al. 2019). head(cl$name) # short name ## BFO:0000002 BFO:0000003 BFO:0000004 ## &quot;continuant&quot; &quot;occurrent&quot; &quot;independent continuant&quot; ## BFO:0000006 BFO:0000015 BFO:0000019 ## &quot;spatial region&quot; &quot;process&quot; &quot;quality&quot; head(cl$def) # longer definition ## BFO:0000002 ## &quot;\\&quot;An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.\\&quot; []&quot; ## BFO:0000003 ## &quot;\\&quot;An entity that has temporal parts and that happens, unfolds or develops through time.\\&quot; []&quot; ## BFO:0000004 ## &quot;\\&quot;A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything.\\&quot; []&quot; ## BFO:0000006 ## NA ## BFO:0000015 ## &quot;\\&quot;An occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t.\\&quot; []&quot; ## BFO:0000019 ## NA library(celldex) ref &lt;- MouseRNAseqData(cell.ont=&quot;nonna&quot;) translated &lt;- cl$name[ref$label.ont] head(translated) ## CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 ## &quot;fat cell&quot; &quot;fat cell&quot; &quot;fat cell&quot; &quot;fat cell&quot; &quot;fat cell&quot; &quot;fat cell&quot; Another interesting application involves examining the relationship between different terms. The ontology itself is a directed acyclic graph, so we can can convert it into graph object for advanced queries using the igraph package. Each edge represents an “is a” relationship where each vertex represents a specialized case of the concept of the parent node. # TODO: wrap in utility function. parents &lt;- cl$parents self &lt;- rep(names(parents), lengths(parents)) library(igraph) g &lt;- make_graph(rbind(unlist(parents), self)) g ## IGRAPH 409d7b0 DN-- 9886 15674 -- ## + attr: name (v/c) ## + edges from 409d7b0 (vertex names): ## [1] BFO:0000002 -&gt;BFO:0000004 BFO:0000141 -&gt;BFO:0000006 ## [3] BFO:0000003 -&gt;BFO:0000015 BFO:0000020 -&gt;BFO:0000019 ## [5] BFO:0000002 -&gt;BFO:0000020 BFO:0000040 -&gt;BFO:0000024 ## [7] BFO:0000040 -&gt;BFO:0000030 BFO:0000002 -&gt;BFO:0000031 ## [9] BFO:0000004 -&gt;BFO:0000040 BFO:0000004 -&gt;BFO:0000141 ## [11] CARO:0030000-&gt;CARO:0000000 CARO:0000006-&gt;CARO:0000003 ## [13] BFO:0000040 -&gt;CARO:0000006 CARO:0000000-&gt;CARO:0000006 ## [15] CARO:0000003-&gt;CARO:0001000 CARO:0001000-&gt;CARO:0001001 ## + ... omitted several edges One query involves identifying all descendents of a particular term of interest. This can be useful when searching for a cell type in the presence of variable annotation resolution; for example, a search for “epithelial cell” can be configured to pick up all child terms such as “endothelial cell” and “ependymal cell”. term &lt;- &quot;CL:0000624&quot; cl$name[term] ## CL:0000624 ## &quot;CD4-positive, alpha-beta T cell&quot; all.kids &lt;- names(subcomponent(g, term)) head(cl$name[all.kids]) ## CL:0000624 ## &quot;CD4-positive, alpha-beta T cell&quot; ## CL:0000492 ## &quot;CD4-positive helper T cell&quot; ## CL:0001051 ## &quot;CD4-positive, CXCR3-negative, CCR6-negative, alpha-beta T cell&quot; ## CL:0000791 ## &quot;mature alpha-beta T cell&quot; ## CL:0000792 ## &quot;CD4-positive, CD25-positive, alpha-beta regulatory T cell&quot; ## CL:0000793 ## &quot;CD4-positive, alpha-beta intraepithelial T cell&quot; Alternatively, we might be interested in the last common ancestor (LCA) for a set of terms. This is the furthest term - or, in some cases, multiple terms - from the root of the ontology that is also an ancestor of all of the terms of interest. We will use this LCA concept in the next section to adjust resolution across multiple references. terms &lt;- c(&quot;CL:0000624&quot;, &quot;CL:0000785&quot;, &quot;CL:0000623&quot;) cl$name[terms] ## CL:0000624 CL:0000785 ## &quot;CD4-positive, alpha-beta T cell&quot; &quot;mature B cell&quot; ## CL:0000623 ## &quot;natural killer cell&quot; # TODO: god, put this in a function somewhere. all.ancestors &lt;- lapply(terms, subcomponent, graph=g, mode=&quot;in&quot;) all.ancestors &lt;- lapply(all.ancestors, names) common.ancestors &lt;- Reduce(intersect, all.ancestors) ancestors.of.ancestors &lt;- lapply(common.ancestors, subcomponent, graph=g, mode=&quot;in&quot;) ancestors.of.ancestors &lt;- lapply(ancestors.of.ancestors, names) ancestors.of.ancestors &lt;- mapply(setdiff, ancestors.of.ancestors, common.ancestors) latest.common.ancestors &lt;- setdiff(common.ancestors, unlist(ancestors.of.ancestors)) cl$name[latest.common.ancestors] ## CL:0000542 ## &quot;lymphocyte&quot; 6.3 Adjusting resolution We can use the ontology graph to adjust the resolution of the reference labels by rolling up overly-specific terms to their LCA. The findCommonAncestors() utility takes a set of terms and returns a list of potential LCAs for various subsets of those terms. Users can inspect this list to identify LCAs at the desired resolution and then map their descendent terms to those LCAs. findCommonAncestors &lt;- function(..., g, remove.self=TRUE, names=NULL) { terms &lt;- list(...) if (is.null(names(terms))) { names(terms) &lt;- sprintf(&quot;set%i&quot;, seq_along(terms)) } all.terms &lt;- unique(unlist(terms)) all.ancestors &lt;- lapply(all.terms, subcomponent, graph=g, mode=&quot;in&quot;) all.ancestors &lt;- lapply(all.ancestors, names) by.ancestor &lt;- split( rep(all.terms, lengths(all.ancestors)), unlist(all.ancestors) ) # Removing ancestor nodes with the same count as its children. available &lt;- names(by.ancestor) for (i in available) { if (!i %in% names(by.ancestor)) { next } counts &lt;- lengths(by.ancestor) cur.ancestors &lt;- subcomponent(g, i, mode=&quot;in&quot;) cur.ancestors &lt;- setdiff(names(cur.ancestors), i) drop &lt;- cur.ancestors[counts[i]==counts[cur.ancestors]] by.ancestor &lt;- by.ancestor[!names(by.ancestor) %in% drop] } if (remove.self) { by.ancestor &lt;- by.ancestor[lengths(by.ancestor) &gt; 1L] } by.ancestor &lt;- by.ancestor[order(lengths(by.ancestor))] # most specific terms first. # Decorating the output. for (i in names(by.ancestor)) { current &lt;- by.ancestor[[i]] df &lt;- DataFrame(row.names=current) curout &lt;- list() if (!is.null(names)) { curout$name &lt;- unname(names[i]) df$name &lt;- names[current] } presence &lt;- list() for (b in names(terms)) { presence[[b]] &lt;- current %in% terms[[b]] } df &lt;- cbind(df, do.call(DataFrame, presence)) curout$descendents &lt;- df by.ancestor[[i]] &lt;- curout } by.ancestor } lca &lt;- findCommonAncestors(ref$label.ont, g=g, names=cl$name) head(lca) ## $`CL:0000081` ## $`CL:0000081`$name ## [1] &quot;blood cell&quot; ## ## $`CL:0000081`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000232 erythrocyte TRUE ## CL:0000094 granulocyte TRUE ## ## ## $`CL:0000126` ## $`CL:0000126`$name ## [1] &quot;macroglial cell&quot; ## ## $`CL:0000126`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000127 astrocyte TRUE ## CL:0000128 oligodendrocyte TRUE ## ## ## $`CL:0000393` ## $`CL:0000393`$name ## [1] &quot;electrically responsive cell&quot; ## ## $`CL:0000393`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000540 neuron TRUE ## CL:0000746 cardiac muscle cell TRUE ## ## ## $`CL:0002320` ## $`CL:0002320`$name ## [1] &quot;connective tissue cell&quot; ## ## $`CL:0002320`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000136 fat cell TRUE ## CL:0000057 fibroblast TRUE ## ## ## $`CL:0011115` ## $`CL:0011115`$name ## [1] &quot;precursor cell&quot; ## ## $`CL:0011115`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000047 neuronal stem cell TRUE ## CL:0000576 monocyte TRUE ## ## ## $`CL:0000066` ## $`CL:0000066`$name ## [1] &quot;epithelial cell&quot; ## ## $`CL:0000066`$descendents ## DataFrame with 3 rows and 2 columns ## name set1 ## &lt;character&gt; &lt;logical&gt; ## CL:0000115 endothelial cell TRUE ## CL:0000182 hepatocyte TRUE ## CL:0000065 ependymal cell TRUE We can also use this function to synchronize multiple sets of terms to the same resolution. Here, we consider the ImmGen dataset (Heng et al. 2008), which provides highly resolved annotation of immune cell types. The findCommonAncestors() function specifies the origins of the descendents for each LCA, allowing us to focus on LCAs that have representatives in both sets of terms. ref2 &lt;- ImmGenData(cell.ont=&quot;nonna&quot;) lca2 &lt;- findCommonAncestors(MouseRNA=ref$label.ont, ImmGen=ref2$label.ont, g=g, names=cl$name) head(lca2) ## $`CL:0000126` ## $`CL:0000126`$name ## [1] &quot;macroglial cell&quot; ## ## $`CL:0000126`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000127 astrocyte TRUE FALSE ## CL:0000128 oligodendrocyte TRUE FALSE ## ## ## $`CL:0000393` ## $`CL:0000393`$name ## [1] &quot;electrically responsive cell&quot; ## ## $`CL:0000393`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000540 neuron TRUE FALSE ## CL:0000746 cardiac muscle cell TRUE FALSE ## ## ## $`CL:0000623` ## $`CL:0000623`$name ## [1] &quot;natural killer cell&quot; ## ## $`CL:0000623`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000623 natural killer cell TRUE TRUE ## CL:0002438 NK1.1-positive natur.. FALSE TRUE ## ## ## $`CL:0000813` ## $`CL:0000813`$name ## [1] &quot;memory T cell&quot; ## ## $`CL:0000813`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000897 CD4-positive, alpha-.. FALSE TRUE ## CL:0000909 CD8-positive, alpha-.. FALSE TRUE ## ## ## $`CL:0000815` ## $`CL:0000815`$name ## [1] &quot;regulatory T cell&quot; ## ## $`CL:0000815`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000792 CD4-positive, CD25-p.. FALSE TRUE ## CL:0000815 regulatory T cell FALSE TRUE ## ## ## $`CL:0000819` ## $`CL:0000819`$name ## [1] &quot;B-1 B cell&quot; ## ## $`CL:0000819`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000820 B-1a B cell FALSE TRUE ## CL:0000821 B-1b B cell FALSE TRUE For example, we might notice that the mouse RNA-seq reference only has a single “T cell” term. To synchronize resolution across references, we would need to roll up all of the ImmGen’s finely resolved subsets into that LCA as shown below. Of course, this results in some loss of precision and information; whether this is an acceptable price for simpler interpretation is a decision that is left to the user. children &lt;- lca2$`CL:0000084`$descendents children ## DataFrame with 35 rows and 3 columns ## name MouseRNA ImmGen ## &lt;character&gt; &lt;logical&gt; &lt;logical&gt; ## CL:0000084 T cell TRUE TRUE ## CL:0002427 resting double-posit.. FALSE TRUE ## CL:0000809 double-positive, alp.. FALSE TRUE ## CL:0002429 CD69-positive double.. FALSE TRUE ## CL:0000624 CD4-positive, alpha-.. FALSE TRUE ## ... ... ... ... ## CL:0002415 immature Vgamma1.1-p.. FALSE TRUE ## CL:0002411 Vgamma1.1-positive, .. FALSE TRUE ## CL:0002416 mature Vgamma1.1-pos.. FALSE TRUE ## CL:0002407 mature Vgamma2-posit.. FALSE TRUE ## CL:0000815 regulatory T cell FALSE TRUE # Synchronization: synced.mm &lt;- ref$label.ont synced.mm[synced.mm %in% rownames(children)] &lt;- &quot;CL:0000084&quot; synced.ig &lt;- ref2$label.ont synced.ig[synced.ig %in% rownames(children)] &lt;- &quot;CL:0000084&quot; Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] igraph_1.2.6 celldex_1.2.0 [3] SummarizedExperiment_1.22.0 Biobase_2.52.0 [5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 [7] IRanges_2.26.0 S4Vectors_0.30.0 [9] BiocGenerics_0.38.0 MatrixGenerics_1.4.0 [11] matrixStats_0.58.0 ontoProc_1.14.0 [13] ontologyIndex_2.7 BiocStyle_2.20.0 [15] rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] filelock_1.0.2 httr_1.4.2 [5] Rgraphviz_2.36.0 tools_4.1.0 [7] bslib_0.2.5.1 utf8_1.2.1 [9] R6_2.5.0 DT_0.18 [11] DBI_1.1.1 withr_2.4.2 [13] tidyselect_1.1.1 bit_4.0.4 [15] curl_4.3.1 compiler_4.1.0 [17] graph_1.70.0 DelayedArray_0.18.0 [19] bookdown_0.22 sass_0.4.0 [21] rappdirs_0.3.3 stringr_1.4.0 [23] digest_0.6.27 rmarkdown_2.8 [25] XVector_0.32.0 pkgconfig_2.0.3 [27] htmltools_0.5.1.1 sparseMatrixStats_1.4.0 [29] dbplyr_2.1.1 fastmap_1.1.0 [31] htmlwidgets_1.5.3 rlang_0.4.11 [33] RSQLite_2.2.7 DelayedMatrixStats_1.14.0 [35] shiny_1.6.0 jquerylib_0.1.4 [37] generics_0.1.0 paintmap_1.0 [39] jsonlite_1.7.2 dplyr_1.0.6 [41] RCurl_1.98-1.3 magrittr_2.0.1 [43] GenomeInfoDbData_1.2.6 Matrix_1.3-3 [45] Rcpp_1.0.6 fansi_0.4.2 [47] lifecycle_1.0.0 stringi_1.6.2 [49] yaml_2.2.1 zlibbioc_1.38.0 [51] AnnotationHub_3.0.0 BiocFileCache_2.0.0 [53] grid_4.1.0 blob_1.2.1 [55] promises_1.2.0.1 ExperimentHub_2.0.0 [57] crayon_1.4.1 lattice_0.20-44 [59] dir.expiry_1.0.0 Biostrings_2.60.0 [61] ontologyPlot_1.6 KEGGREST_1.32.0 [63] CodeDepends_0.6.5 knitr_1.33 [65] pillar_1.6.1 codetools_0.2-18 [67] BiocVersion_3.13.1 XML_3.99-0.6 [69] glue_1.4.2 evaluate_0.14 [71] BiocManager_1.30.15 png_0.1-7 [73] vctrs_0.3.8 httpuv_1.6.1 [75] purrr_0.3.4 assertthat_0.2.1 [77] cachem_1.0.5 xfun_0.23 [79] mime_0.10 xtable_1.8-4 [81] later_1.2.0 tibble_3.1.2 [83] AnnotationDbi_1.54.0 memoise_2.0.0 [85] interactiveDisplayBase_1.30.0 ellipsis_0.3.2 Bibliography "],["advanced-options.html", "Chapter 7 Advanced options 7.1 Preconstructed indices 7.2 Parallelization 7.3 Approximate algorithms 7.4 Cluster-level annotation Session information", " Chapter 7 Advanced options .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Preconstructed indices Advanced users can split the SingleR() workflow into two separate training and classification steps. This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once for any reference. The resulting data structure can then be re-used across multiple classifications with different test datasets, provided the gene annotation in the test dataset is identical to or a superset of the genes in the training set. To illustrate, we will consider the DICE reference dataset (Schmiedel et al. 2018) from the celldex package. library(celldex) dice &lt;- DatabaseImmuneCellExpressionData(ensembl=TRUE) dice ## class: SummarizedExperiment ## dim: 29914 1561 ## metadata(0): ## assays(1): logcounts ## rownames(29914): ENSG00000121410 ENSG00000268895 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(0): ## colnames(1561): TPM_1 TPM_2 ... TPM_101 TPM_102 ## colData names(3): label.main label.fine label.ont table(dice$label.fine) ## ## B cells, naive Monocytes, CD14+ ## 106 106 ## Monocytes, CD16+ NK cells ## 105 105 ## T cells, CD4+, TFH T cells, CD4+, Th1 ## 104 104 ## T cells, CD4+, Th17 T cells, CD4+, Th1_17 ## 104 104 ## T cells, CD4+, Th2 T cells, CD4+, memory TREG ## 104 104 ## T cells, CD4+, naive T cells, CD4+, naive TREG ## 103 104 ## T cells, CD4+, naive, stimulated T cells, CD8+, naive ## 102 104 ## T cells, CD8+, naive, stimulated ## 102 Let’s say we want to use the DICE reference to annotate the PBMC dataset from Chapter 1. library(TENxPBMCData) sce &lt;- TENxPBMCData(&quot;pbmc3k&quot;) We use the trainSingleR() function to do all the necessary calculations that are independent of the test dataset. (Almost; see comments below about common.) This yields a list of various components that contains all identified marker genes and precomputed rank indices to be used in the score calculation. We can also turn on aggregation with aggr.ref=TRUE (Section @ref(pseudo-bulk aggregation)) to further reduce computational work. common &lt;- intersect(rownames(sce), rownames(dice)) library(SingleR) set.seed(2000) trained &lt;- trainSingleR(dice[common,], labels=dice$label.fine, aggr.ref=TRUE) We then use the trained object to annotate our dataset of interest through the classifySingleR() function. As we can see, this yields exactly the same result as applying SingleR() directly. The advantage here is that trained can be re-used for multiple classifySingleR() calls - possibly on different datasets - without having to repeat unnecessary steps when the reference is unchanged. pred &lt;- classifySingleR(sce, trained, assay.type=1) table(pred$labels) ## ## B cells, naive Monocytes, CD14+ ## 344 515 ## Monocytes, CD16+ NK cells ## 187 320 ## T cells, CD4+, TFH T cells, CD4+, Th1 ## 365 222 ## T cells, CD4+, Th17 T cells, CD4+, Th1_17 ## 64 62 ## T cells, CD4+, Th2 T cells, CD4+, memory TREG ## 69 169 ## T cells, CD4+, naive T cells, CD4+, naive TREG ## 115 57 ## T cells, CD8+, naive ## 211 # Comparing to the direct approach. set.seed(2000) direct &lt;- SingleR(sce, ref=dice, labels=dice$label.fine, assay.type.test=1, aggr.ref=TRUE) identical(pred$labels, direct$labels) ## [1] TRUE The big caveat is that the universe of genes in the test dataset must be a superset of that the reference. This is the reason behind the intersection to common genes and the subsequent subsetting of dice. Practical use of preconstructed indices is best combined with some prior information about the gene-level annotation; for example, we might know that we always use a particular version of the Ensembl gene models, so we would filter out any genes in the reference dataset that are not in our test datasets. 7.2 Parallelization Parallelization is an obvious approach to increasing annotation throughput. This is done using the framework in the BiocParallel package, which provides several options for parallelization depending on the available hardware. On POSIX-compliant systems (i.e., Linux and MacOS), the simplest method is to use forking by passing MulticoreParam() to the BPPARAM= argument: library(BiocParallel) pred2a &lt;- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine, BPPARAM=MulticoreParam(8)) # 8 CPUs. Alternatively, one can use separate processes with SnowParam(), which is slower but can be used on all systems - including Windows, our old nemesis. pred2b &lt;- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine, BPPARAM=SnowParam(8)) identical(pred2a$labels, pred2b$labels) ## [1] TRUE When working on a cluster, passing BatchtoolsParam() to SingleR() allows us to seamlessly interface with various job schedulers like SLURM, LSF and so on. This permits heavy-duty parallelization across hundreds of CPUs for highly intensive jobs, though often some configuration is required - see the vignette for more details. 7.3 Approximate algorithms It is possible to sacrifice accuracy to squeeze more speed out of SingleR. The most obvious approach is to simply turn off the fine-tuning with fine.tune=FALSE, which avoids the time-consuming fine-tuning iterations. When the reference labels are well-separated, this is probably an acceptable trade-off. pred3a &lt;- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.main, fine.tune=FALSE) table(pred3a$labels) ## ## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+ ## 348 705 357 950 340 Another approximation is based on the fact that the initial score calculation is done using a nearest-neighbors search. By default, this is an exact seach but we can switch to an approximate algorithm via the BNPARAM= argument. In the example below, we use the Annoy algorithm via the BiocNeighbors framework, which yields mostly similar results. (Note, though, that the Annoy method does involve a considerable amount of overhead, so for small jobs it will actually be slower than the exact search.) library(BiocNeighbors) pred3b &lt;- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.main, fine.tune=FALSE, # for comparison with pred3a. BNPARAM=AnnoyParam()) table(pred3a$labels, pred3b$labels) ## ## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+ ## B cells 348 0 0 0 0 ## Monocytes 0 705 0 0 0 ## NK cells 0 0 357 0 0 ## T cells, CD4+ 0 0 0 950 0 ## T cells, CD8+ 0 0 0 0 340 7.4 Cluster-level annotation The default philosophy of SingleR is to perform annotation of each individual cell in the test dataset. An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells. To demonstrate, we will perform a quick-and-dirty clustering of our PBMC dataset with a variety of Bioconductor packages. library(scuttle) sce &lt;- logNormCounts(sce) library(scran) dec &lt;- modelGeneVarByPoisson(sce) sce &lt;- denoisePCA(sce, dec, subset.row=getTopHVGs(dec, n=5000)) library(bluster) colLabels(sce) &lt;- clusterRows(reducedDim(sce), NNGraphParam()) library(scater) set.seed(117) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) plotTSNE(sce, colour_by=&quot;label&quot;) By passing clusters= to SingleR(), we direct the function to compute an aggregated profile per cluster. Annotation is then performed on the cluster-level profiles rather than on the single-cell level. This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells; it is also easier to interpret as it directly returns the likely cell type identity of each cluster. SingleR(sce, dice, clusters=colLabels(sce), labels=dice$label.main) ## DataFrame with 11 rows and 5 columns ## scores first.labels tuning.scores ## &lt;matrix&gt; &lt;character&gt; &lt;DataFrame&gt; ## 1 0.1534120:0.261500:0.599638:... NK cells 0.391189:0.359560 ## 2 0.2061287:0.232735:0.357030:... T cells, CD4+ 0.551037:0.503613 ## 3 0.0526260:0.271140:0.727792:... NK cells 0.727792:0.393413 ## 4 0.1419607:0.781275:0.209402:... Monocytes 0.781275:0.209402 ## 5 0.1642361:0.763198:0.206301:... Monocytes 0.763198:0.206301 ## 6 0.6402007:0.335862:0.223424:... B cells 0.640201:0.335862 ## 7 0.2275805:0.602347:0.211547:... Monocytes 0.602347:0.227580 ## 8 0.2136143:0.276442:0.404099:... T cells, CD4+ 0.687793:0.580251 ## 9 0.1675653:0.753343:0.260288:... Monocytes 0.753343:0.260288 ## 10 0.2470206:0.245272:0.333286:... T cells, CD4+ 0.359902:0.304589 ## 11 0.0713926:0.223101:0.117047:... Monocytes 0.223101:0.117047 ## labels pruned.labels ## &lt;character&gt; &lt;character&gt; ## 1 T cells, CD4+ T cells, CD4+ ## 2 T cells, CD4+ T cells, CD4+ ## 3 NK cells NK cells ## 4 Monocytes Monocytes ## 5 Monocytes Monocytes ## 6 B cells B cells ## 7 Monocytes Monocytes ## 8 T cells, CD4+ T cells, CD4+ ## 9 Monocytes Monocytes ## 10 T cells, CD4+ T cells, CD4+ ## 11 Monocytes NA This approach assumes that each cluster in the test dataset corresponds to exactly one reference label. If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label. (We note that it would be very difficult to determine the composition of the mixture from the SingleR() scores.) Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels, decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset. The default per-cell strategy is safer and provides more information about the ambiguity of the annotations, which is important for closely related labels where a close correspondence between clusters and labels cannot be expected. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.20.0 ggplot2_3.3.3 [3] bluster_1.2.0 scran_1.20.0 [5] scuttle_1.2.0 BiocNeighbors_1.10.0 [7] BiocParallel_1.26.0 SingleR_1.6.1 [9] TENxPBMCData_1.10.0 HDF5Array_1.20.0 [11] rhdf5_2.36.0 DelayedArray_0.18.0 [13] Matrix_1.3-3 SingleCellExperiment_1.14.1 [15] ensembldb_2.16.0 AnnotationFilter_1.16.0 [17] GenomicFeatures_1.44.0 AnnotationDbi_1.54.0 [19] celldex_1.2.0 SummarizedExperiment_1.22.0 [21] Biobase_2.52.0 GenomicRanges_1.44.0 [23] GenomeInfoDb_1.28.0 IRanges_2.26.0 [25] S4Vectors_0.30.0 BiocGenerics_0.38.0 [27] MatrixGenerics_1.4.0 matrixStats_0.58.0 [29] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] snow_0.4-3 AnnotationHub_3.0.0 [3] BiocFileCache_2.0.0 igraph_1.2.6 [5] lazyeval_0.2.2 digest_0.6.27 [7] htmltools_0.5.1.1 viridis_0.6.1 [9] fansi_0.4.2 magrittr_2.0.1 [11] memoise_2.0.0 ScaledMatrix_1.0.0 [13] cluster_2.1.2 limma_3.48.0 [15] Biostrings_2.60.0 prettyunits_1.1.1 [17] colorspace_2.0-1 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.23 [21] dplyr_1.0.6 crayon_1.4.1 [23] RCurl_1.98-1.3 jsonlite_1.7.2 [25] graph_1.70.0 glue_1.4.2 [27] gtable_0.3.0 zlibbioc_1.38.0 [29] XVector_0.32.0 BiocSingular_1.8.0 [31] Rhdf5lib_1.14.0 scales_1.1.1 [33] DBI_1.1.1 edgeR_3.34.0 [35] Rcpp_1.0.6 viridisLite_0.4.0 [37] xtable_1.8-4 progress_1.2.2 [39] dqrng_0.3.0 bit_4.0.4 [41] rsvd_1.0.5 metapod_1.0.0 [43] httr_1.4.2 dir.expiry_1.0.0 [45] ellipsis_0.3.2 farver_2.1.0 [47] pkgconfig_2.0.3 XML_3.99-0.6 [49] CodeDepends_0.6.5 sass_0.4.0 [51] dbplyr_2.1.1 locfit_1.5-9.4 [53] utf8_1.2.1 labeling_0.4.2 [55] tidyselect_1.1.1 rlang_0.4.11 [57] later_1.2.0 munsell_0.5.0 [59] BiocVersion_3.13.1 tools_4.1.0 [61] cachem_1.0.5 generics_0.1.0 [63] RSQLite_2.2.7 ExperimentHub_2.0.0 [65] evaluate_0.14 stringr_1.4.0 [67] fastmap_1.1.0 yaml_2.2.1 [69] knitr_1.33 bit64_4.0.5 [71] purrr_0.3.4 KEGGREST_1.32.0 [73] sparseMatrixStats_1.4.0 mime_0.10 [75] biomaRt_2.48.0 compiler_4.1.0 [77] beeswarm_0.3.1 filelock_1.0.2 [79] curl_4.3.1 png_0.1-7 [81] interactiveDisplayBase_1.30.0 tibble_3.1.2 [83] statmod_1.4.36 bslib_0.2.5.1 [85] stringi_1.6.2 highr_0.9 [87] lattice_0.20-44 ProtGenerics_1.24.0 [89] vctrs_0.3.8 pillar_1.6.1 [91] lifecycle_1.0.0 rhdf5filters_1.4.0 [93] BiocManager_1.30.15 jquerylib_0.1.4 [95] cowplot_1.1.1 bitops_1.0-7 [97] irlba_2.3.3 httpuv_1.6.1 [99] rtracklayer_1.52.0 R6_2.5.0 [101] BiocIO_1.2.0 bookdown_0.22 [103] promises_1.2.0.1 gridExtra_2.3 [105] vipor_0.4.5 codetools_0.2-18 [107] assertthat_0.2.1 rjson_0.2.20 [109] withr_2.4.2 GenomicAlignments_1.28.0 [111] Rsamtools_2.8.0 GenomeInfoDbData_1.2.6 [113] hms_1.1.0 grid_4.1.0 [115] beachmat_2.8.0 rmarkdown_2.8 [117] DelayedMatrixStats_1.14.0 Rtsne_0.15 [119] shiny_1.6.0 ggbeeswarm_0.6.0 [121] restfulr_0.0.13 Bibliography "],["pancreas-case-study.html", "Chapter 8 Cross-annotating human pancreas 8.1 Loading the data 8.2 Applying the annotation 8.3 Diagnostics 8.4 Comparison to clusters Session information", " Chapter 8 Cross-annotating human pancreas .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Loading the data We load the Muraro et al. (2016) dataset as our reference, removing unlabelled cells or cells without a clear label. library(scRNAseq) sceM &lt;- MuraroPancreasData() sceM &lt;- sceM[,!is.na(sceM$label) &amp; sceM$label!=&quot;unclear&quot;] We compute log-expression values for use in marker detection inside SingleR(). library(scater) sceM &lt;- logNormCounts(sceM) We examine the distribution of labels in this reference. table(sceM$label) ## ## acinar alpha beta delta duct endothelial ## 219 812 448 193 245 21 ## epsilon mesenchymal pp ## 3 80 101 We load the Grun et al. (2016) dataset as our test, applying some basic quality control to remove low-quality cells in some of the batches (see here for details). sceG &lt;- GrunPancreasData() sceG &lt;- addPerCellQC(sceG) qc &lt;- quickPerCellQC(colData(sceG), percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sceG$donor, subset=sceG$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sceG &lt;- sceG[,!qc$discard] Technically speaking, the test dataset does not need log-expression values but we compute them anyway for convenience. sceG &lt;- logNormCounts(sceG) 8.2 Applying the annotation We apply SingleR() with Wilcoxon rank sum test-based marker detection to annotate the Grun dataset with the Muraro labels. library(SingleR) pred.grun &lt;- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method=&quot;wilcox&quot;) We examine the distribution of predicted labels: table(pred.grun$labels) ## ## acinar alpha beta delta duct endothelial ## 277 203 181 50 306 5 ## epsilon mesenchymal pp ## 1 22 19 We can also examine the number of discarded cells for each label: table(Label=pred.grun$labels, Lost=is.na(pred.grun$pruned.labels)) ## Lost ## Label FALSE TRUE ## acinar 251 26 ## alpha 198 5 ## beta 180 1 ## delta 49 1 ## duct 301 5 ## endothelial 4 1 ## epsilon 1 0 ## mesenchymal 22 0 ## pp 17 2 8.3 Diagnostics We visualize the assignment scores for each label in Figure 8.1. plotScoreHeatmap(pred.grun) Figure 8.1: Heatmap of the (normalized) assignment scores for each cell (column) in the Grun test dataset with respect to each label (row) in the Muraro reference dataset. The final assignment for each cell is shown in the annotation bar at the top. The delta for each cell is visualized in Figure 8.2. plotDeltaDistribution(pred.grun) Figure 8.2: Distributions of the deltas for each cell in the Grun dataset assigned to each label in the Muraro dataset. Each cell is represented by a point; low-quality assignments that were pruned out are colored in orange. Finally, we visualize the heatmaps of the marker genes for each label in Figure 8.3. library(scater) collected &lt;- list() all.markers &lt;- metadata(pred.grun)$de.genes sceG$labels &lt;- pred.grun$labels for (lab in unique(pred.grun$labels)) { collected[[lab]] &lt;- plotHeatmap(sceG, silent=TRUE, order_columns_by=&quot;labels&quot;, main=lab, features=unique(unlist(all.markers[[lab]])))[[4]] } do.call(gridExtra::grid.arrange, collected) Figure 8.3: Heatmaps of log-expression values in the Grun dataset for all marker genes upregulated in each label in the Muraro reference dataset. Assigned labels for each cell are shown at the top of each plot. 8.4 Comparison to clusters For comparison, we will perform a quick unsupervised analysis of the Grun dataset. We model the variances using the spike-in data and we perform graph-based clustering (increasing the resolution by dropping k=5). library(scran) decG &lt;- modelGeneVarWithSpikes(sceG, &quot;ERCC&quot;) set.seed(1000100) sceG &lt;- denoisePCA(sceG, decG) library(bluster) sceG$cluster &lt;- clusterRows(reducedDim(sceG), NNGraphParam(k=5)) We see that the clusters map reasonably well to the labels in Figure 8.4. tab &lt;- table(cluster=sceG$cluster, label=pred.grun$labels) pheatmap::pheatmap(log10(tab+10)) Figure 8.4: Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Grun dataset. We proceed to the most important part of the analysis. Yes, that’s right, the \\(t\\)-SNE plot (Figure 8.5). set.seed(101010100) sceG &lt;- runTSNE(sceG, dimred=&quot;PCA&quot;) plotTSNE(sceG, colour_by=&quot;cluster&quot;, text_colour=&quot;red&quot;, text_by=I(pred.grun$labels)) Figure 8.5: \\(t\\)-SNE plot of the Grun dataset, where each point is a cell and is colored by the assigned cluster. Reference labels from the Muraro dataset are also placed on the median coordinate across all cells assigned with that label. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.2.0 scran_1.20.0 [3] SingleR_1.6.1 scater_1.20.0 [5] ggplot2_3.3.3 scuttle_1.2.0 [7] scRNAseq_2.6.0 SingleCellExperiment_1.14.1 [9] SummarizedExperiment_1.22.0 Biobase_2.52.0 [11] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 [13] IRanges_2.26.0 S4Vectors_0.30.0 [15] BiocGenerics_0.38.0 MatrixGenerics_1.4.0 [17] matrixStats_0.58.0 BiocStyle_2.20.0 [19] rebook_1.2.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.0.0 BiocFileCache_2.0.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.26.0 digest_0.6.27 [7] ensembldb_2.16.0 htmltools_0.5.1.1 [9] viridis_0.6.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] ScaledMatrix_1.0.0 cluster_2.1.2 [15] limma_3.48.0 Biostrings_2.60.0 [17] prettyunits_1.1.1 colorspace_2.0-1 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.23 dplyr_1.0.6 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.70.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.38.0 XVector_0.32.0 [31] DelayedArray_0.18.0 BiocSingular_1.8.0 [33] scales_1.1.1 pheatmap_1.0.12 [35] edgeR_3.34.0 DBI_1.1.1 [37] Rcpp_1.0.6 viridisLite_0.4.0 [39] xtable_1.8-4 progress_1.2.2 [41] dqrng_0.3.0 bit_4.0.4 [43] rsvd_1.0.5 metapod_1.0.0 [45] httr_1.4.2 RColorBrewer_1.1-2 [47] dir.expiry_1.0.0 ellipsis_0.3.2 [49] farver_2.1.0 pkgconfig_2.0.3 [51] XML_3.99-0.6 CodeDepends_0.6.5 [53] sass_0.4.0 dbplyr_2.1.1 [55] locfit_1.5-9.4 utf8_1.2.1 [57] labeling_0.4.2 tidyselect_1.1.1 [59] rlang_0.4.11 later_1.2.0 [61] AnnotationDbi_1.54.0 munsell_0.5.0 [63] BiocVersion_3.13.1 tools_4.1.0 [65] cachem_1.0.5 generics_0.1.0 [67] RSQLite_2.2.7 ExperimentHub_2.0.0 [69] evaluate_0.14 stringr_1.4.0 [71] fastmap_1.1.0 yaml_2.2.1 [73] knitr_1.33 bit64_4.0.5 [75] purrr_0.3.4 KEGGREST_1.32.0 [77] AnnotationFilter_1.16.0 sparseMatrixStats_1.4.0 [79] mime_0.10 biomaRt_2.48.0 [81] compiler_4.1.0 beeswarm_0.3.1 [83] filelock_1.0.2 curl_4.3.1 [85] png_0.1-7 interactiveDisplayBase_1.30.0 [87] statmod_1.4.36 tibble_3.1.2 [89] bslib_0.2.5.1 stringi_1.6.2 [91] highr_0.9 GenomicFeatures_1.44.0 [93] lattice_0.20-44 ProtGenerics_1.24.0 [95] Matrix_1.3-3 vctrs_0.3.8 [97] pillar_1.6.1 lifecycle_1.0.0 [99] BiocManager_1.30.15 jquerylib_0.1.4 [101] BiocNeighbors_1.10.0 cowplot_1.1.1 [103] bitops_1.0-7 irlba_2.3.3 [105] httpuv_1.6.1 rtracklayer_1.52.0 [107] R6_2.5.0 BiocIO_1.2.0 [109] bookdown_0.22 promises_1.2.0.1 [111] gridExtra_2.3 vipor_0.4.5 [113] codetools_0.2-18 assertthat_0.2.1 [115] rjson_0.2.20 withr_2.4.2 [117] GenomicAlignments_1.28.0 Rsamtools_2.8.0 [119] GenomeInfoDbData_1.2.6 hms_1.1.0 [121] grid_4.1.0 beachmat_2.8.0 [123] rmarkdown_2.8 DelayedMatrixStats_1.14.0 [125] Rtsne_0.15 shiny_1.6.0 [127] ggbeeswarm_0.6.0 restfulr_0.0.13 Bibliography "],["cross-annotating-mouse-brains.html", "Chapter 9 Cross-annotating mouse brains 9.1 Loading the data 9.2 Applying the annotation 9.3 Diagnostics 9.4 Comparison to clusters Session information", " Chapter 9 Cross-annotating mouse brains .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Loading the data We load the classic Zeisel et al. (2015) dataset as our reference. Here, we’ll rely on the fact that the authors have already performed quality control. library(scRNAseq) sceZ &lt;- ZeiselBrainData() We compute log-expression values for use in marker detection inside SingleR(). library(scater) sceZ &lt;- logNormCounts(sceZ) We examine the distribution of labels in this reference. table(sceZ$level2class) ## ## (none) Astro1 Astro2 CA1Pyr1 CA1Pyr2 CA1PyrInt CA2Pyr2 Choroid ## 189 68 61 380 447 49 41 10 ## ClauPyr Epend Int1 Int10 Int11 Int12 Int13 Int14 ## 5 20 12 21 10 21 15 22 ## Int15 Int16 Int2 Int3 Int4 Int5 Int6 Int7 ## 18 20 24 10 15 20 22 23 ## Int8 Int9 Mgl1 Mgl2 Oligo1 Oligo2 Oligo3 Oligo4 ## 26 11 17 16 45 98 87 106 ## Oligo5 Oligo6 Peric Pvm1 Pvm2 S1PyrDL S1PyrL23 S1PyrL4 ## 125 359 21 32 33 81 74 26 ## S1PyrL5 S1PyrL5a S1PyrL6 S1PyrL6b SubPyr Vend1 Vend2 Vsmc ## 16 28 39 21 22 32 105 62 We load the Tasic et al. (2016) dataset as our test. While not strictly necessary, we remove putative low-quality cells to simplify later interpretation. sceT &lt;- TasicBrainData() sceT &lt;- addPerCellQC(sceT, subsets=list(mito=grep(&quot;^mt_&quot;, rownames(sceT)))) qc &lt;- quickPerCellQC(colData(sceT), percent_subsets=c(&quot;subsets_mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) sceT &lt;- sceT[,which(!qc$discard)] The Tasic dataset was generated using read-based technologies so we need to adjust for the transcript length. library(AnnotationHub) mm.db &lt;- AnnotationHub()[[&quot;AH73905&quot;]] mm.exons &lt;- exonsBy(mm.db, by=&quot;gene&quot;) mm.exons &lt;- reduce(mm.exons) mm.len &lt;- sum(width(mm.exons)) mm.symb &lt;- mapIds(mm.db, keys=names(mm.len), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) names(mm.len) &lt;- mm.symb library(scater) keep &lt;- intersect(names(mm.len), rownames(sceT)) sceT &lt;- sceT[keep,] assay(sceT, &quot;TPM&quot;) &lt;- calculateTPM(sceT, lengths=mm.len[keep]) 9.2 Applying the annotation We apply SingleR() with Wilcoxon rank sum test-based marker detection to annotate the Tasic dataset with the Zeisel labels. library(SingleR) pred.tasic &lt;- SingleR(test=sceT, ref=sceZ, labels=sceZ$level2class, assay.type.test=&quot;TPM&quot;, de.method=&quot;wilcox&quot;) We examine the distribution of predicted labels: table(pred.tasic$labels) ## ## Astro1 Astro2 CA1Pyr2 CA2Pyr2 ClauPyr Int1 Int10 Int11 ## 1 5 1 3 1 152 98 2 ## Int12 Int13 Int14 Int15 Int16 Int2 Int3 Int4 ## 9 18 24 16 10 146 94 29 ## Int6 Int7 Int8 Int9 Oligo1 Oligo2 Oligo3 Oligo4 ## 14 2 35 31 8 1 7 1 ## Oligo6 Peric S1PyrDL S1PyrL23 S1PyrL4 S1PyrL5 S1PyrL5a S1PyrL6 ## 1 1 319 73 16 4 201 46 ## S1PyrL6b SubPyr ## 73 5 We can also examine the number of discarded cells for each label: table(Label=pred.tasic$labels, Lost=is.na(pred.tasic$pruned.labels)) ## Lost ## Label FALSE TRUE ## Astro1 1 0 ## Astro2 5 0 ## CA1Pyr2 1 0 ## CA2Pyr2 3 0 ## ClauPyr 1 0 ## Int1 152 0 ## Int10 98 0 ## Int11 2 0 ## Int12 9 0 ## Int13 18 0 ## Int14 23 1 ## Int15 16 0 ## Int16 10 0 ## Int2 144 2 ## Int3 94 0 ## Int4 29 0 ## Int6 14 0 ## Int7 2 0 ## Int8 35 0 ## Int9 31 0 ## Oligo1 8 0 ## Oligo2 1 0 ## Oligo3 7 0 ## Oligo4 1 0 ## Oligo6 1 0 ## Peric 1 0 ## S1PyrDL 303 16 ## S1PyrL23 73 0 ## S1PyrL4 16 0 ## S1PyrL5 4 0 ## S1PyrL5a 200 1 ## S1PyrL6 45 1 ## S1PyrL6b 72 1 ## SubPyr 5 0 9.3 Diagnostics We visualize the assignment scores for each label in Figure 9.1. plotScoreHeatmap(pred.tasic) Figure 9.1: Heatmap of the (normalized) assignment scores for each cell (column) in the Tasic test dataset with respect to each label (row) in the Zeisel reference dataset. The final assignment for each cell is shown in the annotation bar at the top. The delta for each cell is visualized in Figure 9.2. plotDeltaDistribution(pred.tasic) Figure 9.2: Distributions of the deltas for each cell in the Tasic dataset assigned to each label in the Zeisel dataset. Each cell is represented by a point; low-quality assignments that were pruned out are colored in orange. Finally, we visualize the heatmaps of the marker genes for the most frequent label in Figure 9.3. We could show these for all labels but I wouldn’t want to bore you with a parade of large heatmaps. library(scater) collected &lt;- list() all.markers &lt;- metadata(pred.tasic)$de.genes sceT &lt;- logNormCounts(sceT) top.label &lt;- names(sort(table(pred.tasic$labels), decreasing=TRUE))[1] per.label &lt;- sumCountsAcrossCells(logcounts(sceT), ids=pred.tasic$labels, average=TRUE) per.label &lt;- assay(per.label)[unique(unlist(all.markers[[top.label]])),] pheatmap::pheatmap(per.label, main=top.label) Figure 9.3: Heatmap of log-expression values in the Tasic dataset for all marker genes upregulated in the most frequent label from the Zeisel reference dataset. 9.4 Comparison to clusters For comparison, we will perform a quick unsupervised analysis of the Grun dataset. We model the variances using the spike-in data and we perform graph-based clustering. library(scran) decT &lt;- modelGeneVarWithSpikes(sceT, &quot;ERCC&quot;) set.seed(1000100) sceT &lt;- denoisePCA(sceT, decT, subset.row=getTopHVGs(decT, n=2500)) library(bluster) sceT$cluster &lt;- clusterRows(reducedDim(sceT, &quot;PCA&quot;), NNGraphParam()) We do not observe a clean 1:1 mapping between clusters and labels in Figure 9.4, probably because many of the labels represent closely related cell types that are difficult to distinguish. tab &lt;- table(cluster=sceT$cluster, label=pred.tasic$labels) pheatmap::pheatmap(log10(tab+10)) Figure 9.4: Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Tasic dataset. We proceed to the most important part of the analysis. Yes, that’s right, the \\(t\\)-SNE plot (Figure 9.5). set.seed(101010100) sceT &lt;- runTSNE(sceT, dimred=&quot;PCA&quot;) plotTSNE(sceT, colour_by=&quot;cluster&quot;, text_colour=&quot;red&quot;, text_by=I(pred.tasic$labels)) Figure 9.5: \\(t\\)-SNE plot of the Tasic dataset, where each point is a cell and is colored by the assigned cluster. Reference labels from the Zeisel dataset are also placed on the median coordinate across all cells assigned with that label. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.2.0 scran_1.20.0 [3] SingleR_1.6.1 ensembldb_2.16.0 [5] AnnotationFilter_1.16.0 GenomicFeatures_1.44.0 [7] AnnotationDbi_1.54.0 AnnotationHub_3.0.0 [9] BiocFileCache_2.0.0 dbplyr_2.1.1 [11] scater_1.20.0 ggplot2_3.3.3 [13] scuttle_1.2.0 scRNAseq_2.6.0 [15] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0 [17] Biobase_2.52.0 GenomicRanges_1.44.0 [19] GenomeInfoDb_1.28.0 IRanges_2.26.0 [21] S4Vectors_0.30.0 BiocGenerics_0.38.0 [23] MatrixGenerics_1.4.0 matrixStats_0.58.0 [25] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.26.0 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.6.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 ScaledMatrix_1.0.0 [11] cluster_2.1.2 limma_3.48.0 [13] Biostrings_2.60.0 prettyunits_1.1.1 [15] colorspace_2.0-1 blob_1.2.1 [17] rappdirs_0.3.3 xfun_0.23 [19] dplyr_1.0.6 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.70.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.38.0 [27] XVector_0.32.0 DelayedArray_0.18.0 [29] BiocSingular_1.8.0 scales_1.1.1 [31] pheatmap_1.0.12 edgeR_3.34.0 [33] DBI_1.1.1 Rcpp_1.0.6 [35] viridisLite_0.4.0 xtable_1.8-4 [37] progress_1.2.2 dqrng_0.3.0 [39] bit_4.0.4 rsvd_1.0.5 [41] metapod_1.0.0 httr_1.4.2 [43] RColorBrewer_1.1-2 dir.expiry_1.0.0 [45] ellipsis_0.3.2 farver_2.1.0 [47] pkgconfig_2.0.3 XML_3.99-0.6 [49] CodeDepends_0.6.5 sass_0.4.0 [51] locfit_1.5-9.4 utf8_1.2.1 [53] labeling_0.4.2 tidyselect_1.1.1 [55] rlang_0.4.11 later_1.2.0 [57] munsell_0.5.0 BiocVersion_3.13.1 [59] tools_4.1.0 cachem_1.0.5 [61] generics_0.1.0 RSQLite_2.2.7 [63] ExperimentHub_2.0.0 evaluate_0.14 [65] stringr_1.4.0 fastmap_1.1.0 [67] yaml_2.2.1 knitr_1.33 [69] bit64_4.0.5 purrr_0.3.4 [71] KEGGREST_1.32.0 sparseMatrixStats_1.4.0 [73] mime_0.10 biomaRt_2.48.0 [75] compiler_4.1.0 beeswarm_0.3.1 [77] filelock_1.0.2 curl_4.3.1 [79] png_0.1-7 interactiveDisplayBase_1.30.0 [81] statmod_1.4.36 tibble_3.1.2 [83] bslib_0.2.5.1 stringi_1.6.2 [85] highr_0.9 lattice_0.20-44 [87] ProtGenerics_1.24.0 Matrix_1.3-3 [89] vctrs_0.3.8 pillar_1.6.1 [91] lifecycle_1.0.0 BiocManager_1.30.15 [93] jquerylib_0.1.4 BiocNeighbors_1.10.0 [95] cowplot_1.1.1 bitops_1.0-7 [97] irlba_2.3.3 httpuv_1.6.1 [99] rtracklayer_1.52.0 R6_2.5.0 [101] BiocIO_1.2.0 bookdown_0.22 [103] promises_1.2.0.1 gridExtra_2.3 [105] vipor_0.4.5 codetools_0.2-18 [107] assertthat_0.2.1 rjson_0.2.20 [109] withr_2.4.2 GenomicAlignments_1.28.0 [111] Rsamtools_2.8.0 GenomeInfoDbData_1.2.6 [113] hms_1.1.0 grid_4.1.0 [115] beachmat_2.8.0 rmarkdown_2.8 [117] DelayedMatrixStats_1.14.0 Rtsne_0.15 [119] shiny_1.6.0 ggbeeswarm_0.6.0 [121] restfulr_0.0.13 Bibliography "],["contributors.html", "Chapter 10 Contributors", " Chapter 10 Contributors Aaron Lun An ancient interdimensional horror who has recently awoken from his aeons-long slumber. His true name cannot be pronounced by the human tongue and his true form cannot be grasped by the human mind. It is said that, when the stars are right, he will herald the return of the Old Ones to plunge the world into darkness once more. In the meantime, he maintains about 20 Bioconductor packages for analyzing a range of genomics data modalities. "],["bibliography.html", "Chapter 11 Bibliography", " Chapter 11 Bibliography "]]