scExplorer Guide
Comprehensive documentation on biological foundations, computational methods, and critical interpretation of each module in the scRNA-seq analysis pipeline (Figure 1).
Introduction to Single-Cell RNA Sequencing
Single-cell RNA sequencing (scRNA-seq) enables the quantification of the transcriptome of individual cells, revealing the heterogeneity that conventional bulk RNA-seq averages out. In any tissue, seemingly homogeneous populations harbor distinct transcriptional states: progenitor cells co-exist with terminally differentiated progeny, activated immune cells reside alongside quiescent counterparts, and rare populations (stem cells, circulating tumor cells) that constitute <1% of the sample become identifiable.
Capture technologies
Droplet-based platforms (10x Genomics Chromium, Drop-seq, inDrop) encapsulate individual cells in nanoliter droplets with barcoded beads, enabling high throughput (thousands to tens of thousands of cells) at the cost of shallow sequencing depth per cell (~1,000–3,000 genes/cell). Plate-based methods (Smart-seq2, MARS-seq) sort cells into microtiter plates via FACS, providing full-length transcript coverage and deeper sequencing (~4,000–8,000 genes/cell) but at lower throughput (hundreds of cells). The choice of platform directly impacts downstream analysis: deeper sequencing reduces dropout events and improves detection of lowly expressed transcription factors.
The count matrix
The fundamental data structure in scRNA-seq is a genes × cells matrix where each entry represents the number of unique molecular identifiers (UMIs)Short random sequence appended during library prep to tag individual mRNA molecules, enabling PCR duplicate removal. or read counts for a gene in a cell. This matrix is characteristically sparse: typically 80–95% of entries are zero. These zeros arise from a combination of biological absence (the gene is genuinely not expressed in that cell) and technical dropoutFailure to detect an expressed transcript due to stochastic mRNA capture. Results in excess zeros in the count matrix. (the transcript was present but not captured due to the stochastic nature of mRNA capture and reverse transcription). Distinguishing between these two sources of zeros remains one of the central challenges in scRNA-seq analysis.
Because zero values can reflect both molecular undersampling and genuine biological heterogeneity, downstream analysis has diverged into two broad strategies. One strategy applies denoising or imputation to recover signal that may be masked by sparse sampling, particularly when the objective is to stabilize low abundance transcripts, recover regulatory programs, or improve gene-gene structure. This is the rationale behind methods such as MAGIC, SAVER, and scImpute. A second strategy avoids explicit imputation and models the observed counts directly, especially in UMI based droplet datasets where many zeros are compatible with standard count sampling. In this framework, excessive smoothing may blur cell state boundaries, inflate correlations, and distort downstream inference.
Current evidence indicates that the value of imputation is task dependent rather than universal. Svensson showed that droplet scRNA-seq data are often not zero inflated beyond standard count models, whereas Qiu emphasized that dropout patterns can themselves contain biologically informative structure. A systematic benchmark by Hou et al. further showed that imputation may improve some tasks, such as agreement with bulk profiles, while providing limited or inconsistent benefit for clustering and trajectory inference. Accordingly, imputation is best treated as an optional analytical layer for visualization or exploratory denoising, whereas conclusions about marker genes, differential expression, trajectories, or regulatory networks should ideally be confirmed in the observed count space or in a parallel non-imputed analysis.
- van Dijk D et al. (2018). Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell.
- Huang M et al. (2018). SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods.
- Li WV, Li JJ (2018). An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications.
- Andrews TS, Hemberg M (2019). False signals induced by single-cell imputation. F1000Research.
- Svensson V (2020). Droplet scRNA-seq is not zero-inflated. Nature Biotechnology.
- Qiu P (2020). Embracing the dropouts in single-cell RNA-seq analysis. Nature Communications.
- Hou W et al. (2020). A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biology.
0 Upload
AnnData (h5ad)
The AnnData format (used by Scanpy) organizes single-cell data into a structured object: X stores the count matrix (cells × genes), obs contains cell-level metadata (sample origin, QC metrics, cluster assignments), var holds gene-level annotations (gene symbols, Ensembl IDs, mitochondrial flags), obsm stores embeddings (PCA coordinates, UMAP), and layers can hold alternative representations (raw counts, normalized values, spliced/unspliced for velocity). This hierarchical structure ensures all analytical outputs remain bound to the original data, maintaining full provenance.
10x Genomics h5 (CellRanger output)
CellRanger produces filtered feature-barcode matrices in HDF5 format. The filtering step uses an algorithm based on the barcode rank plot (the "knee plot") to distinguish genuine cell-containing droplets from empty droplets containing ambient RNA. The h5 file contains the sparse count matrix, gene IDs (both Ensembl and symbol), and barcode sequences.
Seurat objects (RDS)
R-based workflows store data as Seurat objects serialized in RDS format. scExplorer can ingest these and convert them internally to AnnData for processing.
MT- for human (e.g., MT-CO1, MT-ND1) and mt- for mouse (e.g., mt-Co1, mt-Nd1). Incorrect species assignment will result in zero mitochondrial genes detected and an unreliable QC metric.
1 Preprocessing & Quality Control
Quality control in scRNA-seq is fundamentally a biological problem: distinguishing intact, viable cells from damaged cells, empty droplets, and doublets. Each type of artifact has a specific molecular signature rooted in cell biology:
Empty droplets and low-complexity barcodes
In droplet-based platforms, a fraction of droplets fail to capture a cell but still contain ambient RNA released from lysed cells during sample preparation. These barcodes show characteristically low gene counts and total UMIs. The min_genes threshold removes these, but the appropriate cutoff depends on cell type: red blood cells and platelets naturally express very few genes (~200–500), whereas hepatocytes and neurons routinely express >4,000. Setting min_genes too high risks eliminating genuine low-complexity cell types.
Mitochondrial content as a viability marker
When the plasma membrane loses integrity (apoptosis, mechanical damage during dissociation), cytoplasmic mRNA leaks out of the cell. However, mitochondrial transcripts, protected by the double mitochondrial membrane, are retained proportionally. The result is a cell with an artificially elevated fraction of mitochondrial reads. The mito_threshold parameter sets the maximum tolerable proportion. Typical thresholds: 5% for brain tissue (neurons have low mitochondrial content), 10–15% for PBMCs, up to 20% for metabolically active tissues (heart, kidney) and solid tumors where dissociation-induced damage is common.
Doublet detection
Doublets occur when two cells are captured in the same droplet (~2–8% rate, scaling quadratically with loading concentration). Computational methods like Scrublet simulate artificial doublets by averaging pairs of transcriptomes, then score each real barcode by its similarity to these synthetic doublets. Heterotypic doublets (two different cell types) are detectable because they occupy intermediate positions in transcriptomic space; homotypic doublets (same cell type) are essentially undetectable computationally (Figure 2).
min_genes), damaged cells (elevated mitochondrial RNA fraction, filtered by mito_threshold), and doublets (hybrid transcriptomes from two co-captured cells, detected by Scrublet). Only barcodes passing all three filters are retained as viable cells for downstream analysis.| Parameter | Function | Default | Range | Biological Impact |
|---|---|---|---|---|
min_genes |
Minimum genes detected per cell | 200 | 100–500 | Low values retain empty droplets; high values remove genuine low-complexity cells (erythrocytes ~200, platelets ~300). Check the gene count distribution for bimodal separation between empties and cells. |
min_cells |
Minimum cells expressing a gene | 3 | 1–10 | Removes genes detected in very few cells: likely technical noise or ambient RNA contamination. Setting to 1 retains potential rare markers; setting >10 risks losing genes specific to very small populations. |
mito_threshold |
Maximum % mitochondrial reads | 20% | 5–30% | Tissue-dependent: brain (5%), PBMCs (10–15%), solid tumors (20%), cardiomyocytes (up to 30% due to high mitochondrial content). Plot the distribution before choosing. |
2 Embedding & Clustering
Not all genes carry equal information about cellular identity. Housekeeping genes (GAPDH, ACTB) are expressed at similar levels across cell types and contribute noise rather than signal to clustering. Highly variable gene (HVG) selectionGenes whose variance exceeds technical expectation, indicating biological relevance for distinguishing cell types. identifies genes with biological variance exceeding the technical (Poisson) expectation. These genes define cell type identity and state transitions. Typical selections retain 2,000–5,000 HVGs, capturing major identity markers while excluding uninformative genes that would dilute cluster resolution.
Principal Component Analysis performs a linear transformation of the expression matrix to identify orthogonal axes (principal components) that capture decreasing amounts of variance. In scRNA-seq, the first 10–50 PCs typically capture biologically meaningful variance, including cell type identity, activation state, and cell cycle phase, while later PCs are dominated by technical noise.
The elbow plot (variance explained per PC) helps identify this transition: the point where the curve flattens marks the boundary between signal and noise. Using too few PCs risks losing subtle biological variation (e.g., functional states within a cell type); using too many introduces noise that can fragment clusters artificially. The full dimensionality reduction pipeline is illustrated in Figure 3.
k-Nearest Neighbor graph
The k-NN graphk-Nearest Neighbor graph. Network connecting each cell to its k most similar neighbors in PC space. connects each cell to its k nearest neighbors in PC space, creating a network where edges represent transcriptomic similarity. This graph is the substrate for both clustering and UMAP projection. The n_neighbors parameter controls the locality of connections. Low values (5–15) preserve fine local structure, useful for identifying rare subpopulations or transitional states, while high values (30–100) capture broader relationships between cell populations at the cost of blurring subtle distinctions.
UMAP projection
UMAP (Uniform Manifold Approximation and Projection)Non-linear dimensionality reduction for visualization. Preserves local topology; global distances are not meaningful. is a non-linear dimensionality reduction that projects the neighbor graph into two dimensions for visualization. It preserves local topology (cells that are neighbors in high-dimensional space remain close in UMAP) but does not preserve global distances. Two clusters separated by a large gap in UMAP are not necessarily more transcriptomically different than two adjacent clusters. The relative position, distance, and size of clusters in UMAP can vary across random seeds. UMAP is a visualization tool, not an analytical one: all quantitative analyses (clustering, DEA, trajectory inference) operate on the graph or PC space directly.
The Leiden algorithmCommunity detection algorithm that partitions the k-NN graph into clusters by optimizing modularity with guaranteed connectivity. detects communities in the k-NN graph by optimizing a modularity-based objective function. It partitions the graph into groups of cells with dense internal connections (transcriptomically similar) and sparse connections between groups. Unlike the earlier Louvain algorithm, Leiden guarantees that each resulting community is connected: avoiding the generation of disconnected cluster fragments.
The resolution parameter is the primary control for clustering granularity. At low resolution (0.2–0.5), the algorithm identifies broad lineages: all T cells in one cluster, all myeloid cells in another. At intermediate resolution (0.5–1.0), major subtypes emerge: CD4+ and CD8+ T cells separate, classical and non-classical monocytes distinguish. At high resolution (1.0–2.0+), functional states within subtypes become visible: naive vs memory CD4+, Th1 vs Th2 vs Th17 vs Treg. However, excessive resolution fragments continuous biological variation into discrete clusters that can generate misleading differential expression results.
| Parameter | Function | Default | Range | Biological Impact |
|---|---|---|---|---|
n_neighbors |
Neighbors per cell in k-NN graph | 15 | 5–100 | Low: fine local structure (rare subsets); High: global relationships (lineages). For datasets with expected rare populations (<1%), use lower values. |
n_pcs |
PCs used for graph/UMAP | 30 | 10–50 | Guided by elbow plot. Too few loses fine variation; too many introduces noise. Complex tissues (developing organs) benefit from more PCs. |
resolution |
Leiden clustering granularity | 1.0 | 0.1–3.0 | 0.3: major lineages (T, B, myeloid). 1.0: subtypes (CD4, CD8, NK). 2.0+: states (Th1, Th17, Treg). Iterate and validate with known markers. |
min_dist |
UMAP minimum distance | 0.5 | 0.0–1.0 | Visual only. Low values create tighter clusters; high values produce more uniform spread. Does not affect analytical results. |
3 Differential Expression Analysis
Differential expression analysis identifies genes whose expression differs significantly between clusters, enabling cell type annotation. Each cluster is compared against all other cells (one-vs-rest) to find marker genes: transcripts that are both highly expressed within the cluster and absent or low in others. Ideal markers combine high fold change (magnitude of upregulation) with high specificity (expression restricted to the cluster of interest). A gene upregulated 10-fold but expressed in multiple clusters is a poor marker; a gene upregulated 3-fold but exclusive to one cluster is far more informative for identity assignment.
Wilcoxon rank-sum test
A non-parametric test that compares the rank distributions of expression values between two groups. It makes no assumptions about the underlying distribution, making it robust for scRNA-seq data which is zero-inflated and right-skewed. Recommended as the default choice for most analyses.
Welch's t-test
A parametric test assuming approximately normal distributions. It is more statistically powerful (lower false negative rate) when the normality assumption holds: which improves after log-normalization and for moderately to highly expressed genes. However, for sparse genes with many zeros, the normality assumption breaks down and false positive rates increase.
Logistic regression
Models the probability that a cell belongs to a given cluster as a function of each gene's expression. The regression coefficient reflects each gene's discriminative capacity. This approach naturally accounts for the binary nature of classification and can be more robust to outlier expression values. Particularly useful for large datasets where computational efficiency matters.
Interpreting results: the volcano plot
Volcano plots display statistical significance (−log₁₀ p-value, y-axis) against effect size (log₂ fold change, x-axis), as shown in Figure 4. Genes in the upper right quadrant (high significance, strong upregulation) are the primary marker candidates. Genes with high fold change but low significance may be expressed in only a few cells (noisy). Genes with high significance but low fold change are broadly but weakly differentially expressed: potentially less useful for identity assignment but potentially interesting for regulatory analysis.
Over-clustering artifacts: If resolution is set too high, a continuous population (e.g., a gradient of monocyte activation) may be split into multiple clusters. DEA will then identify genes that differ along this gradient as "markers," but they represent arbitrary cut-points in a continuum rather than biologically discrete populations. If the top DEGs between two clusters are graded rather than binary, consider merging them.
Compositional effects: In one-vs-rest comparisons, a gene can appear as a "marker" for a rare cluster simply because it is absent in the dominant cluster type. Always verify that markers are specifically enriched in the target cluster, not just absent elsewhere.
Multiple testing: With thousands of genes tested across multiple clusters, false discovery correction (Benjamini-Hochberg) is essential. Use adjusted p-values for interpretation.
4 Gene Expression Visualization
This module projects the expression of individual genes onto UMAP or PCA coordinates, enabling visual validation of cluster identity with known markers. By coloring each cell according to its expression level of a gene, you can assess whether marker genes co-localize with expected clusters.
Canonical marker validation
Cell type annotation typically starts with established markers: CD3E (T lymphocytes), MS4A1/CD20 (B lymphocytes), CD14/LYZ (classical monocytes), FCGR3A/CD16 (non-classical monocytes), NKG7/GNLY (NK cells), PECAM1/CD31 (endothelial cells), COL1A1 (fibroblasts), EPCAM (epithelial cells). Expression should be restricted to the expected cluster; diffuse expression across multiple clusters suggests either poor resolution or the gene is not a good discriminator for that dataset.
5 Heatmap
Heatmaps display the expression of selected genes (rows) across clusters or individual cells (columns), providing a comprehensive view of marker specificity and co-expression patterns. They are the standard method for presenting the transcriptional signature of each cluster in publications.
Z-score normalization (row scaling)
When enabled, each gene's expression is transformed to a z-score across clusters: values represent standard deviations from the gene's own mean. This is essential when comparing genes with vastly different absolute expression levels: a transcription factor expressed at 50 counts and a ribosomal gene at 10,000 counts become comparable when scaled. The color then reflects relative enrichment or depletion within each gene's own distribution.
Interpretation patterns
Block-diagonal pattern: When marker genes are correctly ordered, each cluster should show a distinct block of upregulated genes: the diagonal blocks indicate well-separated populations (Figure 9). Shared expression blocks: Two clusters sharing the same upregulated gene modules may represent subclusters of the same cell type that should be merged. Gradient expression: Genes showing progressive change across multiple clusters suggest a differentiation trajectory rather than discrete populations.
6 Downstream Analyses
Downstream analyses extend beyond cell type identification to address dynamic and regulatory questions: How do cells transition between states? What directionality does this trajectory have? Which transcription factors drive cell identity? How do cell populations communicate? scExplorer integrates four specialized modules to address these questions.
Trajectory Inference
Many biological processes involve continuous transitions between cellular states: hematopoietic differentiation from stem cells through progenitors to mature blood cells, epithelial-to-mesenchymal transition in development and cancer, T cell activation and exhaustion gradients. Trajectory inference reconstructs these continuous processes from snapshot scRNA-seq data by ordering cells along inferred differentiation paths.
PAGA (Partition-based Graph Abstraction)
PAGA quantifies the statistical connectivity between Leiden clusters in the k-NN graph, generating a coarsened abstraction of the data topology. If two clusters share many inter-cluster edges (more than expected by random), PAGA assigns a high connectivity weight: indicating a probable biological transition. The resulting graph reveals which populations are directly connected (e.g., HSC → CMP → GMP → monocyte) and which are disconnected (e.g., T cells and epithelial cells in a tumor sample should not show strong connectivity).
DPT (Diffusion Pseudotime)
DPT measures the distance between cells using a diffusion process on the neighbor graph. Starting from a user-defined root cell (which should correspond to the biologically most undifferentiated state), DPT assigns each cell a pseudotimeContinuous variable ordering cells along an inferred trajectory. Not real time: represents transcriptional progression. value representing its progress along the differentiation trajectory. Unlike simple path-based methods, DPT can handle branching: identifying bifurcation points where progenitors commit to different lineage fates.
RNA Velocity (scVelo)
RNA velocity leverages the kinetics of mRNA processing to infer the future transcriptional state of each cell. Newly transcribed pre-mRNA retains intronic sequences (unspliced), which are progressively removed to generate mature mRNA (spliced). In steady state, the ratio of unspliced to spliced transcripts is constant. Deviations from this equilibrium are informative:
- Excess of unspliced RNA → the gene is being actively upregulated (transcription rate exceeds degradation of unspliced intermediates).
- Excess of spliced RNA → the gene is being downregulated (transcription has slowed; remaining spliced mRNA has not yet been degraded).
By computing this ratio across all genes for each cell, scVelo generates a velocity vector that indicates the probable direction of transcriptional change (Figure 5). When projected onto the UMAP embedding, these vectors form a flow field showing the predicted directionality of cell state transitions.
Data requirements
RNA velocity requires quantification of both spliced and unspliced reads. Standard CellRanger count does not separate these; the data must be processed with velocyto (Loom files) or STARsolo with the --soloFeatures Gene Velocyto option. The unspliced/spliced layers must be present in the AnnData object.
Modes
Deterministic mode: Solves the ODE model for splicing kinetics analytically. Faster but assumes a single set of rate parameters per gene. Stochastic mode: Models the variance of the spliced/unspliced distributions, capturing intrinsic stochasticity of transcription. More robust for genes with complex kinetics but computationally more expensive.
SCENIC: Gene Regulatory Networks
Cell identity is ultimately determined by the activity of transcription factors (TFs): proteins that bind specific DNA motifs in promoter/enhancer regions and regulate the expression of downstream target genes. SCENIC reconstructs these regulatory relationships from scRNA-seq data, identifying which TFs are active in each cell and which gene programs (regulons) they control (Figure 6).
Step 1: GRNBoost2 (or GENIE3)
Uses gradient boosting regression to predict each gene's expression from TF expression levels, identifying TF-target co-expression modules. This step generates a large number of candidate regulatory links, many of which are indirect (gene A correlates with gene B because both are regulated by an unmodeled factor).
Step 2: RcisTarget
Filters candidate TF-target links by requiring the presence of the TF's binding motif in the promoter region (typically ±500bp around TSS) of the target gene. This critical validation step eliminates ~80–90% of spurious co-expression links, retaining only those supported by regulatory sequence evidence. The result is a set of regulonsA transcription factor and its validated target genes. Represents an active regulatory program inferred by SCENIC.: each consisting of a TF and its validated target genes.
Step 3: AUCell
Quantifies the activity of each regulon in each individual cell using the Area Under the recovery Curve (AUC) metric. For each cell, genes are ranked by expression, and the AUC measures how many regulon targets are among the top-expressed genes. This produces a cells × regulons activity matrix that can be used for clustering, differential analysis, and regulatory state characterization.
Example regulons by cell type
- FOXP3 regulon: active in regulatory T cells (Tregs)
- SPI1 (PU.1) regulon: active in myeloid lineage (monocytes, macrophages, dendritic cells)
- PAX5 regulon: active in B lymphocytes
- GATA1 regulon: active in erythroid progenitors
- HNF4A regulon: active in hepatocytes
CellChat: Cell-Cell Communication
Cells within tissues do not operate in isolation. They communicate through a repertoire of signaling molecules: secreted ligands that bind cognate receptors on neighboring or distant cells, direct cell-cell contact via adhesion molecules, and extracellular matrix (ECM)-mediated signals. CellChat infers these intercellular communication networks from scRNA-seq data by evaluating the co-expression of ligand-receptor pairs across cell type pairs (Figure 7).
Signaling probability model
For each ligand-receptor pair, CellChat computes the probability of communication between each pair of cell types based on the average expression of the ligand in the sender population and the receptor in the receiver population, incorporating co-factors and mediators where known. A permutation test establishes statistical significance by shuffling cell type labels.
CellChatDB
The analysis relies on a curated database of >2,000 known interactions for human and mouse, categorized into:
- Secreted signaling: Ligands released into the extracellular space (cytokines, chemokines, growth factors). Examples: CXCL12→CXCR4, TNF→TNFRSF1A.
- Cell-cell contact: Direct interaction requiring physical proximity. Examples: NOTCH ligands (DLL1, JAG1)→NOTCH receptors, CD80/CD86→CD28/CTLA4.
- ECM-receptor: Extracellular matrix components signaling through surface receptors. Examples: COL1A1→integrins, LAMA→dystroglycan.
Output interpretation
Communication networks: Directed graphs showing signaling strength between cell type pairs: edge weight reflects probability and number of significant interactions. Pathway-level analysis: Aggregation of individual interactions into pathways (WNT, NOTCH, TGFb) identifies dominant signaling programs. Sender/receiver roles: Identifies which cell types are predominantly sources vs targets of communication: useful for understanding immune microenvironment dynamics in tumors.
scTag: Collaborative Cell Type Annotation
Cell type annotation: assigning biological identities to computationally defined clusters: is arguably the most subjective step in scRNA-seq analysis. It requires domain expertise, familiarity with tissue-specific marker panels, and often produces disagreements between researchers. scTag addresses this by enabling collaborative, multi-annotator workflows with transparent consensus building (Figure 8).
Annotation workflow
- Export dataset from the scExplorer pipeline: the UMAP coordinates, cluster assignments, and top DEGs are loaded into scTag.
- Interactive visualization: annotators explore the UMAP with cluster overlays and can query expression of specific markers in real time.
- Independent annotation: each annotator labels clusters with cell type identities. The system provides marker-based suggestions using a scoring function (sigmoid-weighted scoring of positive markers that should be expressed and negative markers that should be absent).
- Consensus generation: when multiple annotators complete their annotations, the system generates consensus labels using weighted voting. Annotator confidence scores are factored into the weighting.
- Export: final consensus labels are exported as metadata (h5ad/RDS/CSV) for use in downstream analysis, publication, and data sharing.
7 Results Hub
The Results page aggregates all artifacts generated throughout the pipeline into a navigable dashboard organized by analysis step:
- Raw Data: Original dataset summary: cell count, gene count, sparsity metrics.
- QC Metrics: Violin plots and scatter plots of gene counts, UMI counts, and mitochondrial fraction before and after filtering.
- Preprocessing: Normalization parameters, HVG list.
- Embedding: PCA variance plots, UMAP coordinates, cluster assignments.
- DEA: Ranked gene tables per cluster, volcano plots, statistical summaries.
- Downstream Analyses: Trajectory graphs, velocity fields, SCENIC regulon heatmaps, CellChat networks (lazy-loaded for performance).
What to export for publication
For a typical manuscript: UMAP with cluster identities, annotated UMAP with cell type labels, top-10 marker gene heatmap, volcano plots for key comparisons, and relevant downstream analysis figures. All plots can be downloaded directly from the Results page. The underlying h5ad file contains all computed metadata for reproducibility.
Integration: Batch Effect Correction
When combining scRNA-seq datasets from different experiments, conditions, or laboratories, batch effectsSystematic technical variation between experiments that can confound biological signals.: systematic technical variations: can dominate the biological signal. Differences in cell dissociation protocols, library preparation kits, sequencing depth, or even the day of the experiment can cause cells of the same type to cluster separately by batch rather than by identity. Integration methods correct these technical variations while preserving genuine biological differences (e.g., treatment-induced changes, disease vs healthy states), as illustrated in Figure 10.
| Method | Approach | Best for | Limitations |
|---|---|---|---|
| Harmony | Iterative soft-clustering in PC space; adjusts embeddings to remove batch variation while preserving within-batch structure | Fast integration of batches with similar cell type compositions; large datasets; quick iteration | May over-correct if batches have genuinely different compositions; operates only in PC space (not on raw counts) |
| Scanorama | Finds mutual nearest neighbors (MNNs) across batches and uses these anchor pairs to learn a batch correction | Batches with partially overlapping compositions; preserves batch-specific populations | Slower than Harmony; requires sufficient shared cell types between batches as anchors |
| ComBat | Empirical Bayes linear model that estimates and removes batch-specific shifts and scaling | Simple technical variation (sequencing depth differences, kit differences) with known batch labels | Assumes linear batch effects; can over-correct complex non-linear variations; may not handle cell composition differences well |
| BBKNN | Modifies the k-NN graph construction to balance neighbors across batches, ensuring each cell's neighborhood includes cells from all batches | Preserving rare populations; moderate batch effects; graph-based analyses | Does not produce corrected expression values (correction is at the graph level only); depends on balanced batch sizes |
- Korsunsky I, Millard N, et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods.
- Hie B, Bryson B, Berger B (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature Biotechnology.
- Polański K, Young MD, et al. (2020). BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics.
Labeling: Manual Cell Type Assignment
The Labeling module allows you to assign cell type names to Leiden clusters directly within the pipeline. After reviewing DEA results and validating markers in the Visualization module, each cluster receives a biological identity label (e.g., "CD4+ T cells," "Classical monocytes," "Alveolar type II cells").
Merging over-clustered populations
When multiple clusters represent fragments of the same biological population (common at high Leiden resolution), Labeling allows you to assign the same label to multiple clusters. This effectively merges them for downstream interpretation without needing to re-run the clustering. For example, if clusters 3, 7, and 12 all express CD3E, CD4, and TCF7 (naive CD4+ T cell markers) and their DEGs show only quantitative (not qualitative) differences, they can all be labeled "Naive CD4+ T cells."
Glossary
- AnnData
- Annotated data matrix format used by Scanpy. Stores the count matrix (X), cell metadata (obs), gene metadata (var), embeddings (obsm), and additional layers.
- Barcode
- Short DNA sequence (typically 16 nt in 10x) that uniquely identifies a cell-containing droplet. Each cell's reads carry a shared barcode.
- Batch effect
- Systematic technical variation between experiments or samples that can confound biological signals. Sources include different library prep dates, sequencing runs, or operators.
- Doublet
- A barcode representing two cells captured in the same droplet. Produces a hybrid transcriptome that can appear as an intermediate state or novel population.
- Dropout
- Failure to detect an expressed transcript due to the stochastic nature of mRNA capture and reverse transcription. Results in excess zeros in the count matrix.
- Elbow plot
- Plot of variance explained per principal component. The "elbow" indicates the transition from biologically informative PCs to noise-dominated ones.
- HVG
- Highly Variable Gene. A gene whose variance across cells exceeds the expected technical variance: indicates biological relevance for cell type distinction.
- k-NN graph
- k-Nearest Neighbor graph. Network connecting each cell to its k most similar neighbors in PC space. Substrate for clustering and UMAP.
- Leiden
- Community detection algorithm that partitions the k-NN graph into clusters by optimizing modularity with guaranteed connectivity.
- Log₂FC
- Log₂ fold change. Measure of expression difference between groups. log₂FC of 1 = 2× increase; log₂FC of 2 = 4× increase.
- Pseudotime
- Continuous variable ordering cells along an inferred differentiation trajectory. Not real time: represents transcriptional progression.
- Regulon
- A transcription factor and its set of validated target genes, as inferred by SCENIC. Represents an active regulatory program.
- Spliced / Unspliced
- Mature mRNA (introns removed) vs pre-mRNA (introns retained). Their ratio informs RNA velocity: the predicted direction of transcriptional change.
- UMAP
- Uniform Manifold Approximation and Projection. Non-linear dimensionality reduction for visualization. Preserves local topology; global distances are not meaningful.
- UMI
- Unique Molecular Identifier. Short random sequence appended during library prep to tag individual mRNA molecules, enabling PCR duplicate removal and absolute quantification.
- Zero-inflated
- Statistical property of scRNA-seq data where the number of zeros exceeds what is expected from a standard count distribution, due to dropout and biological absence combined.